Gigantua: 1.5 Giganodes per Second per Core move generator

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Gigantua: 1.5 Giganodes per Second per Core move generator

Post by dangi12012 »

So for the last 2½ I was obsessed with the move generation problem of chess.
I originally wrote a few move generators in C# and C before even looking into www.chessprogramming.org
My movegenerator generates 1500 MNodes on a single thread without hashing for kiwipete and needs 5300ms for depth 6.

Not looking into chessprogramming.org and other solutions before gave me the great opportunity to develop some original ideas that I saw were not present in any existing movegenerator. I was happy to see that hashing with different seeds (or as you call it magic hashing) was a good idea - but as it turns out it will soon be obsolete by fast pext hardware on all processors (seed lookup will still be important for gpus)

There is a way to write a legal movegenerator without a movelist and without the concept of make/unmaking a move - as it turns out its way faster to directly enumerate all moves without a between step of generating and storing and looping over the results + making/unmaking moves.

This Movegenerator is a Incremental Bitboard generator. Which only ever looks at the 2/3 bits(enpassant)/4 bits(castling) that change during all possible moves. All checks are handled by a single & instruction. All pins are handled by maximum two & instrustions. All Combinations of all Enpassant / Pinned / Check moves are handled the same code.

So this engine has:
- abscent movelist (all moves still get a callback for possible ordering)
- Checks and Pins need 3 Instructions max
- Branchless Metaprogramming
- Bulk Counting for leaf nodes (can be disabled)
- Probably the fastest Slider + Slider Xray lookup possible with a total of 4 instructions per slider move (pext)

Possible Perfomance improvements:
- Hashing
- Multithreading

What you need to run this:
- Ryzen 5000 or Intel Processor
- Zen2/3 build will come soon and will be 25% slower.

If you have a Ryzen 5000 / Intel give it a go: (github page and sourcecode will follow soon)
https://1drv.ms/u/s!AoDIyNlDG2QTg8YI8qM ... A?e=TlOcXb

Use it with your own Fen string like this:
Gigantua_Zen3.exe <FEN> <depth>
Gigantua_Zen3.exe "r3k2r/Pppp1ppp/1b3nbN/nP6/BBP1P3/q4N2/Pp1P2PP/R2Q1RK1 w kq - 0 1" 6


So anyway here is the output:
Perft Start 1: 20 0ms 1.05263 MNodes/s
Perft Start 2: 400 0ms 30.7692 MNodes/s
Perft Start 3: 8902 0ms 127.171 MNodes/s
Perft Start 4: 197281 0ms 583.672 MNodes/s
Perft Start 5: 4865609 5ms 861.322 MNodes/s
Perft Start 6: 119060324 131ms 903.972 MNodes/s
Perft Start 7: 3195901860 3286ms 972.469 MNodes/s
OK

Perft Kiwi 1: 48 0ms 6.85714 MNodes/s
Perft Kiwi 2: 2039 0ms 169.917 MNodes/s
Perft Kiwi 3: 97862 0ms 843.638 MNodes/s
Perft Kiwi 4: 4085603 3ms 1356.44 MNodes/s
Perft Kiwi 5: 193690690 119ms 1621.86 MNodes/s
Perft Kiwi 6: 8031647685 5363ms 1497.43 MNodes/s
OK

Perft Midgame 1: 46 0ms 6.57143 MNodes/s
Perft Midgame 2: 2079 0ms 346.5 MNodes/s
Perft Midgame 3: 89890 0ms 1123.62 MNodes/s
Perft Midgame 4: 3894594 2ms 1570.4 MNodes/s
Perft Midgame 5: 164075551 102ms 1598.63 MNodes/s
Perft Midgame 6: 6923051137 4255ms 1626.67 MNodes/s
OK

Perft Endgame 1: 38 0ms 19 MNodes/s
Perft Endgame 2: 1129 0ms 225.8 MNodes/s
Perft Endgame 3: 37035 0ms 787.979 MNodes/s
Perft Endgame 4: 1023977 0ms 1572.93 MNodes/s
Perft Endgame 5: 31265700 17ms 1764.23 MNodes/s
Perft Endgame 6: 849167880 515ms 1647.11 MNodes/s
OK

Perft aggregate: 18999768562 13817ms 1375.05 MNodes/s
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
R. Tomasi
Posts: 307
Joined: Wed Sep 01, 2021 4:08 pm
Location: Germany
Full name: Roland Tomasi

Re: Gigantua: 1.5 Giganodes per Second per Core move generator

Post by R. Tomasi »

That's indeed some impressive speed. The trouble I see when using that in an actual chess engine (opposed to using in a pure perft counter) is that you will need the actual move for things like TT storage, killers/history heuristics, etc.
pedrojdm2021
Posts: 157
Joined: Fri Apr 30, 2021 7:19 am
Full name: Pedro Duran

Re: Gigantua: 1.5 Giganodes per Second per Core move generator

Post by pedrojdm2021 »

Looking forward into this, that speed is insane, i have been doing experiments and tests to improve my engine speed, so it will be of great help to many of us :mrgreen: amazing job!
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Gigantua: 1.5 Giganodes per Second per Core move generator

Post by dangi12012 »

Thats the cool part - the actual move is there too. The solution is a template inlined "callback" function which recieves the from and to bitboards (which means it knows the move type and the current and next board) and not try to compress it into a list which gets enumerated a second time later.

That function which will compile away to almost nothing - can still store the move in a list if you want to (but dont need to!). The ordering can happen too!

I mean 1.5 Billion NPs means that on a 4.5ghz Zen3 that can do 4-16 instructions per clock means that the code needs to compile away to about 30-50 instructions on average per node Board position (queens rooks checks pins etc...) no matter how complicated the position is.

Just for fun I disabled everything: (which means a leaf node increments the count by ONE for each position and everything is really expanded)
No Bulk counting + No Hashtable + No multithreading + Collecting count stats per type of move (pawn, ep, castle, rook moves etc)
Perft Kiwi 6: 8031647685 10136ms 792.316 MNodes/s


Templates helped a lot. But I think the most speed improvements comes from solving Checkmasks and Pinmasks the optimal way.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
R. Tomasi
Posts: 307
Joined: Wed Sep 01, 2021 4:08 pm
Location: Germany
Full name: Roland Tomasi

Re: Gigantua: 1.5 Giganodes per Second per Core move generator

Post by R. Tomasi »

dangi12012 wrote: Wed Sep 22, 2021 5:14 pm Thats the cool part - the actual move is there too. The solution is a template inlined "callback" function which recieves the from and to bitboards (which means it knows the move type and the current and next board) and not try to compress it into a list which gets enumerated a second time later.
That might work then :) it will however still be necessary to compress the move into (typically) 16bits for TT storage. The compressed information should contain anything you would need to later make/unmake the move if you get a TT hit (like is it a capture? is it an en passant capture? is it a castling move? is it a promotion? promotion to what piece?). So I guess you will not get around having a few instructions for that. Also you will need a second routine (on top of the speedy incremental movegen routine) that will allow to make/unmake the moves from TT or killer tables.
tcusr
Posts: 323
Joined: Tue Aug 31, 2021 10:32 pm
Full name: tcusr

Re: Gigantua: 1.5 Giganodes per Second per Core move generator

Post by tcusr »

so you run a functor in your move generator and make the move on the fly?
how do you order the moves with these setup? the only way i can think of is by updating a move list though these functors
i'm looking forward to read your code, good job!
klx
Posts: 179
Joined: Tue Jun 15, 2021 8:11 pm
Full name: Emanuel Torres

Re: Gigantua: 1.5 Giganodes per Second per Core move generator

Post by klx »

Nice, this is around the speed I'd expect.
dangi12012 wrote: Wed Sep 22, 2021 12:19 pm Not looking into chessprogramming.org and other solutions before gave me the great opportunity to develop some original ideas
Exactly! That's how I operate too, and that's what real success looks like. You'll find that almost everybody is more or less copying Stockfish though.
dangi12012 wrote: Wed Sep 22, 2021 5:14 pm a 4.5ghz Zen3 that can do 4-16 instructions per clock
How do you get to 16 instructions per clock cycle? Are you referring to SIMD?
[Moderation warning] This signature violated the rule against commercial exhortations.
tcusr
Posts: 323
Joined: Tue Aug 31, 2021 10:32 pm
Full name: tcusr

Re: Gigantua: 1.5 Giganodes per Second per Core move generator

Post by tcusr »

klx wrote: Wed Sep 22, 2021 7:51 pm Exactly! That's how I operate too, and that's what real success looks like. You'll find that almost everybody is more or less copying Stockfish though.
you have never posted any of your enormous improvements
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Gigantua: 1.5 Giganodes per Second per Core move generator

Post by dangi12012 »

klx wrote: Wed Sep 22, 2021 7:51 pm How do you get to 16 instructions per clock cycle? Are you referring to SIMD?
Agner Fog Intstruction Tables
https://www.agner.org/optimize/instruction_tables.pdf

During writing of this I used this table extensively. In the 3rd column you have the "reciprocal throughput" listed which is 0.25 for most chess relevant instructions. Meaning one clock of 4.5 ghz can do 4 at once per Execution Unit. Also modern processors have more than one Execution Unit per thread.
Its also good to write code that does not depend on previous results and keep it as simple as possible so that the compiler can optimize too 👍
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
klx
Posts: 179
Joined: Tue Jun 15, 2021 8:11 pm
Full name: Emanuel Torres

Re: Gigantua: 1.5 Giganodes per Second per Core move generator

Post by klx »

dangi12012 wrote: Wed Sep 22, 2021 8:05 pm During writing of this I used this table extensively. In the 3rd column you have the "reciprocal throughput" listed which is 0.25 for most chess relevant instructions. Meaning one clock of 4.5 ghz can do 4 at once per Execution Unit. Also modern processors have more than one Execution Unit per thread.
Yeah I'm aware of that table. Can you give an example of a piece of code that executes 16 instructions per cycle?
[Moderation warning] This signature violated the rule against commercial exhortations.