Comparison of all known Sliding lookup algorithms [CUDA]

dangi12012 · Post by **dangi12012** » Sat Jan 08, 2022 8:11 pm

CUDA Performance Release

Now I took this Saturday to translate all algos we talked about in the CPU thread to CUDA. Now I am very interested in the performance you get.
If you read this - run this cuda executable and post results here: https://github.com/Gigantua/Chess_Moveg ... ompare.exe
Nvidia only for now.

Results:

Code: Select all

NVIDIA GeForce RTX 3080
Fancy Magic:    6.82 GigaQueens/s
QBB Algo:       58.02 GigaQueens/s
Bob Lookup:     0.77 GigaQueens/s
Kogge Stone:    40.43 GigaQueens/s
Hyperbola Qsc:  17.43 GigaQueens/s
Switch Lookup:  4.40 GigaQueens/s
Slide Arithm:   18.38 GigaQueens/s
Pext Lookup:    16.85 GigaQueens/s
SISSY Lookup:   8.08 GigaQueens/s
Hypercube Alg:  1.31 GigaQueens/s

I am very suprised how some algos performend. Seems to me that cuda really does not care about dependency chains (makes sense because it doesnt do uOp fusing like modern x64) so Kogge Stone and QBB really are good performers here.

Now everyone here can see that switching hardware from the usual x64 to some other architecture really changes the usual performance metrics we are used to. PEXT emulated is faster than fancy magic! This is something i did not expect.

The general performance is much much higher than the maximum on 32 Threads with a Ryzen 3cpu which was 10 Gigalookups/s.
I am looking forward to some discussions. All memory accessing algorithms seem to be much slower than the zero table lookups we have.
Sourcecode: https://github.com/Gigantua/Chess_Movegen

Winners: QBB, Kogge Stone
Losers: Fancy Magic, Bob Lookup

CPU discussion: http://www.talkchess.com/forum3/posting ... 7&p=917782

lithander · Post by **lithander** » Sat Jan 08, 2022 10:50 pm

Interesting! My results:

Code: Select all

NVIDIA GeForce RTX 2070
FancyHash:      3.79 GigaQueens/s
QBB Algo:       31.65 GigaQueens/s
Bob Lookup:     0.44 GigaQueens/s
Kogge Stone:    20.78 GigaQueens/s
Hyperbola Qsc:  9.05 GigaQueens/s
Switch Lookup:  0.18 GigaQueens/s
Slide Arithm:   9.48 GigaQueens/s
Pext Lookup:    8.41 GigaQueens/s
SISSY Lookup:   4.04 GigaQueens/s
Hypercube Alg:  0.61 GigaQueens/s

dangi12012 · Post by **dangi12012** » Sun Jan 16, 2022 1:44 am

Obstruction difference and NO HEADACHE coming soon!
Also I see that ankan-bans perft_gpu is using magic - which is by far not a good gpu algorithm. (You only find stuff like this with a seperate benchmarking software like this thread)

If someone wants to add an algo for CUDA please let me know!
I am esp. interested in a vectorized version of Kogge-Stone (which apperently should exist somewhere)

smatovic · Post by **smatovic** » Sun Jan 16, 2022 5:59 am

dangi12012 wrote: ↑Sun Jan 16, 2022 1:44 am Obstruction difference and NO HEADACHE coming soon!
Also I see that ankan-bans perft_gpu is using magic - which is by far not a good gpu algorithm. (You only find stuff like this with a seperate benchmarking software like this thread)

If someone wants to add an algo for CUDA please let me know!
I am esp. interested in a vectorized version of Kogge-Stone (which apperently should exist somewhere)

not sure if you mean something like this:

https://zeta-chess.app26.de/post/64-bit ... tor-based/

note:

https://zeta-chess.app26.de/post/three- ... sm-on-gpu/

https://zeta-chess.app26.de/post/how-co ... n-on-gpus/

https://zeta-chess.app26.de/post/why-co ... n-on-gpus/

Perft on gpu is one thing, a chess playing engine another.

What about Dumb7Fill?

--
Srdja

dangi12012 · Post by **dangi12012** » Sun Jan 16, 2022 2:35 pm

smatovic wrote: ↑Sun Jan 16, 2022 5:59 am https://zeta-chess.app26.de/post/how-co ... n-on-gpus/

What about Dumb7Fill?

--
Srdja

I would actually like to talk about that link because the most optimized chess gpgu option is not listed there.

you have the definitions for Rotateleft? This could actually be very fast on the gpu.
The wiki gets this wrong sometimes - here I would move dir8 to a template - and have no branch inside rotateleft with if constexpr. Would be the same as writing all 8 rays by hand - but with 1/8 of the amount of C++ code.

Code: Select all

U64 slidingAttacks (U64 sliders, U64 empty, int dir8) {
   U64 flood = sliders;
   int r = shift[dir8]; // {+-1,7,8,9}
   empty &= avoidWrap[dir8];
   flood |= sliders = rotateLeft(sliders , r) & empty;
   flood |= sliders = rotateLeft(sliders , r) & empty;
   flood |= sliders = rotateLeft(sliders , r) & empty;
   flood |= sliders = rotateLeft(sliders , r) & empty;
   flood |= sliders = rotateLeft(sliders , r) & empty;
   flood |=         = rotateLeft(sliders , r) & empty;
   return   rotateLeft(flood, r)  &   avoidWrap[dir8];
}

Then dirshift and dirmask is a consteval (meaning compiletime constant) and not the array lookup that people seem to use so much.

tcusr · Post by **tcusr** » Sun Jan 16, 2022 5:53 pm

Rotateleft is std::rotl in C++20

smatovic · Post by **smatovic** » Sun Jan 16, 2022 6:19 pm

dangi12012 wrote: ↑Sun Jan 16, 2022 2:35 pm ...
you have the definitions for Rotateleft? This could actually be very fast on the gpu.
...

Back then there were no Rotateleft instructions on gpu arch present, idk about current ones, hence I implemented them by myself:

http://www.talkchess.com/forum3/viewtop ... 74#p504208

Code: Select all

            // do kogge stone
            gen8 |= pro8 & ((gen8 << r8) | (gen8 >> (64-r8)));
            pro8 &=       ((pro8 << r8) | (pro8 >> (64-r8)));
            gen8 |= pro8 & ((gen8 << 2*r8) | (gen8 >> (64-2*r8)));
            pro8 &=       ((pro8 << 2*r8) | (pro8 >> (64-2*r8)));
            gen8 |= pro8 & ((gen8 << 4*r8) | (gen8 >> (64-4*r8)));

            // Shift one for Captures
            gen8 = ((gen8 << r8) | (gen8 >> (64-r8))) & avoidWrap8[piece];

--
Srdja

tcusr · Post by **tcusr** » Sun Jan 16, 2022 6:27 pm

tcusr wrote: ↑Sun Jan 16, 2022 5:53 pm Rotateleft is std::rotl in C++20

sorry, i forgot you're working with vectors

dangi12012 · Post by **dangi12012** » Sat Mar 05, 2022 8:17 pm

Comparison of GPU Sliding Lookup Algorithms Github Release!

Github:
https://github.com/Gigantua/Chess_Movegen_GPU

So the top performance is around 60 Billion Lookups/s which is around 5x faster than a 16 Core 5950X CPU can manage on all threads concurrently with optimized PEXT. That result is really great!

Lookup algorithms generally perform really bad compared to "No Lookup versions" (7 Billion Lookups/s). Thats why I was so keen on developing 0kb versions of all known algorithms wherever possible - to get good performance on totally different hardware. (this thread: https://www.talkchess.com/forum3/viewto ... =7&t=79005) Seems my work payed off and porting most of these algos was a breeze!

The code is not "production ready" in any way - correctness verification and different performance metrics need to be calculated. This is Visual Studio 2019 for now - and make sure to have Cuda SDK installed. Makefile and linux/clang support coming soon.

Please give some feedback for the code - and maybe even send me a PR if you discover something -

Output:

Code: Select all

NVIDIA GeForce RTX 3080
Black Magic - Fixed shift:      6.79 GigaQueens/s
QBB Algo                 :      55.65 GigaQueens/s
Bob Lookup               :      1.63 GigaQueens/s
Kogge Stone              :      39.95 GigaQueens/s
Hyperbola Quiescence     :      17.33 GigaQueens/s
Switch Lookup            :      4.33 GigaQueens/s
Slide Arithm             :      18.28 GigaQueens/s
Pext Lookup              :      16.79 GigaQueens/s
SISSY Lookup             :      8.06 GigaQueens/s
Hypercube Alg            :      1.31 GigaQueens/s
Dumb 7 Fill              :      24.43 GigaQueens/s
Obstruction Difference   :      45.28 GigaQueens/s
Leorik                   :      59.06 GigaQueens/s
SBAMG o^(o-3cbn)         :      62.87 GigaQueens/s
NO HEADACHE              :      29.37 GigaQueens/s
AVX Branchless Shift     :      28.96 GigaQueens/s
Slide Arithmetic Inline  :      62.31 GigaQueens/s

Now:
I need help from the really smart people here - probably someone with a background in rotated bitboards.
So probably one of the best thing about CUDA is that there are other instrinsics available compared to a x64 cpu. https://docs.nvidia.com/cuda/cuda-math- ... cf536974cd
For instance we have a "Reverse the bit (not byte) order of a 64-bit unsigned integer" instruction.
So any algorithm that works on "positive rays only" could possibly outperform any and all algorithms above.

Do you see any possibilities for that instruction?

smatovic · Post by **smatovic** » Sun Mar 06, 2022 7:28 am

dangi12012 wrote: ↑Sat Mar 05, 2022 8:17 pm [...]
So the top performance is around 60 Billion Lookups/s...
[...]
NVIDIA GeForce RTX 3080
[...]

May I ask how many threads you run concurrently on the device, multiple waves of Warps/SIMT? Lookups/thread? The RTX 3080 has 8704 Cuda-cores according to Wikipedia. What does the profiler say about the GPU utilization?

--
Srdja

Comparison of all known Sliding lookup algorithms [CUDA]

Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]