Comparison of all known Sliding lookup algorithms [CUDA]

Discussion of chess software programming and technical issues.

Moderator: Ras

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

Final Release Part 2
I was going through the ptx and comparing the pipeline bottlenecks of all algos. Bitrotation needs to rotate 3x per ray.
GeneticObstructionDifference (the AST sifted improvement of Obstruction Difference) did not emit optimal PTX.

So I eleminated all locals and optimized the shared memory into a struct. This answers the question of Array of Structs vs Struct of Arrays is for these algos).

It makes me happy to anounce a new best overall algorithm:

Code: Select all

NVIDIA GeForce RTX 3080
Black Magic - Fixed shift:      6.53 GigaQueens/s
QBB Algo                 :      60.49 GigaQueens/s
Bob Lookup               :      58.17 GigaQueens/s
Kogge Stone              :      40.33 GigaQueens/s
Hyperbola Quiescence     :      16.91 GigaQueens/s
Switch Lookup            :      5.52 GigaQueens/s
Slide Arithm             :      87.93 GigaQueens/s
Pext Lookup              :      15.92 GigaQueens/s
SISSY Lookup             :      8.33 GigaQueens/s
Dumb 7 Fill              :      26.51 GigaQueens/s
Obstruction Difference   :      67.78 GigaQueens/s
Genetic Obstruction Diff :      121.13 GigaQueens/s
Leorik                   :      61.69 GigaQueens/s
SBAMG o^(o-3cbn)         :      71.36 GigaQueens/s
NO HEADACHE              :      30.62 GigaQueens/s
AVX Branchless Shift     :      29.45 GigaQueens/s
Slide Arithmetic Inline  :      71.04 GigaQueens/s
C++ Tree Sifter - 8 Rays :      88.21 GigaQueens/s
Bitrotation o^(o-2r)     :      113.19 GigaQueens/s
Genetic Obstruction Difference is now the overall fastest algorithm. This was a great journey and the performane achieved is really something else.
I also made sure to do the same thing for Bitrotation to make everything as optimized as possible. But 3x bitrotation is just more work compared to 1x countlzero!

The code is quite pleasing and the masks array is the same among many algorithms!

Image


I also created a Github release
https://github.com/Gigantua/Chess_Moveg ... e_2022.exe
Please share your results!

Last words: When looking up 121 Billion Queens per second that is not the number of calculated squares. Its the number of Queen positions. So the actual number of target squares calculated (relevant for actual perft or movegen) will be the sum of set bits in each and every result of the 121 Billion results per second.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

For Linux users. Please share your results. I provided an update to compile under Linux / WSL2

Hany two liner:

Code: Select all

git clone https://github.com/Gigantua/Chess_Movegen_GPU.git && cd Chess_Movegen_GPU
make
Prerequesary: Nvidia Pascal (10xx) or later
https://developer.nvidia.com/cuda-downloads
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
pferd
Posts: 134
Joined: Thu Jul 24, 2014 2:49 pm

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by pferd »

two minor typos in source code:

Code: Select all

git diff
diff --git a/kernel.cu b/kernel.cu
index 79d641e..4375aba 100644
--- a/kernel.cu
+++ b/kernel.cu
@@ -1,5 +1,5 @@
 
-#include "Cu_Common.h"
+#include "cu_Common.h"
 
 #include <numeric>
 #include <iostream>
@@ -27,7 +27,7 @@
 #include "cu_SlideArithmInline.h"
 #include "cu_Genetic8Ray.h"
 #include "cu_Bitrotation.h"
-#include "cu_foldingHash.h"
+#include "cu_FoldingHash.h"
 #include "kernel.h"
 
 /// <summary>
@@ -281,4 +281,4 @@ int main()
     TestChessprocessor<18>(blocks, threadsperblock);
     TestChessprocessor<19>(blocks, threadsperblock);
     TestChessprocessor<20>(blocks, threadsperblock);
-}
\ No newline at end of file
+}
Here are my results:

Code: Select all

make
nvcc -gencode arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_87,code=sm_87 --expt-relaxed-constexpr -std=c++17 --run --threads 8 -O3 kernel.cu -o movegen_gpu
NVIDIA GeForce GTX 1070
Black Magic - Fixed shift:      2.25 GigaQueens/s
QBB Algo                 :      21.73 GigaQueens/s
Bob Lookup               :      17.83 GigaQueens/s
Kogge Stone              :      14.71 GigaQueens/s
Hyperbola Quiescence     :      34.45 GigaQueens/s
Switch Lookup            :      1.16 GigaQueens/s
Slide Arithm             :      35.02 GigaQueens/s
Pext Lookup              :      7.35 GigaQueens/s
SISSY Lookup             :      3.66 GigaQueens/s
Dumb 7 Fill              :      9.81 GigaQueens/s
Obstruction Difference   :      23.07 GigaQueens/s
Genetic Obstruction Diff :      32.55 GigaQueens/s
Leorik                   :      22.95 GigaQueens/s
SBAMG o^(o-3cbn)         :      24.41 GigaQueens/s
NO HEADACHE              :      7.85 GigaQueens/s
AVX Branchless Shift     :      11.08 GigaQueens/s
Slide Arithmetic Inline  :      25.78 GigaQueens/s
C++ Tree Sifter - 8 Rays :      30.21 GigaQueens/s
Bitrotation o^(o-2r)     :      43.87 GigaQueens/s
FoldingHash (uncomplete) :      16.65 GigaQueens/s
make  43,77s user 3,36s system 185% cpu 25,383 total
ColonelPhantom
Posts: 6
Joined: Fri Mar 12, 2021 3:48 pm
Full name: Quinten Kock

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by ColonelPhantom »

I ported the code to AMD's HIP framework so that it runs on AMD Radeon GPUs using ROCm.

ROCm's portability isn't too great (in terms of e.g. legacy GPU support, and iirc it also requires some host-side features like PCIe atomics), but HIP luckily also works on CUDA (so a HIP codebase runs on both ROCm and CUDA, and shouldn't lose much if any performance over a real CUDA implementation).

AMD even provides an automated translation tool for CUDA->HIP, and I only needed to manually fix up some things :D

(Do note, I did not check in any way that the translated code is correct, other than that it compiles. I give zero guarantees.)

I also ran the code on my own GPU (an RX 580 4GB):

Code: Select all

AMD Radeon RX 580 Series
Black Magic - Fixed shift:      5.38 GigaQueens/s
QBB Algo                 :      13.01 GigaQueens/s
Bob Lookup               :      21.90 GigaQueens/s
Kogge Stone              :      6.22 GigaQueens/s
Hyperbola Quiescence     :      23.40 GigaQueens/s
Switch Lookup            :      0.42 GigaQueens/s
Slide Arithm             :      22.48 GigaQueens/s
Pext Lookup              :      4.28 GigaQueens/s
SISSY Lookup             :      1.42 GigaQueens/s
Dumb 7 Fill              :      4.05 GigaQueens/s
Obstruction Difference   :      11.82 GigaQueens/s
Genetic Obstruction Diff :      23.64 GigaQueens/s
Leorik                   :      14.86 GigaQueens/s
SBAMG o^(o-3cbn)         :      14.33 GigaQueens/s
NO HEADACHE              :      3.54 GigaQueens/s
AVX Branchless Shift     :      4.55 GigaQueens/s
Slide Arithmetic Inline  :      15.95 GigaQueens/s
C++ Tree Sifter - 8 Rays :      18.20 GigaQueens/s
Bitrotation o^(o-2r)     :      29.29 GigaQueens/s
FoldingHash (uncomplete) :      17.44 GigaQueens/s
It's interesting to see that some algorithms are (relatively) better on AMD hardware (black magic for example), and some on NVIDIA hardware. It does seem Bitrotation o^(o-2r) is universally quite strong, being the best on my RX 580 and also on pferd's GTX 1070. On dangi's RTX 3080, Genetic Obstruction Diff is slightly faster but bitrotation is still close.

If you want to run it yourself, note that you need a HIP runtime and also hipRAND (which wraps either cuRAND or rocRAND). Here are the commands I use to compile and run on my AMD setup:

Code: Select all

hipcc -I/opt/rocm/hiprand/include --std=c++17 hip/kernel.cu -o movegen_hip
./movegen_hip
Especially test results from RDNA or perhaps even CDNA would be very interesting!
ColonelPhantom
Posts: 6
Joined: Fri Mar 12, 2021 3:48 pm
Full name: Quinten Kock

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by ColonelPhantom »

Oops, I forgot to include a link to my GitHub repository, it is at https://github.com/ColonelPhantom/Chess_Movegen_GPU

I also did not find an Edit-button, so apologies for the double-post.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

If anyone gets their hand on an RTX 4080 or above please contact me.

The AMD post above this will get merged into my repo on time (got the authors approval)

I have some new algorithms that are not ready yet - but should be quite a step forward still.
Going from below Billion Queens/s (which is Rook + Bishop) to over 100Billion was quite the fun journey, lets see what Ada Lovelace can do in this domain.

Example for binary matrix multiplication:
https://github.com/NVIDIA/cutlass/blob/ ... 32_sm80.cu
Which is essentially what is so very useful about tensor cores. You can expand from a native 8x8 bitboard uint1b_t into int or half by matrix multiplication. Or stay with uint1b_t and have it rotated, mirrored or weighed or any other matrix operation you can think of.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

New Breakthrough:

Through better hardware understanding I could optimize the bitrotation more. Bitrotation is already a unification possible when true bitreverse hardware instructions are avaibale.

Code: Select all

Bitrotation: 4.6437G/s (32 Threads)
cu_Bitrotation: 215.806G/s
For context: This is a rtx 3080, cpu is a 5950X that can get 13Billion lookups with optimized pext.
Results are verified against reference movegens with 10 Million positions from actual games.
This new code can find all moves for all rooks, queens and bishops in under 1ms for all 10 Million positions.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
krunch
Posts: 10
Joined: Sat Sep 18, 2021 9:36 pm
Full name: Tony Schwebs

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by krunch »

And numbers for a RTX 4090

Code: Select all

NVIDIA GeForce RTX 4090
Black Magic - Fixed shift:      15.75 GigaQueens/s
QBB Algo                 :      152.80 GigaQueens/s
Bob Lookup               :      187.73 GigaQueens/s
Kogge Stone              :      100.24 GigaQueens/s
Hyperbola Quiescence     :      42.12 GigaQueens/s
Switch Lookup            :      13.60 GigaQueens/s
Slide Arithm             :      222.92 GigaQueens/s
Pext Lookup              :      38.83 GigaQueens/s
SISSY Lookup             :      9.44 GigaQueens/s
Dumb 7 Fill              :      65.31 GigaQueens/s
Obstruction Difference   :      171.22 GigaQueens/s
Genetic Obstruction Diff :      291.15 GigaQueens/s
Leorik                   :      157.57 GigaQueens/s
SBAMG o^(o-3cbn)         :      175.14 GigaQueens/s
NO HEADACHE              :      75.57 GigaQueens/s
AVX Branchless Shift     :      72.70 GigaQueens/s
Slide Arithmetic Inline  :      173.02 GigaQueens/s
C++ Tree Sifter - 8 Rays :      210.50 GigaQueens/s
Bitrotation o^(o-2r)     :      299.20 GigaQueens/s
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

In this thread we mentioned a new sifted solution for positive and negative rays. The cute thing about this would be a board structure where we maintained the occ and bit_reversed(occ) by applying sq ^ 64 to all moves as well. This gives rise to the possiblity of using bitrotation without ever rotating bits.
https://www.talkchess.com/forum3/viewto ... =7&t=82547

New algo added: Bitray 2023 version

Code: Select all

NVIDIA GeForce RTX 3080
Black Magic - Fixed shift:      40.88 GigaQueens/s
QBB Algo                 :      58.07 GigaQueens/s
Bob Lookup               :      71.90 GigaQueens/s
Kogge Stone              :      39.93 GigaQueens/s
Hyperbola Quiescence     :      99.89 GigaQueens/s
Switch Lookup            :      5.41 GigaQueens/s
Slide Arithm             :      84.16 GigaQueens/s
Pext Lookup              :      15.75 GigaQueens/s
SISSY Lookup             :      8.08 GigaQueens/s
Dumb 7 Fill              :      26.11 GigaQueens/s
Obstruction Difference   :      66.50 GigaQueens/s
Genetic Obstruction Diff :      98.69 GigaQueens/s
Leorik                   :      59.21 GigaQueens/s
SBAMG o^(o-3cbn)         :      69.46 GigaQueens/s
NO HEADACHE              :      30.53 GigaQueens/s
AVX Branchless Shift     :      29.43 GigaQueens/s
Slide Arithmetic Inline  :      68.82 GigaQueens/s
C++ Tree Sifter - 8 Rays :      81.49 GigaQueens/s
Bitrotation o^(o-2r)     :      134.14 GigaQueens/s
FoldingHash (uncomplete) :      68.51 GigaQueens/s
Bitray 2023 version      :      110.84 GigaQueens/s

Code: Select all

#define bitswap(X) __brevll(X)
BitFunction ray(uint64_t occ, uint64_t mask, uint64_t OCC, uint64_t MASK) {
	uint64_t o = occ & mask;
	uint64_t O = OCC & MASK;

	return ((o ^ (o - 1ull)) & mask) | bitswap((O ^ (O - 1ull)) & MASK);
}
We dont use the improvement of maintained reversed occupation because on cuda because bit reversal is a single instruction.
The interesting part and the true never seen before novelty of this algorithm would be this:
We could enumerate the lefthand side of the return normally via a bitloop.
We can also enumerate the righthand side of the return WITHOUT a bitswap and normalizing the target square sq = (sq' ^ 64)

110.84 GigaQueens/s is a proven big improvement over Hyperbola in the slower form too. Hyperbola to remind you does not have the property in the middle is a XOR and the left and righthand side are not disjoint.

The reason these things become important because NNUE evaluation is to my own suprise a real natural fit for GPUs. Faster memory, forwardpropagation fits in shared mem. we can solve many thousands accumulators at once, featureactivation is insanely fast. But there is a deeper layer to be explored with cuda instructions in section 7.24. but more on that on the thread where I will release the 60% faster cpu inference, and hopefully suprise with the performance of gpu inference. So no discussion about that here please we have this thread for gpu movegen only.

Check out bitray here!
https://github.com/Gigantua/Chess_Moveg ... u_Bitray.h

Next up when I find the time: KGSSB by our friend Mike. KGSSB will not fit in __shared__ memory which is the preferred memory location for chess. I found a workaround by using __ldg() instruction which bypasses the cache hierarchy and puts any global memory access into L1 Tex readonly memoy which is as fast as shared mem and may fit kgssb perfectly.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
ColonelPhantom
Posts: 6
Joined: Fri Mar 12, 2021 3:48 pm
Full name: Quinten Kock

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by ColonelPhantom »

I found out about chipSTAR, which allows running CUDA/HIP on Intel GPUs. Out of curiosity, I decided to try running it on my laptop, and test out this thing. I've got pretty old and low-end integrated graphics, so wasn't expecting much.

After removing a few calls to specific features, as well as cuRAND (returning 0 instead of an actually random number :( ), I got it to compile and work. I also reduced the number of blocks to 256 because otherwise it was just taking too long to compute.

https://github.com/ColonelPhantom/Chess ... e/chipstar

Code: Select all

Intel(R) UHD Graphics 620
Black Magic - Fixed shift:      0.50 GigaQueens/s
Bob Lookup               :      0.58 GigaQueens/s
Kogge Stone              :      0.31 GigaQueens/s
Hyperbola Quiescence     :      0.35 GigaQueens/s
Switch Lookup            :      0.09 GigaQueens/s
Slide Arithm             :      0.72 GigaQueens/s
Pext Lookup              :      0.28 GigaQueens/s
SISSY Lookup             :      0.18 GigaQueens/s
Dumb 7 Fill              :      0.22 GigaQueens/s
Obstruction Difference   :      0.51 GigaQueens/s
Genetic Obstruction Diff :      0.73 GigaQueens/s
Leorik                   :      0.62 GigaQueens/s
SBAMG o^(o-3cbn)         :      0.42 GigaQueens/s
NO HEADACHE              :      0.38 GigaQueens/s
AVX Branchless Shift     :      0.23 GigaQueens/s
Slide Arithmetic Inline  :      0.62 GigaQueens/s
C++ Tree Sifter - 8 Rays :      0.38 GigaQueens/s
Bitrotation o^(o-2r)     :      0.59 GigaQueens/s
FoldingHash (uncomplete) :      0.25 GigaQueens/s
Bitray 2023 version      :      0.44 GigaQueens/s
Those results are absolutely pathetic, but keep in mind this is a sub-TFLOP GPU that is multiple generations behind compared to current Xe/Arc GPUs.

In addition, the chipSTAR software is only at version 1.0 (which is the version I used), and is known to not be very well-optimized yet in general, meaning newer versions would hopefully make some of these at least a bit faster.

chipSTAR might also be nice on other platforms, as it should work on any OpenCL platform that supports SPIR-V ingestion, as far as I know.