Comparison of all known Sliding lookup algorithms [CUDA]

smatovic · Post by **smatovic** » Wed Apr 20, 2022 7:33 am

smatovic wrote: ↑Tue Apr 19, 2022 5:18 pm
dangi12012 wrote: ↑Tue Apr 19, 2022 2:53 pm [...]
What made it much faster:
Constant Memory is really slow - dont use it. It servers a different purpose and is not applicable to chessprogramming.
Shared memory. This is what should be used to replace general ray lookups. I could improve Bob Lookup from 1.6B to 52.79B just from this change alone. Shared memory is limited to around 48kbyte - so Black Magic - Fixed shift and Pext Lookup really does not belong on GPUs.
[...]
2 cents, constant memory is about 64KB but only 8 to 16 KB are cached, latency is higher than registers but should be in the range of shared/scratch-pad memory...depends on the conrete architecture ofc...but yes, on GPU, computation over lookup, or alike.

--
Srdja

followup, note: constant cache might be serialized on Nvidia arch when different threads of a Warp access different addresses, AMD differs here, but in real world practice for lookup tables in chess you should also have differing addresses with scratch-pad memory, which results also in serialized access cos the memory banks are not coalescenced (64 squares 8-byte aligned vs. 32 threads 4-byte aligned)...depends on the real world use case, anyway, in most cases constant cache or scratch-pad memory should be faster than global memory (VRAM) access.

--
Srdja

dangi12012 · Post by **dangi12012** » Wed Apr 20, 2022 2:20 pm

Mike Sherwin wrote: ↑Wed Apr 20, 2022 2:41 am tcusr reminded me of a technique that I once used. Using that technique finally allowed me to get rid of the stop bit in Classical Bob-Mike.

Code: Select all

struct Rays {
  u64 rayNW;
  u64 rayNN;
  u64 rayNE;
  u64 rayEE;
  u64 raySE;
  u64 raySS;
  u64 raySW;
  u64 rayWW;
};

Rays rays[66];
Rays* ray = rays + 1;

// So now a 63 - (sq = 64) returns a -1 and ray[-1].raySS is zero.
// And sq =  64 is ray[64].rayNN which is zero.

	  rayNW = ray[std::countr_zero(ray[fs].rayNW & occ)].rayNW;
	  rayNE = ray[std::countr_zero(ray[fs].rayNE & occ)].rayNE;
	  rayNN = ray[std::countr_zero(ray[fs].rayNN & occ)].rayNN;
	  rayEE = ray[std::countr_zero(ray[fs].rayEE & occ)].rayEE;
	  raySE = ray[63 - std::countl_zero(ray[fs].raySE & occ)].raySE;
	  raySW = ray[63 - std::countl_zero(ray[fs].raySW & occ)].raySW;
	  raySS = ray[63 - std::countl_zero(ray[fs].raySS & occ)].raySS;
	  rayWW = ray[63 - std::countl_zero(ray[fs].rayWW & occ)].rayWW;
	  bb = qob[fs] ^ (rayNN | rayEE | raySS | rayWW | rayNW | rayNE | raySE | raySW);

This should be noticeably faster.

This looks good - i will definitely take a look at also!
Some quick ideas for that algo:
- Maybe there is even a way to get rid of the 63 - xxx by having a second reversed Rays array.
- Maybe we dont even need countr_zero but just a fast instruction that perfectly hashes the value of ray[fs].rayNW & occ into 0..63.
- Maybe its worth to take a look if this can be one 4x bigger lookup array without 8 rays but with only 4. (with a seperate solution lookup)
I promised myself 1 week break - so I will continue research into all new algorithms and improvements then. What I wanted to achieve is mostly done.

Mike Sherwin · Post by **Mike Sherwin** » Wed Apr 20, 2022 8:13 pm

dangi12012 wrote: ↑Wed Apr 20, 2022 2:20 pm
Mike Sherwin wrote: ↑Wed Apr 20, 2022 2:41 am tcusr reminded me of a technique that I once used. Using that technique finally allowed me to get rid of the stop bit in Classical Bob-Mike.
Code: Select all
struct Rays {
  u64 rayNW;
  u64 rayNN;
  u64 rayNE;
  u64 rayEE;
  u64 raySE;
  u64 raySS;
  u64 raySW;
  u64 rayWW;
};

Rays rays[66];
Rays* ray = rays + 1;

// So now a 63 - (sq = 64) returns a -1 and ray[-1].raySS is zero.
// And sq =  64 is ray[64].rayNN which is zero.

	  rayNW = ray[std::countr_zero(ray[fs].rayNW & occ)].rayNW;
	  rayNE = ray[std::countr_zero(ray[fs].rayNE & occ)].rayNE;
	  rayNN = ray[std::countr_zero(ray[fs].rayNN & occ)].rayNN;
	  rayEE = ray[std::countr_zero(ray[fs].rayEE & occ)].rayEE;
	  raySE = ray[63 - std::countl_zero(ray[fs].raySE & occ)].raySE;
	  raySW = ray[63 - std::countl_zero(ray[fs].raySW & occ)].raySW;
	  raySS = ray[63 - std::countl_zero(ray[fs].raySS & occ)].raySS;
	  rayWW = ray[63 - std::countl_zero(ray[fs].rayWW & occ)].rayWW;
	  bb = qob[fs] ^ (rayNN | rayEE | raySS | rayWW | rayNW | rayNE | raySE | raySW);

This should be noticeably faster. 
This looks good - i will definitely take a look at also!
Some quick ideas for that algo:
- Maybe there is even a way to get rid of the 63 - xxx by having a second reversed Rays array.
- Maybe we dont even need countr_zero but just a fast instruction that perfectly hashes the value of ray[fs].rayNW & occ into 0..63.
- Maybe its worth to take a look if this can be one 4x bigger lookup array without 8 rays but with only 4. (with a seperate solution lookup)
I promised myself 1 week break - so I will continue research into all new algorithms and improvements then. What I wanted to achieve is mostly done.

"- Maybe there is even a way to get rid of the 63 - xxx by having a second reversed Rays array."
Yes, thank you for reminding me. I already posted that possibility but I forgot because that is what I do best. I forget.

"- Maybe we dont even need countr_zero but just a fast instruction that perfectly hashes the value of ray[fs].rayNW & occ into 0..63."
Sounds like a job for PEXT but eight times using far smaller tables. Does not sound good.

"- Maybe its worth to take a look if this can be one 4x bigger lookup array without 8 rays but with only 4. (with a seperate solution lookup)"
I wouldn't even know where to start.

"I promised myself 1 week break - so I will continue research into all new algorithms and improvements then."
I take breaks all the time even when I do not want too. It is just that my brain quits working from time to time.

" What I wanted to achieve is mostly done."
I'm nowhere close to being done. I hope in my next life I won't be born with a memory disability and yet be able to keep the abilities that I do possess. Then I might be really really dangerous.

I wish someone would try my idea for real time learning before I die. It would be very simple for someone without memory impediments blocking their way. You saw how long it took me to optimize one line of pawn code. Most of the code already exist in a chess engine. Just take a portion of the search time and play as many shallow searched games (or partial games) as possible. This will get accumulated info from much farther away than the normal search horizon. Then the moves of the games are placed in the tt with adjusted scores using RL. And this will influence what the normal search calculates to be best. And it should result in better move ordering during normal search. There are many things lurking beyond the horizon for RT learning to discover.

dangi12012 · Post by **dangi12012** » Sat Apr 23, 2022 1:11 pm

Mike Sherwin wrote: ↑Wed Apr 20, 2022 8:13 pm I wish someone would try my idea for real time learning before I die. It would be very simple for someone without memory impediments blocking their way. You saw how long it took me to optimize one line of pawn code. Most of the code already exist in a chess engine. Just take a portion of the search time and play as many shallow searched games (or partial games) as possible. This will get accumulated info from much farther away than the normal search horizon. Then the moves of the games are placed in the tt with adjusted scores using RL. And this will influence what the normal search calculates to be best. And it should result in better move ordering during normal search. There are many things lurking beyond the horizon for RT learning to discover.

If you describe it further in a seperate thread we can discuss it!.

This algo improvement is for you - will repost it in the appropriate non cuda thread.
I could half the needed memory and increase performance by around 20%.

Code: Select all

Classical Bob-Mike                 315.235323                    1024    [8kb]       countr_zero, countl_zero yes       Robert Hyatt and Michael Sherwin             https://www.chessprogramming.org/Classical_Approach
Advanced Bob-Mike                  375.742091                    520     [4kb]       countr_zero, countl_zero no        Michael Sherwin and Daniel Inführ            http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=50#p924653

dangi12012 · Post by **dangi12012** » Sat Apr 23, 2022 1:57 pm

CUDA Final Release

My work with sliding piece algorithms is done and with further research this is the final result (RTX 3080)

Code: Select all

Black Magic - Fixed shift:      6.55 GigaQueens/s
QBB Algo                 :      60.21 GigaQueens/s
Bob Lookup               :      58.23 GigaQueens/s
Kogge Stone              :      40.41 GigaQueens/s
Hyperbola Quiescence     :      16.92 GigaQueens/s
Switch Lookup            :      5.53 GigaQueens/s
Slide Arithm             :      86.92 GigaQueens/s
Pext Lookup              :      16.12 GigaQueens/s
SISSY Lookup             :      8.37 GigaQueens/s
Dumb 7 Fill              :      26.55 GigaQueens/s
Obstruction Difference   :      67.33 GigaQueens/s
Genetic Obstruction Diff :      101.77 GigaQueens/s
Leorik                   :      62.14 GigaQueens/s
SBAMG o^(o-3cbn)         :      66.79 GigaQueens/s
NO HEADACHE              :      30.93 GigaQueens/s
AVX Branchless Shift     :      29.32 GigaQueens/s
Slide Arithmetic Inline  :      70.60 GigaQueens/s
C++ Tree Sifter - 8 Rays :      86.03 GigaQueens/s
Bitrotation o^(o-2r)     :      115.40 GigaQueens/s

Many of these optimisations were non trivial and for example the algorithm "Genetic Obstruction Difference" was a new discovery I created with my sifiting tool. Please feel free to compare with the results on page 1 how far this has come!

With more time reading all the CUDA docs the understanding of the hardware went a long way for me. So I could tweak all parts so that everything is in its optimum. (NSight reports 102% hardware occupancy)

So its not a coincidence that the blue dots happen to be at the optimum for all 3 graphs.
https://github.com/Gigantua/Chess_Movegen_GPU

There is a huge missmatch between the potential of the technology for chessprogramming and the actual usage. (Because the learning curve is steep and without hardware understanding you get nowhere) The only other person I know of that works on this is the author of ZetaChess. I work in the 3d industry and still I needed to learn many new tricks to achieve the new levels of performance.

I might look into all the possible Tree search algorithms next and compare them the same way. I have the feeling that is the right way to make progress. Slow like a steamroller but exploring every branch and rabbithole that presents itself until there are none left. Improving algorithms to their own extreme while keeping the core idea intact. That way at the end you can with 100% confidence say:
E pluribus unum

Mike Sherwin · Post by **Mike Sherwin** » Sat Apr 23, 2022 2:48 pm

dangi12012 wrote: ↑Sat Apr 23, 2022 1:11 pm This algo improvement is for you - will repost it in the appropriate non cuda thread.
I could half the needed memory and increase performance by around 20%.
Code: Select all
Classical Bob-Mike                 315.235323                    1024    [8kb]       countr_zero, countl_zero yes       Robert Hyatt and Michael Sherwin             https://www.chessprogramming.org/Classical_Approach
Advanced Bob-Mike                  375.742091                    520     [4kb]       countr_zero, countl_zero no        Michael Sherwin and Daniel Inführ            http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=50#p924653

Thanks!
I agree that tree search is a better metric just because it is real world and like you said explores every nook and cranny.
I have perft running but not perfectly. On an R9-3950x at 4.3 GHz the node rate is 70 million nodes per second. And that is making and unmaking all moves and no speed up tricks like bulk counting or whatever else is done to make the numbers look better. And that is using Optimized Classical Bob-Mike. If I would simply not make/unmake the leaf nodes it would be ~180 million nodes per second. And if I had a PEXT efficient cpu and used PEXT it would be faster yet. I think these numbers are really good but it is hard to tell because perft node rates are often reported without giving the cpu or clock rate or tricks used making it impossible to compare results between engines.

dangi12012 · Post by **dangi12012** » Sun Apr 24, 2022 12:30 pm

Final Release Part 2
I was going through the ptx and comparing the pipeline bottlenecks of all algos. Bitrotation needs to rotate 3x per ray.
GeneticObstructionDifference (the AST sifted improvement of Obstruction Difference) did not emit optimal PTX.

So I eleminated all locals and optimized the shared memory into a struct. This answers the question of Array of Structs vs Struct of Arrays is for these algos).

It makes me happy to anounce a new best overall algorithm:

Code: Select all

NVIDIA GeForce RTX 3080
Black Magic - Fixed shift:      6.53 GigaQueens/s
QBB Algo                 :      60.49 GigaQueens/s
Bob Lookup               :      58.17 GigaQueens/s
Kogge Stone              :      40.33 GigaQueens/s
Hyperbola Quiescence     :      16.91 GigaQueens/s
Switch Lookup            :      5.52 GigaQueens/s
Slide Arithm             :      87.93 GigaQueens/s
Pext Lookup              :      15.92 GigaQueens/s
SISSY Lookup             :      8.33 GigaQueens/s
Dumb 7 Fill              :      26.51 GigaQueens/s
Obstruction Difference   :      67.78 GigaQueens/s
Genetic Obstruction Diff :      121.13 GigaQueens/s
Leorik                   :      61.69 GigaQueens/s
SBAMG o^(o-3cbn)         :      71.36 GigaQueens/s
NO HEADACHE              :      30.62 GigaQueens/s
AVX Branchless Shift     :      29.45 GigaQueens/s
Slide Arithmetic Inline  :      71.04 GigaQueens/s
C++ Tree Sifter - 8 Rays :      88.21 GigaQueens/s
Bitrotation o^(o-2r)     :      113.19 GigaQueens/s

Genetic Obstruction Difference is now the overall fastest algorithm. This was a great journey and the performane achieved is really something else.
I also made sure to do the same thing for Bitrotation to make everything as optimized as possible. But 3x bitrotation is just more work compared to 1x countlzero!

The code is quite pleasing and the masks array is the same among many algorithms!

I also created a Github release
https://github.com/Gigantua/Chess_Moveg ... e_2022.exe
Please share your results!

Last words: When looking up 121 Billion Queens per second that is not the number of calculated squares. Its the number of Queen positions. So the actual number of target squares calculated (relevant for actual perft or movegen) will be the sum of set bits in each and every result of the 121 Billion results per second.

dangi12012 · Post by **dangi12012** » Thu Apr 28, 2022 5:52 pm

For Linux users. Please share your results. I provided an update to compile under Linux / WSL2

Hany two liner:

Code: Select all

git clone https://github.com/Gigantua/Chess_Movegen_GPU.git && cd Chess_Movegen_GPU
make

Prerequesary: Nvidia Pascal (10xx) or later
https://developer.nvidia.com/cuda-downloads

pferd · Post by **pferd** » Mon May 02, 2022 6:59 pm

two minor typos in source code:

Code: Select all

git diff
diff --git a/kernel.cu b/kernel.cu
index 79d641e..4375aba 100644
--- a/kernel.cu
+++ b/kernel.cu
@@ -1,5 +1,5 @@
 
-#include "Cu_Common.h"
+#include "cu_Common.h"
 
 #include <numeric>
 #include <iostream>
@@ -27,7 +27,7 @@
 #include "cu_SlideArithmInline.h"
 #include "cu_Genetic8Ray.h"
 #include "cu_Bitrotation.h"
-#include "cu_foldingHash.h"
+#include "cu_FoldingHash.h"
 #include "kernel.h"
 
 /// <summary>
@@ -281,4 +281,4 @@ int main()
     TestChessprocessor<18>(blocks, threadsperblock);
     TestChessprocessor<19>(blocks, threadsperblock);
     TestChessprocessor<20>(blocks, threadsperblock);
-}
\ No newline at end of file
+}

Here are my results:

Code: Select all

make
nvcc -gencode arch=compute_52,code=sm_52 -gencode=arch=compute_60,code=sm_60 -gencode=arch=compute_61,code=sm_61 -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_75,code=sm_75 -gencode=arch=compute_80,code=sm_80 -gencode=arch=compute_86,code=sm_86 -gencode=arch=compute_87,code=sm_87 --expt-relaxed-constexpr -std=c++17 --run --threads 8 -O3 kernel.cu -o movegen_gpu
NVIDIA GeForce GTX 1070
Black Magic - Fixed shift:      2.25 GigaQueens/s
QBB Algo                 :      21.73 GigaQueens/s
Bob Lookup               :      17.83 GigaQueens/s
Kogge Stone              :      14.71 GigaQueens/s
Hyperbola Quiescence     :      34.45 GigaQueens/s
Switch Lookup            :      1.16 GigaQueens/s
Slide Arithm             :      35.02 GigaQueens/s
Pext Lookup              :      7.35 GigaQueens/s
SISSY Lookup             :      3.66 GigaQueens/s
Dumb 7 Fill              :      9.81 GigaQueens/s
Obstruction Difference   :      23.07 GigaQueens/s
Genetic Obstruction Diff :      32.55 GigaQueens/s
Leorik                   :      22.95 GigaQueens/s
SBAMG o^(o-3cbn)         :      24.41 GigaQueens/s
NO HEADACHE              :      7.85 GigaQueens/s
AVX Branchless Shift     :      11.08 GigaQueens/s
Slide Arithmetic Inline  :      25.78 GigaQueens/s
C++ Tree Sifter - 8 Rays :      30.21 GigaQueens/s
Bitrotation o^(o-2r)     :      43.87 GigaQueens/s
FoldingHash (uncomplete) :      16.65 GigaQueens/s
make  43,77s user 3,36s system 185% cpu 25,383 total

ColonelPhantom · Post by **ColonelPhantom** » Mon Sep 19, 2022 12:23 am

I ported the code to AMD's HIP framework so that it runs on AMD Radeon GPUs using ROCm.

ROCm's portability isn't too great (in terms of e.g. legacy GPU support, and iirc it also requires some host-side features like PCIe atomics), but HIP luckily also works on CUDA (so a HIP codebase runs on both ROCm and CUDA, and shouldn't lose much if any performance over a real CUDA implementation).

AMD even provides an automated translation tool for CUDA->HIP, and I only needed to manually fix up some things

(Do note, I did not check in any way that the translated code is correct, other than that it compiles. I give zero guarantees.)

I also ran the code on my own GPU (an RX 580 4GB):

Code: Select all

AMD Radeon RX 580 Series
Black Magic - Fixed shift:      5.38 GigaQueens/s
QBB Algo                 :      13.01 GigaQueens/s
Bob Lookup               :      21.90 GigaQueens/s
Kogge Stone              :      6.22 GigaQueens/s
Hyperbola Quiescence     :      23.40 GigaQueens/s
Switch Lookup            :      0.42 GigaQueens/s
Slide Arithm             :      22.48 GigaQueens/s
Pext Lookup              :      4.28 GigaQueens/s
SISSY Lookup             :      1.42 GigaQueens/s
Dumb 7 Fill              :      4.05 GigaQueens/s
Obstruction Difference   :      11.82 GigaQueens/s
Genetic Obstruction Diff :      23.64 GigaQueens/s
Leorik                   :      14.86 GigaQueens/s
SBAMG o^(o-3cbn)         :      14.33 GigaQueens/s
NO HEADACHE              :      3.54 GigaQueens/s
AVX Branchless Shift     :      4.55 GigaQueens/s
Slide Arithmetic Inline  :      15.95 GigaQueens/s
C++ Tree Sifter - 8 Rays :      18.20 GigaQueens/s
Bitrotation o^(o-2r)     :      29.29 GigaQueens/s
FoldingHash (uncomplete) :      17.44 GigaQueens/s

It's interesting to see that some algorithms are (relatively) better on AMD hardware (black magic for example), and some on NVIDIA hardware. It does seem Bitrotation o^(o-2r) is universally quite strong, being the best on my RX 580 and also on pferd's GTX 1070. On dangi's RTX 3080, Genetic Obstruction Diff is slightly faster but bitrotation is still close.

If you want to run it yourself, note that you need a HIP runtime and also hipRAND (which wraps either cuRAND or rocRAND). Here are the commands I use to compile and run on my AMD setup:

Code: Select all

hipcc -I/opt/rocm/hiprand/include --std=c++17 hip/kernel.cu -o movegen_hip
./movegen_hip

Especially test results from RDNA or perhaps even CDNA would be very interesting!

Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]