Comparison of all known Sliding lookup algorithms [CUDA]

Discussion of chess software programming and technical issues.

Moderator: Ras

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

64bit is no issue for the gpu. Memory is 64bit so many simple operations use native 64bit instructions. Similar to how LEA is used on your cpu for math.

Notable exception: 64x64bit multiply is slow and thats one reason why cuda fancy magic is so slow.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
smatovic
Posts: 3189
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by smatovic »

dangi12012 wrote: Wed Mar 23, 2022 5:58 pm 64bit is no issue for the gpu. Memory is 64bit so many simple operations use native 64bit instructions. Similar to how LEA is used on your cpu for math.

Notable exception: 64x64bit multiply is slow and thats one reason why cuda fancy magic is so slow.
Registers are 32-bit, local and global coalescenced memory access optimized for a warp/wavefront with 32-bit too...

--
Srdja
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

Yes we all know that! I tested it all too - nothing beats normal Bitboards on Cuda. (for now)
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

smatovic wrote: Wed Mar 23, 2022 6:22 pm Registers are 32-bit, local and global coalescenced memory access optimized for a warp/wavefront with 32-bit too...

Srdja
Also accessing in multiples of 64 implies perfect alignment with 32 - so no problem there!
What you literally can do is divide int32 performance by 2 and thats your performance for 64 bit in most cases - except for some specific operations like multiply or shift etc. where its worse.
But you also get many many intrinsics that take a 64 bit as input and work fast as 32bit - just because thats a trickle down from scientific fp64 and int64 professional gpus into our enduser cards.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Dann Corbit
Posts: 12768
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by Dann Corbit »

dangi12012 wrote: Wed Mar 23, 2022 5:58 pm 64bit is no issue for the gpu. Memory is 64bit so many simple operations use native 64bit instructions. Similar to how LEA is used on your cpu for math.

Notable exception: 64x64bit multiply is slow and thats one reason why cuda fancy magic is so slow.
I remember a trick from long ago in my 8 and 16 bit assembly programming that we would use address calculation to perform a multiply since the CPUs were not as good at it.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

UPDATES
  • With some time to improve CUDA programming skills I optimized the algorithms further.
  • Added the newly found GeneticObstructionDifference algorithm
  • Updated to lates Nvidia Toolkit (general 2-5% improvement)
  • Support for Visual Studio 2022
What made it much faster:
Constant Memory is really slow - dont use it. It servers a different purpose and is not applicable to chessprogramming.
Shared memory. This is what should be used to replace general ray lookups. I could improve Bob Lookup from 1.6B to 52.79B just from this change alone. Shared memory is limited to around 48kbyte - so Black Magic - Fixed shift and Pext Lookup really does not belong on GPUs.

Code: Select all

NVIDIA GeForce RTX 3080
Black Magic - Fixed shift:      4.91 GigaQueens/s
QBB Algo                 :      42.15 GigaQueens/s
Bob Lookup               :      52.79 GigaQueens/s
Kogge Stone              :      29.14 GigaQueens/s
Hyperbola Quiescence     :      12.63 GigaQueens/s
Switch Lookup            :      3.91 GigaQueens/s
Slide Arithm             :      13.42 GigaQueens/s
Pext Lookup              :      12.18 GigaQueens/s
SISSY Lookup             :      6.08 GigaQueens/s
Hypercube Alg            :      0.98 GigaQueens/s
Dumb 7 Fill              :      18.52 GigaQueens/s
Obstruction Difference   :      45.99 GigaQueens/s
Genetic Obstruction Diff :      73.06 GigaQueens/s
Leorik                   :      43.16 GigaQueens/s
SBAMG o^(o-3cbn)         :      47.15 GigaQueens/s
NO HEADACHE              :      20.95 GigaQueens/s
AVX Branchless Shift     :      20.22 GigaQueens/s
Slide Arithmetic Inline  :      48.86 GigaQueens/s
Bitrotation o^(o-2r)     :      84.51 GigaQueens/s
Also finding new algorithms really helped a lot.

Code: Select all

Obstruction Difference   :      45.99 GigaQueens/s
Genetic Obstruction Diff :      73.06 GigaQueens/s
This is now faster than the previous recordholder - but Bitrotation will stay on top.
Bitrotation which was created at the end of 2021 is also a HUGE improvement over Hyperbola Quiescence.

So all in all the very fastest versions of all algorithms and many new ones were created this year and I think my work into sliding algorithms is close to completion. I will also add the new 8Ray algorithm which is a new addition.

If you read this - please share your results especially if you have a RTX 3090 or 2080Ti. My Results are for an RTX 3080.
https://github.com/Gigantua/Chess_Movegen_GPU
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
smatovic
Posts: 3189
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by smatovic »

dangi12012 wrote: Tue Apr 19, 2022 2:53 pm [...]
What made it much faster:
Constant Memory is really slow - dont use it. It servers a different purpose and is not applicable to chessprogramming.
Shared memory. This is what should be used to replace general ray lookups. I could improve Bob Lookup from 1.6B to 52.79B just from this change alone. Shared memory is limited to around 48kbyte - so Black Magic - Fixed shift and Pext Lookup really does not belong on GPUs.
[...]
2 cents, constant memory is about 64KB but only 8 to 16 KB are cached, latency is higher than registers but should be in the range of shared/scratch-pad memory...depends on the conrete architecture ofc...but yes, on GPU, computation over lookup, or alike.

--
Srdja
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

smatovic wrote: Tue Apr 19, 2022 5:18 pm
Mister Matovic - please share the results for you GPU.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by Mike Sherwin »

tcusr reminded me of a technique that I once used. Using that technique finally allowed me to get rid of the stop bit in Classical Bob-Mike.

Code: Select all

struct Rays {
  u64 rayNW;
  u64 rayNN;
  u64 rayNE;
  u64 rayEE;
  u64 raySE;
  u64 raySS;
  u64 raySW;
  u64 rayWW;
};

Rays rays[66];
Rays* ray = rays + 1;

// So now a 63 - (sq = 64) returns a -1 and ray[-1].raySS is zero.
// And sq =  64 is ray[64].rayNN which is zero.

	  rayNW = ray[std::countr_zero(ray[fs].rayNW & occ)].rayNW;
	  rayNE = ray[std::countr_zero(ray[fs].rayNE & occ)].rayNE;
	  rayNN = ray[std::countr_zero(ray[fs].rayNN & occ)].rayNN;
	  rayEE = ray[std::countr_zero(ray[fs].rayEE & occ)].rayEE;
	  raySE = ray[63 - std::countl_zero(ray[fs].raySE & occ)].raySE;
	  raySW = ray[63 - std::countl_zero(ray[fs].raySW & occ)].raySW;
	  raySS = ray[63 - std::countl_zero(ray[fs].raySS & occ)].raySS;
	  rayWW = ray[63 - std::countl_zero(ray[fs].rayWW & occ)].rayWW;
	  bb = qob[fs] ^ (rayNN | rayEE | raySS | rayWW | rayNW | rayNE | raySE | raySW);

This should be noticeably faster. 
smatovic
Posts: 3189
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by smatovic »

dangi12012 wrote: Tue Apr 19, 2022 9:58 pm
smatovic wrote: Tue Apr 19, 2022 5:18 pm
Mister Matovic - please share the results for you GPU.
To be honest, I do not even know how to compile this, runing Linux with gcc, clang, mingw. Windows only for testing GUIs and the OpenCL stack, and I doubt my GTX 750 will impress anybody with some numbers ;) ....your benchmark tests absolute performance across different architectures, not the relative of a single GPU thread, Warp, or compute unit.

--
Srdja