Comparison of all known Sliding lookup algorithms [CUDA]

dangi12012 · Post by **dangi12012** » Wed Mar 23, 2022 5:58 pm

64bit is no issue for the gpu. Memory is 64bit so many simple operations use native 64bit instructions. Similar to how LEA is used on your cpu for math.

Notable exception: 64x64bit multiply is slow and thats one reason why cuda fancy magic is so slow.

smatovic · Post by **smatovic** » Wed Mar 23, 2022 6:22 pm

dangi12012 wrote: ↑Wed Mar 23, 2022 5:58 pm 64bit is no issue for the gpu. Memory is 64bit so many simple operations use native 64bit instructions. Similar to how LEA is used on your cpu for math.

Notable exception: 64x64bit multiply is slow and thats one reason why cuda fancy magic is so slow.

Registers are 32-bit, local and global coalescenced memory access optimized for a warp/wavefront with 32-bit too...

--
Srdja

dangi12012 · Post by **dangi12012** » Wed Mar 23, 2022 7:43 pm

Yes we all know that! I tested it all too - nothing beats normal Bitboards on Cuda. (for now)

dangi12012 · Post by **dangi12012** » Wed Mar 23, 2022 9:10 pm

smatovic wrote: ↑Wed Mar 23, 2022 6:22 pm Registers are 32-bit, local and global coalescenced memory access optimized for a warp/wavefront with 32-bit too...

Srdja

Also accessing in multiples of 64 implies perfect alignment with 32 - so no problem there!
What you literally can do is divide int32 performance by 2 and thats your performance for 64 bit in most cases - except for some specific operations like multiply or shift etc. where its worse.
But you also get many many intrinsics that take a 64 bit as input and work fast as 32bit - just because thats a trickle down from scientific fp64 and int64 professional gpus into our enduser cards.

Dann Corbit · Post by **Dann Corbit** » Thu Mar 24, 2022 4:57 am

dangi12012 wrote: ↑Wed Mar 23, 2022 5:58 pm 64bit is no issue for the gpu. Memory is 64bit so many simple operations use native 64bit instructions. Similar to how LEA is used on your cpu for math.

Notable exception: 64x64bit multiply is slow and thats one reason why cuda fancy magic is so slow.

I remember a trick from long ago in my 8 and 16 bit assembly programming that we would use address calculation to perform a multiply since the CPUs were not as good at it.

dangi12012 · Post by **dangi12012** » Tue Apr 19, 2022 2:53 pm

UPDATES

With some time to improve CUDA programming skills I optimized the algorithms further.

Added the newly found GeneticObstructionDifference algorithm

Updated to lates Nvidia Toolkit (general 2-5% improvement)

Support for Visual Studio 2022

What made it much faster:
Constant Memory is really slow - dont use it. It servers a different purpose and is not applicable to chessprogramming.
Shared memory. This is what should be used to replace general ray lookups. I could improve Bob Lookup from 1.6B to 52.79B just from this change alone. Shared memory is limited to around 48kbyte - so Black Magic - Fixed shift and Pext Lookup really does not belong on GPUs.

Code: Select all

NVIDIA GeForce RTX 3080
Black Magic - Fixed shift:      4.91 GigaQueens/s
QBB Algo                 :      42.15 GigaQueens/s
Bob Lookup               :      52.79 GigaQueens/s
Kogge Stone              :      29.14 GigaQueens/s
Hyperbola Quiescence     :      12.63 GigaQueens/s
Switch Lookup            :      3.91 GigaQueens/s
Slide Arithm             :      13.42 GigaQueens/s
Pext Lookup              :      12.18 GigaQueens/s
SISSY Lookup             :      6.08 GigaQueens/s
Hypercube Alg            :      0.98 GigaQueens/s
Dumb 7 Fill              :      18.52 GigaQueens/s
Obstruction Difference   :      45.99 GigaQueens/s
Genetic Obstruction Diff :      73.06 GigaQueens/s
Leorik                   :      43.16 GigaQueens/s
SBAMG o^(o-3cbn)         :      47.15 GigaQueens/s
NO HEADACHE              :      20.95 GigaQueens/s
AVX Branchless Shift     :      20.22 GigaQueens/s
Slide Arithmetic Inline  :      48.86 GigaQueens/s
Bitrotation o^(o-2r)     :      84.51 GigaQueens/s

Also finding new algorithms really helped a lot.

Code: Select all

Obstruction Difference   :      45.99 GigaQueens/s
Genetic Obstruction Diff :      73.06 GigaQueens/s

This is now faster than the previous recordholder - but Bitrotation will stay on top.
Bitrotation which was created at the end of 2021 is also a HUGE improvement over Hyperbola Quiescence.

So all in all the very fastest versions of all algorithms and many new ones were created this year and I think my work into sliding algorithms is close to completion. I will also add the new 8Ray algorithm which is a new addition.

If you read this - please share your results especially if you have a RTX 3090 or 2080Ti. My Results are for an RTX 3080.
https://github.com/Gigantua/Chess_Movegen_GPU

smatovic · Post by **smatovic** » Tue Apr 19, 2022 5:18 pm

dangi12012 wrote: ↑Tue Apr 19, 2022 2:53 pm [...]
What made it much faster:
Constant Memory is really slow - dont use it. It servers a different purpose and is not applicable to chessprogramming.
Shared memory. This is what should be used to replace general ray lookups. I could improve Bob Lookup from 1.6B to 52.79B just from this change alone. Shared memory is limited to around 48kbyte - so Black Magic - Fixed shift and Pext Lookup really does not belong on GPUs.
[...]

2 cents, constant memory is about 64KB but only 8 to 16 KB are cached, latency is higher than registers but should be in the range of shared/scratch-pad memory...depends on the conrete architecture ofc...but yes, on GPU, computation over lookup, or alike.

--
Srdja

dangi12012 · Post by **dangi12012** » Tue Apr 19, 2022 9:58 pm

smatovic wrote: ↑Tue Apr 19, 2022 5:18 pm

Mister Matovic - please share the results for you GPU.

Mike Sherwin · Post by **Mike Sherwin** » Wed Apr 20, 2022 2:41 am

tcusr reminded me of a technique that I once used. Using that technique finally allowed me to get rid of the stop bit in Classical Bob-Mike.

Code: Select all

struct Rays {
  u64 rayNW;
  u64 rayNN;
  u64 rayNE;
  u64 rayEE;
  u64 raySE;
  u64 raySS;
  u64 raySW;
  u64 rayWW;
};

Rays rays[66];
Rays* ray = rays + 1;

// So now a 63 - (sq = 64) returns a -1 and ray[-1].raySS is zero.
// And sq =  64 is ray[64].rayNN which is zero.

	  rayNW = ray[std::countr_zero(ray[fs].rayNW & occ)].rayNW;
	  rayNE = ray[std::countr_zero(ray[fs].rayNE & occ)].rayNE;
	  rayNN = ray[std::countr_zero(ray[fs].rayNN & occ)].rayNN;
	  rayEE = ray[std::countr_zero(ray[fs].rayEE & occ)].rayEE;
	  raySE = ray[63 - std::countl_zero(ray[fs].raySE & occ)].raySE;
	  raySW = ray[63 - std::countl_zero(ray[fs].raySW & occ)].raySW;
	  raySS = ray[63 - std::countl_zero(ray[fs].raySS & occ)].raySS;
	  rayWW = ray[63 - std::countl_zero(ray[fs].rayWW & occ)].rayWW;
	  bb = qob[fs] ^ (rayNN | rayEE | raySS | rayWW | rayNW | rayNE | raySE | raySW);

This should be noticeably faster.

smatovic · Post by **smatovic** » Wed Apr 20, 2022 6:39 am

dangi12012 wrote: ↑Tue Apr 19, 2022 9:58 pm
smatovic wrote: ↑Tue Apr 19, 2022 5:18 pm
Mister Matovic - please share the results for you GPU.

To be honest, I do not even know how to compile this, runing Linux with gcc, clang, mingw. Windows only for testing GUIs and the OpenCL stack, and I doubt my GTX 750 will impress anybody with some numbers

....your benchmark tests absolute performance across different architectures, not the relative of a single GPU thread, Warp, or compute unit.

--
Srdja

Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]