64bit is no issue for the gpu. Memory is 64bit so many simple operations use native 64bit instructions. Similar to how LEA is used on your cpu for math.
Notable exception: 64x64bit multiply is slow and thats one reason why cuda fancy magic is so slow.
Comparison of all known Sliding lookup algorithms [CUDA]
Moderator: Ras
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Comparison of all known Sliding lookup algorithms [CUDA]
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 3189
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Comparison of all known Sliding lookup algorithms [CUDA]
Registers are 32-bit, local and global coalescenced memory access optimized for a warp/wavefront with 32-bit too...dangi12012 wrote: ↑Wed Mar 23, 2022 5:58 pm 64bit is no issue for the gpu. Memory is 64bit so many simple operations use native 64bit instructions. Similar to how LEA is used on your cpu for math.
Notable exception: 64x64bit multiply is slow and thats one reason why cuda fancy magic is so slow.
--
Srdja
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Comparison of all known Sliding lookup algorithms [CUDA]
Yes we all know that! I tested it all too - nothing beats normal Bitboards on Cuda. (for now)
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Comparison of all known Sliding lookup algorithms [CUDA]
Also accessing in multiples of 64 implies perfect alignment with 32 - so no problem there!
What you literally can do is divide int32 performance by 2 and thats your performance for 64 bit in most cases - except for some specific operations like multiply or shift etc. where its worse.
But you also get many many intrinsics that take a 64 bit as input and work fast as 32bit - just because thats a trickle down from scientific fp64 and int64 professional gpus into our enduser cards.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 12768
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Comparison of all known Sliding lookup algorithms [CUDA]
I remember a trick from long ago in my 8 and 16 bit assembly programming that we would use address calculation to perform a multiply since the CPUs were not as good at it.dangi12012 wrote: ↑Wed Mar 23, 2022 5:58 pm 64bit is no issue for the gpu. Memory is 64bit so many simple operations use native 64bit instructions. Similar to how LEA is used on your cpu for math.
Notable exception: 64x64bit multiply is slow and thats one reason why cuda fancy magic is so slow.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Comparison of all known Sliding lookup algorithms [CUDA]
UPDATES
Constant Memory is really slow - dont use it. It servers a different purpose and is not applicable to chessprogramming.
Shared memory. This is what should be used to replace general ray lookups. I could improve Bob Lookup from 1.6B to 52.79B just from this change alone. Shared memory is limited to around 48kbyte - so Black Magic - Fixed shift and Pext Lookup really does not belong on GPUs.
Also finding new algorithms really helped a lot.
This is now faster than the previous recordholder - but Bitrotation will stay on top.
Bitrotation which was created at the end of 2021 is also a HUGE improvement over Hyperbola Quiescence.
So all in all the very fastest versions of all algorithms and many new ones were created this year and I think my work into sliding algorithms is close to completion. I will also add the new 8Ray algorithm which is a new addition.
If you read this - please share your results especially if you have a RTX 3090 or 2080Ti. My Results are for an RTX 3080.
https://github.com/Gigantua/Chess_Movegen_GPU
- With some time to improve CUDA programming skills I optimized the algorithms further.
- Added the newly found GeneticObstructionDifference algorithm
- Updated to lates Nvidia Toolkit (general 2-5% improvement)
- Support for Visual Studio 2022
Constant Memory is really slow - dont use it. It servers a different purpose and is not applicable to chessprogramming.
Shared memory. This is what should be used to replace general ray lookups. I could improve Bob Lookup from 1.6B to 52.79B just from this change alone. Shared memory is limited to around 48kbyte - so Black Magic - Fixed shift and Pext Lookup really does not belong on GPUs.
Code: Select all
NVIDIA GeForce RTX 3080
Black Magic - Fixed shift: 4.91 GigaQueens/s
QBB Algo : 42.15 GigaQueens/s
Bob Lookup : 52.79 GigaQueens/s
Kogge Stone : 29.14 GigaQueens/s
Hyperbola Quiescence : 12.63 GigaQueens/s
Switch Lookup : 3.91 GigaQueens/s
Slide Arithm : 13.42 GigaQueens/s
Pext Lookup : 12.18 GigaQueens/s
SISSY Lookup : 6.08 GigaQueens/s
Hypercube Alg : 0.98 GigaQueens/s
Dumb 7 Fill : 18.52 GigaQueens/s
Obstruction Difference : 45.99 GigaQueens/s
Genetic Obstruction Diff : 73.06 GigaQueens/s
Leorik : 43.16 GigaQueens/s
SBAMG o^(o-3cbn) : 47.15 GigaQueens/s
NO HEADACHE : 20.95 GigaQueens/s
AVX Branchless Shift : 20.22 GigaQueens/s
Slide Arithmetic Inline : 48.86 GigaQueens/s
Bitrotation o^(o-2r) : 84.51 GigaQueens/s
Code: Select all
Obstruction Difference : 45.99 GigaQueens/s
Genetic Obstruction Diff : 73.06 GigaQueens/s
Bitrotation which was created at the end of 2021 is also a HUGE improvement over Hyperbola Quiescence.
So all in all the very fastest versions of all algorithms and many new ones were created this year and I think my work into sliding algorithms is close to completion. I will also add the new 8Ray algorithm which is a new addition.
If you read this - please share your results especially if you have a RTX 3090 or 2080Ti. My Results are for an RTX 3080.
https://github.com/Gigantua/Chess_Movegen_GPU
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 3189
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Comparison of all known Sliding lookup algorithms [CUDA]
2 cents, constant memory is about 64KB but only 8 to 16 KB are cached, latency is higher than registers but should be in the range of shared/scratch-pad memory...depends on the conrete architecture ofc...but yes, on GPU, computation over lookup, or alike.dangi12012 wrote: ↑Tue Apr 19, 2022 2:53 pm [...]
What made it much faster:
Constant Memory is really slow - dont use it. It servers a different purpose and is not applicable to chessprogramming.
Shared memory. This is what should be used to replace general ray lookups. I could improve Bob Lookup from 1.6B to 52.79B just from this change alone. Shared memory is limited to around 48kbyte - so Black Magic - Fixed shift and Pext Lookup really does not belong on GPUs.
[...]
--
Srdja
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Comparison of all known Sliding lookup algorithms [CUDA]
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: Comparison of all known Sliding lookup algorithms [CUDA]
tcusr reminded me of a technique that I once used. Using that technique finally allowed me to get rid of the stop bit in Classical Bob-Mike.
Code: Select all
struct Rays {
u64 rayNW;
u64 rayNN;
u64 rayNE;
u64 rayEE;
u64 raySE;
u64 raySS;
u64 raySW;
u64 rayWW;
};
Rays rays[66];
Rays* ray = rays + 1;
// So now a 63 - (sq = 64) returns a -1 and ray[-1].raySS is zero.
// And sq = 64 is ray[64].rayNN which is zero.
rayNW = ray[std::countr_zero(ray[fs].rayNW & occ)].rayNW;
rayNE = ray[std::countr_zero(ray[fs].rayNE & occ)].rayNE;
rayNN = ray[std::countr_zero(ray[fs].rayNN & occ)].rayNN;
rayEE = ray[std::countr_zero(ray[fs].rayEE & occ)].rayEE;
raySE = ray[63 - std::countl_zero(ray[fs].raySE & occ)].raySE;
raySW = ray[63 - std::countl_zero(ray[fs].raySW & occ)].raySW;
raySS = ray[63 - std::countl_zero(ray[fs].raySS & occ)].raySS;
rayWW = ray[63 - std::countl_zero(ray[fs].rayWW & occ)].rayWW;
bb = qob[fs] ^ (rayNN | rayEE | raySS | rayWW | rayNW | rayNE | raySE | raySW);
This should be noticeably faster.
-
- Posts: 3189
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Comparison of all known Sliding lookup algorithms [CUDA]
To be honest, I do not even know how to compile this, runing Linux with gcc, clang, mingw. Windows only for testing GUIs and the OpenCL stack, and I doubt my GTX 750 will impress anybody with some numbers

--
Srdja