Comparison of all known Sliding lookup algorithms [CUDA]

ColonelPhantom · Post by **ColonelPhantom** » Mon Sep 19, 2022 1:05 am

Oops, I forgot to include a link to my GitHub repository, it is at https://github.com/ColonelPhantom/Chess_Movegen_GPU

I also did not find an Edit-button, so apologies for the double-post.

dangi12012 · Post by **dangi12012** » Mon Oct 03, 2022 4:36 pm

If anyone gets their hand on an RTX 4080 or above please contact me.

The AMD post above this will get merged into my repo on time (got the authors approval)

I have some new algorithms that are not ready yet - but should be quite a step forward still.
Going from below Billion Queens/s (which is Rook + Bishop) to over 100Billion was quite the fun journey, lets see what Ada Lovelace can do in this domain.

Example for binary matrix multiplication:
https://github.com/NVIDIA/cutlass/blob/ ... 32_sm80.cu
Which is essentially what is so very useful about tensor cores. You can expand from a native 8x8 bitboard uint1b_t into int or half by matrix multiplication. Or stay with uint1b_t and have it rotated, mirrored or weighed or any other matrix operation you can think of.

dangi12012 · Post by **dangi12012** » Mon Oct 31, 2022 2:34 pm

New Breakthrough:

Through better hardware understanding I could optimize the bitrotation more. Bitrotation is already a unification possible when true bitreverse hardware instructions are avaibale.

Code: Select all

Bitrotation: 4.6437G/s (32 Threads)
cu_Bitrotation: 215.806G/s

For context: This is a rtx 3080, cpu is a 5950X that can get 13Billion lookups with optimized pext.
Results are verified against reference movegens with 10 Million positions from actual games.
This new code can find all moves for all rooks, queens and bishops in under 1ms for all 10 Million positions.

krunch · Post by **krunch** » Mon Nov 28, 2022 5:07 am

And numbers for a RTX 4090

Code: Select all

NVIDIA GeForce RTX 4090
Black Magic - Fixed shift:      15.75 GigaQueens/s
QBB Algo                 :      152.80 GigaQueens/s
Bob Lookup               :      187.73 GigaQueens/s
Kogge Stone              :      100.24 GigaQueens/s
Hyperbola Quiescence     :      42.12 GigaQueens/s
Switch Lookup            :      13.60 GigaQueens/s
Slide Arithm             :      222.92 GigaQueens/s
Pext Lookup              :      38.83 GigaQueens/s
SISSY Lookup             :      9.44 GigaQueens/s
Dumb 7 Fill              :      65.31 GigaQueens/s
Obstruction Difference   :      171.22 GigaQueens/s
Genetic Obstruction Diff :      291.15 GigaQueens/s
Leorik                   :      157.57 GigaQueens/s
SBAMG o^(o-3cbn)         :      175.14 GigaQueens/s
NO HEADACHE              :      75.57 GigaQueens/s
AVX Branchless Shift     :      72.70 GigaQueens/s
Slide Arithmetic Inline  :      173.02 GigaQueens/s
C++ Tree Sifter - 8 Rays :      210.50 GigaQueens/s
Bitrotation o^(o-2r)     :      299.20 GigaQueens/s

dangi12012 · Post by **dangi12012** » Mon Oct 09, 2023 2:17 am

In this thread we mentioned a new sifted solution for positive and negative rays. The cute thing about this would be a board structure where we maintained the occ and bit_reversed(occ) by applying sq ^ 64 to all moves as well. This gives rise to the possiblity of using bitrotation without ever rotating bits.
https://www.talkchess.com/forum3/viewto ... =7&t=82547

New algo added: Bitray 2023 version

Code: Select all

NVIDIA GeForce RTX 3080
Black Magic - Fixed shift:      40.88 GigaQueens/s
QBB Algo                 :      58.07 GigaQueens/s
Bob Lookup               :      71.90 GigaQueens/s
Kogge Stone              :      39.93 GigaQueens/s
Hyperbola Quiescence     :      99.89 GigaQueens/s
Switch Lookup            :      5.41 GigaQueens/s
Slide Arithm             :      84.16 GigaQueens/s
Pext Lookup              :      15.75 GigaQueens/s
SISSY Lookup             :      8.08 GigaQueens/s
Dumb 7 Fill              :      26.11 GigaQueens/s
Obstruction Difference   :      66.50 GigaQueens/s
Genetic Obstruction Diff :      98.69 GigaQueens/s
Leorik                   :      59.21 GigaQueens/s
SBAMG o^(o-3cbn)         :      69.46 GigaQueens/s
NO HEADACHE              :      30.53 GigaQueens/s
AVX Branchless Shift     :      29.43 GigaQueens/s
Slide Arithmetic Inline  :      68.82 GigaQueens/s
C++ Tree Sifter - 8 Rays :      81.49 GigaQueens/s
Bitrotation o^(o-2r)     :      134.14 GigaQueens/s
FoldingHash (uncomplete) :      68.51 GigaQueens/s
Bitray 2023 version      :      110.84 GigaQueens/s

Code: Select all

#define bitswap(X) __brevll(X)
BitFunction ray(uint64_t occ, uint64_t mask, uint64_t OCC, uint64_t MASK) {
	uint64_t o = occ & mask;
	uint64_t O = OCC & MASK;

	return ((o ^ (o - 1ull)) & mask) | bitswap((O ^ (O - 1ull)) & MASK);
}

We dont use the improvement of maintained reversed occupation because on cuda because bit reversal is a single instruction.
The interesting part and the true never seen before novelty of this algorithm would be this:
We could enumerate the lefthand side of the return normally via a bitloop.
We can also enumerate the righthand side of the return WITHOUT a bitswap and normalizing the target square sq = (sq' ^ 64)

110.84 GigaQueens/s is a proven big improvement over Hyperbola in the slower form too. Hyperbola to remind you does not have the property in the middle is a XOR and the left and righthand side are not disjoint.

The reason these things become important because NNUE evaluation is to my own suprise a real natural fit for GPUs. Faster memory, forwardpropagation fits in shared mem. we can solve many thousands accumulators at once, featureactivation is insanely fast. But there is a deeper layer to be explored with cuda instructions in section 7.24. but more on that on the thread where I will release the 60% faster cpu inference, and hopefully suprise with the performance of gpu inference. So no discussion about that here please we have this thread for gpu movegen only.

Check out bitray here!
https://github.com/Gigantua/Chess_Moveg ... u_Bitray.h

Next up when I find the time: KGSSB by our friend Mike. KGSSB will not fit in __shared__ memory which is the preferred memory location for chess. I found a workaround by using __ldg() instruction which bypasses the cache hierarchy and puts any global memory access into L1 Tex readonly memoy which is as fast as shared mem and may fit kgssb perfectly.

ColonelPhantom · Post by **ColonelPhantom** » Mon Oct 23, 2023 12:02 am

I found out about chipSTAR, which allows running CUDA/HIP on Intel GPUs. Out of curiosity, I decided to try running it on my laptop, and test out this thing. I've got pretty old and low-end integrated graphics, so wasn't expecting much.

After removing a few calls to specific features, as well as cuRAND (returning 0 instead of an actually random number

), I got it to compile and work. I also reduced the number of blocks to 256 because otherwise it was just taking too long to compute.

https://github.com/ColonelPhantom/Chess ... e/chipstar

Code: Select all

Intel(R) UHD Graphics 620
Black Magic - Fixed shift:      0.50 GigaQueens/s
Bob Lookup               :      0.58 GigaQueens/s
Kogge Stone              :      0.31 GigaQueens/s
Hyperbola Quiescence     :      0.35 GigaQueens/s
Switch Lookup            :      0.09 GigaQueens/s
Slide Arithm             :      0.72 GigaQueens/s
Pext Lookup              :      0.28 GigaQueens/s
SISSY Lookup             :      0.18 GigaQueens/s
Dumb 7 Fill              :      0.22 GigaQueens/s
Obstruction Difference   :      0.51 GigaQueens/s
Genetic Obstruction Diff :      0.73 GigaQueens/s
Leorik                   :      0.62 GigaQueens/s
SBAMG o^(o-3cbn)         :      0.42 GigaQueens/s
NO HEADACHE              :      0.38 GigaQueens/s
AVX Branchless Shift     :      0.23 GigaQueens/s
Slide Arithmetic Inline  :      0.62 GigaQueens/s
C++ Tree Sifter - 8 Rays :      0.38 GigaQueens/s
Bitrotation o^(o-2r)     :      0.59 GigaQueens/s
FoldingHash (uncomplete) :      0.25 GigaQueens/s
Bitray 2023 version      :      0.44 GigaQueens/s

Those results are absolutely pathetic, but keep in mind this is a sub-TFLOP GPU that is multiple generations behind compared to current Xe/Arc GPUs.

In addition, the chipSTAR software is only at version 1.0 (which is the version I used), and is known to not be very well-optimized yet in general, meaning newer versions would hopefully make some of these at least a bit faster.

chipSTAR might also be nice on other platforms, as it should work on any OpenCL platform that supports SPIR-V ingestion, as far as I know.

dangi12012 · Post by **dangi12012** » Thu Oct 26, 2023 11:03 pm

You know, I took 4 weeks of my life and did the pain of implementing this in Vulkan already.
1500 Lines of boilerplate code (which you have to understand the mental model of Vulkan to be effective)

Then writing the GLSL for it, and you dont C++ or C or pointers, or recursion

BUT: It runs on anly platform which is these devices: gpu, cpu. On these platforms: android, ios, consoles, pc... Some important 64 bit intrinsics dont exist via core 1.3 or vk extensions. Host is my reference, Swiftshader the cpu part and run against VK on RTX 3080. All assertions work.

For example:

Code: Select all

uint64_t bit_reverse(uint64_t value){
    //uvec2 vec = unpackUint2x32(value);
    //return packUint2x32(bitfieldReverse(vec).yx);

    //Equivalent and 20% faster:
    return (uint64_t(bitfieldReverse(uint(value))) << 32) | bitfieldReverse(uint(value >> 32));
}

Also most of VK API is extremely verbose, but kind of elegant once you get the gist of it.

But you get access to constant memory, shared memory, storage buffers and uniform buffers. Also the pain of registering a memory mapped file memory space to the device is also there.

In my tests cuda is much much faster, but that of course wont hold true on AMD and Intel and phones and whatnot.
Also GLSL being essentially const char* string that is loaded at runtime it opens some doors to some cool tricks when you load some specialized codepaths. Also this does only hold true in compute bound applications. Some chess problems are entirely memory bound and there cuda and vulkan perform the same. Think of movegen. If we produce 150 Billion uint64_t per second that is already at the limit of global memory. You have to be smarter with gpu programming, keep everything in shared memory or registers.

Long story short, I will shortly release all my cool stuff on a website which will link to github with a GLSL equivalent of these algorithms.
This includes a "Computestream" class which makes life as easy as cuda. Computeshaders in GLSL have to be written by you still.

A minimal setup might look like this. "movegen.comp" for the many algorithms that require 64x4 masks:

Code: Select all

#version 450
#extension GL_EXT_shader_explicit_arithmetic_types : require
#extension GL_EXT_scalar_block_layout : require
#extension GL_EXT_debug_printf : require

layout(local_size_x_id = 0) in;               
layout(local_size_y_id = 1) in;

layout(set = 0, binding = 0, scalar) buffer OutputBuffer_T {
    uint64_t moves[];
} outputBuffer;

layout(set = 0, binding = 1) uniform MaskBuffer_T {
    u64vec4 mask[64]; 
} maskBuffer;
shared u64vec4 masks[64];

smatovic · Post by **smatovic** » Fri Oct 27, 2023 8:40 am

Vulkan, cool, even the Raspberry Pi speaks meanwhile Vulkan....

--
Srdja

Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Re: Comparison of all known Sliding lookup algorithms [CUDA]