Comparison of all known Sliding lookup algorithms [CUDA]

Discussion of chess software programming and technical issues.

Moderator: Ras

Dann Corbit
Posts: 12777
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by Dann Corbit »

Has anyone tried porting to a Google TPU like these:
https://coral.ai/products/#prototyping-products
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

Dann Corbit wrote: Mon Mar 21, 2022 3:18 am Has anyone tried porting to a Google TPU like these:
https://coral.ai/products/#prototyping-products
There are no sliding algorithm that work with matrix multiplication except this one:
https://www.talkchess.com/forum3/viewto ... =7&t=79332

But that is not a conventional matrix since its a single 64 bit integer that is a binary 8x8 matrix - so not supported by TPUs.
Question: Is there a TPU that supports XNor networks?

A mailslot matrix algo will get memory bound very fast - a bitboard algo is much more space efficient.
But I dont need a dedicated TPU if an RTX 3080 is faster with GEMM than 99.9% of hardware out there.

What GLSL vs CUDA enables us to do is use even better hardware than gpus natively:
Quote for the fastest existing pcie 4.0 fpga:

Code: Select all

Card applications can be created using the Vitis™ unified software platform which uses high-level languages
such as C, C++, and OpenCL™. For experienced programmable logic developers, the card can be developed with
the Vivado® Design Suite where the full resources of the programmable logic device are made available for
development.
But going to exotic hardware is not needed at this stage. First of all there still may be improvements with classical approaches that are not discovered yet. Second of all fpgas would be its own thread. Since they are used in the financial industry more and more - this is also worth looking into since the last relevant fpga post is a 10? years old.

If someone knows a way to use GEMM (general matrix multiply) to solve the sliders in chess let me know.
But first I want to try out the tensor cores for Sliding Algos.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

Let me summarise this. The road is clear - but actually taking it is hard hard work.
Sentences like "just use cuda cores" require dozens of hours to read documentation with stuff that does not have a dedicated forum or a big developer pool to ask questions to.
Also having a good Vulcan GLSL project is not as easy as c++ hello world.
Fun times are ahead.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Dann Corbit
Posts: 12777
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by Dann Corbit »

dangi12012 wrote: Mon Mar 21, 2022 1:48 pm
Dann Corbit wrote: Mon Mar 21, 2022 3:18 am Has anyone tried porting to a Google TPU like these:
https://coral.ai/products/#prototyping-products
There are no sliding algorithm that work with matrix multiplication except this one:
https://www.talkchess.com/forum3/viewto ... =7&t=79332

But that is not a conventional matrix since its a single 64 bit integer that is a binary 8x8 matrix - so not supported by TPUs.
Question: Is there a TPU that supports XNor networks?

A mailslot matrix algo will get memory bound very fast - a bitboard algo is much more space efficient.
But I dont need a dedicated TPU if an RTX 3080 is faster with GEMM than 99.9% of hardware out there.

What GLSL vs CUDA enables us to do is use even better hardware than gpus natively:
Quote for the fastest existing pcie 4.0 fpga:

Code: Select all

Card applications can be created using the Vitis™ unified software platform which uses high-level languages
such as C, C++, and OpenCL™. For experienced programmable logic developers, the card can be developed with
the Vivado® Design Suite where the full resources of the programmable logic device are made available for
development.
But going to exotic hardware is not needed at this stage. First of all there still may be improvements with classical approaches that are not discovered yet. Second of all fpgas would be its own thread. Since they are used in the financial industry more and more - this is also worth looking into since the last relevant fpga post is a 10? years old.

If someone knows a way to use GEMM (general matrix multiply) to solve the sliders in chess let me know.
But first I want to try out the tensor cores for Sliding Algos.
The thing is the TPUs are cheap. Here is an m.2 with two of them for $39:
https://coral.ai/products/m2-accelerator-dual-edgetpu

That is 8 TOPS. Since the original Google experiment went well, I assume that there must be some way to write an efficient chess program using them. Maybe move generation is not ultimate, but other things might be.

There are PCIE4.0 boards that you can put 8 m.2 units into. So that would be 32 TOPS for $156. Compare that to GPU prices.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

Dann Corbit wrote: Mon Mar 21, 2022 8:18 pm If someone knows a way to use GEMM (general matrix multiply) to solve the sliders in chess let me know.
But first I want to try out the tensor cores for Sliding Algos.
That is 8 TOPS. Since the original Google experiment went well, I assume that there must be some way to write an efficient chess program using them. Maybe move generation is not ultimate, but other things might be.

There are PCIE4.0 boards that you can put 8 m.2 units into. So that would be 32 TOPS for $156. Compare that to GPU prices.
Well Dan - if you find me an algorithm that works with 32bit floating point or bfloat16 - this can work.
Probably an algorithm that runs on Bfloat16 is interesting - but the board will have to be Mailbox which is a regression in my books.

But yes - if you can transform chess problems into GEMM + Activationfunction you suddenly get access to hardware that is an order of magnitude faster than the gpu again - but you lose versatility.

I can see me implementing all the codes in vulcan first - and maybe creating a new bfloat16 algo (which could use tensor cores also)!

We are at an interesting positition in computerchess. Fast hardware exist - but latency overhead means porting the whole code over to the gpu - or stay on the cpu altogether. Apple M1 does not have a latency issue but pcs and android have it.
Also FPGAs evolved a LOT. But one step after another - big numbers is not the goal here - but to achieve something new that brings computerchess forward. BFloat16 is a good starting point.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Dann Corbit
Posts: 12777
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by Dann Corbit »

dangi12012 wrote: Mon Mar 21, 2022 9:30 pm
Dann Corbit wrote: Mon Mar 21, 2022 8:18 pm If someone knows a way to use GEMM (general matrix multiply) to solve the sliders in chess let me know.
But first I want to try out the tensor cores for Sliding Algos.
That is 8 TOPS. Since the original Google experiment went well, I assume that there must be some way to write an efficient chess program using them. Maybe move generation is not ultimate, but other things might be.

There are PCIE4.0 boards that you can put 8 m.2 units into. So that would be 32 TOPS for $156. Compare that to GPU prices.
Well Dan - if you find me an algorithm that works with 32bit floating point or bfloat16 - this can work.
Probably an algorithm that runs on Bfloat16 is interesting - but the board will have to be Mailbox which is a regression in my books.

But yes - if you can transform chess problems into GEMM + Activationfunction you suddenly get access to hardware that is an order of magnitude faster than the gpu again - but you lose versatility.

I can see me implementing all the codes in vulcan first - and maybe creating a new bfloat16 algo (which could use tensor cores also)!

We are at an interesting positition in computerchess. Fast hardware exist - but latency overhead means porting the whole code over to the gpu - or stay on the cpu altogether. Apple M1 does not have a latency issue but pcs and android have it.
Also FPGAs evolved a LOT. But one step after another - big numbers is not the goal here - but to achieve something new that brings computerchess forward. BFloat16 is a good starting point.
The new archtecture for the AMD GPUs shows a planned shared memory where both the GPU can read the CPU RAM and the CPU can read the GPU VRAM. Using infinity fabric. No latency or copy problems.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Dann Corbit
Posts: 12777
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by Dann Corbit »

Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by dangi12012 »

This also seems interesting:
https://www.anandtech.com/show/17327/nv ... -announced

So I better speed up my research and publish a tensor core algo soon! :twisted:
My creation of a single bit matix algorithm was a good idea after all.

I currently see no way of using this for mailbox boards - but it might exist.
If someone knows a way let me know.
Even tho APIs of 8byte per element matrices are much better than a binary matrix - but mailbox "occupation" is an array of 64x8bit values so that also contains the piece type and is non standard. Of course one can maintain neighbour tables etc - but that is proven to many times slower than bitboard approaches.

If you are smart and read this - try to solve this:
Find a 8bit per element algebra way of solving sliding pieces for either representation: 64x8bit (mailbox) or 1x64bit (bitboard)
It can be one solution per square such that there can be 64 different matrices that if multiplied with a full board position a unique index can be created that would correspond to either the solution directly or an 14 bit index into a table.

Ideally it would be an idea that is already supported by TPUs (so multiple matrix layers with activationfunctions in between are allowed)
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Pio
Posts: 335
Joined: Sat Feb 25, 2012 10:42 pm
Location: Stockholm

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by Pio »

dangi12012 wrote: Tue Mar 22, 2022 10:10 pm This also seems interesting:
https://www.anandtech.com/show/17327/nv ... -announced

So I better speed up my research and publish a tensor core algo soon! :twisted:
My creation of a single bit matix algorithm was a good idea after all.

I currently see no way of using this for mailbox boards - but it might exist.
If someone knows a way let me know.
Even tho APIs of 8byte per element matrices are much better than a binary matrix - but mailbox "occupation" is an array of 64x8bit values so that also contains the piece type and is non standard. Of course one can maintain neighbour tables etc - but that is proven to many times slower than bitboard approaches.

If you are smart and read this - try to solve this:
Find a 8bit per element algebra way of solving sliding pieces for either representation: 64x8bit (mailbox) or 1x64bit (bitboard)
It can be one solution per square such that there can be 64 different matrices that if multiplied with a full board position a unique index can be created that would correspond to either the solution directly or an 14 bit index into a table.

Ideally it would be an idea that is already supported by TPUs (so multiple matrix layers with activationfunctions in between are allowed)
Hi Daniel!

I don’t exactly understand the problem so if you could give more information it would be great.

One really interesting property of matrices is that you can divide a matrix into a matrix of matrices. For example if you have worked out the matrix algebra for 64 bit, it will still be valid for 64 bytes. You can also see 64 bytes as 64 x 4 x 2 bits so each 8 bit element can be further seen as a 2 x 2 matrix where each element is 2 bits.

Also don’t forget that you can use the fact that the determinant of A multiplied with the determinant of B equals the determinant of AB

Matrix algebra is not only valid for addition and multiplication but for other abelian groups and rings like sub matrices with matrix addition as + (addition) and matrix multiplication as x (multiplication).

Good luck!!!
smatovic
Posts: 3220
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Comparison of all known Sliding lookup algorithms [CUDA]

Post by smatovic »

dangi12012 wrote: Tue Mar 22, 2022 10:10 pm [...]
I currently see no way of using this for mailbox boards - but it might exist.
[...]
Not into TensorCores, but gamer gpus are 32-bit machines, so you get a penalty for 64-bit operations, I tried once to figure out an 0x88 SWAR like move gen and board representation, to let an vector of 4 or all 8 0x88 directions perform at once, should also be SIMD-friendly, but I could not implement an efficient, vectorized board representation...imagine the idea like an vector of 8 8-bit 0x88 Dumb7Fill approach, or alike.

--
Srdja