Zeta with NNUE on GPU?

Discussion of chess software programming and technical issues.

Moderator: Ras

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Zeta with NNUE on GPU?

Post by dangi12012 »

Why do you want to copy nnue on the gpu? It is an imperfect format that has an insanely big first layer to support incremental moves and a layout created for AVX2.

My advice: Try to create something new that leverages the architecture. Could still support incremental moves.
Else you are just copying conventional wisdom.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
smatovic
Posts: 3234
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Zeta with NNUE on GPU?

Post by smatovic »

I simply have not the resources (time/hardware) to create/train own NNs and experiment with alternate NN architectures...will happily clone any technique I consider useful.

--
Srdja
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Zeta with NNUE on GPU?

Post by dangi12012 »

No its the other way around - you will save time by just using your own format - instead of porting over nnue.
At least thats what I have in mind for my very similar goal :D

https://developer.nvidia.com/cudnn
You also get access to already existing training libraries that only need a few hundred iterations to converge against a good result.
My gut feeling tells me that having a deeper network is computationally much more expensive - but on the gpu it really almost does not matter since you get access to 1-2 orders of magnitude more compute.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Zeta with NNUE on GPU?

Post by dangi12012 »

What really kills these kind of projects is the latency - and warp divergence.
The details matter a lot to get good performance.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
smatovic
Posts: 3234
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Zeta with NNUE on GPU?

Post by smatovic »

dangi12012 wrote: Wed Mar 23, 2022 9:02 pm What really kills these kind of projects is the latency - and warp divergence.
The details matter a lot to get good performance.
Since there are GPGPU frameworks chessprogrammers ponder about how to use a GPU for chess, these are the main issues that were mentioned over the last decade:

- CPU-GPU latency
- SIMD architecture
- SIMT architecture
- VRAM memory latency
- integer performance
- no recursion
- synching of gpu-threads

https://zeta-chess.app26.de/post/zeta-o ... mpossible/

With Zeta v099 I ported a classic, selective, parallel AB engine to OpenCL to run on a GPU, I solved all these, but that is not the way how to utilize a GPU, my NPS throughput per worker is to low, you really have to run mutliple waves of SIMT, run 100s of thousands of threads to utilize a GPU, and that's simply not how to run a game tree search for chess considering selective AB engines with an effective branching factor of ~2, with brute force you can not compete against selectiveness, that is the common wisdom of modern AB chess engines, and I guess with NNs present in chess, we will see even more selectiveness. Feel free to prove otherwise with your Gigantua project, bon chance.

--
Srdja
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Zeta with NNUE on GPU?

Post by dangi12012 »

https://zeta-chess.app26.de/post/zeta-n ... enchmarks/

There is no way to port this to the gpu
There really is no way to do AB multithreaded too - Multithreading is only a trick to do more work and hope the TT contains the entry.

AB is an algorithmically forced singlethreaded code:
Image

What is a TT? Its the 1d lookup for a gametree by calculation - it works in reverse because with depth first approaches you technically dont store a tree in memory. You just visit nodes - so it is depth first.
I have done some preliminary tests - and the purest most perfect C++ AB implementation caps out at around 1.2 Billion nps and only if you use a heavy template machinery. If not then you will only get 400Mnps. There is no way to use more threads to be faster.

What will work on the gpu: A top down algorithm like Monte Carlo or others that dont use recursion the way its used in AB.
There the number of threads is unlimited and parallelisation is unlimited. Because threads dont need synchronisation at all.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
smatovic
Posts: 3234
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Zeta with NNUE on GPU?

Post by smatovic »

dangi12012 wrote: Thu Mar 24, 2022 4:33 pm What will work on the gpu: A top down algorithm like Monte Carlo or others that dont use recursion the way its used in AB.
There the number of threads is unlimited and parallelisation is unlimited. Because threads dont need synchronisation at all.
Well, how does the saying go, the proof is in the market place...looking forward for Gigantua MCTS!

--
Srdja
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Zeta with NNUE on GPU?

Post by Daniel Shawul »

dangi12012 wrote: Thu Mar 24, 2022 4:33 pm What will work on the gpu: A top down algorithm like Monte Carlo or others that dont use recursion the way its used in AB.
There the number of threads is unlimited and parallelisation is unlimited. Because threads dont need synchronisation at all.
That is not correct, but atleast you are now convinced about the hopelessness of a strictly sequential AB implementation on the GPU.
The elo you can get from parallel MCTS is capped too, because once you go above a few hundred threads collisions will start to dominate, so you will have to do an extremely wide search (high cpuct) to counteract it, and that looses elo! This is similar to the YBW vs ABDADA/LazySmp search debate, which the latter is proven to be slightly better now. You have to pay the "overhead fee" to get good parallelization whether you use AB or MCTS. That MCTS doesn't use recursion helps some because now a thread is searching one leaf node in one round from root-to-leaf, but if the way you reach the leaf nodes is still constrained by the AB criteria or in the case of MCTS very low CPUCT (high selectivity), you are bound to reach the same leaf node or internal nodes again and again with different threads so collision happens!
smatovic
Posts: 3234
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Zeta with NNUE on GPU?

Post by smatovic »

dangi12012 wrote: Thu Mar 24, 2022 4:33 pm ....
Three Ways of GPGPU Chess...
https://zeta-chess.app26.de/post/three- ... pu-chess-/
[...]
So, the question which remains open is how to run thousands to millions of threads, performing all the same computation, in an parallel game tree search.
[...]
As I already mentioned, MCTS was already proposed in 2008 for GPGPU, but AFAIK no one implemented a massive parallel MCTS for chess for gpu and posted results. As you mentioned already, plain MCTS playouts are blind to forced, tactical lines in chess, so you have to enhance the vanilla brute force approach for example with known MCTS-AB or MCTS-PUCT techniques, or, come up with some yet unknown method...

--
Srdja

PS: It really helps to write some simple AB engine with standard techniques from scratch to get an feeling how computer chess works, Ankan wrote for example his Paladin, and I wrote Zeta Dva...