Using Neural or Graphical Processor Units in NNUE engines

RogerC · Post by **RogerC** » Thu Aug 03, 2023 11:57 am

Hi,

I wonder why NNUE chess Engines don't use Neural processing units or GPUs of modern Apple and Qualcomm SOCs in Smatphones, instead of using there CPUs.

Those units are extremely powerful and would undoubtedly increase nps, if it's possible to use them.

I'm not programmer at all, just a chess player.

Best regards.

Ras · Post by **Ras** » Thu Aug 03, 2023 12:09 pm

RogerC wrote: ↑Thu Aug 03, 2023 11:57 amI wonder why NNUE chess Engines don't use Neural processing units or GPUs of modern Apple and Qualcomm SOCs in Smatphones, instead of using there CPUs.

Because the "UE" part, i.e. efficiently updated, eliminates most of the calculations to begin with. For the remaining part, the overhead for transfer and synchronisation with several threads would more than eat up any potential gains.

RogerC · Post by **RogerC** » Thu Aug 03, 2023 6:34 pm

Ras wrote: ↑Thu Aug 03, 2023 12:09 pm Because the "UE" part, i.e. efficiently updated, eliminates most of the calculations to begin with. For the remaining part, the overhead for transfer and synchronisation with several threads would more than eat up any potential gains.

Thank you Ras, it's more clear to me now.

syzygy · Post by **syzygy** » Thu Aug 03, 2023 10:21 pm

Ras wrote: ↑Thu Aug 03, 2023 12:09 pm
RogerC wrote: ↑Thu Aug 03, 2023 11:57 amI wonder why NNUE chess Engines don't use Neural processing units or GPUs of modern Apple and Qualcomm SOCs in Smatphones, instead of using there CPUs.
Because the "UE" part, i.e. efficiently updated, eliminates most of the calculations to begin with. For the remaining part, the overhead for transfer and synchronisation with several threads would more than eat up any potential gains.

Is the hardware really unsuitable for NNUE or is it (also?) a problem of lack of (low-level) documentation?

ColonelPhantom · Post by **ColonelPhantom** » Fri Aug 04, 2023 1:04 am

I think the main/real/only problem is transfer and synchronization overhead. If you want to do evaluation, it is not practical to do a roundtrip to the GPU if you want it to be really fast (which evaluating an NNUE is).

Bigger networks such as those used by Leela Chess are evaluated on GPU! Because these take so much longer to evaluate, the overhead is more than worth it given the considerable speedup.

If you want to do NNUE on a GPU, you probably want the entire engine to be running on the GPU. There's not been a succesful GPU-only chess engine so far, although there is definitely interest in it! There's Zeta by smatovic, which has a few releases out but none too strong (but note that none of the published versions use NN evaluation). Looking at the website it seems it's still being worked on, which is encouraging. I think Zeta is the only public GPU-based engine, with other projects only being capable of perft or not having materialized as of yet.

syzygy · Post by **syzygy** » Fri Aug 04, 2023 2:43 am

ColonelPhantom wrote: ↑Fri Aug 04, 2023 1:04 am I think the main/real/only problem is transfer and synchronization overhead. If you want to do evaluation, it is not practical to do a roundtrip to the GPU if you want it to be really fast (which evaluating an NNUE is).

But Apple's neural engine is not a GPU. Perhaps it has the same disadvantages as a GPU, but this I don't know.

Ras · Post by **Ras** » Fri Aug 04, 2023 7:10 am

syzygy wrote: ↑Fri Aug 04, 2023 2:43 amBut Apple's neural engine is not a GPU.

That's correct, the issue with Apple is that its neural engine is badly documented, only usable through a proprietary framework, and limited in what nets it can deal with. It seems that it's not meant to be used by third parties, unless that has changed in the meantime. Here's the best I've come across so far: https://github.com/hollance/neural-engine

syzygy · Post by **syzygy** » Fri Aug 04, 2023 3:51 pm

Ras wrote: ↑Fri Aug 04, 2023 7:10 am
syzygy wrote: ↑Fri Aug 04, 2023 2:43 amBut Apple's neural engine is not a GPU.
That's correct, the issue with Apple is that its neural engine is badly documented, only usable through a proprietary framework, and limited in what nets it can deal with. It seems that it's not meant to be used by third parties, unless that has changed in the meantime. Here's the best I've come across so far: https://github.com/hollance/neural-engine

Thanks for the link! So indeed there is a severe lack of documentation.

Despite it using shared memory I suppose it cannot be used to speed up NNUE due to synchronisation issues. But perhaps there is room for a GPU or Neural Engine based alpha-beta approach with many many threads, something in between LC0 and SF.

dangi12012 · Post by **dangi12012** » Sun Aug 06, 2023 11:04 pm

GPUs can never be used to improve incremental NNs in the traditional sense. The latency from host to device is too high, even in integrated GPUs or other accelerators.
The important part would be to shift the incremental NNUE network to the device as well, meaning the accumulators. The difficulty is thread divergence and the general forced singlethreadedness of Alphabeta which clashes with a GPU paradigm.
Incidentally gpu memory is many times faster than CPU memory which helps with feature activation/deactivation for thousands of accumulators in parallel.

To answer the specific question how fast a single 32 thread warp which executes 32 of a single instruction per clock and shares registers etc can be made to execute ONE evaluation starting from an accumulator I am running a research project where the result be exactly this comparison.

I wrote it many times before and__viaddmin_s16x2_relu for example looks very juicy for parts of this system.

Open question:
If we have 32threads for one NNUE eval or have 32 evals concurrently executing in lockstep is an open question. I recon the second approach will win since the later parts of nnue work on 15 elements making half of the threads idle with the first approach.

Summary:
NNue fits in 50 lines of code and cannot be accelerated when crossing any device boundaries. What you will need is to do the incremental update on the same cache local device as well. Using unsupported propriatary apis is never a good idea since they can change on a whim.
Cuda and gpus are insanely good for any type of unconditional integer or float arithmetic - to leverage it in any way shape or form needs a paradigm shift from existing engine approaches.

Unrelated fun fact: Memory mapped tablebase probing from the gpu is supportable too.

smatovic · Post by **smatovic** » Mon Aug 07, 2023 1:17 pm

One thing is using TPUs or neural engines for CNNs as in Lc0, this probably can benefit, Ankan for example did some microbenchmarking for running Lc0 on Apple's neural engine instead CPU instead Metal-GPU. But it seems vendors use proprietary frameworks to program these.

As already mentioned, NNUE is optimized to run on a SIMD unit of a CPU, the first layer, where the most weights are, is incrementally updated, I suppose that the further layers rest in cache, so the offload latency to an external compute unit might circumvent any speedup.

There is Intel's Xeon Sapphire Rapids arch with so called AMX, a TMUL compute unit, a matrix-math compute unit, but until now these are not present in consumer brand CPUs.

AMD seems also to add neural engines in some CPU variants, with own programming framework/API.

I think this is something to look into in future, if Intel and AMD and ARM have TPUs sitting near CPU, how to use them for chess neural networks, might need an architecture design in between Lc0 CNN and SF NNUE to benefit, and it might turn out that CPU+SIMD with higher clocks and broader bit-width (AVX-512) are the way to go.

Another thing is the GPU, Lc0 works with big CNNs and batches -> parallelization, SF's AlphaBeta search algorithm is by nature serial, the less workers parallel AB runs, the more efficient it is, and we see a trend to increased single thread performance, less parallel, in AB engines.

AFAIK Daniel Shawul has/had with Scorpio some batched AB running, to offload AB tasks to the GPU, I am not into the details though.

Zeta itself is intended to run completely on the GPU, but I need to implement NNUE/NNOM eval first, to see if a massive parallel AB with low NPS/worker can compete with something like SF on a CPU with high NPS/worker.

As Dann Corbit wrote, to make use of CPU+SIMD+GPU+TPU+HBM via unified memory, the whole compute power of some newer devices, you will need a new programming paradigm in chess.

--
Srdja

Using Neural or Graphical Processor Units in NNUE engines

Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines

Re: Using Neural or Graphical Processor Units in NNUE engines