Using Neural or Graphical Processor Units in NNUE engines

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

RogerC
Posts: 41
Joined: Tue Oct 29, 2019 8:33 pm
Location: French Polynesia
Full name: Roger C.

Re: Using Neural or Graphical Processor Units in NNUE engines

Post by RogerC »

Since Implementation of AffineTransformSparseInput for armv8 and specific compilation for armv8-dotprod architectures, Android users had got a huge speed improvement.

This shows us that a small change in coding and/or compilation can impact speed a lot, on modern architectures.
Magnum
Posts: 188
Joined: Thu Feb 04, 2021 10:24 pm
Full name: Arnold Magnum

Re: Using Neural or Graphical Processor Units in NNUE engines

Post by Magnum »

RogerC wrote: Wed Aug 09, 2023 3:48 am Since Implementation of AffineTransformSparseInput for armv8 and specific compilation for armv8-dotprod architectures, Android users had got a huge speed improvement.

This shows us that a small change in coding and/or compilation can impact speed a lot, on modern architectures.
That’s not a huge speed improvement.

+75.63% on Apple devices. This is a huge speed improvement :mrgreen:
It‘s 3x more than armv8-dotprod, Cortex-X1 : 27.1% speed-up

https://en.wikipedia.org/wiki/ARM_Cortex-X1
ARMv8.2-A
https://en.wikipedia.org/wiki/AArch64#ARMv8.2-A
https://en.wikipedia.org/wiki/ARM_architecture_family

Stockfish developers should try: ARMv9.2
https://www.anandtech.com/show/18871/ar ... -exclusive
Ras
Posts: 2500
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: Using Neural or Graphical Processor Units in NNUE engines

Post by Ras »

Magnum wrote: Sat Aug 12, 2023 6:29 pmStockfish developers should try: ARMv9.2
How would they try it? With what hardware as of now? ARM themselves design the CPUs without actually manufacturing them.
Rasmus Althoff
https://www.ct800.net
smatovic
Posts: 2727
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Using Neural or Graphical Processor Units in NNUE engines

Post by smatovic »

dangi12012 wrote: Sun Aug 06, 2023 11:04 pm [...]
Open question:
If we have 32threads for one NNUE eval or have 32 evals concurrently executing in lockstep is an open question. I recon the second approach will win since the later parts of nnue work on 15 elements making half of the threads idle with the first approach.
[...]
My take, you will need to couple a Warp with 32 or Wavefront with 64 threads to compete with AVX2, and you might want to use vector-packed math with char4, a vector of 4x8-bit, hence, to utilize a Warp with 32xchar4 you will need an optimized neural net architecture, which differs from current SF, generally, Lc0 has different net sizes, and I can imagine that it would make sense to use 128b, 256b, 512b...2048b optimized net archs for NNUE to run on different SIMD units with different bit-width.

--
Srdja
Sopel
Posts: 389
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: Using Neural or Graphical Processor Units in NNUE engines

Post by Sopel »

smatovic wrote: Mon Aug 07, 2023 1:17 pm There is Intel's Xeon Sapphire Rapids arch with so called AMX, a TMUL compute unit, a matrix-math compute unit, but until now these are not present in consumer brand CPUs.
We tried using it. It's trash for level-2 BLAS routines. It's so bad it's worse than SSSE3 for NNUE. Because of this I don't have much hope in matmul accelerators being helpful for NNUE apart from very specialized networks that would be useless on other hardware.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.