Zeta with NNUE on GPU?

smatovic · Post by **smatovic** » Wed Mar 31, 2021 4:15 pm

I think it is possible to add the new neural network technique 'NNUE' to Zeta for upcoming GPU architectures like Nvidia Lovelace, Intel Xe and AMD RDNA3 which probably will all have support of INT8, 8 bit integer, math with higher throughput and maybe some 10 to 20 MB L3 cache per SIMD unit for the network weights file.

With INT8 optimized datatypes and instructions, one could build an vectorized 8 bit 0x88 move generator which operates over the 8 directions as vector and with 32 parallel gpu threads of one SIMD unit handles all pieces at once. Maybe reaching 1 to 2 million nodes per second per SIMD unit in an Zeta engine like framework.

With 32 SIMD gpu threads performing 32xFP32 or 32x2xFP16 operations per clock the NNUE inference performance could be 2 to 4 times faster than with current NNUE on CPUs with AVX2 (roughly estimated), considering a switch from integer to float weights.

Volta/Turing/Ampere have currently 16 cores per FPU SIMD and support doubled throughput for FP16 operations, I guess Nvidia will move with Lovelace back again to an 32 core per SIMD design with unified INT/FP16 cores. RDNA has 32 cores per SIMD, also with doubled throughput for FP16. Intel seems to use SIMD8 with 8 FP cores for its Xe GPU (with support for higher throughput for lower precision), maybe Intel will also add some kind of SIMD32, to couple 4 EUs to one compute unit.

So...

- up to 2 Mnps per SIMD unit possible
- up to 4x faster inference for NNUE possible
- up to 160 parallel workers (SIMD units) on current highend-gpus

again, just some rough numbers estimated, big grain of salt and alike...

if the above all holds, then you get a hell of NNUE monster on highend-gpus.

Zeta v099 already has an simple AB framework implemented with ABDADA or as option RMO Lazy SMP parallel search across SIMD units, hence the main part would then be to implement all those funny search extensions and tricks Stockfish does in an iterative way in Zeta for GPU - full time job

--
Srdja

smatovic · Post by **smatovic** » Wed Mar 31, 2021 4:49 pm

Followup:

I wrote 10 to 20 MB L3 cache per SIMD unit, assuming the whole net should fit
in cache, doubt that this is common practice with NNUE on CPU, maybe the
first layer with most of the weights resists in RAM for the incremental updates,
and the further layers only get cached? Dunno.

--
Srdja

Dann Corbit · Post by **Dann Corbit** » Fri Apr 02, 2021 12:44 am

Do you have a card you can test it on?

smatovic · Post by **smatovic** » Fri Apr 02, 2021 7:03 am

Dann Corbit wrote: ↑Fri Apr 02, 2021 12:44 am Do you have a card you can test it on?

Nope, I decommissioned my GPU workstation some time ago, last run on Zeta was
with an Volta 100 rented over Google Cloud Platform. AMD's current RDNA2 with
L3 cache would be the candidate to verify my estimated numbers.

--
Srdja

smatovic · Post by **smatovic** » Fri Apr 02, 2021 7:27 am

from the RDNA whitepaper:

More importantly, the compute unit vector registers natively support packed data including two half-precision (16-bit) FP values, four 8-bit integers, or eight 4-bit integers.

https://www.amd.com/system/files/docume ... epaper.pdf

this was called RPM, rapid packed math, for GCN architecture, but it has also to be supported in the OpenCL software stack -> hands on with microbenchmarking necessary.

--
Srdja

Dann Corbit · Post by **Dann Corbit** » Fri Apr 02, 2021 9:59 am

Just looking at these two numbers:

Code: Select all

- up to 2 Mnps per SIMD unit possible
...
- up to 160 parallel workers (SIMD units) on current highend-gpus

Seems to indicate 320M NPS on a single card.
Now that's cooking with gas.

smatovic · Post by **smatovic** » Sat Apr 03, 2021 5:07 pm

I just looked up and AMD has the little Navi 24 in pipe, with 1024 or 1280
cores, 32 resp. 40 SIMD units, maybe I will give it a try when this gpu-price
madness settles...

--
Srdja

smatovic · Post by **smatovic** » Mon Apr 12, 2021 10:02 am

Followup:

- I mixed up NNUE first layer INT16 and further INT8 weights, so the possible
4x inference speedup holds only if we assume 8-bit vector packed math on gpu.

- I was not able to implement an efficient 8-bit 0x88 vector-based board
representation on pen n paper, hence no 8-bit speedup for move gen in sight.

- Even if I keep the current v099 bitboard design a switch to 32 gpu-threads
piece-wise worker may pay off with certain architecture improvements of
AMD's RDNA and increasing gpu clocks in mind, benchmarks will tell.

--
Srdja

smatovic · Post by **smatovic** » Tue Sep 21, 2021 7:12 am

Zeta NNUE for AMD RDNA2...

I took a look at the RDNA white paper and I pretty much like the architecture.

One SIMD unit has 32 cores, this would fit for a piece-wise move generator, it clocks up to 2.5 GHz, and the cache hierarchy of RDNA2 seems to fit NNUE networks with 10 million weights with 20 MB.

I am yet not sure if the caches are for textures only or if they can be used for program data, and the latencies are according to some benchmarks about an order of magnitude higher than on CPUs, hence it remains open how the NNUE inference will perform.

The scratch-pad memory (LDS) shared across one work-group is 32KB, this is enough for me to store the iterative search var stacks, constant cache of 16KB for the lookup-tables, the L0 cache is 16KB, L1 128KB, L2 4MB, and L3 varies from 16 to 128 MB. If I use a short2 data-type for the first NNUE layer and char4 for the further this should fit across the L0 to L3 caches. Alternative is to store the INT8 weights in the vector registers file, 128KB per SIMD.

Still not sure if 8-bit vector packed math is supported via OpenCL, to speed up NN inference.

I did set up a machine with an GTX 750 for development again, with the aim to purchase the little Navi 24 with AMD RDNA2 arch when available.

--
Srdja

smatovic · Post by **smatovic** » Wed Mar 23, 2022 5:45 pm

Haha, okay, did some basic benchmarks with my v099 Bitboard 64 gpu-threads per worker design, I am stuck on 100 Knps to max 200 Knps per worker, w/o any NNUE implementation, far too slow to compete with NNUE engines on CPUs...

--
Srdja

Zeta with NNUE on GPU?

Zeta with NNUE on GPU?

Re: Zeta with NNUE on GPU?

Re: Zeta with NNUE on GPU?

Re: Zeta with NNUE on GPU?

Re: Zeta with NNUE on GPU?

Re: Zeta with NNUE on GPU?

Re: Zeta with NNUE on GPU?

Re: Zeta with NNUE on GPU?

Re: Zeta with NNUE on GPU?

Re: Zeta with NNUE on GPU? - basic benchmarks