AVX-512 and NNUE

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

AVX-512 and NNUE

Post by Gian-Carlo Pascutto »

Has anyone benchmarked Stockfish NNUE's performance on an AVX-512 system with an AVX-512 build? How does it compare to a Zen2 AVX2 build?
mmt
Posts: 343
Joined: Sun Aug 25, 2019 8:33 am
Full name: .

Re: AVX-512 and NNUE

Post by mmt »

Is there a build available? I could compile but it'd be easier if somebody had it ready.
Vinvin
Posts: 5228
Joined: Thu Mar 09, 2006 9:40 am
Full name: Vincent Lejeune

Re: AVX-512 and NNUE

Post by Vinvin »

Gian-Carlo Pascutto wrote: Tue Sep 08, 2020 8:15 pm Has anyone benchmarked Stockfish NNUE's performance on an AVX-512 system with an AVX-512 build? How does it compare to a Zen2 AVX2 build?
Some fun facts about Intel and reduced Turbo Boost frequency :
https://en.wikipedia.org/wiki/Advanced_ ... wnclocking
Downclocking

Since AVX instructions are wider and generate more heat, Intel processors have provisions to reduce the Turbo Boost frequency limit when such instructions are being executed. The throttling is divided into three levels:[43][44]

L0 (100%): The normal turbo boost limit.
L1 (~85%): The "AVX boost" limit. Soft-triggered by 256-bit "heavy" (floating-point unit: FP math and integer multiplication) instructions. Hard-triggered by "light" (all other) 512-bit instructions.
L2 (~60%): The "AVX-512 boost" limit. Soft-triggered by 512-bit heavy instructions.

The frequency transition can be soft or hard. Hard transition means the frequency is reduced as soon as such an instruction is spotted; soft transition means that the frequency is reduced only after reaching a threshold number of matching instructions. The limit is per-thread.[43]

Downclocking means that using AVX in a mixed workload with an Intel processor can incur a frequency penalty despite it being faster in a "pure" context. Avoiding the use of wide and heavy instructions help minimize the impact in these cases. AVX-512VL allows for using 256-bit or 128-bit operands in AVX-512, making it a sensible default for mixed loads.[45]
MMarco
Posts: 195
Joined: Sun Apr 12, 2020 1:09 am
Full name: Marc-O Moisan-Plante

Re: AVX-512 and NNUE

Post by MMarco »

mmt wrote: Tue Sep 08, 2020 10:15 pm Is there a build available? I could compile but it'd be easier if somebody had it ready.
I don't have an appropriate cpu for this, but on Stockfish discord a few people have and shared information and binaries. Here is one: https://gofile.io/d/G3mkFy
Navs wrote: Some highly optimised Binaries for Intel Skylake CPU's
Have increased nps by 38% and now running at 70% of SF speed
Otherwise, nodchip pure nnue from 07/19 also has AVX 512 binaries : https://github.com/nodchip/Stockfish/re ... 2020-07-19
Last edited by MMarco on Wed Sep 09, 2020 12:03 am, edited 1 time in total.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: AVX-512 and NNUE

Post by Daniel Shawul »

Not sure if it is using AVX-512 but at Navs stream , there are two stockfish-nnue playing with vnni256 and vnni512
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: AVX-512 and NNUE

Post by syzygy »

Gian-Carlo Pascutto wrote: Tue Sep 08, 2020 8:15 pm Has anyone benchmarked Stockfish NNUE's performance on an AVX-512 system with an AVX-512 build? How does it compare to a Zen2 AVX2 build?
This seems to get close:
https://github.com/official-stockfish/S ... -678426702
voffka
Posts: 288
Joined: Sat Jun 30, 2018 10:58 pm
Location: Ukraine
Full name: Volodymyr Shcherbyna

Re: AVX-512 and NNUE

Post by voffka »

I could produce such build but for Igel :)

Unfortunately none of my machine is AVX512 capable so I did not produce this build because I could not. So if someone could test it it would be great.
mmt
Posts: 343
Joined: Sun Aug 25, 2019 8:33 am
Full name: .

Re: AVX-512 and NNUE

Post by mmt »

MMarco wrote: Tue Sep 08, 2020 11:58 pm I don't have an appropriate cpu for this, but on Stockfish discord a few people have and shared information and binaries. Here is one: https://gofile.io/d/G3mkFy
This build crashes on startup for me.
MMarco wrote: Tue Sep 08, 2020 11:58 pm Otherwise, nodchip pure nnue from 07/19 also has AVX 512 binaries : https://github.com/nodchip/Stockfish/re ... 2020-07-19
Here are the benchmark results for builds with 256x2 for 2 x Xeon Gold 6246, 512 MB cache (24 cores, 48 hyperthreads), average of 3 runs with depth 16:
Ratio of NPS for AVX-512 as compared to AVX2 and BMI2:
Threads AVX-512/AVX2 AVX-512/BMI2
1 1.011 1.110
2 1.004 1.095
4 0.935 1.045
8 0.946 1.035
16 0.946 1.029
32 0.934 1.025
45 0.963 1.051

Raw NPS:
Threads AVX2 BMI2 AVX-512
1 1284745 1169957 1298495
2 2600270 2383164 2610592
4 5316033 4752857 4968256
8 10672936 9757955 10096322
16 20227740 18598880 19139049
32 34599384 31519777 32309726
45 39815839 36463132 38324966

So AVX-512 was best at only 1 and 2 threads. The AVX2 build was best otherwise.

Image
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: AVX-512 and NNUE

Post by syzygy »

mmt wrote: Wed Sep 09, 2020 5:35 am
MMarco wrote: Tue Sep 08, 2020 11:58 pm I don't have an appropriate cpu for this, but on Stockfish discord a few people have and shared information and binaries. Here is one: https://gofile.io/d/G3mkFy
This build crashes on startup for me.
MMarco wrote: Tue Sep 08, 2020 11:58 pm Otherwise, nodchip pure nnue from 07/19 also has AVX 512 binaries : https://github.com/nodchip/Stockfish/re ... 2020-07-19
Here are the benchmark results for builds with 256x2 for 2 x Xeon Gold 6246, 512 MB cache (24 cores, 48 hyperthreads), average of 3 runs with depth 16:
Ratio of NPS for AVX-512 as compared to AVX2 and BMI2:
Threads AVX-512/AVX2 AVX-512/BMI2
1 1.011 1.110
2 1.004 1.095
4 0.935 1.045
8 0.946 1.035
16 0.946 1.029
32 0.934 1.025
45 0.963 1.051

Raw NPS:
Threads AVX2 BMI2 AVX-512
1 1284745 1169957 1298495
2 2600270 2383164 2610592
4 5316033 4752857 4968256
8 10672936 9757955 10096322
16 20227740 18598880 19139049
32 34599384 31519777 32309726
45 39815839 36463132 38324966

So AVX-512 was best at only 1 and 2 threads. The AVX2 build was best otherwise.
What puzzles me about these results is that the AVX2 build is faster than the BMI2 build. The BMI2 build should include the AVX2 code (and add pext-based move generation), so it should be faster than AVX2 on Intel (not on AMD/Zen, but you are testing on Intel).
mmt
Posts: 343
Joined: Sun Aug 25, 2019 8:33 am
Full name: .

Re: AVX-512 and NNUE

Post by mmt »

syzygy wrote: Wed Sep 09, 2020 1:30 pm What puzzles me about these results is that the AVX2 build is faster than the BMI2 build. The BMI2 build should include the AVX2 code (and add pext-based move generation), so it should be faster than AVX2 on Intel (not on AMD/Zen, but you are testing on Intel).
I was surprised also. But I re-ran the test to double-check and the result is the same so I don't think I made a mistake.