Dispelling the Myth of NNUE with LazySMP: An Analysis

AndrewGrant · Post by **AndrewGrant** » Wed Dec 30, 2020 11:02 pm

I've recently been made aware that some on this forum are under the impression that engines employing NNUE (Shogi-HalfKP NNs), potentially in conjunction with LazySMP, demonstrate worse elo gains per core-doubling than other engines not yet employing NNUE's in their evaluation routines. I have not yet seen any argument as to why this is the case, but Users have presented what they view as evidence of a decline in NNUE scaling as a function of thread count increases. The allegations made are that Shogi-HalfKP NN using engines demonstrate scaling that is far worse than the conventional wisdom. Conventional wisdom is that a core doubling is worth around +70 elo for the first doubling, and then trends downwards.

Not long ago, it was believed by many, myself included, that making use of Hyperthreads for engine play was sub-optimal. At one point Fishtest's wikis contained guidance to Users to not use Hyper-threads. Similarly, Houdini once came with such a disclaimer. This has since been refuted, with this test. In this test, Noobpwnftw ran a set of games on a machine with 192 real cores, and 384 hyper threads. His test is a worst case of core doubling: He doubled the cores, but in doing so weakened each individual core as a result of splitting into threads. Under this test method, we can call the elo result a lower bound on elo per core-doubling. From this, we derive that +20 elo is a reasonable figure for any core doubling, even for the worst case.

To check these claims, I took the most recent Ethereal that is playing at TCEC and CCC, which contains a 2x128 Shogi-HalfKP NNUE as its primary evaluation feature. Much like all other NNUE engines, Ethereal will most often choose to evaluate using the NNUE, aside from positions with massive material or positional imbalance.

Below is a template for each cutechess command used to generate games. The tests use the commonly used 4moves_noob opening book, which was recently the default book on Fishtest (and may still be), as well as the default book for OpenBench testing.

./cutechess -repeat -recover -variant standard -resign movecount=3 score=400 -draw movenumber=40 movecount=8 score=20 -concurrency 8 -games 64000 -engine cmd=Ethereal option.Threads=N proto=uci tc=T name=Ethereal-NNUE-NC -engine cmd=Ethereal option.Threads=K proto=uci tc=T name=Ethereal-NNUE-KC -openings file=4moves_noob.pgn format=pgn order=random

Code: Select all

10.0s+0.1s
Ethereal-NNUE-2C vs Ethereal-NNUE-1C: 314 -  73 -  613  [0.621] 1000 +86
Ethereal-NNUE-4C vs Ethereal-NNUE-2C: 290 -  74 -  636  [0.608] 1000 +76
Ethereal-NNUE-8C vs Ethereal-NNUE-4C: 621 - 171 - 1708  [0.590] 2500 +63

These results are in direct contradiction with the claims made about NNUE's scaling in engines. Note that Ethereal does use the purest form of LazySMP. Now those who have entrenched views will first point to the time control used -- a measly 10s+.01s. However, this is the next part of my argument.

I beleive that what other Users are calling a scaling issue in NNUE is actually a live example of the diminishing returns of superior software as one reaches the elo ceiling. It is well known that the draw rate of chess increases in correlation with the time control used. This results in a compression of elo. Stockfish gained an inordinate amount of strength with the introduction of NNUE, as did all engines which have followed in Stockfish's footsteps. Thus as strength increases, so does the drawrate, and thus the elo difference derived from a set of games decreases.

As of writing this, only two arguments based on data have been presented to me: One which contains a sample of fewer than 100 games. Anyone who has worked on a chess engine, or watched Fishtest or OpenBench as SPRT values fluctuate, knows that small samples are not strong indicators of final results. So be the way of statistics.

The other argument I have seen states that CCRL supports their claims, but no specific data has been pointed to, nor any sort of explanation of it. I can argue that CCRL has additional variance introduced, such as the choice of openings differing by teser, the hardware differing by tester, the pool of opponents chosen for any given engine, the presence of multiple versions of individual engines distorting the ratings of others by over estimating the presence of engines with particular features, etc. I ask those who view CCRL as a source of evidence for their claims just why I could not reproduce such a result in a test with strong control features?

PGNs of the games played can be found here:

syzygy · Post by **syzygy** » Thu Dec 31, 2020 1:42 am

AndrewGrant wrote: ↑Wed Dec 30, 2020 11:02 pm Not long ago, it was believed by many, myself included, that making use of Hyperthreads for engine play was sub-optimal. At one point Fishtest's wikis contained guidance to Users to not use Hyper-threads. Similarly, Houdini once came with such a disclaimer.

The use of hyperthreads for fishtest is a completely different issue.

Houdini advised against hyperthreads because it was thought that Houdini searching with 2N threads on a processor with N physical cores and 2N logical cores was less effective than Houdini searching with N threads on the same processor. Although 2N threads would give higher nps, it was not clear that this outweighed the increased SMP overhead.

On fishtest, almost all tests are single threaded, so the use of hyperthreads increases total nps and does not increase SMP overhead. However, the use of hyperthreads very likely does increase noise (sometimes an engine will have the full physical core for itself for a period of time and have a big advantage over its opponent which only gets one logical core = a shared physical core). Noise on fishtest is bad, as it increases the number of games needed to pass or fail a test. It is not clear that the increase in total nps outweights the efficiency loss of the testing framework due to increased noise.

mwyoung · Post by **mwyoung** » Thu Dec 31, 2020 3:15 am

Lasko's Law----What's not clear? 3 doublings in cores mean nowadays at least 2.5 real effective doublings in TC. Each effective doubling in TC in these blitz conditions means at very least 40 Elo points, therefore at very least 80 Elo points 1 core -> 8 cores. In fact more likely 120 - 140 Elo points. That result posted in OP and discrepancy beyond doubt break the Elo model.

Or Grant's Law Now.

Code: Select all

Garbage results, which lead to poor understanding of how engines scale. Misinformation at best, deceit at worst.
Unless you are saying that you are giving the 2 cores 10x the thinking time? Which if you are, ignore my entire post.

Typically speaking, a core doubling from 1 to 2 starts at about +70 elo for an engine like Stockfish, using a balanced opening book. As you continue the doubling, that begins to drop. I am not sure exactly of the rate, but it appears logarithmic. However, there was a test with 192 cores vs 384 threads. Note that this is not a typical core doubling, but making use of the hyperthreads on a core. IE, it is not as good as doubling the number of cores. This resulted in about +20 elo after a few thousand games. This test is on fishtest somewhere, and is public record.

So you start your initial core doubling at +70 per doubling, and by the end a doubling is worth at least 20 elo. log2(256) = 8. Hyperthreads are being used. So it can be derived from this knowledge that 1 core SF vs 256 thread SF is roughly 7 * ((70 + 20) / 2) + 20, which is 300+ elo.

Will you see this over the board? Probably not. Elo differences at those extremes are poorly defined by the elo curve, which is well adjusted for similarly skilled opponents, but fails at the extremes due to the nature of the games it is employed in. The issue in this case being that SF is already so strong that, if there were a skill cap, SF is closer to it than any other entity. The result is more draws, which dampen your ability to exploit hundreds of elo advantages, and compresses the elo curve.

It is clear to me that Stockfish NNUE does not obey Lasko's law as stated above. CCRL most likely does not have flawed testing.. And as suspected. The issues is with Stockfish NNUE. It took me many hours of testing to show this result, and the full results will be shown soon. When the testing is completed. The bottom line is the issue is with Stockfish NNUE, and not with CCRL testing. Full results coming soon. As you know testing can take days to answer this kind of anomaly, or false assumption.
-------------------------------------------------------------------------------------------------------------------------------------------------

All results were tested under the same conditions with a TC = 2m+1s. With the same book, and settings, with Perfect Book 2019. CPU was a 2950x with all cores locked to 4.1 Ghz.

Stockfish 11 with a classical evaluation obeys Lasko's Law. But assuming Stockfish 12 a hybrid with the new NN evaluation will also obey Stockfish's classical pattern was in error. Stockfish 12 does not obey Lasko's Law.

I tested two versions of Stockfish 12, version 12, and version 12 (051020). To make sure this behavior was not with just the original Stockfish 12.

Stockfish 11 1 vs 8 cores +147.2 Elo
Stockfish 12 1 vs 8 cores +77.7 Elo
Stockfish 051020 1 vs 8 cores +54.3 Elo

Code: Select all

Result:
------------------------------------------------------------------------------------------------
  #  name                                games    wins   draws  losses   score    los%  elo+/-
  1. Stockfish 11 64 POPCNT dup 8 cores    200      81     118       1   140.0   100.0   147.2
  2. Stockfish 11 64 POPCNT dup 1 core     200       1     118      81    60.0     0.0  -147.2

Cross table:
------------------------------------------------------------------------------------------------
  #  name                                   score   games                                                                                                                                                                                                        1                                                                                                                                                                                                        2
  1. Stockfish 11 64 POPCNT dup 8 cores     140.0     200                                                                                                                                                                                                        x 111===1===111=1==1=1==1===11=1==11=====1====11=11==1=111==111====1111==11===1==11=========1===1====1111=111=1======1=1=1=0===1==1==1====11=11==11=11=1=11=1==1===1===1=11====11=====1==11=1==11==11==1==
  2. Stockfish 11 64 POPCNT dup 1 core       60.0     200 000===0===000=0==0=0==0===00=0==00=====0====00=00==0=000==000====0000==00===0==00=========0===0====0000=000=0======0=0=0=1===0==0==0====00=00==00=00=0=00=0==0===0===0=00====00=====0==00=0==00==00==0==                                                                                                                                                                                                        x

Tech:
------------------------------------------------------------------------------------------------

Tech (average nodes, depths, time/m per move, others per game), counted for computing moves only, ignored moves with zero nodes:
  #  name                                  nodes/m         NPS  depth/m   time/m    moves     time
  1. Stockfish 11 64 POPCNT dup 8 cores     35492K    13595007     31.7      2.6     61.0    159.1
  2. Stockfish 11 64 POPCNT dup 1 core       4551K     1662287     27.1      2.7     61.1    167.4
     all ---                                19530K     7478160     29.4      2.7     61.0    163.3

Tournament finished! Elapsed: 18:23:36

Code: Select all

Result:
------------------------------------------------------------------------------------------
  #  name                          games    wins   draws  losses   score    los%  elo+/-
  1. Stockfish 051020 dup 8 cores    200      31     169       0   115.5   100.0    54.3
  2. Stockfish 051020 dup 1 core     200       0     169      31    84.5     0.0   -54.3

Cross table:
------------------------------------------------------------------------------------------
  #  name                             score   games                                                                                                                                                                                                        1                                                                                                                                                                                                        2
  1. Stockfish 051020 dup 8 cores     115.5     200                                                                                                                                                                                                        x =1==1======1========111=====1=====1=================1====1======1===1==1========1=====================1=1======1==========1==1====1========1========1==1=====1============1===1==========1==11====1==1==
  2. Stockfish 051020 dup 1 core       84.5     200 =0==0======0========000=====0=====0=================0====0======0===0==0========0=====================0=0======0==========0==0====0========0========0==0=====0============0===0==========0==00====0==0==                                                                                                                                                                                                        x

Tech:
------------------------------------------------------------------------------------------

Tech (average nodes, depths, time/m per move, others per game), counted for computing moves only, ignored moves with zero nodes:
  #  name                            nodes/m         NPS  depth/m   time/m    moves     time
  1. Stockfish 051020 dup 8 cores     30556K    10784132     37.2      2.8     49.1    139.1
  2. Stockfish 051020 dup 1 core       3659K     1282020     29.7      2.9     49.2    140.4
     all ---                          16695K     6012180     33.4      2.8     49.1    139.8

Tournament finished! Elapsed: 15:50:53

Code: Select all

Result:
--------------------------------------------------------------------------------------
  #  name                      games    wins   draws  losses   score    los%  elo+/-
  1. Stockfish 12 dup 8 cores    200      44     156       0   122.0   100.0    77.7
  2. Stockfish 12 dup 1 core     200       0     156      44    78.0     0.0   -77.7

Cross table:
--------------------------------------------------------------------------------------
  #  name                         score   games                                                                                                                                                                                                        1                                                                                                                                                                                                        2
  1. Stockfish 12 dup 8 cores     122.0     200                                                                                                                                                                                                        x 1===1===============1======11======1==1===========1=1=====1===11===1===11==1===1========1=1=1==1=1========1===1=========1==1===========1===11======1====1====1==1====1=====1======1====111==1=11===1===1
  2. Stockfish 12 dup 1 core       78.0     200 0===0===============0======00======0==0===========0=0=====0===00===0===00==0===0========0=0=0==0=0========0===0=========0==0===========0===00======0====0====0==0====0=====0======0====000==0=00===0===0                                                                                                                                                                                                        x

Tech:
--------------------------------------------------------------------------------------

Tech (average nodes, depths, time/m per move, others per game), counted for computing moves only, ignored moves with zero nodes:
  #  name                        nodes/m         NPS  depth/m   time/m    moves     time
  1. Stockfish 12 dup 8 cores     28475K     9905390     35.2      2.9     48.0    137.8
  2. Stockfish 12 dup 1 core       3474K     1180298     29.1      2.9     48.1    141.4
     all ---                      15585K     5486571     32.1      2.9     48.0    139.6

Tournament finished! Elapsed: 15:46:49

Stockfish 11 1 vs 8 cores +147.2 Elo
Stockfish 12 1 vs 8 cores +77.7 Elo
Stockfish 051020 1 vs 8 cores +54.3 Elo
Stockfish 051020 8 vs 16 cores +1.7 Elo

Code: Select all

Result:
-------------------------------------------------------------------------------------------
  #  name                           games    wins   draws  losses   score    los%  elo+/-
  1. Stockfish 051020 dup 16 cores    200       2     197       1   100.5    71.8     1.7
  2. Stockfish 051020 dup 8 cores     200       1     197       2    99.5    28.2    -1.7

Cross table:
-------------------------------------------------------------------------------------------
  #  name                              score   games                                                                                                                                                                                                        1                                                                                                                                                                                                        2
  1. Stockfish 051020 dup 16 cores     100.5     200                                                                                                                                                                                                        x ==============================================================0=1============================================================================1==========================================================
  2. Stockfish 051020 dup 8 cores       99.5     200 ==============================================================1=0============================================================================0==========================================================                                                                                                                                                                                                        x

Tech:
-------------------------------------------------------------------------------------------

Tech (average nodes, depths, time/m per move, others per game), counted for computing moves only, ignored moves with zero nodes:
  #  name                             nodes/m         NPS  depth/m   time/m    moves     time
  1. Stockfish 051020 dup 16 cores     57514K    20152492     41.9      2.9     49.5    141.2
  2. Stockfish 051020 dup 8 cores      29868K    10471507     39.0      2.9     49.5    141.2
     all ---                           42662K    15311939     40.4      2.9     49.5    141.2

brianr · Post by **brianr** » Thu Dec 31, 2020 3:27 am

Allow me to disagree.
Since the games are the same engine against itself, your matches do not show anything directly about scalability, other than perhaps a second order effect. The lower margins of superiority might simply be a result of more draws.

mwyoung · Post by **mwyoung** » Thu Dec 31, 2020 3:34 am

brianr wrote: ↑Thu Dec 31, 2020 3:27 am Allow me to disagree.
Since the games are the same engine against itself, your matches do not show anything directly about scalability, other than perhaps a second order effect. The lower margins of superiority might simply be a result of more draws.

Allow me to disagree.

Grant's results.

Code: Select all

10.0s+0.1s
Ethereal-NNUE-2C vs Ethereal-NNUE-1C: 314 -  73 -  613  [0.621] 1000 +86
Ethereal-NNUE-4C vs Ethereal-NNUE-2C: 290 -  74 -  636  [0.608] 1000 +76
Ethereal-NNUE-8C vs Ethereal-NNUE-4C: 621 - 171 - 1708  [0.590] 2500 +63

And I will throw in this bone for you. As my testing never stops. And keeps showing the same results with Dragon and Stockfish NNUE scaling.

Code: Select all

Chess Match Stockfish 280920 (2 cores, No TB) vs Dragon(32 Threads) (TC=3m+2s)


DESKTOP-CORSAIR, Blitz 3.0min+2.0sec  0

                                         
1   Dragon by Komodo Chess 64-bit   +10  +5/=64/-3 51.39%   37.0/72
2   Stockfish 280920                -10  +3/=64/-5 48.61%   35.0/72

brianr · Post by **brianr** » Thu Dec 31, 2020 3:41 am

Grant's results look like perfectly fine scaling to me...
What am I missing?

mwyoung · Post by **mwyoung** » Thu Dec 31, 2020 3:44 am

brianr wrote: ↑Thu Dec 31, 2020 3:41 am Grant's results look like perfectly fine scaling to me...
What am I missing?

You need to stay off the bottle. You have a short term memory issue.

"Since the games are the same engine against itself, your matches do not show anything directly about scalability"
Ethereal-NNUE-2C vs Ethereal-NNUE-1C: 314 - 73 - 613 [0.621] 1000 +86

Grant in his test justed used the same engine against itself.

What am I missing!

brianr · Post by **brianr** » Thu Dec 31, 2020 3:49 am

You are correct.
My mistake.

AndrewGrant · Post by **AndrewGrant** » Thu Dec 31, 2020 3:58 am

mwyoung wrote: ↑Thu Dec 31, 2020 3:15 am Stockfish 11 with a classical evaluation obeys Lasko's Law. But assuming Stockfish 12 a hybrid with the new NN evaluation will also obey Stockfish's classical pattern was in error. Stockfish 12 does not obey Lasko's Law.

Okay. You posted everything finally

So can you explain to me why Ethereal NNUE does not exhibit the same scaling change? My non-NNUE version follows what you've called Grant's Law. But so does my NNUE version. What gives? Its the same evaluation function as Stockfish in essence.

If your answer is that its selfplay: Then I will do Ethereal (not NNUE) vs Ethereal (NNUE), and do scaling core counts for the NNUE version. If this returns the similar results... what say you?

mwyoung · Post by **mwyoung** » Thu Dec 31, 2020 4:01 am

AndrewGrant wrote: ↑Thu Dec 31, 2020 3:58 am
mwyoung wrote: ↑Thu Dec 31, 2020 3:15 am Stockfish 11 with a classical evaluation obeys Lasko's Law. But assuming Stockfish 12 a hybrid with the new NN evaluation will also obey Stockfish's classical pattern was in error. Stockfish 12 does not obey Lasko's Law.
Okay. You posted everything finally

So can you explain to me why Ethereal NNUE does not exhibit the same scaling change? My non-NNUE version follows what you've called Grant's Law. But so does my NNUE version. What gives? Its the same evaluation function as Stockfish in essence.

If your answer is that its selfplay: Then I will do Ethereal (not NNUE) vs Ethereal (NNUE), and do scaling core counts for the NNUE version. If this returns the similar results... what say you?

What are you talking about. You were in the thread that this data came from. It was posted on Sun Oct 11, 2020 3:40 pm.

Dispelling the Myth of NNUE with LazySMP: An Analysis

Dispelling the Myth of NNUE with LazySMP: An Analysis

Re: Dispelling the Myth of NNUE with LazySMP: An Analysis

Re: Dispelling the Myth of NNUE with LazySMP: An Analysis

Re: Dispelling the Myth of NNUE with LazySMP: An Analysis

Re: Dispelling the Myth of NNUE with LazySMP: An Analysis

Re: Dispelling the Myth of NNUE with LazySMP: An Analysis

Re: Dispelling the Myth of NNUE with LazySMP: An Analysis

Re: Dispelling the Myth of NNUE with LazySMP: An Analysis

Re: Dispelling the Myth of NNUE with LazySMP: An Analysis

Re: Dispelling the Myth of NNUE with LazySMP: An Analysis