Not long ago, it was believed by many, myself included, that making use of Hyperthreads for engine play was sub-optimal. At one point Fishtest's wikis contained guidance to Users to not use Hyper-threads. Similarly, Houdini once came with such a disclaimer. This has since been refuted, with this test. In this test, Noobpwnftw ran a set of games on a machine with 192 real cores, and 384 hyper threads. His test is a worst case of core doubling: He doubled the cores, but in doing so weakened each individual core as a result of splitting into threads. Under this test method, we can call the elo result a lower bound on elo per core-doubling. From this, we derive that +20 elo is a reasonable figure for any core doubling, even for the worst case.
To check these claims, I took the most recent Ethereal that is playing at TCEC and CCC, which contains a 2x128 Shogi-HalfKP NNUE as its primary evaluation feature. Much like all other NNUE engines, Ethereal will most often choose to evaluate using the NNUE, aside from positions with massive material or positional imbalance.
Below is a template for each cutechess command used to generate games. The tests use the commonly used 4moves_noob opening book, which was recently the default book on Fishtest (and may still be), as well as the default book for OpenBench testing.
./cutechess -repeat -recover -variant standard -resign movecount=3 score=400 -draw movenumber=40 movecount=8 score=20 -concurrency 8 -games 64000 -engine cmd=Ethereal option.Threads=N proto=uci tc=T name=Ethereal-NNUE-NC -engine cmd=Ethereal option.Threads=K proto=uci tc=T name=Ethereal-NNUE-KC -openings file=4moves_noob.pgn format=pgn order=random
Code: Select all
10.0s+0.1s
Ethereal-NNUE-2C vs Ethereal-NNUE-1C: 314 - 73 - 613 [0.621] 1000 +86
Ethereal-NNUE-4C vs Ethereal-NNUE-2C: 290 - 74 - 636 [0.608] 1000 +76
Ethereal-NNUE-8C vs Ethereal-NNUE-4C: 621 - 171 - 1708 [0.590] 2500 +63
I beleive that what other Users are calling a scaling issue in NNUE is actually a live example of the diminishing returns of superior software as one reaches the elo ceiling. It is well known that the draw rate of chess increases in correlation with the time control used. This results in a compression of elo. Stockfish gained an inordinate amount of strength with the introduction of NNUE, as did all engines which have followed in Stockfish's footsteps. Thus as strength increases, so does the drawrate, and thus the elo difference derived from a set of games decreases.
As of writing this, only two arguments based on data have been presented to me: One which contains a sample of fewer than 100 games. Anyone who has worked on a chess engine, or watched Fishtest or OpenBench as SPRT values fluctuate, knows that small samples are not strong indicators of final results. So be the way of statistics.
The other argument I have seen states that CCRL supports their claims, but no specific data has been pointed to, nor any sort of explanation of it. I can argue that CCRL has additional variance introduced, such as the choice of openings differing by teser, the hardware differing by tester, the pool of opponents chosen for any given engine, the presence of multiple versions of individual engines distorting the ratings of others by over estimating the presence of engines with particular features, etc. I ask those who view CCRL as a source of evidence for their claims just why I could not reproduce such a result in a test with strong control features?
PGNs of the games played can be found here: