CCRL flawed testing : SF12 above SF12 8CPU

Modern Times · Post by **Modern Times** » Sat Oct 10, 2020 12:11 pm

hgm wrote: ↑Sat Oct 10, 2020 11:26 am If SF 8-core has 'underperformed' in terms of Elo because it had too many NN opponents, this can be considered evidence for the above hypothesis.

There will be a wider range of opponents in the update later today. It was probably unwise to publish just the games we did, but that is how we work, we upload the games as they are played and people need to be mindful that the ratings can change as games accumulate and the opponent pool widens. And it has pulled ahead of 1CPU now but not by as much as you'd expect.

Also depends on the ratings tool you use, and you did touch on that a bit. With the list as it stands, bayeselo has 8 CPU behind 1CPU. If you run that same database through Ordo, that switches and Ordo has 8CPU as stronger. But we are talking about small numbers.

Guenther · Post by **Guenther** » Sat Oct 10, 2020 1:04 pm

Modern Times wrote: ↑Sat Oct 10, 2020 12:11 pm
hgm wrote: ↑Sat Oct 10, 2020 11:26 am If SF 8-core has 'underperformed' in terms of Elo because it had too many NN opponents, this can be considered evidence for the above hypothesis.
There will be a wider range of opponents in the update later today. It was probably unwise to publish just the games we did, but that is how we work, we upload the games as they are played and people need to be mindful that the ratings can change as games accumulate and the opponent pool widens. And it has pulled ahead of 1CPU now but not by as much as you'd expect.

Also depends on the ratings tool you use, and you did touch on that a bit. With the list as it stands, bayeselo has 8 CPU behind 1CPU. If you run that same database through Ordo, that switches and Ordo has 8CPU as stronger. But we are talking about small numbers.

Can you release games with depth/eval/time info from the 8core tests vs. non GPU/non 8core hardware, if those exist already?
(note for others: games for blitz list are always released as bare pgn w/o any info - probably due to saving server space)

What is the real tc for the 8core machine(s?) in blitz testing currently due to your formula?

Modern Times · Post by **Modern Times** » Sat Oct 10, 2020 1:49 pm

According to Ordo we now have 8CPU as +48 Elo to 1CPU. The ratings list will not show that because it uses bayeselo - that I think will show it as +30 Elo. However 1CPU has currently not played the GPU engines that 8CPU has. That is planned for the coming week, and I suspect that may depress the 1CPU rating a little and widen the gap between 1CPU and 8CPU.

We are getting very variable and strange results from Stockfish 12 on more than 1CPU and that is causing a lot of concern and investigation, especially on the blits list with the GPU engines. Not so much of an issue on 40/15. In the pre NN and NNUE world opponent selection and common opponents wasn't as critical as it is now.

hgm · Post by **hgm** » Sat Oct 10, 2020 4:20 pm

That it is critical proves the undelying model is wrong: for a correct model it should not matter how you select opponents!

As data accumulates we should be able to get a statistically significant impression for how average result depends on rating difference for NN engines against each other, as well as for AB vs NN engines. And if this is very different from how this is for AB engines against each other (which we should already know), we should switch to using a rating calculator that takes this into account. It could be that this requires labeling each engine with a pair of ratings, one they would get in the pool of AB engines, and another they would get in a pool of NN engines. And that these ratings can be substantially different. But then neither of those would be dependent on opponent choice anymore.

mwyoung · Post by **mwyoung** » Sat Oct 10, 2020 7:02 pm

Modern Times wrote: ↑Sat Oct 10, 2020 1:49 pm According to Ordo we now have 8CPU as +48 Elo to 1CPU. The ratings list will not show that because it uses bayeselo - that I think will show it as +30 Elo. However 1CPU has currently not played the GPU engines that 8CPU has. That is planned for the coming week, and I suspect that may depress the 1CPU rating a little and widen the gap between 1CPU and 8CPU.

We are getting very variable and strange results from Stockfish 12 on more than 1CPU and that is causing a lot of concern and investigation, especially on the blits list with the GPU engines. Not so much of an issue on 40/15. In the pre NN and NNUE world opponent selection and common opponents wasn't as critical as it is now.

Yes! And I have raised the same concerns here also. 1 tier testing at bullet time controls, and only testing with 1 core with Stockfish NNUE. Will give you a over inflated impression of Stockfish NNUE's true strength. And the reason is simple, Stockfish NNUE does not scale like other engines. With more cores and time.

Ajedrecista · Post by **Ajedrecista** » Sat Oct 10, 2020 7:52 pm

Hello Kai:

Third and hopefully last attempt with the minimum draw ratio D_min. The other two are wrong. It is easy to check that +122 =0 -78 after 200 games translates into +77.7 Elo and the draw ratio is zero, so D_min. is nor 20.43% neither 10.22%:

Code: Select all

Elo_diff. ~ 77.7 Elo after 200 games. Possible outcomes:

Wins   Draws   Loses
 44     156       0
 45     154       1
 46     152       2
[...]
120       4      76
121       2      77
122       0      78

D_min. will be 0 or 1/n (0 or 1 draws after n games) depending on W - L, which was calculated before. I follow the next approach:

Code: Select all

W - L = [10^(Elo_diff./400) - 1]/[10^(Elo_diff./400) + 1]

Assuming Elo_diff. >= 0 Elo after n games (Elo_diff. < 0 Elo swaps wins and loses):

Loses --> Wins = Loses + n*(W - L) --> Draws = n - Loses - Wins

Loses          Wins                  Draws
  0       0 + n*(W - L)     n - 0 - [0 + n*(W - L)]
  1       1 + n*(W - L)     n - 1 - [1 + n*(W - L)]
  2       2 + n*(W - L)     n - 2 - [2 + n*(W - L)]
[...]
  k       k + n*(W - L)     n - k - [k + n*(W - L)]

Draws = n*[1 - (W - L)] - 2*k     // k is a non-negative integer.

Minimum number of draws: 0 or 1? --> mod 2

{n*[1 - (W - L)] - 2*k} mod 2 = {n*[1 - (W - L)]} mod 2 - (2*k) mod 2 = {n*[1 - (W - L)]} mod 2 - 0 = {n*[1 - (W - L)]} mod 2

D_min. = (1/n)*({n*[1 - (W - L)]} mod 2)
D_min. = (1/n) * { [ n * ( 1 - { [ 10^(Elo_diff./400) - 1] / [10^(Elo_diff./400) + 1] } ) ] mod 2 }

In the particular case of Elo_diff. ~ 77.7 Elo after n = 200 games:

Code: Select all

W - L = [10^(Elo_diff./400) - 1]/[10^(Elo_diff./400) + 1]
// Elo_diff. = 77.7 ==> W - L = 0.22

{n*[1 - (W - L)]} mod 2 = [200*(1 - 0.22)] = 156 mod 2 = 0

D_min. = 0

Going on topic, I will keep an eye on the research of what is happening with SF 12 ratings.

Regards from Spain.

Ajedrecista.

Modern Times · Post by **Modern Times** » Sat Oct 10, 2020 8:03 pm

mwyoung wrote: ↑Sat Oct 10, 2020 7:02 pm
Yes! And I have raised the same concerns here also. 1 tier testing at bullet time controls, and only testing with 1 core with Stockfish NNUE. Will give you a over inflated impression of Stockfish NNUE's true strength. And the reason is simple, Stockfish NNUE does not scale like other engines. With more cores and time.

It is doing my head in for sure. The somewhat unstructured and ad-hoc nature of our testing doesn't help in this situation either, although with enough games that usually eventually resolves itself. To get to the bottom of it you need to do some structured testing with exactly the same opponents, same hardware and testing conditions - which you and others have done or are doing.

mwyoung · Post by **mwyoung** » Sat Oct 10, 2020 8:23 pm

Modern Times wrote: ↑Sat Oct 10, 2020 8:03 pm
mwyoung wrote: ↑Sat Oct 10, 2020 7:02 pm
Yes! And I have raised the same concerns here also. 1 tier testing at bullet time controls, and only testing with 1 core with Stockfish NNUE. Will give you a over inflated impression of Stockfish NNUE's true strength. And the reason is simple, Stockfish NNUE does not scale like other engines. With more cores and time.
It is doing my head in for sure. The somewhat unstructured and ad-hoc nature of our testing doesn't help in this situation either, although with enough games that usually eventually resolves itself. To get to the bottom of it you need to do some structured testing with exactly the same opponents, same hardware and testing conditions - which you and others have done or are doing.

Do not be dismayed! Their are only a few of us. Who are crazy enough to do multi-tier testing, and testing with more then one core under the current paradigm of testing . And you are discovering, what I have already discovered. And our results agree.

mwyoung · Post by **mwyoung** » Sat Oct 10, 2020 11:22 pm

mwyoung wrote: ↑Sat Oct 10, 2020 7:23 am Lasko's Law----What's not clear? 3 doublings in cores mean nowadays at least 2.5 real effective doublings in TC. Each effective doubling in TC in these blitz conditions means at very least 40 Elo points, therefore at very least 80 Elo points 1 core -> 8 cores. In fact more likely 120 - 140 Elo points. That result posted in OP and discrepancy beyond doubt break the Elo model.

It is clear to me that Stockfish NNUE does not obey Lasko's law as stated above. CCRL most likely does not have flawed testing.. And as suspected. The issues is with Stockfish NNUE. It took me many hours to testing to show this result, and the full results will be shown soon. When the testing is completed. The bottom line is the issue is with Stockfish NNUE, and not with CCRL testing. Full results coming soon. As you know testing can take days to answer this kind of anomaly, or false assumption.

All results were tested under the same conditions with a TC = 2m+1s. With the same book, and settings, with Perfect Book 2019. CPU was a 2950x with all cores locked to 4.1 Ghz.

Stockfish 11 with a classical evaluation obeys Lasko's Law. But assuming Stockfish 12 a hybrid with the new NN evaluation will also obey Stockfish's classical pattern was in error. Stockfish 12 does not obey Lasko's Law.

I tested two versions of Stockfish 12, version 12, and version 12 (051020). To make sure this behavior was not with just the original Stockfish 12.

Stockfish 11 1 vs 8 cores +147.2 Elo
Stockfish 12 1 vs 8 cores +77.7 Elo
Stockfish 051020 1 vs 8 cores +54.3 Elo

Code: Select all

Result:
------------------------------------------------------------------------------------------------
  #  name                                games    wins   draws  losses   score    los%  elo+/-
  1. Stockfish 11 64 POPCNT dup 8 cores    200      81     118       1   140.0   100.0   147.2
  2. Stockfish 11 64 POPCNT dup 1 core     200       1     118      81    60.0     0.0  -147.2

Cross table:
------------------------------------------------------------------------------------------------
  #  name                                   score   games                                                                                                                                                                                                        1                                                                                                                                                                                                        2
  1. Stockfish 11 64 POPCNT dup 8 cores     140.0     200                                                                                                                                                                                                        x 111===1===111=1==1=1==1===11=1==11=====1====11=11==1=111==111====1111==11===1==11=========1===1====1111=111=1======1=1=1=0===1==1==1====11=11==11=11=1=11=1==1===1===1=11====11=====1==11=1==11==11==1==
  2. Stockfish 11 64 POPCNT dup 1 core       60.0     200 000===0===000=0==0=0==0===00=0==00=====0====00=00==0=000==000====0000==00===0==00=========0===0====0000=000=0======0=0=0=1===0==0==0====00=00==00=00=0=00=0==0===0===0=00====00=====0==00=0==00==00==0==                                                                                                                                                                                                        x

Tech:
------------------------------------------------------------------------------------------------

Tech (average nodes, depths, time/m per move, others per game), counted for computing moves only, ignored moves with zero nodes:
  #  name                                  nodes/m         NPS  depth/m   time/m    moves     time
  1. Stockfish 11 64 POPCNT dup 8 cores     35492K    13595007     31.7      2.6     61.0    159.1
  2. Stockfish 11 64 POPCNT dup 1 core       4551K     1662287     27.1      2.7     61.1    167.4
     all ---                                19530K     7478160     29.4      2.7     61.0    163.3

Tournament finished! Elapsed: 18:23:36

Code: Select all

Result:
------------------------------------------------------------------------------------------
  #  name                          games    wins   draws  losses   score    los%  elo+/-
  1. Stockfish 051020 dup 8 cores    200      31     169       0   115.5   100.0    54.3
  2. Stockfish 051020 dup 1 core     200       0     169      31    84.5     0.0   -54.3

Cross table:
------------------------------------------------------------------------------------------
  #  name                             score   games                                                                                                                                                                                                        1                                                                                                                                                                                                        2
  1. Stockfish 051020 dup 8 cores     115.5     200                                                                                                                                                                                                        x =1==1======1========111=====1=====1=================1====1======1===1==1========1=====================1=1======1==========1==1====1========1========1==1=====1============1===1==========1==11====1==1==
  2. Stockfish 051020 dup 1 core       84.5     200 =0==0======0========000=====0=====0=================0====0======0===0==0========0=====================0=0======0==========0==0====0========0========0==0=====0============0===0==========0==00====0==0==                                                                                                                                                                                                        x

Tech:
------------------------------------------------------------------------------------------

Tech (average nodes, depths, time/m per move, others per game), counted for computing moves only, ignored moves with zero nodes:
  #  name                            nodes/m         NPS  depth/m   time/m    moves     time
  1. Stockfish 051020 dup 8 cores     30556K    10784132     37.2      2.8     49.1    139.1
  2. Stockfish 051020 dup 1 core       3659K     1282020     29.7      2.9     49.2    140.4
     all ---                          16695K     6012180     33.4      2.8     49.1    139.8

Tournament finished! Elapsed: 15:50:53

Code: Select all

Result:
--------------------------------------------------------------------------------------
  #  name                      games    wins   draws  losses   score    los%  elo+/-
  1. Stockfish 12 dup 8 cores    200      44     156       0   122.0   100.0    77.7
  2. Stockfish 12 dup 1 core     200       0     156      44    78.0     0.0   -77.7

Cross table:
--------------------------------------------------------------------------------------
  #  name                         score   games                                                                                                                                                                                                        1                                                                                                                                                                                                        2
  1. Stockfish 12 dup 8 cores     122.0     200                                                                                                                                                                                                        x 1===1===============1======11======1==1===========1=1=====1===11===1===11==1===1========1=1=1==1=1========1===1=========1==1===========1===11======1====1====1==1====1=====1======1====111==1=11===1===1
  2. Stockfish 12 dup 1 core       78.0     200 0===0===============0======00======0==0===========0=0=====0===00===0===00==0===0========0=0=0==0=0========0===0=========0==0===========0===00======0====0====0==0====0=====0======0====000==0=00===0===0                                                                                                                                                                                                        x

Tech:
--------------------------------------------------------------------------------------

Tech (average nodes, depths, time/m per move, others per game), counted for computing moves only, ignored moves with zero nodes:
  #  name                        nodes/m         NPS  depth/m   time/m    moves     time
  1. Stockfish 12 dup 8 cores     28475K     9905390     35.2      2.9     48.0    137.8
  2. Stockfish 12 dup 1 core       3474K     1180298     29.1      2.9     48.1    141.4
     all ---                      15585K     5486571     32.1      2.9     48.0    139.6

Tournament finished! Elapsed: 15:46:49

Laskos · Post by **Laskos** » Sun Oct 11, 2020 12:22 pm

hgm wrote: ↑Sat Oct 10, 2020 4:20 pm That it is critical proves the undelying model is wrong: for a correct model it should not matter how you select opponents!

As data accumulates we should be able to get a statistically significant impression for how average result depends on rating difference for NN engines against each other, as well as for AB vs NN engines. And if this is very different from how this is for AB engines against each other (which we should already know), we should switch to using a rating calculator that takes this into account. It could be that this requires labeling each engine with a pair of ratings, one they would get in the pool of AB engines, and another they would get in a pool of NN engines. And that these ratings can be substantially different. But then neither of those would be dependent on opponent choice anymore.

I tried to keep Leela MCTS and regular engines together by introducing a new parameter describing the degree of Elo schizophrenia of Leela in this thread studying a bit the problem almost two years ago:
http://talkchess.com/forum3/viewtopic.php?f=2&t=69672

CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing: SF12 above SF12 8CPU.

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU