CCRL flawed testing : SF12 above SF12 8CPU

Graham Banks · Post by **Graham Banks** » Fri Oct 09, 2020 12:23 am

mwyoung wrote: ↑Fri Oct 09, 2020 12:18 am
Raphexon wrote: ↑Fri Oct 09, 2020 12:16 am
mwyoung wrote: ↑Thu Oct 08, 2020 10:34 pm
Jouni wrote: ↑Thu Oct 08, 2020 10:14 pm Fastgm page has test with SF8 and there after 3000 games 8th vs 1th was +158 ELO. SF wiki has value of +178 ELO after 1000 games. Both 60+0.6 games. But we are so much higher level now.
Remember we are making a big assumption here. SF 12 is not a A/B engine. Just because SF 8 scales this way. Does not mean SF NNUE scales the same way. It is still possible this is normal.
lol.
Then I guess CCRL has a lot to explain...

We're looking into it for sure.

AndrewGrant · Post by **AndrewGrant** » Fri Oct 09, 2020 12:42 am

mwyoung wrote: ↑Thu Oct 08, 2020 10:34 pm Remember we are making a big assumption here. SF 12 is not a A/B engine. Just because SF 8 scales this way. Does not mean SF NNUE scales the same way. It is still possible this is normal.

Has someone told KingCrusher? The Content has been getting dry ....

Laskos · Post by **Laskos** » Fri Oct 09, 2020 12:45 am

Ajedrecista wrote: ↑Thu Oct 08, 2020 9:10 pm Hello Kai and Alayan:

Laskos wrote: ↑Thu Oct 08, 2020 7:37 pmWhat was the draw rate? Your openings favor huge draw rates and compress Elo differences. Can you write WDL numbers?

Alayan wrote: ↑Thu Oct 08, 2020 8:45 pm[...]

I saw SF 12 8CPU with zero losses vs the 1 core after 60+ games but I don't know the final results. Anyway, 8CPU had a huge win:loss ratio.
While we wait Mark's answer, I did some math. Based on my own post when SF 12 was released, the draw ratio is:
Code: Select all
W = K*L >= L     // K >= 1
K*L + D + L = 1
(K + 1)*L = 1 - D
L = (1 - D)/(K + 1)

Elo_diff. = 400*log10{[2*K - (K - 1)*D]/[2 + (K - 1)*D]}

D = 2*[10^(Elo_diff./400) - K]/{(1 - K)*[10^(Elo_diff./400) + 1]}
I computed draw ratios for some K values just to get an idea:
Code: Select all
Elo difference = 72 Elo.
W = K*L

 K       W          D          L
 2     40.86%     38.71%     20.43%
 3     30.65%     59.14%     10.22%
 4     27.24%     65.95%      6.81%
 5     25.54%     69.35%      5.11%
 6     24.52%     71.40%      4.09%
 7     23.84%     72.76%      3.41%
 8     23.35%     73.73%      2.92%
 9     22.99%     74.46%      2.55%
10     22.70%     75.03%      2.27%
11     22.47%     75.48%      2.04%
12     22.29%     75.85%      1.86%
13     22.13%     76.16%      1.70%
14     22.00%     76.43%      1.57%
15     21.89%     76.65%      1.46%
16     21.79%     76.84%      1.36%
17     21.71%     77.01%      1.28%
18     21.63%     77.16%      1.20%
19     21.57%     77.30%      1.14%
20     21.51%     77.42%      1.08%
Of course, the extreme cases are:
Code: Select all
Assuming Elo_diff. >= 0 Elo:

Elo_diff. = 400*log10[score/(1 - score)]
score = 1/[1 + 10^(-Elo_diff./400)] = W + D = 1/2 + D_min.
D_min. = 1/[1 + 10^(-Elo_diff./400)] - 1/2
// Elo_diff. = 72 ==> D_min. ~ 10.22%

W + D_max. = 1     // L = 0
Elo_diff. = 400*log10[(W + D_max./2)/(D_max./2)]
Elo_diff. = 400*log10[(1 - D_max. + D_max./2)/(D_max./2)]
10^(Elo_diff./400) = (1 - D_max./2)/(D_max./2) = 2/D_max. - 1
D_max. = 2/[1 + 10^(Elo_diff./400)]
// Elo_diff. = 72 ==> D_max. ~ 79.57%
So 10.22% < D < 79.57% in this case, with great chances of being around 75%. If there were 200 games, then WDL figures must be in steps of 0.5% and some values of my table can be discarded like K = 7. K has low chances of being an integer, after all. I picked integer values for K just to get a rough idea of the draw ratio.

Regards from Spain.

Ajedrecista.

Very nice!
His result was 78% draws, very close to your likely 75%. And his wins : losses was 44 : 0, quite telling W/L and an associated Elo compression.

mwyoung · Post by **mwyoung** » Fri Oct 09, 2020 1:46 am

Graham Banks wrote: ↑Fri Oct 09, 2020 12:23 am
mwyoung wrote: ↑Fri Oct 09, 2020 12:18 am
Raphexon wrote: ↑Fri Oct 09, 2020 12:16 am
mwyoung wrote: ↑Thu Oct 08, 2020 10:34 pm
Jouni wrote: ↑Thu Oct 08, 2020 10:14 pm Fastgm page has test with SF8 and there after 3000 games 8th vs 1th was +158 ELO. SF wiki has value of +178 ELO after 1000 games. Both 60+0.6 games. But we are so much higher level now.
Remember we are making a big assumption here. SF 12 is not a A/B engine. Just because SF 8 scales this way. Does not mean SF NNUE scales the same way. It is still possible this is normal.
lol.
Then I guess CCRL has a lot to explain...
We're looking into it for sure.

Thanks,

Keep us up to date. Interesting for testers.

Ajedrecista · Post by **Ajedrecista** » Fri Oct 09, 2020 7:16 pm

Hello Kai:

Laskos wrote: ↑Fri Oct 09, 2020 12:45 amVery nice!
His result was 78% draws, very close to your likely 75%. And his wins : losses was 44 : 0, quite telling W/L and an associated Elo compression.

I found a typo in my formula for D_min. The true value is:

Code: Select all

/* WRONG IN MY PREVIOUS POST:
score = 1/[1 + 10^(-Elo_diff./400)] = W + D = 1/2 + D_min.
D_min. = 1/[1 + 10^(-Elo_diff./400)] - 1/2
// Elo_diff. = 72 ==> D_min. ~ 10.22%[code]
*/

// IT IS THE DOUBLE:
score = 1/[1 + 10^(-Elo_diff./400)] = W + D = 1/2 + D_min./2     // D_min./2 instead of D_min.
D_min. = 2*score - 1 = 2/[1 + 10^(-Elo_diff./400)] - 1
// Elo_diff. = 72 ==> D_min. ~ 20.43%

Furthermore, I did not realize that +72 Elo could not be a result after 200 games because W - L would not be a multiple of 1/200 = 0.005 = 0.5%:

Code: Select all

Elo_diff. = 400*log10{[1/2 + (W - L)/2]/[1/2 - (W - L)/2]}
10^(-Elo_diff./400) = [1 + (W - L)]/[1 - (W - L)]
W - L = [10^(Elo_diff./400) - 1]/[10^(Elo_diff./400) + 1]
// Elo_diff. = 72 ==> W - L ~ 20.43%
// W - L = D_min.
// n*(W - L) = 200*(W - L) ~ 40.86 (far enough to the closest integer, but it could be due to Elo_diff. rounding).

Computing again with 200*(W - L) = 44 ==> Elo_diff. ~ 77.7:

Code: Select all

Elo difference = 77.7 Elo.
W = K*L

 K       W          D          L
 2     44.00%     34.01%     22.00%
 3     33.00%     56.00%     11.00%
 4     29.33%     63.34%      7.33%
 5     27.50%     67.00%      5.50%
 6     26.40%     69.20%      4.40%
 7     25.66%     70.67%      3.67%
 8     25.14%     71.72%      3.14%
 9     24.75%     72.50%      2.75%
10     24.44%     73.11%      2.44%
11     24.20%     73.60%      2.20%
12     24.00%     74.00%      2.00%
13     23.83%     74.34%      1.83%
14     23.69%     74.62%      1.69%
15     23.57%     74.86%      1.57%
16     23.46%     75.07%      1.47%
17     23.37%     75.25%      1.37%
18     23.29%     75.41%      1.29%
19     23.22%     75.56%      1.22%
20     23.16%     75.69%      1.16%

Code: Select all

Assuming Elo_diff. >= 0 Elo:

D_min. = 2/[1 + 10^(-Elo_diff./400)] - 1
// Elo_diff. = 77.7 ==> D_min. = 22%

D_max. = 2/[1 + 10^(Elo_diff./400)]
// Elo_diff. = 77.7 ==> D_max. = 78%

D_min. + D_max. = {2/[1 + 10^(-Elo_diff./400)] - 1} + 2/[1 + 10^(Elo_diff./400)]
D_min. + D_max. = 1

I would have chosen around 72.5% or 74% of draw ratio in this case, falling even shorter than with the wrong Elo_diff. of 72 Elo. The stronger engine completed a perfect score regarding loses, which is something I value a lot in many games and sports: chess, checkers, goals against in football (soccer), games lost in football (Arsenal F.C. in the 2003-04 Premier League season)...

Regards from Spain.

Ajedrecista.

h1a8 · Post by **h1a8** » Fri Oct 09, 2020 8:49 pm

Alayan wrote: ↑Tue Oct 06, 2020 6:00 pm Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :
Code: Select all
Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.

If a different mix makes a significant difference then there is a flaw with elo rating system period.

Nay Lin Tun · Post by **Nay Lin Tun** » Fri Oct 09, 2020 11:14 pm

Alayan wrote: ↑Tue Oct 06, 2020 6:00 pm Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :
Code: Select all
Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.

So, we dont need to use more than 1 core for SF 12.
Environmently friendly, good job in NNUE revolution!

mwyoung · Post by **mwyoung** » Sat Oct 10, 2020 7:23 am

Lasko's Law----What's not clear? 3 doublings in cores mean nowadays at least 2.5 real effective doublings in TC. Each effective doubling in TC in these blitz conditions means at very least 40 Elo points, therefore at very least 80 Elo points 1 core -> 8 cores. In fact more likely 120 - 140 Elo points. That result posted in OP and discrepancy beyond doubt break the Elo model.

It is clear to me that Stockfish NNUE does not obey Lasko's law as stated above. CCRL most likely does not have flawed testing.. And as suspected. The issues is with Stockfish NNUE. It took me many hours to testing to show this result, and the full results will be shown soon. When the testing is completed. The bottom line is the issue is with Stockfish NNUE, and not with CCRL testing. Full results coming soon. As you know testing can take days to answer this kind of anomaly, or false assumption.

hgm · Post by **hgm** » Sat Oct 10, 2020 11:26 am

One thing should be clear: there is nothing flawed in CCRL testing. They just play games, and record the results.

What could be flawed is the Elo model used to analyze the testing data. This cannot be blamed on CCRL. It just means we have to develop better rating models, and use these instead to extract useful information (which might not be representable as a single rating number) from the data set.

I have suspected for a long time that conventional ratings are artifacts caused by 'incestuous testing': the players tested are too similar, so that the test only measures a single aspect of their performance, almost completely ignoring other aspects, as there is no opponent in the pool that would punish you for being bad at the others. So I have always wondered whether an A-B engine that tests (say) 100 Elo stronger as another one, would also perform 100 Elo better in a gauntlet against a more varied mix of opponents. I never excluded the possibility that it might actually perform worse, because the 100-Elo advantage was only reached by specializing it more on the single aspect that A-B engines are good at, at the expense of other aspects.

If SF 8-core has 'underperformed' in terms of Elo because it had too many NN opponents, this can be considered evidence for the above hypothesis. Deeper search through more cores might not help against strategically superior opponents, because the tactical mistakes the latter make can already be seen at the lower depth. The only thing that would help is to not fall for their their strategical traps. The idea that better search beyond some point makes the engine stronger might just be an artifact of the rating pool being dominated by A-B engines.

smatovic · Post by **smatovic** » Sat Oct 10, 2020 11:36 am

hgm wrote: ↑Sat Oct 10, 2020 11:26 am One thing should be clear: there is nothing flawed in CCRL testing. They just play games, and record the results.

What could be flawed is the Elo model used to analyze the testing data. This cannot be blamed on CCRL. It just means we have to develop better rating models, and use these instead to extract useful information (which might not be representable as a single rating number) from the data set.

I have suspected for a long time that conventional ratings are artifacts caused by 'incestuous testing': the players tested are too similar, so that the test only measures a single aspect of their performance, almost completely ignoring other aspects, as there is no opponent in the pool that would punish you for being bad at the others. So I have always wondered whether an A-B engine that tests (say) 100 Elo stronger as another one, would also perform 100 Elo better in a gauntlet against a more varied mix of opponents. I never excluded the possibility that it might actually perform worse, because the 100-Elo advantage was only reached by specializing it more on the single aspect that A-B engines are good at, at the expense of other aspects.

If SF 8-core has 'underperformed' in terms of Elo because it had too many NN opponents, this can be considered evidence for the above hypothesis. Deeper search through more cores might not help against strategically superior opponents, because the tactical mistakes the latter make can already be seen at the lower depth. The only thing that would help is to not fall for their their strategical traps. The idea that better search beyond some point makes the engine stronger might just be an artifact of the rating pool being dominated by A-B engines.

Just as a side remark, it is not clear to me if SF parallel search goes deeper with more cores or just widens the search, maybe NNUE does not profit the same way as classic SF from more cores and they have to rework the parallel search again, or alike...

--
Srdja

CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing: SF12 above SF12 8CPU.

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing: SF12 above SF12 8CPU.

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU