CCRL flawed testing : SF12 above SF12 8CPU

Laskos · Post by **Laskos** » Thu Oct 08, 2020 7:37 pm

mwyoung wrote: ↑Thu Oct 08, 2020 7:17 pm
Alayan wrote: ↑Thu Oct 08, 2020 6:19 pm What's your final result ?
It was not 140 Elo, or 120 Elo, and not even 80 Elo. It was 72. So if Lasko's numbers are correct. Then this may not be a CCRL issue in total.

What was the draw rate? Your openings favor huge draw rates and compress Elo differences. Can you write WDL numbers?

Alayan · Post by **Alayan** » Thu Oct 08, 2020 8:45 pm

mvanthoor wrote: ↑Thu Oct 08, 2020 7:22 pm Can't it just be a scaling problem with Stockfish? According to CEGT, Stockfish 11 is only 6 ELO stronger @ 8CPU than Stockfish 11 @ 4CPU:

http://www.cegt.net/40_40%20Rating%20Li ... liste.html

This would probably mean that, if you run a long enough test between Stockfish 11 @ 4CPU and 8CPU, the result would be almost, if not equal.

These elo differences are completely unreliable because of the error bars.

The 1 vs 8 CPU at CCRL is so wrong that it's rather easy to show in a h2h test or a test vs common opponents that it's incorrect.

Laskos wrote: ↑Thu Oct 08, 2020 7:37 pm
mwyoung wrote: ↑Thu Oct 08, 2020 7:17 pm
Alayan wrote: ↑Thu Oct 08, 2020 6:19 pm What's your final result ?
It was not 140 Elo, or 120 Elo, and not even 80 Elo. It was 72. So if Lasko's numbers are correct. Then this may not be a CCRL issue in total.
What was the draw rate? Your openings favor huge draw rates and compress Elo differences. Can you write WDL numbers?

I saw SF 12 8CPU with zero losses vs the 1 core after 60+ games but I don't know the final results. Anyway, 8CPU had a huge win:loss ratio.

Ajedrecista · Post by **Ajedrecista** » Thu Oct 08, 2020 9:10 pm

Hello Kai and Alayan:

Laskos wrote: ↑Thu Oct 08, 2020 7:37 pmWhat was the draw rate? Your openings favor huge draw rates and compress Elo differences. Can you write WDL numbers?

Alayan wrote: ↑Thu Oct 08, 2020 8:45 pm[...]

I saw SF 12 8CPU with zero losses vs the 1 core after 60+ games but I don't know the final results. Anyway, 8CPU had a huge win:loss ratio.

While we wait Mark's answer, I did some math. Based on my own post when SF 12 was released, the draw ratio is:

Code: Select all

W = K*L >= L     // K >= 1
K*L + D + L = 1
(K + 1)*L = 1 - D
L = (1 - D)/(K + 1)

Elo_diff. = 400*log10{[2*K - (K - 1)*D]/[2 + (K - 1)*D]}

D = 2*[10^(Elo_diff./400) - K]/{(1 - K)*[10^(Elo_diff./400) + 1]}

I computed draw ratios for some K values just to get an idea:

Code: Select all

Elo difference = 72 Elo.
W = K*L

 K       W          D          L
 2     40.86%     38.71%     20.43%
 3     30.65%     59.14%     10.22%
 4     27.24%     65.95%      6.81%
 5     25.54%     69.35%      5.11%
 6     24.52%     71.40%      4.09%
 7     23.84%     72.76%      3.41%
 8     23.35%     73.73%      2.92%
 9     22.99%     74.46%      2.55%
10     22.70%     75.03%      2.27%
11     22.47%     75.48%      2.04%
12     22.29%     75.85%      1.86%
13     22.13%     76.16%      1.70%
14     22.00%     76.43%      1.57%
15     21.89%     76.65%      1.46%
16     21.79%     76.84%      1.36%
17     21.71%     77.01%      1.28%
18     21.63%     77.16%      1.20%
19     21.57%     77.30%      1.14%
20     21.51%     77.42%      1.08%

Of course, the extreme cases are:

Code: Select all

Assuming Elo_diff. >= 0 Elo:

Elo_diff. = 400*log10[score/(1 - score)]
score = 1/[1 + 10^(-Elo_diff./400)] = W + D = 1/2 + D_min.
D_min. = 1/[1 + 10^(-Elo_diff./400)] - 1/2
// Elo_diff. = 72 ==> D_min. ~ 10.22%

W + D_max. = 1     // L = 0
Elo_diff. = 400*log10[(W + D_max./2)/(D_max./2)]
Elo_diff. = 400*log10[(1 - D_max. + D_max./2)/(D_max./2)]
10^(Elo_diff./400) = (1 - D_max./2)/(D_max./2) = 2/D_max. - 1
D_max. = 2/[1 + 10^(Elo_diff./400)]
// Elo_diff. = 72 ==> D_max. ~ 79.57%

So 10.22% < D < 79.57% in this case, with great chances of being around 75%. If there were 200 games, then WDL figures must be in steps of 0.5% and some values of my table can be discarded like K = 7. K has low chances of being an integer, after all. I picked integer values for K just to get a rough idea of the draw ratio.

Regards from Spain.

Ajedrecista.

mwyoung · Post by **mwyoung** » Thu Oct 08, 2020 9:16 pm

Laskos wrote: ↑Thu Oct 08, 2020 7:37 pm
mwyoung wrote: ↑Thu Oct 08, 2020 7:17 pm
Alayan wrote: ↑Thu Oct 08, 2020 6:19 pm What's your final result ?
It was not 140 Elo, or 120 Elo, and not even 80 Elo. It was 72. So if Lasko's numbers are correct. Then this may not be a CCRL issue in total.
What was the draw rate? Your openings favor huge draw rates and compress Elo differences. Can you write WDL numbers?

I will post this when I get home. The book should not matter. As I use the same opening standards as CCRL.

Jouni · Post by **Jouni** » Thu Oct 08, 2020 10:14 pm

Fastgm page has test with SF8 and there after 3000 games 8th vs 1th was +158 ELO. SF wiki has value of +178 ELO after 1000 games. Both 60+0.6 games. But we are so much higher level now.

mwyoung · Post by **mwyoung** » Thu Oct 08, 2020 10:34 pm

Jouni wrote: ↑Thu Oct 08, 2020 10:14 pm Fastgm page has test with SF8 and there after 3000 games 8th vs 1th was +158 ELO. SF wiki has value of +178 ELO after 1000 games. Both 60+0.6 games. But we are so much higher level now.

Remember we are making a big assumption here. SF 12 is not a A/B engine. Just because SF 8 scales this way. Does not mean SF NNUE scales the same way. It is still possible this is normal.

mwyoung · Post by **mwyoung** » Thu Oct 08, 2020 11:07 pm

mwyoung wrote: ↑Tue Oct 06, 2020 9:11 pm
Laskos wrote: ↑Tue Oct 06, 2020 8:53 pm
mwyoung wrote: ↑Tue Oct 06, 2020 8:35 pm
Laskos wrote: ↑Tue Oct 06, 2020 8:21 pm
mwyoung wrote: ↑Tue Oct 06, 2020 8:09 pm
Laskos wrote: ↑Tue Oct 06, 2020 8:02 pm
mwyoung wrote: ↑Tue Oct 06, 2020 7:43 pm
Laskos wrote: ↑Tue Oct 06, 2020 7:37 pm
mwyoung wrote: ↑Tue Oct 06, 2020 6:48 pm
Laskos wrote: ↑Tue Oct 06, 2020 6:16 pm
Alayan wrote: ↑Tue Oct 06, 2020 6:00 pm Both got a very different mix of opponents. Both don't have that much games so small sample size doesn't help, but :
Code: Select all
Stockfish 12 64-bit        3666	+22	−22	89.3%	−325.7	21.0%
Stockfish 12 64-bit 8CPU   3658	+28	−28	63.4%	−75.1	70.7%
Elo transitivity flat out doesn't work, and we can get absurd results like this if the opponent mix is different enough.
5 out of 6 opponents of SF12 8CPU are Leela-like MCTS engines which compress Elo differences when playing against AB engines (was discussed more than a year ago here). Yes, underperformance of 8CPU SF12 is statistically significant, despite not that large number of games.
"Yes, underperformance of 8CPU SF12 is statistically significant"

Why is this true?
What should be the Elo difference with testing between SF 12 on 1 core vs SF 12 on 8 cores at this fast TC?
By CCRL own testing results. SF 12 on 8 cores could have a rating of 3686, and SF 12 on 1 core could have a rating of 3644.

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
The difference should be (much) in excess of 80 Elo points in these conditions, here it is -8 +/- 36 Elo points 2 standard deviations, therefore the mismatch is highly statistically significant. The explanation is that Leela-like MCTS engines in a pool of AB engines don't obey the Elo model, and this was discussed awhile ago here.
"The difference should be (much) in excess of 80 Elo points in these conditions."

Why should it be over 80 Elo with SF 12. 1 core vs 8 cores. I have not tested this. What results are you looking at that do not agree with CCRL.

If you are correct. Then why is it so off. As I said before...

This is not even considering the hardware CCRL is using. And is it configured correctly. As in did they lock the cpu core speed of the CPUs, CPU ramping, and other considerations. This can have a big impact on performance at these TC.
What's not clear? 3 doublings in cores mean nowadays at least 2.5 real effective doublings in TC. Each effective doubling in TC in these blitz conditions means at very least 40 Elo points, therefore at very least 80 Elo points 1 core -> 8 cores. In fact more likely 120 - 140 Elo points. That result posted in OP and discrepancy beyond doubt break the Elo model.
Then you are assuming this is true then with STOCKFISH 12. So you have no data! This is why you always fall off the rails.
So this could be a issue with Stockfish 12 with 8 cores, and CCRL testing could be correct.
I checked on 4 cores with my i7 CPU and the result against 1 core in bullet was in excess of 100 Elo points. It's very unlikely that it regresses to 8 cores. In fact I saw results on SF testing framework on many cores (64?) showing good scaling with cores of SF NNUE, at least at short time controls.
Now we have some limited data to work with. The data said we have a issue if true. Why? Bad testing, bad hardware configuration, or still is there a issue with SF 12 on 8 cores. Since we have only CCRL data for 8 cores with SF 12.

We need to rule out a SF 12 issue first. Before looking at other reasons like CCRL.
You are free to rule out anything you want.
Thanks,

If CCRL has really bad data here. I would like to be fair, and show it with some kind of data.

Stockfish 12 (1 core) vs Stockfish 12 (8 cores) (TC=2m+1s) (200 Rounds)

Live:

Code: Select all

Result:
--------------------------------------------------------------------------------------
  #  name                      games    wins   draws  losses   score    los%  elo+/-
  1. Stockfish 12 dup 8 cores    200      44     156       0   122.0   100.0    77.7
  2. Stockfish 12 dup 1 core     200       0     156      44    78.0     0.0   -77.7

Cross table:
--------------------------------------------------------------------------------------
  #  name                         score   games                                                                                                                                                                                                        1                                                                                                                                                                                                        2
  1. Stockfish 12 dup 8 cores     122.0     200                                                                                                                                                                                                        x 1===1===============1======11======1==1===========1=1=====1===11===1===11==1===1========1=1=1==1=1========1===1=========1==1===========1===11======1====1====1==1====1=====1======1====111==1=11===1===1
  2. Stockfish 12 dup 1 core       78.0     200 0===0===============0======00======0==0===========0=0=====0===00===0===00==0===0========0=0=0==0=0========0===0=========0==0===========0===00======0====0====0==0====0=====0======0====000==0=00===0===0                                                                                                                                                                                                        x

Tech:
--------------------------------------------------------------------------------------

Tech (average nodes, depths, time/m per move, others per game), counted for computing moves only, ignored moves with zero nodes:
  #  name                        nodes/m         NPS  depth/m   time/m    moves     time
  1. Stockfish 12 dup 8 cores     28475K     9905390     35.2      2.9     48.0    137.8
  2. Stockfish 12 dup 1 core       3474K     1180298     29.1      2.9     48.1    141.4
     all ---                      15585K     5486571     32.1      2.9     48.0    139.6

Tournament finished! Elapsed: 15:46:49

Raphexon · Post by **Raphexon** » Fri Oct 09, 2020 12:16 am

mwyoung wrote: ↑Thu Oct 08, 2020 10:34 pm
Jouni wrote: ↑Thu Oct 08, 2020 10:14 pm Fastgm page has test with SF8 and there after 3000 games 8th vs 1th was +158 ELO. SF wiki has value of +178 ELO after 1000 games. Both 60+0.6 games. But we are so much higher level now.
Remember we are making a big assumption here. SF 12 is not a A/B engine. Just because SF 8 scales this way. Does not mean SF NNUE scales the same way. It is still possible this is normal.

lol.

mwyoung · Post by **mwyoung** » Fri Oct 09, 2020 12:18 am

Raphexon wrote: ↑Fri Oct 09, 2020 12:16 am
mwyoung wrote: ↑Thu Oct 08, 2020 10:34 pm
Jouni wrote: ↑Thu Oct 08, 2020 10:14 pm Fastgm page has test with SF8 and there after 3000 games 8th vs 1th was +158 ELO. SF wiki has value of +178 ELO after 1000 games. Both 60+0.6 games. But we are so much higher level now.
Remember we are making a big assumption here. SF 12 is not a A/B engine. Just because SF 8 scales this way. Does not mean SF NNUE scales the same way. It is still possible this is normal.
lol.

If this is not the fault of SF 12.

Then I guess CCRL has a lot to explain...

Raphexon · Post by **Raphexon** » Fri Oct 09, 2020 12:21 am

mwyoung wrote: ↑Fri Oct 09, 2020 12:18 am
Raphexon wrote: ↑Fri Oct 09, 2020 12:16 am
mwyoung wrote: ↑Thu Oct 08, 2020 10:34 pm
Jouni wrote: ↑Thu Oct 08, 2020 10:14 pm Fastgm page has test with SF8 and there after 3000 games 8th vs 1th was +158 ELO. SF wiki has value of +178 ELO after 1000 games. Both 60+0.6 games. But we are so much higher level now.
Remember we are making a big assumption here. SF 12 is not a A/B engine. Just because SF 8 scales this way. Does not mean SF NNUE scales the same way. It is still possible this is normal.
lol.
Then I guess CCRL has a lot to explain...

Compare SF9 and SF10... (4CPU)
For a long time SF9 was actually ahead of SF10 on the CCRL 40/4 list.

As Alayan has already said; when they don't share the same opponent pool, things get fuzzy.
Is SF10 only 2 elo stronger than SF9?

http://ccrl.chessdom.com/ccrl/404/cgi/c ... +opponents

CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing: SF12 above SF12 8CPU.

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU

Re: CCRL flawed testing : SF12 above SF12 8CPU