Invariance with time control of rating schemes

Laskos · Post by **Laskos** » Sat Jul 22, 2017 8:12 pm

It can be important to have invariant with time control rating scheme. Engines' ratings could thus be compared across rating lists which use different time controls, and the scaling of engines could be directly inferred.

In a post in a thread here:
http://www.talkchess.com/forum/viewtopi ... 8&start=56 ,
I suggested that "Normalized ELO" (Michel's paper http://hardy.uhasselt.be/Toga/normalized_elo.pdf ), or basically (score-1/2)/(sigma*sqrt(N)), is the correct time control invariant measure of engines' strength. It also has a nice statistical interpretation: inverse square of it gives the number of games to desired LOS, p-value and SPRT stop. I was on a vacation, and left my PC to check the invariance hypothesis, and meanwhile I remembered a very important experiment of Andreas Strangmüller:
http://www.talkchess.com/forum/viewtopic.php?t=61784
So, I basically checked his results on a limited span. The tests are the following: take a good engine, play self games double vs single time control and measure the difference according to rating scheme, for different time controls. I took Komodo 11.01 and Stockfish dev in 3000 self games each match at 60''+ 0.6'' vs 30''+ 0.3'' and 300''+ 3'' vs 150''+ 1.5'' to see if Normalized ELO doesn't vary much and W/L ratio increases. Opening suite was 3moves_Elo2200.epd. The results are here:

Code: Select all

K 60''+ 0.6'' vs 30''+ 0.3''&#58; 
Score of K2 vs K1&#58; 1007 - 127 - 1866  &#91;0.647&#93; 3000
ELO difference&#58; 105.00 +/- 7.34
W/L&#58; 7.93
Normalized ELO&#58; 0.543 +/- 0.0358

K 300''+ 3'' vs 150''+ 1.5''&#58;
Score of K2 vs K1&#58; 703 - 104 - 2193  &#91;0.600&#93; 3000
ELO difference&#58; 70.32 +/- 6.19
W/L&#58; 6.76
Normalized ELO&#58; 0.417 +/- 0.0358



SF 60''+ 0.6'' vs 30''+ 0.3''&#58;
Score of SF2 vs SF1&#58; 890 - 100 - 2010  &#91;0.632&#93; 3000
ELO difference&#58; 93.70 +/- 6.81
W/L&#58; 8.90
Normalized ELO&#58; 0.516 +/- 0.0358

SF 300''+ 3'' vs 150''+ 1.5''&#58;
Score of SF2 vs SF1&#58; 547 - 96 - 2357  &#91;0.575&#93; 3000
ELO difference&#58; 52.63 +/- 5.56
W/L&#58; 5.70
Normalized ELO&#58; 0.343 +/- 0.0358

It is not what I expected. W/L decreases with time control instead of increasing, Normalized ELO decreases even more, instead of being constant. It is contrary to my built model from http://www.talkchess.com/forum/viewtopi ... 8&start=56 .

Laskos · Post by **Laskos** » Sat Jul 22, 2017 8:14 pm

Then I took Andreas' comprehensive test with Komodo 9.3. He has each doubling in time control from 5'' vs 2.5'' to 5120'' vs 2560'' in 3000 games each datapoint, basically the entire useful span of time controls. Mine and his results needn't match exactly, as I used different hardware and opening suite (both opening suites are balanced, so pentanomial variance is not significantly different from trinomial one). The plots for Andreas (11 datapoints, red line) and my (2 datapoints for SF, blue line, 2 datapoints for Komodo, green line) for different proposed rating schemes are here (notation sigma' = sigma * sqrt(N)):

We see that all proposed rating schemes - ELO, WiLo, Normalized ELO deflate with time control. That is contrary to my previous guess that W/L increases with reasonably long time control, and Normalized ELO is pretty stable. I introduced a new, hard to interpret ad hoc quantity, (score - 1/2) / sigma'**4, which seems to stay stable for all the span of time controls (factor of 1024). I cannot find a statistical interpretation of this quantity.

The caveats of these tests:
1/ I am using self-games
2/ ELO difference in these doublings in time control are large. It is not clear it translates directly to small ELO differences.

For small ELO differences, in Michel notations we have the following:

With eps small

(w,d,l)=(a+eps,1-2*a,a-eps)

and we look at the dominant term in eps.

The result is as follows

ELO is proportional to eps
Normalized ELO is proportional to eps/sqrt(a)
WiLo is proportional to eps/a
(score - 1/2) / sigma'**4 is proportional to eps/a**2

It seems here that eps/a**2 stays constant. Then

ELO, WiLo, Normalized ELO all go down with TC
(score - 1/2) / sigma'**4 stays constant

Davidson model has eps/a behavior like WiLo, Bayeselo (Rao-Kupper) has eps/(a*(1-a)).

Even if results stay true for large span of strength differences, (score - 1/2) / sigma'**4 cannot yet be considered as a rating scheme until we verify that for 3 engines dif(1,2) + dif(2,3) = dif(1,3). That is also testable.

Laskos · Post by **Laskos** » Sat Jul 22, 2017 8:16 pm

From FGRL rating list of April 2017 (same excellent work of Andreas), I computed the Top 10 ratings with ad hoc rating scheme (score - 1/2) / sigma'**4. The results are not that different from normal ELO rating, Komodo seems a bit disadvantaged because of its Contempt factor, which harms a bit here:

Code: Select all

60''+ 0.6''

 #    Name               &#40;s-1/2&#41; / sigma'**4

 1. Stockfish 8                35.4
 2. Houdini 5.01               29.5
 3. Komodo 10.4                22.5
 4. Shredder 13                 0.2
 5. Fire 5                     -1.2
 6. Fizbo 1.9                  -2.9 
 7. Gull 3                     -5.4  
 8. Andscacs 0.90             -10.1
 9. Fritz 15                  -10.6
10. Chiron 4                  -11.5



60'+ 15''

 #    Name               &#40;s-1/2&#41; / sigma'**4

 1. Stockfish 8                37.8
 2. Komodo 10.4                28.8
 3. Houdini 5.01               28.4
 4. Shredder 13                 1.8
 5. Fire 5                     -0.8
 6. Fizbo 1.9                  -7.6
 7. Gull 3                     -7.7  
 8. Andscacs 0.90              -7.8
 9. Chiron 4                  -16.0
10. Fritz 15                  -19.8

These ratings are hopefully directly comparable, despite being separated by a factor of 60 in time control. That's the purpose of time control invariant rating scheme.

kbhearn · Post by **kbhearn** » Sat Jul 22, 2017 9:18 pm

Perhaps the noted decrease of W/L with time control is related to the opening selections - i.e. that any openings that have advantages (whether the engines realise they're there or not) are more likely to be converted by both engines at longer time control since the weaker engine can scale to the point where it can better utilise that advantage? What happens if you discard the wins where the same color won both sides of an opening position?

Laskos · Post by **Laskos** » Sat Jul 22, 2017 9:33 pm

kbhearn wrote:Perhaps the noted decrease of W/L with time control is related to the opening selections - i.e. that any openings that have advantages (whether the engines realise they're there or not) are more likely to be converted by both engines at longer time control since the weaker engine can scale to the point where it can better utilise that advantage? What happens if you discard the wins where the same color won both sides of an opening position?

I also thought of that, but I used pentanomial too, which takes care of that, and the difference in variance was 2-5% across the time controls compared to trinomial variance, not really significant and without a clear trend to longer time controls. I specifically asked Andreas for PGNs to check that. We both use sufficiently balanced openings (note I avoided 2moves_v1.epd which contains quite a lot of unbalanced positions), and our results seem congruent.

Michel · Post by **Michel** » Sat Jul 22, 2017 10:57 pm

Thanks,

So your experiments show that for every reasonable measure ultimately the difference in strength between engines decreases with increasing time control. It seems that in this case common wisdom is correct!

Since the decrease also happens for normalized elo this means that engines become objectively(!) harder to separate experimentally (for a given level of significance) with increasing time control.

Laskos · Post by **Laskos** » Sun Jul 23, 2017 10:18 am

Laskos wrote:
kbhearn wrote:Perhaps the noted decrease of W/L with time control is related to the opening selections - i.e. that any openings that have advantages (whether the engines realise they're there or not) are more likely to be converted by both engines at longer time control since the weaker engine can scale to the point where it can better utilise that advantage? What happens if you discard the wins where the same color won both sides of an opening position?
I also thought of that, but I used pentanomial too, which takes care of that, and the difference in variance was 2-5% across the time controls compared to trinomial variance, not really significant and without a clear trend to longer time controls. I specifically asked Andreas for PGNs to check that. We both use sufficiently balanced openings (note I avoided 2moves_v1.epd which contains quite a lot of unbalanced positions), and our results seem congruent.

I addressed the issue with pentanomial versus trinomial variance. Pentanomial takes care of unbalanced positions, whether engines realize that an opening is determined or not. Nothing special happens at LTC. I plotted here pentanomial (correct) and trinomial Normalized ELO from Andreas PGN kindly provided to me.

As you see, the differences are minimal, meaning that Andreas used fair, balanced opening set of positions.

Laskos · Post by **Laskos** » Sun Jul 23, 2017 10:48 am

Michel wrote:Thanks,

So your experiments show that for every reasonable measure ultimately the difference in strength between engines decreases with increasing time control. It seems that in this case common wisdom is correct!

Since the decrease also happens for normalized elo this means that engines become objectively(!) harder to separate experimentally (for a given level of significance) with increasing time control.

Yes, that's true. Normalized ELO gives the power of separation in number of games necessary for needed statistical significance. And this power of separation decreases with longer time control.

Still, the seemingly constant empirical quantity with time control (s-1/2)/sigma'**4 can be used to compare rating lists and scaling at different time controls, although it doesn't seem to have a reasonable interpretation. I checked it for error margins versus magnitude:

SF8 - Houdini5: TCEC (100 games):
Normalized ELO (pentanomial): 0.216 +/- 0.196 (95% confidence)
(s - 1/2) / sigma'**4 (pentanomial): 23.84 +/- 21.6 (95% confidence)

SF8 - Houdini5: FGRL 60min + 15sec (150 games):
Normalized ELO (pentanomial): 0.044 +/- 0.160 (95% confidence)
(s - 1/2) / sigma'**4 (pentanomial): 3.8 +/- 13.7 (95% confidence)

I used pentanomial because TCEC openings are often unbalanced. Both error margins scale as 1/sqrt(N), so if central values stay the same with more games, the magnitude of difference between rating schemes shows (s-1/2)/sigma'**4 as magnifying the difference. Here, although within 1.96*sigma confidence, SF8 seems to scale better than Houdini5 to TCEC conditions.

Laskos · Post by **Laskos** » Sun Jul 23, 2017 11:25 am

Laskos wrote:
Michel wrote:Thanks,

So your experiments show that for every reasonable measure ultimately the difference in strength between engines decreases with increasing time control. It seems that in this case common wisdom is correct!

Since the decrease also happens for normalized elo this means that engines become objectively(!) harder to separate experimentally (for a given level of significance) with increasing time control.
Yes, that's true. Normalized ELO gives the power of separation in number of games necessary for needed statistical significance. And this power of separation decreases with longer time control.

Still, the seemingly constant empirical quantity with time control (s-1/2)/sigma'**4 can be used to compare rating lists and scaling at different time controls, although it doesn't seem to have a reasonable interpretation. I checked it for error margins versus magnitude:

SF8 - Houdini5: TCEC (100 games):
Normalized ELO (pentanomial): 0.216 +/- 0.196 (95% confidence)
(s - 1/2) / sigma'**4 (pentanomial): 23.84 +/- 21.6 (95% confidence)

SF8 - Houdini5: FGRL 60min + 15sec (150 games):
Normalized ELO (pentanomial): 0.044 +/- 0.160 (95% confidence)
(s - 1/2) / sigma'**4 (pentanomial): 3.8 +/- 13.7 (95% confidence)

I used pentanomial because TCEC openings are often unbalanced. Both error margins scale as 1/sqrt(N), so if central values stay the same with more games, the magnitude of difference between rating schemes shows (s-1/2)/sigma'**4 as magnifying the difference. Here, although within 1.96*sigma confidence, SF8 seems to scale better than Houdini5 to TCEC conditions.

Another issue:
If it is true that eps/a**2 stays constant across time controls for doubling time control, it means that doubling time control in Normalized ELO (which gives power of separation) scales as a**(3/2), and for higher draw rates (lower "a") in the future, diminishes quite drastically. So, time odds are not very feasible to separate engines in the future. A similar result I got when talking here on "Drawish Chess" (Endgame Chess), when I got very meager improvement for time odds and reversed. It seems that the only solution in the future is using unbalanced positions and pentanomial, with unbalance chosen such as draw rate is 50%. That gives a significant boost in sensitivity (Normalized ELO) for high draw rates.

BeyondCritics · Post by **BeyondCritics** » Sun Jul 23, 2017 7:08 pm

So what is the gist of that?
The highly important queston is: What is the best time control to test engine differences? Think of a developer with a machine or two, who wants to test the effect of some minor changes as efficiently as possible.
Can you give advice?

Invariance with time control of rating schemes

Invariance with time control of rating schemes

Re: Invariance with time control of rating schemes

Re: Invariance with time control of rating schemes

Re: Invariance with time control of rating schemes

Re: Invariance with time control of rating schemes

Re: Invariance with time control of rating schemes

Re: Invariance with time control of rating schemes

Re: Invariance with time control of rating schemes

Re: Invariance with time control of rating schemes

Re: Invariance with time control of rating schemes