Opening testing suites efficiency

Laskos · Post by **Laskos** » Fri Jul 07, 2017 8:23 am

If we have a faith that Normalized ELO is invariant to doubling in time control in self-games, especially to longer time controls, we can build empirical models.

We take, for small "eps" (although it is not that small in case of doubling, but let's say we increase time control by 10% for one opponent in self-games):
(w,d,l)=(a+eps,1-2*a,a-eps)

We look at the dominant term in eps.

The result which both accounts for the expansion and my past empirical evidence is as follows:

Normalized ELO is proportional to 1 + f(t) with f(t) -> 0 for t -> infinity, f(t) increases as t -> 0 (assumption and evidence)
ELO is proportional to sqrt(a) * (1+f(t))
WiLo is proportional to 1/sqrt(a) * (1+f(t))

Empirically: d = 1 - 2*c/log(t) => a = c/log(t)

Normalized ELO ~ 1 + f(t)
ELO ~ (1+f(t)) / sqrt(log(t))
WiLo ~ (1+f(t)) * sqrt(log(t))

Take from empirical evidence f(t) = 1/(log(t))**2

Then we have to plot, aside from some constants

Normalized ELO ~ 1 + 1/x**2
ELO ~ (1+1/x**2) / sqrt(x)
WiLo ~ (1+1/x**2) * sqrt(x)

with x ~ log(t+constant)

I also played pretty flimsy self-games matches at double time control between Komodo, to not overfit on Stockfish. The opening suite was the fairly balanced 3moves_Elo2200.epd, but observe that even in its case pentanomial gave 5-6% better results than trinomial.

Code: Select all

6s vs 3s
Score of K2 vs K1&#58; 1054 - 112 - 834  &#91;0.736&#93; 2000
ELO difference&#58; 177.66 +/- 11.75
Win/Loss&#58; 9.41
Normalized ELO trinomial&#58; 0.784 +/- 0.044
Normalized ELO pentanomial&#58; 0.803 +/- 0.044

20s vs 10s
Score of K2 vs K1&#58; 232 - 34 - 334  &#91;0.665&#93; 600
ELO difference&#58; 119.11 +/- 18.04
Win/Loss&#58; 6.82
Normalized ELO trinomial&#58; 0.571 +/- 0.080
Normalized ELO pentanomial&#58; 0.609 +/- 0.080

60s vs 30s
Score of K2 vs K1&#58; 195 - 19 - 386  &#91;0.647&#93; 600
ELO difference&#58; 105.00 +/- 15.81
Win/Loss&#58; 10.26
Normalized ELO trinomial&#58; 0.564 +/- 0.080
Normalized ELO pentanomial&#58; 0.603 +/- 0.080

It seems to confirm the model at its point were WiLo has a minimum, and Normalized ELO starts to stabilize to the doubling in time control in self-games. I will soon depart on vacation for a week, and I will leave my home computer running more serious tests for the Normalized ELO stability to longer time controls.

Michel · Post by **Michel** » Wed Jan 02, 2019 4:53 pm

I updated the normalized elo document

http://hardy.uhasselt.be/Toga/normalized_elo.pdf

with a proof that normalized elo is a measure for the amount of effort it takes to separate two engines (this is section 5).

First of all I had to think about the formal meaning of this statement. So I introduce the notion of a context. A context will typically consists of an opening book and a time control, but it may include other things as well such as contempt settings. One then defines the relative sensitivity of contexts C,D as the ratio of the normalized elo's of two engines X,Y with respect to those contexts. The weak dependency hypothesis then says that the relative sensitivity of C,D does not depend strongly on the engines X,Y used to measure it.

Assuming the weak dependency hypothesis one then shows (Theorem 5.1.4) that the relative expected duration of two SPRT's, using contexts C,D, that have the same power to separate two engines is inversely proportional to the square of the relative sensitivity of C,D.

Here is a trivial example. There were regression tests of sf9->sf10 using the 2 moves and the 8 moves books. The outcomes were

Code: Select all

W,L,D=9754,3612,26634 # LTC test (sf9->sf10) with 8 moves book.
W,L,D=12041,4583,23376 # LTC test (sf9->sf10) with 2 moves book

A simple computation shows that the relative sensitivity of the 2 moves book versus the 8 moves book is with 95% confidence in the interval

Code: Select all

[1.04375629346 1.14932517816]

In other words (assuming the weak dependency hypothesis) the reduction in games achievable by using the 2 moves book, without sacrificing power, would be between 8% and 24%.

There is one possible caveat however in interpreting these results. Since fishtest using the 2 moves book for testing patches there may be a form of selection bias going on. Patches that work well with the 2 moves book are more likely to make it into master, possibly inflating normalized elo when measured with the 2 moves book.

jorose · Post by **jorose** » Wed Jan 02, 2019 6:17 pm

Thanks for reviving this thread, I had missed it =)

The links to the opening suites seem dead. Does anybody know where I could find them now?

Laskos · Post by **Laskos** » Sat Jan 05, 2019 9:50 am

Michel wrote: ↑Tue Jul 04, 2017 3:58 pm I posted an update of the document containing the formula for the pentanomial model. The factors of 2 are confusing but I think I got it right.

http://hardy.uhasselt.be/Toga/normalized_elo.pdf

Michel, I will have a look, I am on vacation now on my phone, hard to do anything. And this formal mathematical language is hard to decipher for me in some more pragmatic one to me.

Michel · Post by **Michel** » Sat Jan 05, 2019 11:02 am

Laskos wrote: ↑Sat Jan 05, 2019 9:50 am
Michel wrote: ↑Tue Jul 04, 2017 3:58 pm I posted an update of the document containing the formula for the pentanomial model. The factors of 2 are confusing but I think I got it right.

http://hardy.uhasselt.be/Toga/normalized_elo.pdf
Michel, I will have a look, I am on vacation now on my phone, hard to do anything. And this formal mathematical language is hard to decipher for me in some more pragmatic one to me.

Don't worry. Enjoy your vacation. I am sorry about the somewhat mathematical style of the text (although it is far below the standards of a genuine paper in mathematics or statistics). The thing is that when making statements about the efficiency of tests one has to be quite careful not to compare apples and oranges. Partial statements that certain setups "amplify elo" or "reduce noise", even if true, are insufficient. So I simply tried to be precise about the assumptions and the conclusions - which can be used without reading the proof.

Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency

Re: Opening testing suites efficiency