Properties of unbalanced openings using Bayeselo model

Laskos · Post by **Laskos** » Sat Aug 27, 2016 10:31 am

With the risk of boring people, I will post this, just to have it somewhere. With Bayeselo model, which defines ELO model, draw model, bias of the openings (unbalancedness) model:

f(Delta) = 1 / (1 + 10^(Delta/400)),
PW = f(-eloDelta - eloBias + eloDraw),
PB = f(eloDelta + eloBias + eloDraw),
PD = 1 - PW - PB,

and using 5-nomial variance for side-and-reversed pair of games on an opening position, one can derive the important properties of the t-value (in direct relation with p-value or LOS) defined as (w-l)/sqrt(Variance), where w, l are win, respectively loss ratios in games between 2 closely matched engines (eloDelta assumed small, and the limit is taken at the end eloDelta -> 0). 5-nomial variance is computed "naively", without taking into account impossible to account for in an ELO model correlations between side-and-reversed openings. The results are:

1/ For eloDraw above 222.4 ELO points, unbalanced openings of the order eloBias ~ eloDraw show a better t-value (sensitivity or resolution) for the same number of games than the balanced ones.
2/ The optimum of eloBias for the sensitivity converges towards eloDraw for large values of eloDraw.
3/ With increasing eloDraw, unbalanced case shows a convergence of the sensitivity to a constant, while balanced case shows a drastic decrease of sensitivity with larger and larger eloDraw. The plot is here:

For unbalanced case we have asymptotically:

t-value = sqrt(Number of Games)*(Elo Difference)*log(10)/(400*sqrt(3))

or

Number of Games = 3*[400*(t-value)/{(Elo Difference)*(log(10))}]^2

It is independent of eloDraw. For example, for t-value = 2 (or LOS of 97.5%) and Elo Difference of 3 Bayeselo points, number of expected to be necessary games is ~ 40,000.

For balanced case, the sensitivity or t-value decreases with increased eloDraw, to the point that one cannot measure a 3 Bayeselo points difference with balanced openings for eloDraw say above 1000. This will be relevant in the future, when draw rates will increase. I placed where SF testing and TCEC final stand as eloDraw. It is apparent that already the SF testing can be shortened by a factor of 2, while the TCEC match can be shortened by a factor of 4 for the same resolution (t-value or sensitivity) by using unbalanced openings.

4/ The draw rate for balanced case with increasing eloDraw tends to 100%. For unbalanced case and eloBias=eloDraw = large, the draw rate tends to 50%.
5/ Including correlations between the openings pairwise to 5-nomial variance only seems to accentuate the effect, and unbalanced are slightly even more favored for large eloDraw.
6/ Would be useful if someone checks the results, it was done maybe too quickly

.

cdani · Post by **cdani** » Sat Aug 27, 2016 1:46 pm

Thank! (I suppose)

Could you explain this without mathematical therms, if such thing is possible?

Laskos · Post by **Laskos** » Sat Aug 27, 2016 2:52 pm

cdani wrote:Thank! (I suppose)
Could you explain this without mathematical therms, if such thing is possible?

Sure it's possible, I probably should have written a bit it. Suppose you are testing Andscacs in self-games (like the databases you sent me). Say you are getting about 75% draw rate with balanced set opening positions. This means "eloDraw" of about 350 (there is a formula to simply compute it from 75%). And you are getting 6.4+/-3.2 ELO points difference result. This means "t-value" (or sensitivity or resolution) of 6.4/3.2=2.0. If you use unbalanced set of opening positions (side-and-reversed) with unbalance of order 100cp (this meaning roughly "eloBias " of 350 ELO points), then the expected resolution will be higher than 2.0, looking at my plot, about 3.0. So the difference between the same Andscacs versions with unbalanced set of opening positions will be for example 9.6+/-3.2 ELO points or 6.0+/-2.0 ELO points. The difference between engines is more accentuated compared to error margins. Meaning also higher LOS (likelihood of superiority) in the second case.

It is more important in the following case: with balanced you get 2.8+/-3.0 ELO points with 75% draw rate. With unbalanced set of positions you will get say 2.6+/-1.9 result or similar, which you could call "conclusive". Also, In TCEC superfinal case or in the future of engine testing, the draw rate for balanced openings will reach 85-90%, or "eloDraw" of 450-500 ELO points. The "resolution" is higher by at least a factor of 2 with unbalanced openings, meaning at least square(2)=4 less necessary games for the same confidence or LOS.

cdani · Post by **cdani** » Sat Aug 27, 2016 3:09 pm

So we should make books of more unbalanced positions to test the engines. I will try. Thanks!

Laskos · Post by **Laskos** » Sat Aug 27, 2016 3:32 pm

cdani wrote:So we should make books of more unbalanced positions to test the engines. I will try. Thanks!

Probably not yet, your draw rates are about 54-56%. For unbalanced opening sets to be beneficial, the draw rate with balanced ones should be above 60%. It becomes really important for draw rates above say 80%, and I bet in 5-10 years we will see such draw rates even in fast self-games. In TCEC it's already important. Also, you must know better, the set of openings might influence and distort the eval, so some tests will still be necessary with balanced openings for "sanity of eval check"

.

cdani · Post by **cdani** » Sat Aug 27, 2016 3:53 pm

Laskos wrote:
cdani wrote:So we should make books of more unbalanced positions to test the engines. I will try. Thanks!
Probably not yet, your draw rates are about 54-56%. For unbalanced opening sets to be beneficial, the draw rate with balanced ones should be above 60%. It becomes really important for draw rates above say 80%, and I bet in 5-10 years we will see such draw rates even in fast self-games. In TCEC it's already important. Also, you must know better, the set of openings might influence and distort the eval, so some tests will still be necessary with balanced openings for "sanity of eval check" .

Thanks!! Now I understood, and sure others

Michel · Post by **Michel** » Sat Aug 27, 2016 5:50 pm

Kai wrote:Would be useful if someone checks the results, it was done maybe too quickly

Yes I get a similar graph as you. It is perhaps good to reiterate that, as you discovered, the variance needs to be computed correctly. Otherwise unbalanced positions become advantageous only for much larger values of draw_elo.

Laskos · Post by **Laskos** » Sun Aug 28, 2016 2:33 pm

Michel wrote:
Kai wrote:Would be useful if someone checks the results, it was done maybe too quickly
Yes I get a similar graph as you. It is perhaps good to reiterate that, as you discovered, the variance needs to be computed correctly. Otherwise unbalanced positions become advantageous only for much larger values of draw_elo.

Yes, it should be stressed: the variance (error margins) computed here are correctly computed for side-and-reversed games, using 5-nomial for pairs of (w,d,l) outcomes. The correct 5-nomial formulation is the idea of Michel Van den Bergh from this thread http://www.talkchess.com/forum/viewtopic.php?t=61105 , where I observed that the empirical variance in chess matches of side-and-reversed games is significantly smaller than that computed by trinomial, especially in unbalanced case. For correlations between openings, even the 5-nomial "naive" computation is not perfect.

Therefore: error margins are not those shown by ELO calculators using trinomials. Hence, the direct application of the results shown here for interested people is pending some tools to compute the variance. Also SPRT as used in SF testing framework may be improved by the new variance and "ELO-model-less" (G)SPRT exemplified by Michel in this thread: http://www.talkchess.com/forum/viewtopic.php?t=57465 .

I also computed the same quantities using Davidson model (ELO, draw, bias models), which is empirically even better validated than Bayeselo (Rao-Kupper). The resolution (t-value) for balanced and unbalanced openings as a function of draw rate is almost identical to that in Bayeselo, although computations are pretty different.

I will write down a bit of analytical premises and results in Davidson model:

f(δ) = 1 / [1 + 10^(-δ/400)];

d(δ) = ν(f(+δ)f(−δ))^(1/2) ;
P(W|δ) = f(+δ)/(1 + d(δ)) ;
P(L|δ) = f(−δ)/(1 + d(δ)) ;
P(D|δ) = d(δ)/(1 + d(δ)) = ν[P(W|δ)*P(L|δ)]^(1/2);

where v is a parameter controlling the draw ratio, akin to eloDraw. The equations can be solved analytically and the results are:

1/ Unbalanced positions show a better resolution for draw rate (eloBias=0) above 2/3 or v > 4.
2/ Optimal eloBias is given by the equation

eloBias = (800 * log[(v^2 - 8 + sqrt[64 - 20 v^2 + v^4])/(2*v)])/log[10]

3/ The shape of resolution (t-value) is almost identical to Rao-Kupper (Bayeselo) case.
4/ Needs again to be checked.

Laskos · Post by **Laskos** » Mon Aug 29, 2016 8:07 am

Laskos wrote: 5/ Including correlations between the openings pairwise to 5-nomial variance only seems to accentuate the effect, and unbalanced are slightly even more favored for large eloDraw.

To show this, I tentatively fitted the empirical 5-nomial variance which includes correlations. The fit was done using several databases of games, including those provided by Daniel. The advantage of unbalanced openings for resolution seems to be significantly accentuated by correlations between the openings in a side-and-reversed pair:

The length of SF matches to SPRT stop can be shortened by a factor of two (it goes as resolution^2), and the length of TCEC superfinal by a factor of at least 4 for the same resolution (or improve resolution by a factor of 2 for the same length of a match). It seems TCEC is the first important event already using somewhat unbalanced openings. I would suggest them to use even higher unbalance, until the draw rate becomes close to 50%.

brtzsnr · Post by **brtzsnr** » Thu Sep 01, 2016 12:32 pm

Hi, Kai!

I think most people here are interested to know how to implement the stopping rule. Most of the statistics is way beyond me despite taking a few introductory statistics courses. Would it be possible to provide some code?

I'm looking forward to implement this in my testing framework and reduce the number of games by 50%. I'm already using your other idea for SPRT (alpha = 3%, beta = 15%) which saves a lot of time on bad patches.

On my end, I could generate a set of many unbalanced openings (e.g. 1 to 5 random moves from the start + 1m eval). Would that help you in anyway?

Properties of unbalanced openings using Bayeselo model

Properties of unbalanced openings using Bayeselo model

Re: Properties of unbalanced openings using Bayeselo model

Re: Properties of unbalanced openings using Bayeselo model

Re: Properties of unbalanced openings using Bayeselo model

Re: Properties of unbalanced openings using Bayeselo model

Re: Properties of unbalanced openings using Bayeselo model

Re: Properties of unbalanced openings using Bayeselo model

Re: Properties of unbalanced openings using Bayeselo model

Re: Properties of unbalanced openings using Bayeselo model

Re: Properties of unbalanced openings using Bayeselo model