Are all test the same?

Kempelen · Post by **Kempelen** » Tue Jan 17, 2012 11:57 am

Quick question: are the results of the games in a test important for the result?

Or explained with an example: Lets suppose two test of an engine A'' a mod of A':

a) Version A' : 20.000 games. Score: 10.000 points: 5.000 wins and 10.000 draws.
b) Version A'': 20.000 games. Score: 10.000 points: 20.000 draws.

Of course this sample is exagerated, but valid for orientate.

In results terms, both are the same. We could say impove A'' over A' does not increase Elo points. But in quality observations, the improvement A'' make the play style more conservative.

So my question is, supposing we would like to observe point-type (wins, draw, losses), what margin or difference we can say an engine plays more conservative than other version?. (p.e. if A'' is 5001 wins and 9.998 draws, it does mean little, but as b), it is very clear style play is very changed).

Or it this a question of non-mathematical, subjetive perception? Should type-result be important for accept changes decisions? How?

kbhearn · Post by **kbhearn** » Tue Jan 17, 2012 12:09 pm

you could optimise performance for a 3-1-0 scale instead of 1-0.5-0 if you wanted to bias towards taking chances, but i'd think including opposition a bit weaker than your engine would suffice (where you quite clearly need wins to optimise elo).

hgm · Post by **hgm** » Tue Jan 17, 2012 12:34 pm

This is a very good question, and AFAIK the answer is completely unknown. BayesElo effectively double-counts draws, so that a 0+/20k=/0- result produces a significantly smaller error bar in the rating than a 5k+/10k=/5k- result. Part of this is due to the increase of the 'effective number' of games (40h vs. 30k, i.e. 33% more, which in itself is good for a 15% reduction of the error bar).

But also the estimated standard deviation gets lower if the draw percentage increases. In fact for the extreme 0+/20k=/0- case, the SD drops to zero, and there will be no error at all. This is a bit of a pathological case, but of course it indeed suggest that this engine can only draw against these opponents. (I mean, if there were another 20k games to be played, and you could bet on the result of those being again 0+/20k=/0-, or against it, how would you bet?)

Of more practical relevance would be to compare 5k+/10k=/5k- to 10k+/0=/10k-. The SD is proportional to sqrt(score*(1-score)-drawFraction/4), which for 50% draws would be 1/sqrt(8), and for 0% draws 1/2. I.e. the draws couse an increase of the effective number of games from 20k to 30k, reducing the overall SD by a factor sqrt(3/2), while the drop in single-game SD is a factor sqrt(2). So in the end the error bar on the rating (as reported by BayesElo) will be sqrt(3) = 1.73 times smaller for the result with draws.

Unfortunately the idea that draws are more significant than wins/losses is model-dependent, and no one ever verified the validity of the Elo model for BayesElo. The reduction of the single-game SD is a fundamental statistical effect, though.

Kempelen · Post by **Kempelen** » Tue Jan 17, 2012 4:49 pm

hgm wrote:This is a very good question, and AFAIK the answer is completely unknown. BayesElo effectively double-counts draws, so that a 0+/20k=/0- result produces a significantly smaller error bar in the rating than a 5k+/10k=/5k- result. Part of this is due to the increase of the 'effective number' of games (40h vs. 30k, i.e. 33% more, which in itself is good for a 15% reduction of the error bar).

But also the estimated standard deviation gets lower if the draw percentage increases. In fact for the extreme 0+/20k=/0- case, the SD drops to zero, and there will be no error at all. This is a bit of a pathological case, but of course it indeed suggest that this engine can only draw against these opponents. (I mean, if there were another 20k games to be played, and you could bet on the result of those being again 0+/20k=/0-, or against it, how would you bet?)

Of more practical relevance would be to compare 5k+/10k=/5k- to 10k+/0=/10k-. The SD is proportional to sqrt(score*(1-score)-drawFraction/4), which for 50% draws would be 1/sqrt(8), and for 0% draws 1/2. I.e. the draws couse an increase of the effective number of games from 20k to 30k, reducing the overall SD by a factor sqrt(3/2), while the drop in single-game SD is a factor sqrt(2). So in the end the error bar on the rating (as reported by BayesElo) will be sqrt(3) = 1.73 times smaller for the result with draws.

Unfortunately the idea that draws are more significant than wins/losses is model-dependent, and no one ever verified the validity of the Elo model for BayesElo. The reduction of the single-game SD is a fundamental statistical effect, though.

What I was thinking is not about the error bar, but if it would be better in the long term to tune with win-oriented results, or draw oriented. Actually programmers only look at the total score or elo gain without seen result-type statistics. Or are you trying to say that both are relactioned?

Are all test the same?

Are all test the same?

Re: Are all test the same?

Re: Are all test the same?

Re: Are all test the same?