LOS calculation: Does the same result is always the same?

bob · Post by **bob** » Mon Oct 01, 2012 5:57 pm

Uri Blass wrote:
hgm wrote:The only ways that results for a given number of games could be more 'volatile' (i.e. have larger standard deviation) is if they had a significantly lower draw rate, or if the results of individual games are somehow dependent. (So that N games do not really count as N games, but, say, as N/2 pairs of games, where the result of game 1 from a pair would predict the result of game 2.)

The latter is hard to imagine, and points to a severe flaw in testing methodology. It means there should be some memory from one game to he next, so that games could influence each other. This could lead to 'winning streaks' and 'losing streaks'. A way to check that would be to Forier-transform the result sequence of a test, to see if the power spetcrum looks like white noise, or decays at high frequencies.
If you play a match from the same position with opposite colors then it is obvious that results of individual games are dependent and I expect the correlation to be higher at long time control.

I think that long time control has more draws so it makes sense that 10 elo difference at long time control with the same number of games is more
significant in the meaning that you can have more confidence that it is an improvement.

An extreme example

20-0 and 980 draws is clearly a significant result
510-490 with no draws is not a significant result.

Both results have the same estimate for the elo difference.

I don't follow your reasoning. In either case, one program won 20 games more than the other, out of 1,000 games total. why is a win and a loss worse than 2 draws?

abulmo · Post by **abulmo** » Mon Oct 01, 2012 6:42 pm

bob wrote:
Uri Blass wrote: An extreme example

20-0 and 980 draws is clearly a significant result
510-490 with no draws is not a significant result.

Both results have the same estimate for the elo difference.
I don't follow your reasoning. In either case, one program won 20 games more than the other, out of 1,000 games total. why is a win and a loss worse than 2 draws?

If we make the assumption that the theoretical outcome of the game is a draw, a perfect player can get the first result, not the second. In other words, a strong player making perfect moves can draw against any other player, whatever is strength is, if this one also accidentally play perfectly. In that case, a draw as a value close to a win, and a loss is much worst, IMHO.

Ajedrecista · Post by **Ajedrecista** » Mon Oct 01, 2012 6:50 pm

Hello Bob:

bob wrote:
Uri Blass wrote:
hgm wrote:The only ways that results for a given number of games could be more 'volatile' (i.e. have larger standard deviation) is if they had a significantly lower draw rate, or if the results of individual games are somehow dependent. (So that N games do not really count as N games, but, say, as N/2 pairs of games, where the result of game 1 from a pair would predict the result of game 2.)

The latter is hard to imagine, and points to a severe flaw in testing methodology. It means there should be some memory from one game to he next, so that games could influence each other. This could lead to 'winning streaks' and 'losing streaks'. A way to check that would be to Forier-transform the result sequence of a test, to see if the power spetcrum looks like white noise, or decays at high frequencies.
If you play a match from the same position with opposite colors then it is obvious that results of individual games are dependent and I expect the correlation to be higher at long time control.

I think that long time control has more draws so it makes sense that 10 elo difference at long time control with the same number of games is more
significant in the meaning that you can have more confidence that it is an improvement.

An extreme example

20-0 and 980 draws is clearly a significant result
510-490 with no draws is not a significant result.

Both results have the same estimate for the elo difference.
I don't follow your reasoning. In either case, one program won 20 games more than the other, out of 1,000 games total. why is a win and a loss worse than 2 draws?

I agree with Uri. I am not gifted in Statistics but I calculate LOS in two different ways: both of them give very similar results when the number of games is large enough.

The first way (my own way) is intuitive: I calculate the mean and the standard deviation with the number of wins, draws and loses:

Code: Select all

&#40;Number of games&#41; = n = wins + draws + loses.
&#40;Draw ratio&#41; = D = draws/n

Mean = mu = &#40;wins + draws/2&#41;/n
1 - mu = &#40;draws/2 + loses&#41;/n

&#40;Standard deviation&#41; = sigma = sqrt&#123;&#91;mu*&#40;1 - mu&#41; - D/4&#93;/n&#125;

The explanation of the approximation of mu and sigma from a trinomial distribution is explained at section 3.2 of this paper.

Then I do k = (mu - 0.5)/sigma (the original equation is mu - k*sigma = 0.5); the original equation establishes the point k (a kind of limit), when mu - z*sigma < 0.5, then z < k tells you that the standard deviation z*sigma is larger than mu - 0.5 (LOS is the integral from -infinity to z = k in a normal distribution with my approach). As you can see, a higher draw ratio means a lower standard deviation so, with a given mean, k must be larger for maintaining k*sigma = mu - 0.5.

------------------------

The second way for calculating LOS is described by Rémi in the last equation of this post.

In Uri's example, {wins, loses} = {20, 0} and {510, 490}, which are clearly different in spite of the fact that wins - loses = 20 in both cases, which make the same score because n = 1000 in both cases.

If I run my own Fortran 95 programme, I get these results for Uri's example:

Code: Select all

LOS values are rounded up to 0.01%&#58;

+20 -0 =980.

    My LOS ~ 100%
Rémi's LOS ~ 100%
-------------------

+510 -490 =0.

    My LOS ~ 73.65%
Rémi's LOS ~ 73.64%

I understand that the probability of being wrong in assumptions with a given LOS value is min.(LOS, 1 - LOS): this probability is near zero in the first case, while it is more than 26% in the second case (this is the reason why Uri said a significant result and a not significant result). Please take in mind that I have exposed you two models of calculating LOS and nothing more: they are only models.

bob wrote:I don't follow your reasoning. In either case, one program won 20 games more than the other, out of 1,000 games total. why is a win and a loss worse than 2 draws?

I do not know an exact answer for your question but you will see that it is true when you introduce the numbers in the equations... I know it is a poor answer and maybe someone can answer better than me. You can see that my method is dependant on draw ratio, as I explained before, so it could be a reason.

Of course you can divide the standard deviation by (n - 1) instead of by n, but you know that differences will be too small with n >> 1.

I hope that you follow my reasoning.

Regards from Spain.

Ajedrecista.

Uri Blass · Post by **Uri Blass** » Mon Oct 01, 2012 6:51 pm

bob wrote:
Uri Blass wrote:
hgm wrote:The only ways that results for a given number of games could be more 'volatile' (i.e. have larger standard deviation) is if they had a significantly lower draw rate, or if the results of individual games are somehow dependent. (So that N games do not really count as N games, but, say, as N/2 pairs of games, where the result of game 1 from a pair would predict the result of game 2.)

The latter is hard to imagine, and points to a severe flaw in testing methodology. It means there should be some memory from one game to he next, so that games could influence each other. This could lead to 'winning streaks' and 'losing streaks'. A way to check that would be to Forier-transform the result sequence of a test, to see if the power spetcrum looks like white noise, or decays at high frequencies.
If you play a match from the same position with opposite colors then it is obvious that results of individual games are dependent and I expect the correlation to be higher at long time control.

I think that long time control has more draws so it makes sense that 10 elo difference at long time control with the same number of games is more
significant in the meaning that you can have more confidence that it is an improvement.

An extreme example

20-0 and 980 draws is clearly a significant result
510-490 with no draws is not a significant result.

Both results have the same estimate for the elo difference.
I don't follow your reasoning. In either case, one program won 20 games more than the other, out of 1,000 games total. why is a win and a loss worse than 2 draws?

Because the probability for draw is different in both cases.

If you flip a coin 1000 times
it is often going to fall more than 510 times on the same side so you cannot say that the coin is not fair because it fell 510 times on the same side.

The case with 980 draws is different because 20-0 with no draws is significant and
I can also say that the winner won 20-0 if I define a game that draw means to play another game until somebody wins.

mcostalba · Post by **mcostalba** » Mon Oct 01, 2012 8:09 pm

hgm wrote: A way to check that would be to Forier-transform the result sequence of a test, to see if the power spetcrum looks like white noise, or decays at high frequencies.

This is an interesting idea. I predict that with everything else equal (same opponents, same number of games, etc) but different TC, the fast TC sequence of games will show a bigger energy in high frequencies then a slower TC.

bob · Post by **bob** » Tue Oct 02, 2012 6:17 pm

abulmo wrote:
bob wrote:
Uri Blass wrote: An extreme example

20-0 and 980 draws is clearly a significant result
510-490 with no draws is not a significant result.

Both results have the same estimate for the elo difference.
I don't follow your reasoning. In either case, one program won 20 games more than the other, out of 1,000 games total. why is a win and a loss worse than 2 draws?
If we make the assumption that the theoretical outcome of the game is a draw, a perfect player can get the first result, not the second. In other words, a strong player making perfect moves can draw against any other player, whatever is strength is, if this one also accidentally play perfectly. In that case, a draw as a value close to a win, and a loss is much worst, IMHO.

Is that not factored in when you consider the rating difference of the two opponents? A draw hurts the rating of the stronger player more than the weaker player. But in general, why is two draws better or worse than a win and a loss?

bob · Post by **bob** » Tue Oct 02, 2012 6:23 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
hgm wrote:The only ways that results for a given number of games could be more 'volatile' (i.e. have larger standard deviation) is if they had a significantly lower draw rate, or if the results of individual games are somehow dependent. (So that N games do not really count as N games, but, say, as N/2 pairs of games, where the result of game 1 from a pair would predict the result of game 2.)

The latter is hard to imagine, and points to a severe flaw in testing methodology. It means there should be some memory from one game to he next, so that games could influence each other. This could lead to 'winning streaks' and 'losing streaks'. A way to check that would be to Forier-transform the result sequence of a test, to see if the power spetcrum looks like white noise, or decays at high frequencies.
If you play a match from the same position with opposite colors then it is obvious that results of individual games are dependent and I expect the correlation to be higher at long time control.

I think that long time control has more draws so it makes sense that 10 elo difference at long time control with the same number of games is more
significant in the meaning that you can have more confidence that it is an improvement.

An extreme example

20-0 and 980 draws is clearly a significant result
510-490 with no draws is not a significant result.

Both results have the same estimate for the elo difference.
I don't follow your reasoning. In either case, one program won 20 games more than the other, out of 1,000 games total. why is a win and a loss worse than 2 draws?
Because the probability for draw is different in both cases.

If you flip a coin 1000 times
it is often going to fall more than 510 times on the same side so you cannot say that the coin is not fair because it fell 510 times on the same side.

The case with 980 draws is different because 20-0 with no draws is significant and
I can also say that the winner won 20-0 if I define a game that draw means to play another game until somebody wins.

My question was, however, why does one suggest a difference between the two players that doesn't agree with the other? I'm not arguing the point, I am trying to understand the reasoning.

A simple question:

In my cluster testing, I play 6,000 games between Crafty and each opponent. 3,000 different positions, with Crafty playing two games per position, one black and one white. Does this mean that if I see some of my 3,000 positions where both games are draws, I can toss those positions out to speed up the test, and still get the same level of accuracy? Does it also mean that I should try to pick test positions that offer the fewest draws? What about positions that are always won by white. Does the win/loss for each mean more than a pair of draws on a different position?

Uri Blass · Post by **Uri Blass** » Tue Oct 02, 2012 7:43 pm

bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:
hgm wrote:The only ways that results for a given number of games could be more 'volatile' (i.e. have larger standard deviation) is if they had a significantly lower draw rate, or if the results of individual games are somehow dependent. (So that N games do not really count as N games, but, say, as N/2 pairs of games, where the result of game 1 from a pair would predict the result of game 2.)

The latter is hard to imagine, and points to a severe flaw in testing methodology. It means there should be some memory from one game to he next, so that games could influence each other. This could lead to 'winning streaks' and 'losing streaks'. A way to check that would be to Forier-transform the result sequence of a test, to see if the power spetcrum looks like white noise, or decays at high frequencies.
If you play a match from the same position with opposite colors then it is obvious that results of individual games are dependent and I expect the correlation to be higher at long time control.

I think that long time control has more draws so it makes sense that 10 elo difference at long time control with the same number of games is more
significant in the meaning that you can have more confidence that it is an improvement.

An extreme example

20-0 and 980 draws is clearly a significant result
510-490 with no draws is not a significant result.

Both results have the same estimate for the elo difference.
I don't follow your reasoning. In either case, one program won 20 games more than the other, out of 1,000 games total. why is a win and a loss worse than 2 draws?
Because the probability for draw is different in both cases.

If you flip a coin 1000 times
it is often going to fall more than 510 times on the same side so you cannot say that the coin is not fair because it fell 510 times on the same side.

The case with 980 draws is different because 20-0 with no draws is significant and
I can also say that the winner won 20-0 if I define a game that draw means to play another game until somebody wins.
My question was, however, why does one suggest a difference between the two players that doesn't agree with the other? I'm not arguing the point, I am trying to understand the reasoning.

A simple question:

In my cluster testing, I play 6,000 games between Crafty and each opponent. 3,000 different positions, with Crafty playing two games per position, one black and one white. Does this mean that if I see some of my 3,000 positions where both games are draws, I can toss those positions out to speed up the test, and still get the same level of accuracy? Does it also mean that I should try to pick test positions that offer the fewest draws? What about positions that are always won by white. Does the win/loss for each mean more than a pair of draws on a different position?

If you have positions that are always won for white or are always draws then it is better not to use them.

Note that I agree that
510-490 can be significant even with 510 wins and 490 losses if the 500 matches of 2 games have 490 1-1 result and 10 2-0 results.

If you get something like 200 2-0 results 190 0-2 results and 110 1-1 results then it is not significant.

Edit:some thoughts about it:

1)I think that you should try to pick test positions that give more variety
(draws in 90% of the games is not good and the same for win for white in 90% of the games)

2)It is possible that 1 is not a good idea because
some balanced positions may also give 90% wins for white
because the computer does not understand how to play it with black
so maybe it is a bad idea not to use it because later you may get an improvement that teach your program how to play it with black
so basically you need balanced positions when all results are possible(even if practically at this point of time chess programs have clearly more than 50% with white from part of them).

Ajedrecista · Post by **Ajedrecista** » Tue Oct 02, 2012 7:45 pm

Hello Marco:

mcostalba wrote:I am really clueless in this field of statistical calculations.

Nevertheless I noted some things when looking at a matches between engine A and B, that I'd like experts to comment on:

1) The current algorithm to calculate ELO, LOS, etc use the "final" result, the numbers of wins, lost, draws

2) Experiencing with testing at different TC I noted that at fast TC results are more "volatile" than at longer TC. At fast TC a +10 ELO after 1000 games could easily be reverted in the next 10K games, instead at long TC already after 300 games the current winner has a big potential to be the final winner even after 10K games.

3) In all the ELO calculations the probability for the stronger engine to win a match is assumed to be independent from the TC used. This IMHO is a flaw.

My understanding is that in any match there is a noise level that alters the natural result, i.e. the stronger wins. The reason why we need a lot of games is to average out this noise that is assumed with zero median. But I also think that this noise is not independent from the TC used.

When playing a match we have much more information that the final result. We have all the single games results series. My understanding is that analyzing the sequence of single results a "variance" or noise level estimation could be calculated. The level of noise could be then used to reach the final LOS: my guess is that at long TC less games are needed than at very fast TC (note that total time could be the same or even longer for long TC) to reach a given LOS with a given accuracy.

So my final question is, does anybody ever considered to model the single game result noise using the games series and then use it to calculate the LOS ?

Sorry for being a little off-topic; I want to write a few words about your last commit on GitHub:

Code: Select all

ext = rBeta >= beta ? ONE_PLY + ONE_PLY / 2 &#58; ONE_PLY;

I know NOTHING about C++, but is this line wrong?

Code: Select all

ext = rBeta >= beta ? 1.5 * ONE_PLY &#58; ONE_PLY;

I mean, replace 'ONE_PLY + ONE_PLY / 2' by '1.5 * ONE_PLY' with the aim of speed up a little this line... I think that a multiplication is faster than an addition and a division. Surely this change is negligible, if valid... but I point this, just in case. Please remember that I do not know C++ at all so I will thank if nobody blames me.

Regarding LOS:

Further push singular extension

Extend for an extra half-ply in case the node is (probably)
going to fail high. In this case the added overhead is limited.

A novelity is the way this patch has been tested: Always in
self-play but with a much longer TC to allow the singular
extension to fully kick in and also (my impression) to have
less noisy results.

Ater 1015 games on my QUAD at 60"+0.05
Mod vs Orig 173 - 150 - 692 ELO +8

Code: Select all

LOS_and_Elo_uncertainties_calculator, ® 2012.

----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines&#58;
----------------------------------------------------------------

&#40;The input and output data is referred to the first engine&#41;.

Please write down non-negative integers.

Maximum number of games supported&#58; 2147483647.

Write down the number of wins &#40;up to 1825361100&#41;&#58;

173

Write down the number of loses &#40;up to 1825361100&#41;&#58;

150

Write down the number of draws &#40;up to 2147483324&#41;&#58;

692

 Write down the confidence level &#40;in percentage&#41; between 65% and 99.9% &#40;it will be rounded up to 0.01%)&#58;

95

Write down the clock rate of the CPU &#40;in GHz&#41;, only for timing the elapsed time of the calculations&#58;

3

---------------------------------------
Elo interval for 95.00 % confidence&#58;

Elo rating difference&#58;      7.87 Elo

Lower rating difference&#58;   -4.18 Elo
Upper rating difference&#58;   19.94 Elo

Lower bound uncertainty&#58;  -12.05 Elo
Upper bound uncertainty&#58;   12.07 Elo
Average error&#58;        +/-  12.06 Elo

K = &#40;average error&#41;*&#91;sqrt&#40;n&#41;&#93; =  384.18

Elo interval&#58; &#93;  -4.18,   19.94&#91;
---------------------------------------

Number of games of the match&#58;      1015
Score&#58; 51.13 %
Elo rating difference&#58;    7.87 Elo
Draw ratio&#58; 68.18 %

*********************************************************
Standard deviation&#58;  1.7338 % of the points of the match.
*********************************************************

 Error bars were calculated with two-sided tests; values are rounded up to 0.01
Elo, or 0.01 in the case of K.

-------------------------------------------------------------------
Calculation of likelihood of superiority &#40;LOS&#41; in a one-sided test&#58;
-------------------------------------------------------------------

LOS &#40;taking into account draws&#41; is always calculated, if possible.

LOS &#40;not taking into account draws&#41; is only calculated if wins + loses < 16001.

LOS &#40;average value&#41; is calculated only when LOS &#40;not taking into account draws&#41; is calculated.
______________________________________________

LOS&#58;  89.99 % &#40;taking into account draws&#41;.
LOS&#58;  89.94 % &#40;not taking into account draws&#41;.
LOS&#58;  89.96 % &#40;average value&#41;.
______________________________________________

These values of LOS are rounded up to 0.01%

End of the calculations. Approximated elapsed time&#58;   76 ms.

Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.

LOS ~ 90% so I estimate that this commit is not good with a probability of 10%, which is not insignificant. But n = 1015, so error bars are still around ± 12 Elo with 95% confidence using my model; I know that the objective of this thread is playing longer games (but a lower number) for trying to have a more stable LOS evolution after each game, but if you can play a little more games at that TC (60" + 0.05"/move) it would be better IMHO.

Good luck with the development of SF... it is a hard nut to crack!

Regards from Spain.

Ajedrecista.

mcostalba · Post by **mcostalba** » Tue Oct 02, 2012 8:25 pm

Ajedrecista wrote:
I know NOTHING about C++, but is this line wrong?
Code: Select all
ext = rBeta >= beta ? 1.5 * ONE_PLY &#58; ONE_PLY;
I mean, replace 'ONE_PLY + ONE_PLY / 2' by '1.5 * ONE_PLY' with the aim of speed up a little this line... I think that a multiplication is faster than an addition and a division. Surely this change is negligible, if valid... but I point this, just in case. Please remember that I do not know C++ at all so I will thank if nobody blames me.

Thanks for you comment. Actually the line is good as is (I will explain below), but I want to thank you for dedicating time in reading and commenting my code.

Now to the code:

1) ONE_PLY is a constant integer value, so the compiler is able to workout the sum and the division at compile time, it means that there is no actually any addition or division in the real code, but everything is precalculated during compile.

2) '1.5 * ONE_PLY' is wrong becuase 1.5 is a floating point number, not an integer as ONE_PLY. This means that the result of '1.5 * ONE_PLY' is still a floating point number and this is not what we want. In this particular case you have also some compile errors because the ternary operator ( condition ? a : b) requires 'a' and 'b' to be of the same type.

LOS calculation: Does the same result is always the same?

Re: LOS calculation: Does the same result is always the same

Re: LOS calculation: Does the same result is always the same

LOS calculation: does the same result is always the same?

Re: LOS calculation: Does the same result is always the same

Re: LOS calculation: Does the same result is always the same

Re: LOS calculation: Does the same result is always the same

Re: LOS calculation: Does the same result is always the same

Re: LOS calculation: Does the same result is always the same

LOS calculation: does the same result is always the same?

Re: LOS calculation: does the same result is always the same