LOS calculation: Does the same result is always the same?

zamar · Post by **zamar** » Tue Oct 02, 2012 9:32 pm

LOS is only based on statistical error.

The only way this could be incorrect is that something causes systematic error.
For example some other process is making heavy use of CPU from time to time.
This kind of random systematic error of course favours the weaker engine.

Unfortunately there is no statistical way (AFAIK) to measure systematic error...

lucasart · Post by **lucasart** » Wed Oct 03, 2012 12:54 am

mcostalba wrote:
Ajedrecista wrote:
I know NOTHING about C++, but is this line wrong?
Code: Select all
ext = rBeta >= beta ? 1.5 * ONE_PLY &#58; ONE_PLY;
I mean, replace 'ONE_PLY + ONE_PLY / 2' by '1.5 * ONE_PLY' with the aim of speed up a little this line... I think that a multiplication is faster than an addition and a division. Surely this change is negligible, if valid... but I point this, just in case. Please remember that I do not know C++ at all so I will thank if nobody blames me.
Thanks for you comment. Actually the line is good as is (I will explain below), but I want to thank you for dedicating time in reading and commenting my code.

Now to the code:

1) ONE_PLY is a constant integer value, so the compiler is able to workout the sum and the division at compile time, it means that there is no actually any addition or division in the real code, but everything is precalculated during compile.

2) '1.5 * ONE_PLY' is wrong becuase 1.5 is a floating point number, not an integer as ONE_PLY. This means that the result of '1.5 * ONE_PLY' is still a floating point number and this is not what we want. In this particular case you have also some compile errors because the ternary operator ( condition ? a : b) requires 'a' and 'b' to be of the same type.

Indeed, and also thinking that 1.5 * x is faster than x + x/2 is wrong. x + x / 2 translates as an add and an shl in assembly (2 cycles), while 1.5 * x is a far more complex operation: even an integer multiplication is much slower than add + shl, so a floating point one with a conversion back to integer...

mcostalba · Post by **mcostalba** » Wed Oct 03, 2012 9:12 am

zamar wrote:LOS is only based on statistical error.

The only way this could be incorrect is that something causes systematic error.
For example some other process is making heavy use of CPU from time to time.
This kind of random systematic error of course favours the weaker engine.

Unfortunately there is no statistical way (AFAIK) to measure systematic error...

IMHO a noise is a noise, it can be due to an external source (like uneven CPU usage) or due to intrinsic chess game noise, but always a noise is and bigger it is bigger is the uncertainty of a given result.

My guess is that intrinsic game noise depened on the TC used: it is bigger at fast TC than at longer TC.

Milos · Post by **Milos** » Thu Oct 04, 2012 5:26 pm

Ajedrecista wrote:I agree with Uri. I am not gifted in Statistics but I calculate LOS in two different ways: both of them give very similar results when the number of games is large enough.

The first way (my own way) is intuitive: I calculate the mean and the standard deviation with the number of wins, draws and loses:
Code: Select all
&#40;Number of games&#41; = n = wins + draws + loses.
&#40;Draw ratio&#41; = D = draws/n

Mean = mu = &#40;wins + draws/2&#41;/n
1 - mu = &#40;draws/2 + loses&#41;/n

&#40;Standard deviation&#41; = sigma = sqrt&#123;&#91;mu*&#40;1 - mu&#41; - D/4&#93;/n&#125;
The explanation of the approximation of mu and sigma from a trinomial distribution is explained at section 3.2 of this paper.

Then I do k = (mu - 0.5)/sigma (the original equation is mu - k*sigma = 0.5); the original equation establishes the point k (a kind of limit), when mu - z*sigma < 0.5, then z < k tells you that the standard deviation z*sigma is larger than mu - 0.5 (LOS is the integral from -infinity to z = k in a normal distribution with my approach). As you can see, a higher draw ratio means a lower standard deviation so, with a given mean, k must be larger for maintaining k*sigma = mu - 0.5.

------------------------

The second way for calculating LOS is described by Rémi in the last equation of this post.

In Uri's example, {wins, loses} = {20, 0} and {510, 490}, which are clearly different in spite of the fact that wins - loses = 20 in both cases, which make the same score because n = 1000 in both cases.

If I run my own Fortran 95 programme, I get these results for Uri's example:
Code: Select all
LOS values are rounded up to 0.01%&#58;

+20 -0 =980.

    My LOS ~ 100%
Rémi's LOS ~ 100%
-------------------

+510 -490 =0.

    My LOS ~ 73.65%
Rémi's LOS ~ 73.64%
I understand that the probability of being wrong in assumptions with a given LOS value is min.(LOS, 1 - LOS): this probability is near zero in the first case, while it is more than 26% in the second case (this is the reason why Uri said a significant result and a not significant result). Please take in mind that I have exposed you two models of calculating LOS and nothing more: they are only models.

bob wrote:I don't follow your reasoning. In either case, one program won 20 games more than the other, out of 1,000 games total. why is a win and a loss worse than 2 draws?
I do not know an exact answer for your question but you will see that it is true when you introduce the numbers in the equations... I know it is a poor answer and maybe someone can answer better than me. You can see that my method is dependant on draw ratio, as I explained before, so it could be a reason.

Of course you can divide the standard deviation by (n - 1) instead of by n, but you know that differences will be too small with n >> 1.

I hope that you follow my reasoning.

Regards from Spain.

Ajedrecista.

First the above example is not good one since win/loss ratio in both examples are quite different - in first (510/490/0) win ratio is 0.51, loss ratio is 0.49; in second (20/0/1000) win ratio is 0.02, loss ratio is 0. This gives totally different LOS and both of your calculations are correct.

The actual question is do draws have any impact on the result (good example for that would be 1/0/1000 vs 1/0/0) and this is probably as old question as this forum and Remi is WRONG about that one.
Draws do have impact on LOS but in general small.
There is a long discussion on that here and this is the post that summarizes it. (it also gives an accurate formula for your 'k')
The catch is that even though number of draws doesn't play the role on LOS in Remi's equation, the draw probability does play (implicitly contained in variable 'u' in Remi's equation). In cases 1/0/1000 and 1/0/0 draw probability is certainly different (in first case it is close to 100% and in second we don't have a clue about it).

In reality when there are more draws than what is difference between number of wins and losses LOS has really low sensitivity to it so you can basically neglect the draws (only number of wins and losses plays the role). However, when number of draws is very small (smaller than difference in number of wins and losses) the impact can be significant.

Ajedrecista · Post by **Ajedrecista** » Thu Oct 04, 2012 7:42 pm

Hello Milos:

Milos wrote:
Ajedrecista wrote:I agree with Uri. I am not gifted in Statistics but I calculate LOS in two different ways: both of them give very similar results when the number of games is large enough.

The first way (my own way) is intuitive: I calculate the mean and the standard deviation with the number of wins, draws and loses:
Code: Select all
&#40;Number of games&#41; = n = wins + draws + loses.
&#40;Draw ratio&#41; = D = draws/n

Mean = mu = &#40;wins + draws/2&#41;/n
1 - mu = &#40;draws/2 + loses&#41;/n

&#40;Standard deviation&#41; = sigma = sqrt&#123;&#91;mu*&#40;1 - mu&#41; - D/4&#93;/n&#125;
The explanation of the approximation of mu and sigma from a trinomial distribution is explained at section 3.2 of this paper.

Then I do k = (mu - 0.5)/sigma (the original equation is mu - k*sigma = 0.5); the original equation establishes the point k (a kind of limit), when mu - z*sigma < 0.5, then z < k tells you that the standard deviation z*sigma is larger than mu - 0.5 (LOS is the integral from -infinity to z = k in a normal distribution with my approach). As you can see, a higher draw ratio means a lower standard deviation so, with a given mean, k must be larger for maintaining k*sigma = mu - 0.5.

------------------------

The second way for calculating LOS is described by Rémi in the last equation of this post.

In Uri's example, {wins, loses} = {20, 0} and {510, 490}, which are clearly different in spite of the fact that wins - loses = 20 in both cases, which make the same score because n = 1000 in both cases.

If I run my own Fortran 95 programme, I get these results for Uri's example:
Code: Select all
LOS values are rounded up to 0.01%&#58;

+20 -0 =980.

    My LOS ~ 100%
Rémi's LOS ~ 100%
-------------------

+510 -490 =0.

    My LOS ~ 73.65%
Rémi's LOS ~ 73.64%
I understand that the probability of being wrong in assumptions with a given LOS value is min.(LOS, 1 - LOS): this probability is near zero in the first case, while it is more than 26% in the second case (this is the reason why Uri said a significant result and a not significant result). Please take in mind that I have exposed you two models of calculating LOS and nothing more: they are only models.

bob wrote:I don't follow your reasoning. In either case, one program won 20 games more than the other, out of 1,000 games total. why is a win and a loss worse than 2 draws?
I do not know an exact answer for your question but you will see that it is true when you introduce the numbers in the equations... I know it is a poor answer and maybe someone can answer better than me. You can see that my method is dependant on draw ratio, as I explained before, so it could be a reason.

Of course you can divide the standard deviation by (n - 1) instead of by n, but you know that differences will be too small with n >> 1.

I hope that you follow my reasoning.

Regards from Spain.

Ajedrecista.
First the above example is not good one since win/loss ratio in both examples are quite different - in first (510/490/0) win ratio is 0.51, loss ratio is 0.49; in second (20/0/1000) win ratio is 0.02, loss ratio is 0. This gives totally different LOS and both of your calculations are correct.

The actual question is do draws have any impact on the result (good example for that would be 1/0/1000 vs 1/0/0) and this is probably as old question as this forum and Remi is WRONG about that one.
Draws do have impact on LOS but in general small.
There is a long discussion on that here and this is the post that summarizes it. (it also gives an accurate formula for your 'k')
The catch is that even though number of draws doesn't play the role on LOS in Remi's equation, the draw probability does play (implicitly contained in variable 'u' in Remi's equation). In cases 1/0/1000 and 1/0/0 draw probability is certainly different (in first case it is close to 100% and in second we don't have a clue about it).

In reality when there are more draws than what is difference between number of wins and losses LOS has really low sensitivity to it so you can basically neglect the draws (only number of wins and losses plays the role). However, when number of draws is very small (smaller than difference in number of wins and losses) the impact can be significant.

My model of LOS takes into account the draw ratio as you can see in the formula of sigma I use.

Milos wrote:Draws do have impact on LOS but in general small.

[...]

In reality when there are more draws than what is difference between number of wins and losses LOS has really low sensitivity to it so you can basically neglect the draws (only number of wins and losses plays the role). However, when number of draws is very small (smaller than difference in number of wins and losses) the impact can be significant.

I agree with you: I noticed that draw ratio has low sensitivity in LOS for usual tests and significant differences come for huge changes in draw ratio (for example, from almost 0% to almost 100%). Of course the formula of sigma has its drawback with a draw ratio of 100%, when sigma = 0; furthermore, this formula of sigma does not work well with too unbalanced matches (scores near 0% and 100%), this is why I restrict my programme for calculate these cases. I arbitraly choosed the range mu = [0.15, 0.85] as valid for my calculations and I probably was right by chance when I did the following graph plot later:

Code: Select all

n&#58; number of games.
w&#58; number of wins.
d&#58; number of draws.
l&#58; number of loses.

Rating performance in a chess tournament&#58; +400 against each player you win, 0 against each player you draw, -400 against each player you lose; then the average is the desired result if the average of opponents' ratings is 0&#58;

&#40;Rating performance&#41; = 400*&#40;w - l&#41;/n

------------------------

Rating difference by score&#58;

&#40;Rating difference&#41; = 400*log&#123;&#91;n + &#40;w - l&#41;&#93;/&#91;n - &#40;w - l&#41;&#93;&#125; = 400*log&#123;&#91;1 + &#40;w - l&#41;/n&#93;/&#91;1 - &#40;w - l&#41;/n&#93;&#125;

http&#58;//talkchess.com/forum/viewtopic.php?topic_view=threads&p=473889&t=44420

If you compare both graphs in terms of (w - l)/n , then you will see that both graphs are similar in (w - l)/n ~ [-2/3, 2/3]; I took [-0.7, 0.7] for being a little more flexible... so if |w - l|/n = 0.7, then d/n = D = 1 - 0.7 = 0.3 and scores are 0 + 0.3/2 = 0.15 and 0.7 + 0.3/2 = 0.85 in extreme cases of no wins or no loses, respectively.

Milos wrote:Draws do have impact on LOS but in general small.
There is a long discussion on that here and this is the post that summarizes it. (it also gives an accurate formula for your 'k')

In his formula (w, d, l and N are number of wins, draws, losses and games, respectively):
LOS=Phi(x), where x=(w-l)/sqrt(N-d)=(w-l)/sqrt(w+l) where x really does not depend on d.
However, accurately
x=(mu-0.5)/sigma=(w-l)/(2N*sigma)=...=(w-l)/sqrt((2w+d)(2l+d)/N-d)=(w-l)/sqrt(N-d-((w-l)/N)^2)

We agree in x = k = (mu - 0.5)/sigma:

Code: Select all

2*&#40;mu - 0.5&#41; = 2*mu - 1 = mu - &#40;1 - mu&#41; = &#40;w + d/2&#41;/n - &#40;d/2 + l&#41;/n = &#40;w - l&#41;/n

&#40;mu - 0.5&#41;/sigma = &#40;w - l&#41;/&#40;2n*sigma&#41;

2w + d = mu*2n
2l + d = &#40;1 - mu&#41;*2n

&#40;2w + d&#41;*&#40;2l + d&#41;/n - d = 4n*mu*&#40;1 - mu&#41; - nD = &#91;4*mu*&#40;1 - mu&#41; - D&#93;*n

As you introduced &#40;w - l&#41; instead of &#40;mu - 0.5&#41;, the constant 2n is present, which is &#40;2n&#41;² = 4n² inside a square root.

&#123;&#91;4*mu*&#40;1 - mu&#41; - D&#93;*n&#125;/4n² = &#91;mu*&#40;1 - mu&#41; - D/4&#93;/n &#40;which is the variance I use&#41;.

(w-l)/sqrt((2w+d)(2l+d)/N-d)=(w-l)/sqrt(N-d-((w-l)/N)^2) is inconsistent regarding units. The left member of the equation is OK while the right member is wrong: (N - d) has units of games while [(w - l)/n]² is dimensionless. If I am not wrong, the correct thing is (w - l)/sqrt[n - d - (w - l)²/n] = (w - l)/sqrt[w + l - (w - l)²/n].

Milos wrote:The more the games are played, and the less is the difference between engines, the less is the impact of number of draws on LOS.

I agree with you again. Disgracefully, I am not smart enough to provide a mathematical proof.

Your post was really interesting for me. Good work!

Regards from Spain.

Ajedrecista.

Rémi Coulom · Post by **Rémi Coulom** » Thu Oct 04, 2012 8:01 pm

Milos wrote:The actual question is do draws have any impact on the result (good example for that would be 1/0/1000 vs 1/0/0) and this is probably as old question as this forum and Remi is WRONG about that one.
Draws do have impact on LOS but in general small.
There is a long discussion on that here and this is the post that summarizes it. (it also gives an accurate formula for your 'k')
The catch is that even though number of draws doesn't play the role on LOS in Remi's equation, the draw probability does play (implicitly contained in variable 'u' in Remi's equation). In cases 1/0/1000 and 1/0/0 draw probability is certainly different (in first case it is close to 100% and in second we don't have a clue about it).

In reality when there are more draws than what is difference between number of wins and losses LOS has really low sensitivity to it so you can basically neglect the draws (only number of wins and losses plays the role). However, when number of draws is very small (smaller than difference in number of wins and losses) the impact can be significant.

Your formula is using a Gaussian approximation of the binomial distribution, whereas mine makes no approximation. When using the exact binomial distribution, draws have absolutely no effect at all, as proved by my calculation above. You might have the feeling that there is a small effect for a small number of games, but that is just an illusion caused by the error you make with the Gaussian approximation. There is in fact no effect at all.

Rémi

asanjuan · Post by **asanjuan** » Fri Oct 05, 2012 10:13 am

zamar wrote:LOS is only based on statistical error.

The only way this could be incorrect is that something causes systematic error.
For example some other process is making heavy use of CPU from time to time.
This kind of random systematic error of course favours the weaker engine.

Unfortunately there is no statistical way (AFAIK) to measure systematic error...

¿Why not? I'm not a statistical expert, but... correct me if I'm wrong.

If you choose a time control (say 10+0.01), a set of fixed openings (say 100 positions) an run a match between A and B 100 times. I'm pretty sure that you will find that each match reports different results (different wins/loses/draws in each run).

You can calculate a mean and standard deviation for each parameter (W/L/D).
For those who says that 100 games is not enough, take care that we are not measuring which engine is better, we are measuring the noise level.

Repeat this with a different time control (say 1 min/game) with the same fixed openings and you will get your "sistematic error" depending on TC.

In my opinion you will get different values depending on hardware, too. And I suspect that longer TC will get smaller standard deviations.

¿has anyone tried to measure this?

Michel · Post by **Michel** » Fri Oct 05, 2012 10:39 am

Your formula is using a Gaussian approximation of the binomial distribution, whereas mine makes no approximation. When using the exact binomial distribution, draws have absolutely no effect at all, as proved by my calculation above. You might have the feeling that there is a small effect for a small number of games, but that is just an illusion caused by the error you make with the Gaussian approximation. There is in fact no effect at all.

I think the issue is rather that "likelihood of superiority" seems to be a purely Bayesian concept. Your derivation shows that it is indeed independent of the number of draws.

The frequentist analogue would be something like "confidence of superiority" which does not seem to have a unique definition (it depends on the test to measure it). In general it seems to me that it does depend somewhat on the number of draws. I think the other people are really talking about "confidence of superiority".

The corresponding ambiguity in the Bayesian setting is the choice of the prior. Your derivation works for some priors but not for all of them. It is possible to choose a prior such that there is a dependence on the number of draws.

Rémi Coulom · Post by **Rémi Coulom** » Fri Oct 05, 2012 11:14 am

Michel wrote:
Your formula is using a Gaussian approximation of the binomial distribution, whereas mine makes no approximation. When using the exact binomial distribution, draws have absolutely no effect at all, as proved by my calculation above. You might have the feeling that there is a small effect for a small number of games, but that is just an illusion caused by the error you make with the Gaussian approximation. There is in fact no effect at all.

I think the issue is rather that "likelihood of superiority" seems to be a purely Bayesian concept. Your derivation shows that it is indeed independent of the number of draws.

The frequentist analogue would be something like "confidence of superiority" which does not seem to have a unique definition (it depends on the test to measure it). In general it seems to me that it does depend somewhat on the number of draws. I think the other people are really talking about "confidence of superiority".

The corresponding ambiguity in the Bayesian setting is the choice of the prior. You derivation works for some priors but not for all of them. It is possible to choose a prior such that there is a dependence on the number of draws.

Of course it is possible to incorporate a dependence in the prior. But that would really be a strange prior. And frequentist approaches are usually equivalent to Bayesian with no prior, anyway.

The only hypothesis I make in my calculation is that game results are independent and identically distributed. The result of Milos is based on an approach formally similar to mine, but makes a Gaussian approximation of the score distribution instead of making no approximation at all. That approximation is the source of the error. Using the approach of Milos without the Gaussian approximation produces the same equation as mine.

Besides, there is the chess* intuitive explanation that should be really convincing. (being stronger at chess is equivalent to being stronger at chess*, where the game is restarted whenever it ends in a draw, so that no draw is possible in chess*).

It might be possible to find a small dependence on draws if you consider player colors. Since White has an advantage, a draw as White indicates that Black might be slightly stronger. But if we ignore this and consider only the results regardless of color, then draws really do not count at all.

Rémi

Michel · Post by **Michel** » Fri Oct 05, 2012 11:26 am

And frequentist approaches are usually equivalent to Bayesian with no prior, anyway.

This happens to be true for MLE but not in general. Frequentists usually maximize over the parameter space rather than averaging as in the Bayesian approach.

I don't really see a good frequentist derivation of the independence of draws result. If you have one then I'd be happy to see it (I am not a statistician so I may overlook something).

Besides, there is the chess* intuitive explanation that should be really convincing. (being stronger at chess is equivalent to being stronger at chess*, where the game is restarted whenever it ends in a draw, so that no draw is possible in chess*).

I must admit this does sound very good (although it is not a 100% proof since not every sequence of chess games is a sequence of chess* games).

LOS calculation: Does the same result is always the same?

Re: LOS calculation: Does the same result is always the same

Re: LOS calculation: does the same result is always the same

Re: LOS calculation: Does the same result is always the same

Re: LOS calculation: does the same result is always the same

LOS calculation: does the same result is always the same?

Re: LOS calculation: does the same result is always the same

Re: LOS calculation: Does the same result is always the same

Re: LOS calculation: does the same result is always the same

Re: LOS calculation: does the same result is always the same

Re: LOS calculation: does the same result is always the same