Throwing out draws to calculate Elo

Ovyron · Post by **Ovyron** » Thu Jul 02, 2020 8:40 am

Dann Corbit wrote: ↑Thu Jul 02, 2020 5:40 am Take Cfish, and make two copies, naming one Purple and the other Gold.

That's still trying to hit nails with a shoe.

You're the only person in the world that has attempted to use LOS to measure the difference between identical engines with names changed. That you already know have the same Elo. It's like I try to use LOS for protein folding, it's no wonder it'd do terrible at something it was never designed to do.

So I'll go and give you an example of how to use LOS properly:

I go to the CCRL page and find out Stockfish 11 isn't at the top of the list. Why is this, why is there some random version from October?

I go to the page that lists them all, here:

https://ccrl.chessdom.com/ccrl/4040/rat ... t_all.html

Now I can take a look at the LOS of the October version over SF11, it's 62.3%. What does this mean exactly?

It means that if this test was repeated, 62.3% of the time the October version would appear over SF11, and 37.7% of the time SF11 would appear over the October version.

That's all.

Okay, cool, those are some shoes I can walk in.

And taking draws into account wouldn't change these numbers.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 9:15 am

Ovyron wrote: ↑Thu Jul 02, 2020 8:40 am
Dann Corbit wrote: ↑Thu Jul 02, 2020 5:40 am Take Cfish, and make two copies, naming one Purple and the other Gold.
That's still trying to hit nails with a shoe.

You're the only person in the world that has attempted to use LOS to measure the difference between identical engines with names changed. That you already know have the same Elo. It's like I try to use LOS for protein folding, it's no wonder it'd do terrible at something it was never designed to do.

Yes, and that is a big part of the problem.
People did some math and the result sounded good, but nobody bothered to test to see if it actually works.
If you run a test to see if engine x is superior to engine x and it says that over and over again, will you trust it when it says that engine y is superior to engine x?

We know an engine is identical to itself. If the tool does not confirm this then that is a clue that the tool is broken

hgm · Post by **hgm** » Thu Jul 02, 2020 9:38 am

The problem is that the test never says "this engine is superior to that". The LOS only gives a probability that this is the case, and will always leave a remaining probability that it is not the case.

You seem to be upset by the fact that, when testing equal engines, you will quite frequently get LOS of 60% or 70%. Because apparently you think that a LOS of 51% already 'proves' the engine must be superior. But that is misinterpretation of the result.

LOS = 60% means that the chances the engine is better are only 1.5 times as good as that the losing engine would have been better, which was still 40%. That is almost no difference. It is not a perfectly fair coin toss which of the two is better, but it is nearly so. The only conclusion one can draw from an LOS so close to 50% is that more testing is needed. Usually engine developers only accept a change if the stest shows it gives an LOS of at least 95%. And preferably 99%, as with only 95% 1 out of every 20 patches you accept would still be a regression.

And when the engines are truly equal, no amount of testing will alter the situation. The LOS will never get far enough from 50% to be considered decisive, unless you have an extreme fluke. And yes, flukes can happen. That is an intrinsic problem of processes that involve randomness. This is why the LOS will never get to exactly 1. Any conceivable result could in principle have been achieved by exactly equal engines. There doesn't exist any tool that can do anything about that.

This of course assumes fair testing. If you doctor the test results by repeating the test 100k times and discarding the results until you get one that you want, it isn't really testing.

Milos · Post by **Milos** » Thu Jul 02, 2020 9:46 am

Dann what you are writing in this thread is frankly embarrassing.
Draws play a role in LoS too. Implicit, but role never the less.
If you play 10 games and have 5 decisive games for one opponent your LoS would be 97%. If you play 1000 games and have 5 decisive games for one opponent your LoS would be 57%. How you are unable to see this is beyond comprehension.

Ajedrecista · Post by **Ajedrecista** » Thu Jul 02, 2020 1:17 pm

Hello Dann:

AFAIK, LOS is intended to be applied in a single n-game match, not through k n-game matches (you are usually running k = n, so n matches of n games each one). I think it is the key. If you average the n LOS results from the n matches, you will get <LOS> = (1/n)*SUM(LOS_i ; i = 1, ..., n) = 50% (or very close to 50%) since the score distribution is symmetric to 50%.

As explained before by some people, LOS does not quantify an Elo difference, just the probability that Elo > 0 (whatever is +0.01 or +9000, this is, likelihood of superiority, not how much superiority). In other words, LOS is the probability of score being greater than 0.5 (50%), like in the image I posted before in this thread:

Following the classical definition of probability among the cases we are interested on (superiority, score from 0.5 to 1) divided by the whole range of scores (from 0 to 1), just looking at the limits of the definite integrals of the last equation of the image.

------------

Regarding tails, your examples follow the binomial distribution, which approach the normal distribution with large samples. I wrote a little about the binomial distribution in an answer to a recent Dann's post some days ago:

Re: How good is your engine?

With dimensionless numbers (just dividing the mean and the standard deviation by the number of games), the score will be [0.5*n + z*0.5*sqrt(n)]/n = 0.5 + 0.5*z/sqrt(n), where z is the z-score: z = (2*score - 1)*sqrt(n). As the number of games increases, tails are expected to be closer to the central point and so the Elo difference:

Code: Select all

Elo = 400*log[score/(1 - score)] = 400*log( [ 0.5 + 0.5*z/sqrt(n) ] / { 1 - [ 0.5 + 0.5*z/sqrt(n) ] } )

Elo = 400*log[0.5 + 0.5*z/sqrt(n)] - 400*log[0.5 - 0.5*z/sqrt(n)]

Elo = 400*log{[1 + z/sqrt(n)]/2} - 400*log{[1 - z/sqrt(n)]/2} = 400*log[1 + z/sqrt(n)] - 400*log(2) - 400*[1 - z/sqrt(n)] + 400*log(2)

Elo = [400/ln(10)]*ln[1 + z/sqrt(n)] - [400/ln(10)]*ln[1 - z/sqrt(n)]

n >> 1 ; 1/sqrt(n) << 1

Taylor series (x << 1): ln(x) ~ x - x²/2 + x³/3 - ... ~ x

Elo ~ [400/ln(10)]*z/sqrt(n) - [400/ln(10)]*[-z/sqrt(n)] = [800/ln(10)]*z/sqrt(n)

------------

Other form:

Elo = 400*log(wins/loses) since draws do not exist in this example.

Elo = 400*log( [ 0.5 + 0.5*(wins - loses) / (wins + loses) ] / { 1 - [ 0.5 + 0.5*(wins - loses)/(wins + loses) ] } )

Elo = 400*log{ [ 1 + (wins - loses)/n ] / [ 1 - (wins - loses)/n ] }

Elo = [400/ln(10)]*ln[ 1 + (wins - loses)/n ] - [400/ln(10)]*ln[ 1 - (wins - loses)/n ]

Taylor series again because (wins - loses)/n << 1:

Elo ~ [800/ln(10)]*(wins - loses)/n

------------

Other form is Elo ~ [1600/ln(10)]*(score - 0.5) when score is close to 50%.

I hope no typos. Elo differences tends to zero, as expected. Why LOS is not close to 50% then? Because LOS = 0.5*{1 + erf[z/sqrt(2)]} by definition, where erf is the error function. It was seen before that z is proportional to sqrt(n) so knowing the plot of erf, it is expected that LOS is far from 50% with such an unbelievable number of games, even if scores are close to 50%. How much can be |z| = abs(z) in n-game matches? This can give a hint. Other definition with wins and loses in LOS is 0.5*( 1 + erf{ (wins - loses)/sqrt[2*(wins + loses) ] } ), reinforcing the latter statement.

Last but not least, your desired LOS = 50% (other than averaging n LOS values, as in the start of this post) would be reached when wins = loses, which is less and less probable when n grows. Using the binomial distribution once again and let be the number of games an even number (otherwise wins <> loses when draws are not a possible result), then the central binomial coefficient plays a role:

Code: Select all

Prob.(wins) = W = 0.5
Prob.(loses) = L = 1 - W = 0.5

Prob.(wins = loses) = (2*wins over wins) * W^wins * L^loses

Prob.(wins = loses) = {[(2*wins)!]/[(wins!)²]} * 0.5^wins * 0.5^wins = {[(2*wins)!]/[(wins!)²]} * 0.5^(wins + wins) = {[(2*wins)!]/[(wins!)²]} * 0.25^wins

{[(2*wins)!]/[(wins!)²]} ~ (4^wins)/sqrt(pi*wins) applying Stirling's formula of x! ~ x^x * exp(-x) * sqrt(2*pi*x).

Prob.(wins = loses) ~ [(4^wins)/sqrt(pi*wins)] * 0.25^wins = 1/sqrt(pi*wins)

A very small value even if you would get n*Prob.(wins = loses) = n*[1/sqrt(pi*wins)] = sqrt(wins/pi) outcomes (or nearly this value) if you run n simulations of n games each.

Summary:

a) LOS cares about the sign of Elo difference: sign(LOS - 0.5) = sign(Elo difference).

b) Since LOS can be related with z-score, which can be proportional to sqrt(n), which tends to infinity, z-score tends to ±infinity and LOS tends to 0 or 1.

c) The case with wins = loses (LOS = 50%) is less and less probable with more games: probability = 1/sqrt(pi*wins) = sqrt[2/(pi*games)].

d) LOS is intended to be applied in a single n-game match, not in a k n-game matches (in your case, n n-game matches). If you want to obtain 50% by any means, try to average those n LOS values. I expect <LOS> = (1/n)*SUM(LOS_i ; i = 1, ..., n) = 50% or very close given the symmetric distribution of scores with respect of score = 50% (the score that brings LOS = 50%).

Regards from Spain.

Ajedrecista.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 7:12 pm

The purple verses gold battle ran long enough for me to run out of patience after only 13,945 games.
So here we have it, purple is stronger than gold, with almost absolute certaintly:
losses: 2928 wins: 3085 ties: 7932 LOS: 0.978549 Elo diff: 2.49297
Here are the games for anyone who would like to perform their own calculation:

So now we know, without a shadow of a doubt (due to the unflappable and unfailing accuracy of math, and knowing that 1.01 is bigger than one and all that) that Cfish is stronger than Cfish.

Sure glad that is settled.

Elostat output:

Code: Select all

  Program Elo    +   -   Games   Score   Av.Op.  Draws
1 purple : 3335    4   4 13945    50.6 %   3331   56.9 %
2 gold   : 3331    4   4 13945    49.4 %   3335   56.9 %

The main thing that is interesting about this output is that 3335 - 4 = 3331. Kind of an interesting symmetry.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 7:23 pm

Ajedrecista wrote: ↑Thu Jul 02, 2020 1:17 pm Summary:

a) LOS cares about the sign of Elo difference: sign(LOS - 0.5) = sign(Elo difference).

b) Since LOS can be related with z-score, which can be proportional to sqrt(n), which tends to infinity, z-score tends to ±infinity and LOS tends to 0 or 1.

c) The case with wins = loses (LOS = 50%) is less and less probable with more games: probability = 1/sqrt(pi*wins) = sqrt[2/(pi*games)].

d) LOS is intended to be applied in a single n-game match, not in a k n-game matches (in your case, n n-game matches). If you want to obtain 50% by any means, try to average those n LOS values. I expect <LOS> = (1/n)*SUM(LOS_i ; i = 1, ..., n) = 50% or very close given the symmetric distribution of scores with respect of score = 50% (the score that brings LOS = 50%).

Regards from Spain.

Ajedrecista.

I think this analysis is pretty good. Like I said, long matches destroy LOS.
Now, here is the burning question:
In a short, N game match, do you trust LOS since there is randomness in the outcome of chess games?
IOW, is there any reason to believe that the accuracy of wins, losses and draws of a ten game contest is higher for a LOS calculation than for an Elo calculation?

hgm · Post by **hgm** » Thu Jul 02, 2020 7:44 pm

Your result is indeed suspect. Did you have to cheat to get it?

Between equal engines, with 56% draw rate, the standard deviation of the match result (as percentage) of 13,945 games would be 0.277%. You do get a 50.563% result, i.e. more than two standard deviations from the (known in this case) 50% expected result.

This is quite rare. Normally you would have to try such a match 40 times to get one result that is skewed this much (and one opposite result, of 49.437%). This is why there only is a 1-in-40 probability that it is just luck, and a 39-in-40 probability (i.e. 97.5%) that it has another reason. In a well-designed test the only other reason could be that the engine is stronger. Hence the LOS.

So now what exactly is your problem? Are you surprised that when your test produces an extremely atypical result, it can create a wrong impression?

Of course in this case, where you know the engines to be equal, one might start to wonder about the soundness of your testing; after all, the 97.85% probability that there was a reason other than luck is the probability of stronger engine plus flawed testing together. Are you sure all the games were independent? Were all 13K opening positions sufficiently different to never transpose into the same game? Were all middle-game positions reached in those matches unique? Did you restart the engines between games, randomly deciding which you started first, to exclude that one of the two would get a less favorable memory assignment from the paging unit, which slowed it down, and then had to play many games in a row (or perhaps the entire match) with such a disadvantage?

It is not trivial to do measurements at the 0.1% accuracy level. There are many issues that could potentially have an effect of that magnitude.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 7:51 pm

I suggest you read the analysis of Ajedrecista.
I did not cheat, and it is more lopsided that I thought (I was expecting about .8).
If you examine the body of games, I think you will admit it would be a pretty sophisticated job to fake all that data so quickly.
But thank you for your confidence in my abilities.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 7:52 pm

Another interesting factor is that the Elocalc output shows that the engines have the same Elo within the confidence interval.

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo.

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo.

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo