Throwing out draws to calculate Elo

syzygy · Post by **syzygy** » Thu Jul 02, 2020 2:30 am

Dann Corbit wrote: ↑Thu Jul 02, 2020 1:42 am
syzygy wrote: ↑Wed Jul 01, 2020 11:42 pm
Dann Corbit wrote: ↑Wed Jul 01, 2020 2:08 am Nice discussion Ovyron, but I don't think anyone understands what I am saying (probably because I am not communicating very effectively). Lots of intelligent people do not understand what I am saying, which means I am not doing a good job explaining.
No, you are simply making the mistake to think that higher LOS means higher difference in strength and being rather stubborn.
No, I think it means that it is supposed to be more likely that the engine with the bigger LOS is superior.

A LOS of 1 means it is absolutely certain to be superior.
A LOS of .999 means it almost certainly superior
A LOS of 0.5 means that it is a coin toss if it is superior or not

And if engine A draws engine B 99.99999999% of the time and beats engine B the remaining 0.0000001% of the time, would you agree that A is superior?

If these numbers can be established with 100% certainty, would you agree that the LOS is 1?

syzygy · Post by **syzygy** » Thu Jul 02, 2020 2:40 am

Instead of looking at LOS you could instead test the hypothesis that engines A and B are equal in strength.

We run a match until we have 8 decided games.
The match results in N games, i.e. N-8 drawn games and 8 decided games It turns out that A has won all decided games.

What are the chances of this if A and B are indeed equal in strength?
Clearly, it is 1 in 256. This strongly suggests that the hypothesis that A and B are equal in strength is not correct.

Do you agree? (I would hope you do.)

Does any of this depend on the value of N?

Ovyron · Post by **Ovyron** » Thu Jul 02, 2020 2:58 am

Dann Corbit wrote: ↑Wed Jul 01, 2020 1:48 amI don't think I have ever been more sure of anything in my life.

So now it becomes very interesting, if you ever see the light, if from all the people participating in this thread, I were the one that made you see it.

You continue to run simulations where all opponents are equal. LOS wasn't designed for that, it was designed to be used when opponents are of different strengths, not selfplay because you already know it's the same entity copied and the Superiority (let's not confuse it with LOS) is 0 0.

Please go and rerun your simulations with entities of slightly different strength, show us how on those LOS model fails, then a new LOS formula can be created that reflects your simulation.

Otherwise, all you're doing is complaining about how a shoe is really poor at hitting nails, and you go and create simulations of the nails, and then make us imagine how hitting a google nails would just destroy the shoe. You need to actually simulate what shoes are designed for.

syzygy · Post by **syzygy** » Thu Jul 02, 2020 3:02 am

Ovyron wrote: ↑Thu Jul 02, 2020 2:58 am
Dann Corbit wrote: ↑Wed Jul 01, 2020 1:48 amI don't think I have ever been more sure of anything in my life.
So now it becomes very interesting, if you ever see the light, if from all the people participating in this thread, I were the one that made you see it.

In the meantime perhaps Dann could at least admit that the title he has chosen for this thread and the first sentence of his opening post are completely wrong, since nobody has ever suggested that draws don't count for (regular) Elo calculations. The obvious misconceptions on Dann's side started really early here and so far I have not seen any indication by Dann that he is willing to admit at least those.

Ovyron · Post by **Ovyron** » Thu Jul 02, 2020 3:26 am

Oh yeah, to make it clear LOS means "it's this likely that this entity's Elo is higher than this other entity's Elo", and no draws are needed to say that.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 5:40 am

syzygy wrote: ↑Thu Jul 02, 2020 2:30 am
Dann Corbit wrote: ↑Thu Jul 02, 2020 1:42 am
syzygy wrote: ↑Wed Jul 01, 2020 11:42 pm
Dann Corbit wrote: ↑Wed Jul 01, 2020 2:08 am Nice discussion Ovyron, but I don't think anyone understands what I am saying (probably because I am not communicating very effectively). Lots of intelligent people do not understand what I am saying, which means I am not doing a good job explaining.
No, you are simply making the mistake to think that higher LOS means higher difference in strength and being rather stubborn.
No, I think it means that it is supposed to be more likely that the engine with the bigger LOS is superior.

A LOS of 1 means it is absolutely certain to be superior.
A LOS of .999 means it almost certainly superior
A LOS of 0.5 means that it is a coin toss if it is superior or not
And if engine A draws engine B 99.99999999% of the time and beats engine B the remaining 0.0000001% of the time, would you agree that A is superior?

If the 0.0000001% were reliable, I would agree.

If these numbers can be established with 100% certainty, would you agree that the LOS is 1?

Yes. The problem is the first 9 words. I invite you to run the program I attached. It flips pennies in a fair and unbiased way. Then it calculates the odds that the heads (random number between 0 and 1 is half or more than 1/2) occurs more often than tails (random number between 0 and 1 is less than 1/2). This is a totally fair simulation with gaussian distribution. In the simulation, you will see that almost all the time, the LOS algorithm says one of the two players (mr heads or mr tails) is superior. We know that this is not true. And the larger the number of trials you run, the more outlandish the probability becomes {on average}. So LOS cannot discern that two players are equal with reliability.

You may not like my experiment. So I suggest an alternative. Take Cfish, and make two copies, naming one Purple and the other Gold. Have purple play gold at game in one second for 100,000 games (I am actually running this simulation right now, but at a bit slower pace). If you don't have patience for that many, try some smaller sizes. Anyway, run LOS on it and find out how likely it is that Purple is superior to Gold (or conversely Gold is superior to Purple). You will find out that most of the time, it will say one is stronger. Now, we simply know this is not true. And yet it will say, "definitely stronger" with a significant number like .75 or above. The more games you run, the larger the number is likely to be (because although the values are ending towards the mean, the spread is increasing in absolute value). If, then, we cannot tell if an engine is the same strength as itself, how are we going to use it to determine if a change offers an improvment? And if we cannot tell what the numbers it emits means, then how can it be useful?

Don't trust me. Try it for yourself and see.
The program offers a shortcut. It can do the 100,000 trials 100,000 times in just a couple minutes. But maybe you do not trust my code.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 6:03 am

Ovyron wrote: ↑Thu Jul 02, 2020 3:26 am Oh yeah, to make it clear LOS means "it's this likely that this entity's Elo is higher than this other entity's Elo", and no draws are needed to say that.

It's actually less than that, because Elo is drastically affected by draws, and LOS is not.
From here:
http://www.ewbilliards.com/EloCalculater/
We can discover that:
100 wins and 10 losses gives an Elo difference of +400
100 wins and ten losses and 500 draws gives an Elo difference of +52

Same number of wins, but wildly different Elo facter (almost 8 times larger without the draws).

So the LOS tells us nothing about Elo.

The idea is to tell is simply this:
"What is the probablity that A is stronger than B?"
Note that we have no idea how much stronger. Only boolean yes or no is what is being analyzed and the value returned is a probability between 0 and 1. If the value returned is 0.5, then that means "it is a coin toss which one is stronger" and therefore they are the same strength.
If it retuns 1.0, that means it is absolutely certain that A is stronger than B.
If it returns less than 1/2, then B is probably stronger than A and you should reverse the calculation or subtract the returned number from 1 to find out how likely it is that B is stronger than A.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 6:59 am

syzygy wrote: ↑Thu Jul 02, 2020 2:30 am
Dann Corbit wrote: ↑Thu Jul 02, 2020 1:42 am
syzygy wrote: ↑Wed Jul 01, 2020 11:42 pm
Dann Corbit wrote: ↑Wed Jul 01, 2020 2:08 am Nice discussion Ovyron, but I don't think anyone understands what I am saying (probably because I am not communicating very effectively). Lots of intelligent people do not understand what I am saying, which means I am not doing a good job explaining.
No, you are simply making the mistake to think that higher LOS means higher difference in strength and being rather stubborn.
No, I think it means that it is supposed to be more likely that the engine with the bigger LOS is superior.

A LOS of 1 means it is absolutely certain to be superior.
A LOS of .999 means it almost certainly superior
A LOS of 0.5 means that it is a coin toss if it is superior or not
And if engine A draws engine B 99.99999999% of the time and beats engine B the remaining 0.0000001% of the time, would you agree that A is superior?

If these numbers can be established with 100% certainty, would you agree that the LOS is 1?

If you run the experiment twice, that is not enough.
You seem to think that an engine emitting a win is deterministic. It is not.
Otherwise, we could run a single game and know if an engine is stronger.
The reason we have to run a thousand games to get any sort of reasonalbe idea of strength is because there is a lot of randomness involved, especially when the engines are evenly matched.

I am not arguing with mathematics.
Everyone on planet earth agrees that 1.1 is bigger than one, and it does not matter how many zeros are in between unless the count is infinite.
The question is, if I do it again, will it be 1.1 again, and if it is was that also a fluke.
I see that nobody had the guts to answer my question about whether or not computer chess game outcomes have randomness involved.
Maybe becuase that is what makes the LOS house of cards take a tumble.

Of course I am not arguing math.
And a contest with 100 games that has 10 wins and 7 losses is a measured dataum and I do not seriously question the datum (though the recording of the datum also has randomness associated with it as to all electrical or mechanichal processes).
I do not think that there are math errors involvedin the formula.
I do think that the formula fails to provide as promised because of two reasons:
1. It does not take the randomness of the trials into proper account.
2. It discards evidence of equality.
When it does provide the right answer, it is an accident.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 7:19 am

syzygy wrote: ↑Thu Jul 02, 2020 2:40 am Instead of looking at LOS you could instead test the hypothesis that engines A and B are equal in strength.

We run a match until we have 8 decided games.
The match results in N games, i.e. N-8 drawn games and 8 decided games It turns out that A has won all decided games.

What are the chances of this if A and B are indeed equal in strength?
Clearly, it is 1 in 256. This strongly suggests that the hypothesis that A and B are equal in strength is not correct.

Do you agree? (I would hope you do.)

Does any of this depend on the value of N?

I partly agree with it.
A single experiment has show a data point that indicates 1/256 chances that they are equal in strength.
If you run a coin toss tool, you will see that outcome with a fair penny one time out of 256.
But if I run the test again, it may be the same or it may be different.
With 8 data points, the error bar is as big as the figure returned.
Clearly we know that answer can be completely wrong.
If I were to say, "It is 0.390625% of a chance that the engine with less wins is stronger" that is wrong. The reason it is wrong is because I am basing my conclusion on so few data points that it is meaningless data. That is why it is severly frowned upon to do any sort of statistical operation with less than 32 measurements. It is simply coughing up bologna. Now, we can increase the number of games to a big number. But when you do that, the ordinary gaussian randomness sends out a wider and wider band or wins and/or losses that overwhelms LOS. So increasing the game count with LOS actually decreases accuracy.
In addition LOS ignores evidence of equality.
If I ran 8 games and then told you the Elo of my engine, you would say, "You're so full of it your eyes are turning brown. You cannot base the Elo of an engine on 8 games." However, the LOS seems somehow magically different? Why is that? Because we are assuming that the 8 data points are perfect and sent from God himself and cannot be wrong.
The wins and losses in LOS are exactly as iffy as the wins and losses in Elo calculation. Both are affected by randomness, so it takes a great deal of trials to be sure of the result. But LOS has a built in defect.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 7:59 am

A surprising fact about randomness with coin tosses:
Every single sequence has exactly the same probability.
So that:
H-H-H-H-H-H-H-H
is exactly the same probability as:
H-T-H-T-H-T-H-T
is exactly the same probability as:
T-T-T-T-H-H-H-H
is exactly the same probability as:
Etc.
So if we see 8 heads in a row we should not think that is more surprising than any other 8 flip sequence.
The reason these sequences average out to half heads and half tails is that when we total the heads and tails for all possible sequences, the number of heads and tails is half and half.
So:
HHHHHHHH
is balanced out by:
TTTTTTTT
etc.
You see 8 wins in the output of an experiemental trial. Do not think it magic proof of superiority.

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo