Throwing out draws to calculate Elo

syzygy · Post by **syzygy** » Tue Jun 30, 2020 10:00 pm

Dann Corbit wrote: ↑Tue Jun 30, 2020 1:44 pm A tie or a draw is, by its very definition, an indicator of equality.
The more ties between opponents, the more likely that they are of the same strength.
This is obvious by the very definition of a tie or draw: Neither opponent was able to overcome the other.

So you have difficulty with the idea that, even though 1.00000000001 is very close to 1.00000000000, it is still larger with (in this case) mathematical certainty - 100% LOS.

The Tic-Tac-Toe example as explained by HGM seems to be a realistic scenario where your sequence of results of 8 wins, 0 losses and an enormous amounts of draws could happen. Indeed the best inference one could make is that the engine with 8 losses is somehow more susceptible to cosmic particles or other kinds of very rare hardware failures and for that reason inferior. But in "strength" as measured by Elo they will be very very close.

So please stop confusing higher LOS with higher difference in strength.

Dann Corbit · Post by **Dann Corbit** » Tue Jun 30, 2020 10:19 pm

Using: https://www.3dkingdoms.com/chess/elo.htm
The Elo advantage of an engine that wins 1000 games and loses zero games is +infinity:

The Elo advantage of and engine that wins 1000 games, loses zero games and draws a mere 1000000000000000000000000000 times is zero.

Yet the LOS calculation says that an engine that has an infinite Elo advantage has exactly the same likelyhood of superiority as an engine that has an Elo advantage of zero.

What is wrong with this picture?
Everything.

hgm · Post by **hgm** » Tue Jun 30, 2020 10:25 pm

Only you. That a 10^100 is larger than one is equally certain as that 1.000000000000000000000000000000000001 is larger than 1. It is both 100% certain.

Dann Corbit · Post by **Dann Corbit** » Tue Jun 30, 2020 10:36 pm

When you measure something, there is never absolute certainty.

Summing a column of numbers in a computer gives two different answers, depending on the direction of summation (numerical calculation error).

The best engine does not always win (randomness in real life).

The design of the experiment can be imperfect (e.g. one machine is ever so slightly stronger than the other so we are not measuring the strength of the software difference but of the hardware difference)

The one at the end of your long string of zeros is without a doubt, dirt (if it came from empirical experiments and measurements).

Dann Corbit · Post by **Dann Corbit** » Tue Jun 30, 2020 11:02 pm

Code: Select all

/*
Numerical error, when the numbers get big:
*/
#include <math.h>
#include <float.h>
#include <stdio.h>

int main(void)
{
    int i=0;
    int j=0;
    double starting_sum = 0.0;
    double ending_sum = 0.0;

    starting_sum = DBL_MAX / 10.0;
    ending_sum = starting_sum;
    for (i = 0; i < 10000000; i++)
    {
        ending_sum += 1.0;
    }
    printf("Ending sum is now %g bigger than starting sum.\n", ending_sum - starting_sum);
    return 0;
}

syzygy · Post by **syzygy** » Tue Jun 30, 2020 11:26 pm

Dann Corbit wrote: ↑Tue Jun 30, 2020 10:19 pm Using: https://www.3dkingdoms.com/chess/elo.htm
The Elo advantage of an engine that wins 1000 games and loses zero games is +infinity:

The Elo advantage of and engine that wins 1000 games, loses zero games and draws a mere 1000000000000000000000000000 times is zero.

Yet the LOS calculation says that an engine that has an infinite Elo advantage has exactly the same likelyhood of superiority as an engine that has an Elo advantage of zero.

What is wrong with this picture?
Everything.

There is nothing wrong with the picture. You simply need to understand what you are talking about.

Engine A being superior to engine B with 100% certainty says nothing about the Elo difference (except that it is > 0).

Do you understand the Tic Tac Toe example of two equal "perfect" engines A and B with the only difference that engine B once in a blue moon messes up due to cosmic radiation, while A has been properly shielded? The Elo difference is perhaps 0.001 Elo and the LOS is 100%.

Do you understand the other example that was given of an engine B that is a copy of engine A but modified to forfeit on time once in every million games? Engine B's Elo will only be slightly lower than engine A's Elo, but there is absolutely no question that A is superior to B.

Did you even READ the explanations you have been given?

That LOS is independent of the number of draws can be shown mathematically. You say you accept the mathematics. Yet you don't accept the consequence. How much sense are you making?

Dann Corbit · Post by **Dann Corbit** » Tue Jun 30, 2020 11:34 pm

Yes, of course I read them.

But if absolultely identically equal engines play each other thousands of times, the number of wins and losses will not be the same.

You know this, play stockfish against itself for one hundred games.

Hence, a small difference in wins and losses does not tell us which engine is stronger. In order to know if it *might* be stronger, it must be outside of the error bands.

Your accidental forfeit once in a million games will not show up in a contest of equal engines, because the noise of randomness will far exceed the once in a million loss.

And an engine with an infinte Elo advantage should not have the same LOS as an engine with a zero Elo advantage. It could still be stronger, but the probability should be different.

Maybe we do have difficulty understanding each other. I guess my problem is that when something makes no sense to me, I don't believe it.

I could, of course, be wrong. I am wrong a lot. But when the outcome of a model says something stupid, I think the model is wrong.

I think that model is based only on math and not on probability. Otherwise it would not make absurd predictions.

hgm · Post by **hgm** » Tue Jun 30, 2020 11:40 pm

Dann Corbit wrote: ↑Tue Jun 30, 2020 10:36 pmSumming a column of numbers in a computer gives two different answers, depending on the direction of summation (numerical calculation error).

Not if they are integers.

The design of the experiment can be imperfect (e.g. one machine is ever so slightly stronger than the other so we are not measuring the strength of the software difference but of the hardware difference)

Flawed experiments can tell you nothing. Whether they have high draw rate or not. You will always measure the sum of the flaws and the engine strength.

Note, however, that the experiment must be very flawed in order to produce an 8-0 result. This would only be possible in the most careless design. Like making sure that each program uses the same machine the same number of times.

The one at the end of your long string of zeros is without a doubt, dirt (if it came from empirical experiments and measurements).

Not if the experiment involved 10^100 games. You can really measure things very precisely with such a large number of games. Counting with integers involves no loss of precision, and a counter that can count to 10^100 is actually a squite small machine. You could afford hundreds of them, and cross-check those to see if one is in error or not.

But the 10^100 was just proverbial, so nitpicking over it makes no sense. In reality you could never play 10^100 games, not even if you turned all matter in the Universe into PCs and set them playing, before the Universe collapsed into a black hole, or all protons in it decayed to positrons. For the argument you are trying to make, having 8 wins + a billion draws (and no losses) would be just as effective. If a billion draws have zero impact on the LOS, 10^100 will have too, right? 10^100 times 0 is still zero.

Dann Corbit · Post by **Dann Corbit** » Tue Jun 30, 2020 11:41 pm

Code: Select all

/*
Numerical error, when the numbers get small
*/
#include <math.h>
#include <float.h>
#include <stdio.h>

int main(void)
{
    int i=0;
    int j=0;
	float sum_small_to_big = 0.0;
	float sum_big_to_small = 0.0;

    for (i = 0; i < 1000000000; i++)
    {
        sum_small_to_big += FLT_EPSILON;
    }
    sum_small_to_big += 1.1f;

	sum_big_to_small = 1.1f;

    for (i = 0; i < 1000000000; i++)
    {
		sum_big_to_small += FLT_EPSILON;
    }

    printf("sum_small_to_big is %g, sum_big_to_small is %g, and difference is %g\n", 
		sum_small_to_big, 
		sum_big_to_small, 
		sum_small_to_big - sum_big_to_small);

    return 0;
}

Dann Corbit · Post by **Dann Corbit** » Tue Jun 30, 2020 11:43 pm

I can run a billion, quadrillion, or googol or even googolplex trials in a gedankenexperiment. That's what is so nice about them.
And a googol draws tells us that the engines are equal.
A few wins for either side tells us exactly nothing about superiority in that case.

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo