Throwing out draws to calculate Elo

Dann Corbit · Post by **Dann Corbit** » Wed Jul 01, 2020 5:15 pm

I talked to a stone, but the stone did not listen. Maybe because it was unimaginative. I'm not sure. I am not sure how smart it is to talk to stones. People will start to wonder about your sanity.

hgm · Post by **hgm** » Wed Jul 01, 2020 7:12 pm

Start?

Pio · Post by **Pio** » Wed Jul 01, 2020 7:25 pm

Dann Corbit wrote: ↑Wed Jul 01, 2020 5:15 pm I talked to a stone, but the stone did not listen. Maybe because it was unimaginative. I'm not sure. I am not sure how smart it is to talk to stones. People will start to wonder about your sanity.

HGM:s sanity has increased a lot after this discussion. I was a little bit afraid after his Corona mask

Einstein was also a stone (I guess not all will get it).

HGM is correct. If you do not like the LOS measure you do not have to use it.

Actually all draws in your example Increases the likelihood that the one with 8 wins is superior compared to the case with no draws. The reason is that if you assume that opponent actually was slightly better it would be extremely unlikely that he would draw all 10^100 - 8 games and then lose 8 games compared to the likelihood he would lose 8 out of 8 games.

And yes I am also a physicist/mathematician like HGM (even not remotely as smart) so I am probably also wrong. Actually we have a pattern here that says that physicists are mostly wrong but I guess throwing in 10^100 non physicists would make us much more right

syzygy · Post by **syzygy** » Wed Jul 01, 2020 11:27 pm

Dann Corbit wrote: ↑Tue Jun 30, 2020 11:34 pm Yes, of course I read them.

But if absolultely identically equal engines play each other thousands of times, the number of wins and losses will not be the same.

And what is the probability that a match between absolutely identical engines ends up in 8-0 after discarding draws?

And does this probability depend on the number of draws? [Of course not! And this is the crux!]

syzygy · Post by **syzygy** » Wed Jul 01, 2020 11:30 pm

Dann Corbit wrote: ↑Tue Jun 30, 2020 11:50 pm Let me be perfectly clear:
If there were no such thing as randomness in engine game outcomes, then my argument would be completely wrong.

If there is randomness, you will not get 8 wins, 0 losses and 10^100-8 draws to begin with.

Your whole premise implies that there is (almost) no randomness in the experiment you are conducting.

----

It seems you are reasoning like this:
If the outcome of a match between engines A and B is 8-0 with 10000 draws, the LOS formula suggests that engine A is almost certainly superior to engine B.

But this is ridiculous, you say, because in practise there will be lots of random wins and losses and the outcome would be more like 2004-1996 with 6000 draws. And this does not say anything about whether A is better than B!!!

Sure, but YOUR PREMISE is that the outcome was 8-0 with 10000 draws. And in that case, the LOS formula correctly shows that A is almost certainly superior.

On the other hand, if the outcome is 2004-1996 with 6000 draws, then that same LOS formula does not at all suggest that A is almost certainly superior.

syzygy · Post by **syzygy** » Wed Jul 01, 2020 11:42 pm

Dann Corbit wrote: ↑Wed Jul 01, 2020 2:08 am Nice discussion Ovyron, but I don't think anyone understands what I am saying (probably because I am not communicating very effectively). Lots of intelligent people do not understand what I am saying, which means I am not doing a good job explaining.

No, you are simply making the mistake to think that higher LOS means higher difference in strength and being rather stubborn.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 12:09 am

Here is some data from my simulator, with 100K games in 100K experiments. Recall that the engines are exactly the same strength.
Notice that the Losses and wins, even at the tail are only off by about 1%. The LOS is almost 1. That is interesting, because this is the same engine playing against itself. Therefore, the real strength difference is zero and the real LOS is 0.5. Note that at the very most extreme element of the list, the Elo difference is very small. Indicating that the players seem to be about the same strength. In the data that follows, I sorted it but did not remove any duplicates.

Code: Select all

Losses,Wins,Ties,LOS_Prob,Elo_Diff
49373,50627,0,0.999963,4.3569
49375,50625,0,0.999961,4.343
49376,50624,0,0.99996,4.33605
49393,50607,0,0.999938,4.21791
49400,50600,0,0.999926,4.16927
49403,50597,0,0.99992,4.14842
49411,50589,0,0.999902,4.09282
49415,50585,0,0.999892,4.06503
49416,50584,0,0.999889,4.05808
49421,50579,0,0.999875,4.02333
49423,50577,0,0.999869,4.00943
49430,50570,0,0.999844,3.96079
49438,50562,0,0.999811,3.90519
49438,50562,0,0.999811,3.90519
49438,50562,0,0.999811,3.90519
49439,50561,0,0.999806,3.89824
49441,50559,0,0.999796,3.88434
49443,50557,0,0.999786,3.87044
49445,50555,0,0.999776,3.85655
49449,50551,0,0.999754,3.82875
49450,50550,0,0.999748,3.8218

You might object, that this is just the tail.
But this is what the data looks like 20,000 lines into the file:

Code: Select all

49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145

Here, we see that there is less than one Elo difference between the players, and we see that LOS is insiting that it is very likely that the first player is stronger than the second. Yet we know that they are exactly equal.

Here is data from 40,000 lines into the file:

Code: Select all

49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49960,50040,0,0.599859,0.277938
49960,50040,0,0.599859,0.277938

Even this far along, the LOS calculation gives a value of .6 indicating that A is probably stronger than B.
In fact, nearly all of the data in the file other than the exact matches will show that either A is stronger than B or B is stronger than A with a fairly large probability. Yet we know that A and B have exactly the same strength.

If you were using LOS to choose to modify your chess engine, It would almost always give you a false answer. About half the time, it will say the change is better if it does nothing, and half of the time it will say the change is worse if it does nothing. And a good portion of the time, it will say it is dominatingly better when the change has no effect at all.

So what useful thing does LOS tell you?

Code: Select all

#include <random>
#include <iostream>
#include "elo.hpp"
using namespace std;
class results {
public:
    int lt_h;
    int gt_h;
    int ties;
    results()
    {
        lt_h = 0;
        gt_h = 0;
        ties = 0;
    }
};

#include <cstdio>
#include <cstdlib>
#include <cmath>

// http://talkchess.com/forum3/viewtopic.php?t=51003#p554193
double los(int wins, int losses, int draws)
{
    double games = wins + losses + draws;
#ifdef _DEBUG
    std::printf("Number of games: %g\n", games);
#endif
    double winning_fraction = (wins + 0.5*draws) / games;
#ifdef _DEBUG
    std::printf("Winning fraction: %g\n", winning_fraction);
#endif
    double elo_difference = -std::log(1.0 / winning_fraction - 1.0)*400.0 / std::log(10.0);
#ifdef _DEBUG
    std::printf("Elo difference: %+g\n", elo_difference);
#endif
    return 0.5 + 0.5 * std::erf((wins - losses) / std::sqrt(2.0*(wins + losses)));

}

size_t experiment_size = 1000;
int main(int argc, char **argv)
{
    if (argc > 1)
    {
        experiment_size = atoll(argv[1]);
        if (experiment_size < 10) experiment_size = 10;
    }
    std::mt19937 generator (17);
    std::uniform_real_distribution<double> urd(0.0, 1.0);
    results * contests = new results[experiment_size];
    for (int contest = 0; contest < experiment_size; contest++)
    {
        for (int result = 0; result < experiment_size; result++)
        {
            double value = urd(generator);
            if (value < 0.5) contests[contest].lt_h++;
            else if (value > 0.5) contests[contest].gt_h++;
            else  contests[contest].ties++;
        }
    }
    for (int contest = 0; contest < experiment_size; contest++)
    {
		Elo::IntervalEstimate est = Elo::estimate_rating_difference(contests[contest].gt_h, contests[contest].ties, contests[contest].lt_h);

        std::cout << "losses: " << contests[contest].lt_h  << " wins: " <<  contests[contest].gt_h  << " ties: " <<  contests[contest].ties << " LOS: " <<los(contests[contest].gt_h, contests[contest].lt_h, contests[contest].ties) << " Elo diff: " << est.estimate << std::endl;
    }
    delete [] contests;
    return 0;
}

Header file:

Code: Select all

/*
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
#pragma once

#include <cmath>
#include <functional>
#include <stdexcept>
#include <utility>
#include <vector>

namespace Elo {

const double WIN = 1;
const double DRAW = 0.5;
const double LOSS = 0;
const double LOGISTIC_DIVISOR = 400 / std::log(10);

double round_places(double x, double places) {
    if (x == 0) {
        return 0;
    }

    double scale = std::pow(10.0, places);
    return std::round(x * scale) / scale;
}

struct Distribution {
public:
    virtual double cdf(double x, double mean) const {
        return 0;
    };
    virtual ~Distribution() {};
};

struct LogisticDistribution : public Distribution {
    double base;
    double scale;

public:
    LogisticDistribution(double initial_base, double initial_scale):
        base(initial_base), scale(initial_scale) {};

    virtual double cdf(double x, double mean) const override {
        return 1.0 / (1.0 + std::pow(base, -((x - mean) / scale)));
    }
};

struct NormalDistribution : public Distribution {
    double stdev;

public:
    NormalDistribution(double initial_stdev):
        stdev(initial_stdev) {};

    virtual double cdf(double x, double mean) const override {
        return (1 + std::erf((x - mean) / (stdev * std::sqrt(2)))) / 2;
    }
};

LogisticDistribution default_distribution(10, 400);

class Player;
struct Match;

struct Configuration {
    std::function<double(Player&)> calculate_k;
    Distribution& dist;

    Configuration(double initial_k, Distribution& initial_distribution = default_distribution):
        calculate_k([initial_k](Player& p) {
        return initial_k;
    }), dist(initial_distribution) {};

    Configuration(std::function<double(Player&)> initial_calculate_k, Distribution& initial_distribution = default_distribution):
        calculate_k(initial_calculate_k), dist(initial_distribution) {};
};

Configuration default_configuration(32, default_distribution);

class Player {
    std::vector<Match> matches;

public:
    double rating;
    Configuration config;

    Player(double initial_rating, Configuration initial_configuration = default_configuration):
        rating(initial_rating), config(initial_configuration) {};

    double round_rating(double places) {
        return round_places(rating, places);
    }

    std::vector<Match> get_matches() {
        return matches;
    }

    // Add match WITHOUT changing rating.
    void add_match(Match& m) {
        matches.push_back(m);
    }
};

struct Match {
    Player& player_a;
    Player& player_b;
    // This is the result for player_a.
    double result;

    Match(Player& initial_player_a, Player& initial_player_b, double initial_result, bool apply_now = false):
        player_a(initial_player_a), player_b(initial_player_b), result(initial_result) {
        if (apply_now) {
            apply();
        }
    };

    bool apply() {
        if (applied) {
            return false;
        }

        double player_a_delta = player_a.config.calculate_k(player_a) * (result - player_a.config.dist.cdf(player_b.rating, player_a.rating));
        double player_b_delta = player_b.config.calculate_k(player_b) * ((1 - result) - player_b.config.dist.cdf(player_a.rating, player_b.rating));
        player_a.rating += player_a_delta;
        player_b.rating += player_b_delta;
        player_a.add_match(*this);
        player_b.add_match(*this);

        applied = true;
        return true;
    }

private:
    bool applied = false;
};

struct IntervalEstimate {
    double lower;
    double estimate;
    double upper;
    double p;
    bool lower_infinity = false;
    bool estimate_infinity = false;
    bool upper_infinity = false;
};

/* Source: Abramowitz, M. & Stegun, I. (1964).
Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables.
Section 26.2.23. */
double homf_tail(double p) {
    if (p <= 0 || p >= 1) {
        throw std::invalid_argument("p must be in [0,1].");
    }

    double c0 = 2.515517;
    double c1 = 0.802853;
    double c2 = 0.010328;
    double d1 = 1.432788;
    double d2 = 0.189269;
    double d3 = 0.001308;

    double t = std::sqrt(std::log(1 / (p * p)));
    return t - (c0 + c1 * t + c2 * t * t) / (1 + d1 * t + d2 * t * t + d3 * t * t * t);
}

double quantile(double p) {
    if (p < 0 || p > 1) {
        throw std::invalid_argument("p must be in [0,1].");
    }

    if (p < 0.5) {
        return -homf_tail(p);
    } else {
        return homf_tail(1 - p);
    }
}

// x successes in n trials using the Wilson score interval (see also https://arxiv.org/pdf/1303.1288.pdf).
IntervalEstimate binomial_estimate(double x, double n, double p = 0.95) {
    if (x < 0) {
        throw std::invalid_argument("x must be nonnegative.");
    }

    if (n <= 0) {
        throw std::invalid_argument("n must be positive.");
    }

    if (x > n) {
        throw std::invalid_argument("x must not be greater than n.");
    }

    IntervalEstimate est;
    double z = quantile(1 - (1 - p) / 2);
    double mlp = x / n;
    double mlq = 1 - mlp;
    double mid = (x + (z * z) / 2) / (n + z * z);
    double delta = (z / (n + z * z)) * std::sqrt(mlp * mlq * n + (z * z) / 4);

    est.estimate = mid;
    est.lower = mid - delta;
    est.upper = mid + delta;
    est.p = p;
    return est;
}

// Default parameters with the default distribution.
double logistic_inverse_cdf(double x, double mean = 0, double scale = LOGISTIC_DIVISOR) {
    if (x <= 0 || x >= 1) {
        throw std::invalid_argument("x must be in (0,1).");
    }

    return (mean + scale * std::log(x / (1 - x)));
}


// Rating difference assuming the default Logistic distribution.
IntervalEstimate estimate_rating_difference(int wins, int draws, int losses, double p = 0.95) {
    if (wins < 0 || draws < 0 || losses < 0) {
        throw std::invalid_argument("wins, draws, and losses must be nonnegative.");
    }

    // Naive solution: 1 draw = 1 win and 1 loss. This is so we can calculate based on a binomial rather than trinomial distribution.
    wins += draws;
    losses += draws;
    double n = wins + losses;

    if (n <= 0) {
        throw std::invalid_argument("The total number of games must be positive.");
    }

    IntervalEstimate est;

    if (wins / n == 0 || wins / n == 1) {
        est.estimate_infinity = true;
        return est;
    }

    est = binomial_estimate(wins, n, p);

    if (est.estimate <= 0 || est.estimate >= 1) {
        est.estimate_infinity = true;
    } else {
        est.estimate = logistic_inverse_cdf(est.estimate);
    }

    if (est.lower <= 0) {
        est.lower_infinity = true;
    } else {
        est.lower = logistic_inverse_cdf(est.lower);
    }

    if (est.upper >= 1) {
        est.upper_infinity = true;
    } else {
        est.upper = logistic_inverse_cdf(est.upper);
    }

    return est;
}
}

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 12:15 am

syzygy wrote: ↑Wed Jul 01, 2020 11:30 pm
Dann Corbit wrote: ↑Tue Jun 30, 2020 11:50 pm Let me be perfectly clear:
If there were no such thing as randomness in engine game outcomes, then my argument would be completely wrong.
If there is randomness, you will not get 8 wins, 0 losses and 10^100-8 draws to begin with.

Random events do not have to happen frequently.
Astroids hitting the earth and 9.0 earthquakes happen at random time intervals and are very infrequent.
I am not insisting that the draws are random, because that would be pretty unlikely. But the wins and losses are quitly likely random, just very rare events.
The fact that a win or loss only happens once in 8/10^100 trials only shows that they are very rare. Not that they are not random.

Dann Corbit · Post by **Dann Corbit** » Thu Jul 02, 2020 1:42 am

syzygy wrote: ↑Wed Jul 01, 2020 11:42 pm
Dann Corbit wrote: ↑Wed Jul 01, 2020 2:08 am Nice discussion Ovyron, but I don't think anyone understands what I am saying (probably because I am not communicating very effectively). Lots of intelligent people do not understand what I am saying, which means I am not doing a good job explaining.
No, you are simply making the mistake to think that higher LOS means higher difference in strength and being rather stubborn.

No, I think it means that it is supposed to be more likely that the engine with the bigger LOS is superior.

A LOS of 1 means it is absolutely certain to be superior.
A LOS of .999 means it almost certainly superior
A LOS of 0.5 means that it is a coin toss if it is superior or not

syzygy · Post by **syzygy** » Thu Jul 02, 2020 2:20 am

Dann Corbit wrote: ↑Thu Jul 02, 2020 12:15 am
syzygy wrote: ↑Wed Jul 01, 2020 11:30 pm
Dann Corbit wrote: ↑Tue Jun 30, 2020 11:50 pm Let me be perfectly clear:
If there were no such thing as randomness in engine game outcomes, then my argument would be completely wrong.
If there is randomness, you will not get 8 wins, 0 losses and 10^100-8 draws to begin with.
Random events do not have to happen frequently.
Astroids hitting the earth and 9.0 earthquakes happen at random time intervals and are very infrequent.
I am not insisting that the draws are random, because that would be pretty unlikely. But the wins and losses are quitly likely random, just very rare events.

What you are saying is that even with identical engines 8 wins, 0 losses and 10^100-8 draws may happen (and in fact will happen) if you run enough matches.

This is true. And, obviously, the LOS calculated on the basis of that extreme outlier is meaningless.

What is your point?

There is nothing in statistics that can protect you against the extremely unlikely, in particular if you are purposively selecting the extremely unlikely occurrences and basing your calculations on those while tossing out the rest...

The fact that a win or loss only happens once in 8/10^100 trials only shows that they are very rare. Not that they are not random.

If it is always the same engine that is winning, then that is not random but a sound statistical basis for drawing the conclusion that that engine is superior (even if almost equal in strength). (Unless you are selecting the games under consideration out of many other games, in which case all bets are off.) It is does not matter how many draws occur in between. This is a mathematical fact and also not so difficult to grasp intuitively if you take some time to think about it.

Why are you denying mathematics? It makes more sense to argue that the earth is flat...

Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo

Re: Throwing out draws to calculate Elo