Here is some data from my simulator, with 100K games in 100K experiments. Recall that the engines are exactly the same strength.
Notice that the Losses and wins, even at the tail are only off by about 1%. The LOS is almost 1. That is interesting, because this is the same engine playing against itself. Therefore, the real strength difference is zero and the real LOS is 0.5. Note that at the very most extreme element of the list, the Elo difference is very small. Indicating that the players seem to be about the same strength. In the data that follows, I sorted it but did not remove any duplicates.
Code: Select all
Losses,Wins,Ties,LOS_Prob,Elo_Diff
49373,50627,0,0.999963,4.3569
49375,50625,0,0.999961,4.343
49376,50624,0,0.99996,4.33605
49393,50607,0,0.999938,4.21791
49400,50600,0,0.999926,4.16927
49403,50597,0,0.99992,4.14842
49411,50589,0,0.999902,4.09282
49415,50585,0,0.999892,4.06503
49416,50584,0,0.999889,4.05808
49421,50579,0,0.999875,4.02333
49423,50577,0,0.999869,4.00943
49430,50570,0,0.999844,3.96079
49438,50562,0,0.999811,3.90519
49438,50562,0,0.999811,3.90519
49438,50562,0,0.999811,3.90519
49439,50561,0,0.999806,3.89824
49441,50559,0,0.999796,3.88434
49443,50557,0,0.999786,3.87044
49445,50555,0,0.999776,3.85655
49449,50551,0,0.999754,3.82875
49450,50550,0,0.999748,3.8218
You might object, that this is just the tail.
But this is what the data looks like 20,000 lines into the file:
Code: Select all
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
Here, we see that there is less than one Elo difference between the players, and we see that LOS is insiting that it is very likely that the first player is stronger than the second. Yet we know that they are exactly equal.
Here is data from 40,000 lines into the file:
Code: Select all
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49960,50040,0,0.599859,0.277938
49960,50040,0,0.599859,0.277938
Even this far along, the LOS calculation gives a value of .6 indicating that A is probably stronger than B.
In fact, nearly all of the data in the file other than the exact matches will show that either A is stronger than B or B is stronger than A with a fairly large probability. Yet we know that A and B have exactly the same strength.
If you were using LOS to choose to modify your chess engine, It would almost always give you a false answer. About half the time, it will say the change is better if it does nothing, and half of the time it will say the change is worse if it does nothing. And a good portion of the time, it will say it is dominatingly better when the change has no effect at all.
So what useful thing does LOS tell you?
Code: Select all
#include <random>
#include <iostream>
#include "elo.hpp"
using namespace std;
class results {
public:
int lt_h;
int gt_h;
int ties;
results()
{
lt_h = 0;
gt_h = 0;
ties = 0;
}
};
#include <cstdio>
#include <cstdlib>
#include <cmath>
// http://talkchess.com/forum3/viewtopic.php?t=51003#p554193
double los(int wins, int losses, int draws)
{
double games = wins + losses + draws;
#ifdef _DEBUG
std::printf("Number of games: %g\n", games);
#endif
double winning_fraction = (wins + 0.5*draws) / games;
#ifdef _DEBUG
std::printf("Winning fraction: %g\n", winning_fraction);
#endif
double elo_difference = -std::log(1.0 / winning_fraction - 1.0)*400.0 / std::log(10.0);
#ifdef _DEBUG
std::printf("Elo difference: %+g\n", elo_difference);
#endif
return 0.5 + 0.5 * std::erf((wins - losses) / std::sqrt(2.0*(wins + losses)));
}
size_t experiment_size = 1000;
int main(int argc, char **argv)
{
if (argc > 1)
{
experiment_size = atoll(argv[1]);
if (experiment_size < 10) experiment_size = 10;
}
std::mt19937 generator (17);
std::uniform_real_distribution<double> urd(0.0, 1.0);
results * contests = new results[experiment_size];
for (int contest = 0; contest < experiment_size; contest++)
{
for (int result = 0; result < experiment_size; result++)
{
double value = urd(generator);
if (value < 0.5) contests[contest].lt_h++;
else if (value > 0.5) contests[contest].gt_h++;
else contests[contest].ties++;
}
}
for (int contest = 0; contest < experiment_size; contest++)
{
Elo::IntervalEstimate est = Elo::estimate_rating_difference(contests[contest].gt_h, contests[contest].ties, contests[contest].lt_h);
std::cout << "losses: " << contests[contest].lt_h << " wins: " << contests[contest].gt_h << " ties: " << contests[contest].ties << " LOS: " <<los(contests[contest].gt_h, contests[contest].lt_h, contests[contest].ties) << " Elo diff: " << est.estimate << std::endl;
}
delete [] contests;
return 0;
}
Header file:
Code: Select all
/*
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
#pragma once
#include <cmath>
#include <functional>
#include <stdexcept>
#include <utility>
#include <vector>
namespace Elo {
const double WIN = 1;
const double DRAW = 0.5;
const double LOSS = 0;
const double LOGISTIC_DIVISOR = 400 / std::log(10);
double round_places(double x, double places) {
if (x == 0) {
return 0;
}
double scale = std::pow(10.0, places);
return std::round(x * scale) / scale;
}
struct Distribution {
public:
virtual double cdf(double x, double mean) const {
return 0;
};
virtual ~Distribution() {};
};
struct LogisticDistribution : public Distribution {
double base;
double scale;
public:
LogisticDistribution(double initial_base, double initial_scale):
base(initial_base), scale(initial_scale) {};
virtual double cdf(double x, double mean) const override {
return 1.0 / (1.0 + std::pow(base, -((x - mean) / scale)));
}
};
struct NormalDistribution : public Distribution {
double stdev;
public:
NormalDistribution(double initial_stdev):
stdev(initial_stdev) {};
virtual double cdf(double x, double mean) const override {
return (1 + std::erf((x - mean) / (stdev * std::sqrt(2)))) / 2;
}
};
LogisticDistribution default_distribution(10, 400);
class Player;
struct Match;
struct Configuration {
std::function<double(Player&)> calculate_k;
Distribution& dist;
Configuration(double initial_k, Distribution& initial_distribution = default_distribution):
calculate_k([initial_k](Player& p) {
return initial_k;
}), dist(initial_distribution) {};
Configuration(std::function<double(Player&)> initial_calculate_k, Distribution& initial_distribution = default_distribution):
calculate_k(initial_calculate_k), dist(initial_distribution) {};
};
Configuration default_configuration(32, default_distribution);
class Player {
std::vector<Match> matches;
public:
double rating;
Configuration config;
Player(double initial_rating, Configuration initial_configuration = default_configuration):
rating(initial_rating), config(initial_configuration) {};
double round_rating(double places) {
return round_places(rating, places);
}
std::vector<Match> get_matches() {
return matches;
}
// Add match WITHOUT changing rating.
void add_match(Match& m) {
matches.push_back(m);
}
};
struct Match {
Player& player_a;
Player& player_b;
// This is the result for player_a.
double result;
Match(Player& initial_player_a, Player& initial_player_b, double initial_result, bool apply_now = false):
player_a(initial_player_a), player_b(initial_player_b), result(initial_result) {
if (apply_now) {
apply();
}
};
bool apply() {
if (applied) {
return false;
}
double player_a_delta = player_a.config.calculate_k(player_a) * (result - player_a.config.dist.cdf(player_b.rating, player_a.rating));
double player_b_delta = player_b.config.calculate_k(player_b) * ((1 - result) - player_b.config.dist.cdf(player_a.rating, player_b.rating));
player_a.rating += player_a_delta;
player_b.rating += player_b_delta;
player_a.add_match(*this);
player_b.add_match(*this);
applied = true;
return true;
}
private:
bool applied = false;
};
struct IntervalEstimate {
double lower;
double estimate;
double upper;
double p;
bool lower_infinity = false;
bool estimate_infinity = false;
bool upper_infinity = false;
};
/* Source: Abramowitz, M. & Stegun, I. (1964).
Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables.
Section 26.2.23. */
double homf_tail(double p) {
if (p <= 0 || p >= 1) {
throw std::invalid_argument("p must be in [0,1].");
}
double c0 = 2.515517;
double c1 = 0.802853;
double c2 = 0.010328;
double d1 = 1.432788;
double d2 = 0.189269;
double d3 = 0.001308;
double t = std::sqrt(std::log(1 / (p * p)));
return t - (c0 + c1 * t + c2 * t * t) / (1 + d1 * t + d2 * t * t + d3 * t * t * t);
}
double quantile(double p) {
if (p < 0 || p > 1) {
throw std::invalid_argument("p must be in [0,1].");
}
if (p < 0.5) {
return -homf_tail(p);
} else {
return homf_tail(1 - p);
}
}
// x successes in n trials using the Wilson score interval (see also https://arxiv.org/pdf/1303.1288.pdf).
IntervalEstimate binomial_estimate(double x, double n, double p = 0.95) {
if (x < 0) {
throw std::invalid_argument("x must be nonnegative.");
}
if (n <= 0) {
throw std::invalid_argument("n must be positive.");
}
if (x > n) {
throw std::invalid_argument("x must not be greater than n.");
}
IntervalEstimate est;
double z = quantile(1 - (1 - p) / 2);
double mlp = x / n;
double mlq = 1 - mlp;
double mid = (x + (z * z) / 2) / (n + z * z);
double delta = (z / (n + z * z)) * std::sqrt(mlp * mlq * n + (z * z) / 4);
est.estimate = mid;
est.lower = mid - delta;
est.upper = mid + delta;
est.p = p;
return est;
}
// Default parameters with the default distribution.
double logistic_inverse_cdf(double x, double mean = 0, double scale = LOGISTIC_DIVISOR) {
if (x <= 0 || x >= 1) {
throw std::invalid_argument("x must be in (0,1).");
}
return (mean + scale * std::log(x / (1 - x)));
}
// Rating difference assuming the default Logistic distribution.
IntervalEstimate estimate_rating_difference(int wins, int draws, int losses, double p = 0.95) {
if (wins < 0 || draws < 0 || losses < 0) {
throw std::invalid_argument("wins, draws, and losses must be nonnegative.");
}
// Naive solution: 1 draw = 1 win and 1 loss. This is so we can calculate based on a binomial rather than trinomial distribution.
wins += draws;
losses += draws;
double n = wins + losses;
if (n <= 0) {
throw std::invalid_argument("The total number of games must be positive.");
}
IntervalEstimate est;
if (wins / n == 0 || wins / n == 1) {
est.estimate_infinity = true;
return est;
}
est = binomial_estimate(wins, n, p);
if (est.estimate <= 0 || est.estimate >= 1) {
est.estimate_infinity = true;
} else {
est.estimate = logistic_inverse_cdf(est.estimate);
}
if (est.lower <= 0) {
est.lower_infinity = true;
} else {
est.lower = logistic_inverse_cdf(est.lower);
}
if (est.upper >= 1) {
est.upper_infinity = true;
} else {
est.upper = logistic_inverse_cdf(est.upper);
}
return est;
}
}