Throwing out draws to calculate Elo
Moderators: Harvey Williamson, Dann Corbit, hgm
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

 Posts: 11705
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
I talked to a stone, but the stone did not listen. Maybe because it was unimaginative. I'm not sure. I am not sure how smart it is to talk to stones. People will start to wonder about your sanity.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Re: Throwing out draws to calculate Elo
HGM:s sanity has increased a lot after this discussion. I was a little bit afraid after his Corona maskDann Corbit wrote: ↑Wed Jul 01, 2020 3:15 pmI talked to a stone, but the stone did not listen. Maybe because it was unimaginative. I'm not sure. I am not sure how smart it is to talk to stones. People will start to wonder about your sanity.
Einstein was also a stone (I guess not all will get it).
HGM is correct. If you do not like the LOS measure you do not have to use it.
Actually all draws in your example Increases the likelihood that the one with 8 wins is superior compared to the case with no draws. The reason is that if you assume that opponent actually was slightly better it would be extremely unlikely that he would draw all 10^100  8 games and then lose 8 games compared to the likelihood he would lose 8 out of 8 games.
And yes I am also a physicist/mathematician like HGM (even not remotely as smart) so I am probably also wrong. Actually we have a pattern here that says that physicists are mostly wrong but I guess throwing in 10^100 non physicists would make us much more right
Re: Throwing out draws to calculate Elo
And what is the probability that a match between absolutely identical engines ends up in 80 after discarding draws?Dann Corbit wrote: ↑Tue Jun 30, 2020 9:34 pmYes, of course I read them.
But if absolultely identically equal engines play each other thousands of times, the number of wins and losses will not be the same.
And does this probability depend on the number of draws? [Of course not! And this is the crux!]
Re: Throwing out draws to calculate Elo
If there is randomness, you will not get 8 wins, 0 losses and 10^1008 draws to begin with.Dann Corbit wrote: ↑Tue Jun 30, 2020 9:50 pmLet me be perfectly clear:
If there were no such thing as randomness in engine game outcomes, then my argument would be completely wrong.
Your whole premise implies that there is (almost) no randomness in the experiment you are conducting.

It seems you are reasoning like this:
If the outcome of a match between engines A and B is 80 with 10000 draws, the LOS formula suggests that engine A is almost certainly superior to engine B.
But this is ridiculous, you say, because in practise there will be lots of random wins and losses and the outcome would be more like 20041996 with 6000 draws. And this does not say anything about whether A is better than B!!!
Sure, but YOUR PREMISE is that the outcome was 80 with 10000 draws. And in that case, the LOS formula correctly shows that A is almost certainly superior.
On the other hand, if the outcome is 20041996 with 6000 draws, then that same LOS formula does not at all suggest that A is almost certainly superior.
Re: Throwing out draws to calculate Elo
No, you are simply making the mistake to think that higher LOS means higher difference in strength and being rather stubborn.Dann Corbit wrote: ↑Wed Jul 01, 2020 12:08 amNice discussion Ovyron, but I don't think anyone understands what I am saying (probably because I am not communicating very effectively). Lots of intelligent people do not understand what I am saying, which means I am not doing a good job explaining.

 Posts: 11705
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
Here is some data from my simulator, with 100K games in 100K experiments. Recall that the engines are exactly the same strength.
Notice that the Losses and wins, even at the tail are only off by about 1%. The LOS is almost 1. That is interesting, because this is the same engine playing against itself. Therefore, the real strength difference is zero and the real LOS is 0.5. Note that at the very most extreme element of the list, the Elo difference is very small. Indicating that the players seem to be about the same strength. In the data that follows, I sorted it but did not remove any duplicates.
You might object, that this is just the tail.
But this is what the data looks like 20,000 lines into the file:
Here, we see that there is less than one Elo difference between the players, and we see that LOS is insiting that it is very likely that the first player is stronger than the second. Yet we know that they are exactly equal.
Here is data from 40,000 lines into the file:
Even this far along, the LOS calculation gives a value of .6 indicating that A is probably stronger than B.
In fact, nearly all of the data in the file other than the exact matches will show that either A is stronger than B or B is stronger than A with a fairly large probability. Yet we know that A and B have exactly the same strength.
If you were using LOS to choose to modify your chess engine, It would almost always give you a false answer. About half the time, it will say the change is better if it does nothing, and half of the time it will say the change is worse if it does nothing. And a good portion of the time, it will say it is dominatingly better when the change has no effect at all.
So what useful thing does LOS tell you?
Header file:
Notice that the Losses and wins, even at the tail are only off by about 1%. The LOS is almost 1. That is interesting, because this is the same engine playing against itself. Therefore, the real strength difference is zero and the real LOS is 0.5. Note that at the very most extreme element of the list, the Elo difference is very small. Indicating that the players seem to be about the same strength. In the data that follows, I sorted it but did not remove any duplicates.
Code: Select all
Losses,Wins,Ties,LOS_Prob,Elo_Diff
49373,50627,0,0.999963,4.3569
49375,50625,0,0.999961,4.343
49376,50624,0,0.99996,4.33605
49393,50607,0,0.999938,4.21791
49400,50600,0,0.999926,4.16927
49403,50597,0,0.99992,4.14842
49411,50589,0,0.999902,4.09282
49415,50585,0,0.999892,4.06503
49416,50584,0,0.999889,4.05808
49421,50579,0,0.999875,4.02333
49423,50577,0,0.999869,4.00943
49430,50570,0,0.999844,3.96079
49438,50562,0,0.999811,3.90519
49438,50562,0,0.999811,3.90519
49438,50562,0,0.999811,3.90519
49439,50561,0,0.999806,3.89824
49441,50559,0,0.999796,3.88434
49443,50557,0,0.999786,3.87044
49445,50555,0,0.999776,3.85655
49449,50551,0,0.999754,3.82875
49450,50550,0,0.999748,3.8218
But this is what the data looks like 20,000 lines into the file:
Code: Select all
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
49867,50133,0,0.799872,0.924145
Here is data from 40,000 lines into the file:
Code: Select all
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49959,50041,0,0.602301,0.284886
49960,50040,0,0.599859,0.277938
49960,50040,0,0.599859,0.277938
In fact, nearly all of the data in the file other than the exact matches will show that either A is stronger than B or B is stronger than A with a fairly large probability. Yet we know that A and B have exactly the same strength.
If you were using LOS to choose to modify your chess engine, It would almost always give you a false answer. About half the time, it will say the change is better if it does nothing, and half of the time it will say the change is worse if it does nothing. And a good portion of the time, it will say it is dominatingly better when the change has no effect at all.
So what useful thing does LOS tell you?
Code: Select all
#include <random>
#include <iostream>
#include "elo.hpp"
using namespace std;
class results {
public:
int lt_h;
int gt_h;
int ties;
results()
{
lt_h = 0;
gt_h = 0;
ties = 0;
}
};
#include <cstdio>
#include <cstdlib>
#include <cmath>
// http://talkchess.com/forum3/viewtopic.php?t=51003#p554193
double los(int wins, int losses, int draws)
{
double games = wins + losses + draws;
#ifdef _DEBUG
std::printf("Number of games: %g\n", games);
#endif
double winning_fraction = (wins + 0.5*draws) / games;
#ifdef _DEBUG
std::printf("Winning fraction: %g\n", winning_fraction);
#endif
double elo_difference = std::log(1.0 / winning_fraction  1.0)*400.0 / std::log(10.0);
#ifdef _DEBUG
std::printf("Elo difference: %+g\n", elo_difference);
#endif
return 0.5 + 0.5 * std::erf((wins  losses) / std::sqrt(2.0*(wins + losses)));
}
size_t experiment_size = 1000;
int main(int argc, char **argv)
{
if (argc > 1)
{
experiment_size = atoll(argv[1]);
if (experiment_size < 10) experiment_size = 10;
}
std::mt19937 generator (17);
std::uniform_real_distribution<double> urd(0.0, 1.0);
results * contests = new results[experiment_size];
for (int contest = 0; contest < experiment_size; contest++)
{
for (int result = 0; result < experiment_size; result++)
{
double value = urd(generator);
if (value < 0.5) contests[contest].lt_h++;
else if (value > 0.5) contests[contest].gt_h++;
else contests[contest].ties++;
}
}
for (int contest = 0; contest < experiment_size; contest++)
{
Elo::IntervalEstimate est = Elo::estimate_rating_difference(contests[contest].gt_h, contests[contest].ties, contests[contest].lt_h);
std::cout << "losses: " << contests[contest].lt_h << " wins: " << contests[contest].gt_h << " ties: " << contests[contest].ties << " LOS: " <<los(contests[contest].gt_h, contests[contest].lt_h, contests[contest].ties) << " Elo diff: " << est.estimate << std::endl;
}
delete [] contests;
return 0;
}
Code: Select all
/*
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in
all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
THE SOFTWARE.
*/
#pragma once
#include <cmath>
#include <functional>
#include <stdexcept>
#include <utility>
#include <vector>
namespace Elo {
const double WIN = 1;
const double DRAW = 0.5;
const double LOSS = 0;
const double LOGISTIC_DIVISOR = 400 / std::log(10);
double round_places(double x, double places) {
if (x == 0) {
return 0;
}
double scale = std::pow(10.0, places);
return std::round(x * scale) / scale;
}
struct Distribution {
public:
virtual double cdf(double x, double mean) const {
return 0;
};
virtual ~Distribution() {};
};
struct LogisticDistribution : public Distribution {
double base;
double scale;
public:
LogisticDistribution(double initial_base, double initial_scale):
base(initial_base), scale(initial_scale) {};
virtual double cdf(double x, double mean) const override {
return 1.0 / (1.0 + std::pow(base, ((x  mean) / scale)));
}
};
struct NormalDistribution : public Distribution {
double stdev;
public:
NormalDistribution(double initial_stdev):
stdev(initial_stdev) {};
virtual double cdf(double x, double mean) const override {
return (1 + std::erf((x  mean) / (stdev * std::sqrt(2)))) / 2;
}
};
LogisticDistribution default_distribution(10, 400);
class Player;
struct Match;
struct Configuration {
std::function<double(Player&)> calculate_k;
Distribution& dist;
Configuration(double initial_k, Distribution& initial_distribution = default_distribution):
calculate_k([initial_k](Player& p) {
return initial_k;
}), dist(initial_distribution) {};
Configuration(std::function<double(Player&)> initial_calculate_k, Distribution& initial_distribution = default_distribution):
calculate_k(initial_calculate_k), dist(initial_distribution) {};
};
Configuration default_configuration(32, default_distribution);
class Player {
std::vector<Match> matches;
public:
double rating;
Configuration config;
Player(double initial_rating, Configuration initial_configuration = default_configuration):
rating(initial_rating), config(initial_configuration) {};
double round_rating(double places) {
return round_places(rating, places);
}
std::vector<Match> get_matches() {
return matches;
}
// Add match WITHOUT changing rating.
void add_match(Match& m) {
matches.push_back(m);
}
};
struct Match {
Player& player_a;
Player& player_b;
// This is the result for player_a.
double result;
Match(Player& initial_player_a, Player& initial_player_b, double initial_result, bool apply_now = false):
player_a(initial_player_a), player_b(initial_player_b), result(initial_result) {
if (apply_now) {
apply();
}
};
bool apply() {
if (applied) {
return false;
}
double player_a_delta = player_a.config.calculate_k(player_a) * (result  player_a.config.dist.cdf(player_b.rating, player_a.rating));
double player_b_delta = player_b.config.calculate_k(player_b) * ((1  result)  player_b.config.dist.cdf(player_a.rating, player_b.rating));
player_a.rating += player_a_delta;
player_b.rating += player_b_delta;
player_a.add_match(*this);
player_b.add_match(*this);
applied = true;
return true;
}
private:
bool applied = false;
};
struct IntervalEstimate {
double lower;
double estimate;
double upper;
double p;
bool lower_infinity = false;
bool estimate_infinity = false;
bool upper_infinity = false;
};
/* Source: Abramowitz, M. & Stegun, I. (1964).
Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables.
Section 26.2.23. */
double homf_tail(double p) {
if (p <= 0  p >= 1) {
throw std::invalid_argument("p must be in [0,1].");
}
double c0 = 2.515517;
double c1 = 0.802853;
double c2 = 0.010328;
double d1 = 1.432788;
double d2 = 0.189269;
double d3 = 0.001308;
double t = std::sqrt(std::log(1 / (p * p)));
return t  (c0 + c1 * t + c2 * t * t) / (1 + d1 * t + d2 * t * t + d3 * t * t * t);
}
double quantile(double p) {
if (p < 0  p > 1) {
throw std::invalid_argument("p must be in [0,1].");
}
if (p < 0.5) {
return homf_tail(p);
} else {
return homf_tail(1  p);
}
}
// x successes in n trials using the Wilson score interval (see also https://arxiv.org/pdf/1303.1288.pdf).
IntervalEstimate binomial_estimate(double x, double n, double p = 0.95) {
if (x < 0) {
throw std::invalid_argument("x must be nonnegative.");
}
if (n <= 0) {
throw std::invalid_argument("n must be positive.");
}
if (x > n) {
throw std::invalid_argument("x must not be greater than n.");
}
IntervalEstimate est;
double z = quantile(1  (1  p) / 2);
double mlp = x / n;
double mlq = 1  mlp;
double mid = (x + (z * z) / 2) / (n + z * z);
double delta = (z / (n + z * z)) * std::sqrt(mlp * mlq * n + (z * z) / 4);
est.estimate = mid;
est.lower = mid  delta;
est.upper = mid + delta;
est.p = p;
return est;
}
// Default parameters with the default distribution.
double logistic_inverse_cdf(double x, double mean = 0, double scale = LOGISTIC_DIVISOR) {
if (x <= 0  x >= 1) {
throw std::invalid_argument("x must be in (0,1).");
}
return (mean + scale * std::log(x / (1  x)));
}
// Rating difference assuming the default Logistic distribution.
IntervalEstimate estimate_rating_difference(int wins, int draws, int losses, double p = 0.95) {
if (wins < 0  draws < 0  losses < 0) {
throw std::invalid_argument("wins, draws, and losses must be nonnegative.");
}
// Naive solution: 1 draw = 1 win and 1 loss. This is so we can calculate based on a binomial rather than trinomial distribution.
wins += draws;
losses += draws;
double n = wins + losses;
if (n <= 0) {
throw std::invalid_argument("The total number of games must be positive.");
}
IntervalEstimate est;
if (wins / n == 0  wins / n == 1) {
est.estimate_infinity = true;
return est;
}
est = binomial_estimate(wins, n, p);
if (est.estimate <= 0  est.estimate >= 1) {
est.estimate_infinity = true;
} else {
est.estimate = logistic_inverse_cdf(est.estimate);
}
if (est.lower <= 0) {
est.lower_infinity = true;
} else {
est.lower = logistic_inverse_cdf(est.lower);
}
if (est.upper >= 1) {
est.upper_infinity = true;
} else {
est.upper = logistic_inverse_cdf(est.upper);
}
return est;
}
}
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

 Posts: 11705
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
Random events do not have to happen frequently.syzygy wrote: ↑Wed Jul 01, 2020 9:30 pmIf there is randomness, you will not get 8 wins, 0 losses and 10^1008 draws to begin with.Dann Corbit wrote: ↑Tue Jun 30, 2020 9:50 pmLet me be perfectly clear:
If there were no such thing as randomness in engine game outcomes, then my argument would be completely wrong.
Astroids hitting the earth and 9.0 earthquakes happen at random time intervals and are very infrequent.
I am not insisting that the draws are random, because that would be pretty unlikely. But the wins and losses are quitly likely random, just very rare events.
The fact that a win or loss only happens once in 8/10^100 trials only shows that they are very rare. Not that they are not random.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

 Posts: 11705
 Joined: Wed Mar 08, 2006 7:57 pm
 Location: Redmond, WA USA
 Contact:
Re: Throwing out draws to calculate Elo
No, I think it means that it is supposed to be more likely that the engine with the bigger LOS is superior.syzygy wrote: ↑Wed Jul 01, 2020 9:42 pmNo, you are simply making the mistake to think that higher LOS means higher difference in strength and being rather stubborn.Dann Corbit wrote: ↑Wed Jul 01, 2020 12:08 amNice discussion Ovyron, but I don't think anyone understands what I am saying (probably because I am not communicating very effectively). Lots of intelligent people do not understand what I am saying, which means I am not doing a good job explaining.
A LOS of 1 means it is absolutely certain to be superior.
A LOS of .999 means it almost certainly superior
A LOS of 0.5 means that it is a coin toss if it is superior or not
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Re: Throwing out draws to calculate Elo
What you are saying is that even with identical engines 8 wins, 0 losses and 10^1008 draws may happen (and in fact will happen) if you run enough matches.Dann Corbit wrote: ↑Wed Jul 01, 2020 10:15 pmRandom events do not have to happen frequently.syzygy wrote: ↑Wed Jul 01, 2020 9:30 pmIf there is randomness, you will not get 8 wins, 0 losses and 10^1008 draws to begin with.Dann Corbit wrote: ↑Tue Jun 30, 2020 9:50 pmLet me be perfectly clear:
If there were no such thing as randomness in engine game outcomes, then my argument would be completely wrong.
Astroids hitting the earth and 9.0 earthquakes happen at random time intervals and are very infrequent.
I am not insisting that the draws are random, because that would be pretty unlikely. But the wins and losses are quitly likely random, just very rare events.
This is true. And, obviously, the LOS calculated on the basis of that extreme outlier is meaningless.
What is your point?
There is nothing in statistics that can protect you against the extremely unlikely, in particular if you are purposively selecting the extremely unlikely occurrences and basing your calculations on those while tossing out the rest...
If it is always the same engine that is winning, then that is not random but a sound statistical basis for drawing the conclusion that that engine is superior (even if almost equal in strength). (Unless you are selecting the games under consideration out of many other games, in which case all bets are off.) It is does not matter how many draws occur in between. This is a mathematical fact and also not so difficult to grasp intuitively if you take some time to think about it.The fact that a win or loss only happens once in 8/10^100 trials only shows that they are very rare. Not that they are not random.
Why are you denying mathematics? It makes more sense to argue that the earth is flat...