Noise in ELO estimators: a quantitative approach

mcostalba · Post by **mcostalba** » Sun Jan 06, 2013 6:46 pm

It is anecdotal known that noise level influences the convergence speed of the ELO estimators: it is not the same to play 1000 games in a controlled environment in single thread mode or the same 1000 games in a very noisy environment, say 8 threads per engine on a dual core with pondering on end eventually with some heavy activity in the background (like watching a movie).

It is anecdotal known that the reliability of the estimation that is possible to infer from the 2 game series is different.

Here I would like to introduce a practical way to "measure" this reliability.

Say we have a series S0 of 2000 games between 2 engines. For each game we store the result as Win /Lost/ Draw. We feed this series to an ELO estimator (Bayes or something else, does not matter as long as converges). ELO estimator will output an estimated ELO difference between the 2 engines, call it Elo0

Now we split the series in 2: the first 1000 games and the second 1000 games. We feed these 2 sub-series (S1_0, S1_1) to the same ELO estimator and we get the corresponding 2 ELO estimated differences: Elo1_0, Elo1_1

We repeat the previous step splitting in 2 series of 500 games each previous S1 series, getting a total of 4 sub-series: S2_0, S2_1, S2_3, S2_4. As in the previous step we ran the ELO estimators on them getting the estimated 4 ELO differences: Elo2_0, Elo2_1, Elo2_3, Elo2_4

We could further divide, but for this example just stop here. Now we consider the following quantity:

Code: Select all

(Elo0 - Elo0)/1  +  (Elo1_0 - Elo0)/2 + (Elo1_1 - Elo0)/2 +  (Elo2_0 - Elo0)/4 + (Elo2_1 - Elo0)/4 +  (Elo2_3 - Elo0)/4 + (Elo2_4 - Elo0)/4

The above represents the sum of the errors intended as the difference between the best ELO difference estimation (Elo0) and each elo estimation calculated from a sub-series, divided (weighted) by the number of same type sub-series.

Because we are interested in a measure of the power of the noise, we consider the square of each term getting (first term is zero):

Code: Select all

Noise2 = (Elo1_0 - Elo0)^2/4 + (Elo1_1 - Elo0)^2/4 +  (Elo2_0 - Elo0)^2/16 + (Elo2_1 - Elo0)^2/16 +  (Elo2_3 - Elo0)^2/16 + (Elo2_4 - Elo0)^2/16

Where Noise2 is the power of noise calculated with 2 following sub-splits. We can also calculate Noise3 and in general NoiseN.

My guess is that this NoiseX quantity:

1) Is independent from the 2 engines under test, in particular is independent by their ELO difference.

2) Converges to zero when the number of games go to infinity (this is a consequence of ELO estimator convergences on the sub-series)

3) Is non negative and is higher when the single sub-series have very wide results. It is smaller in case single sub-series better approach full series.

4) At the moment does not converge with increasing number of splits due to the series of n term: a1/n^2 + a2/n^2 + ... + an/n^2 is asymptotically similar to n*ax/n^2 -> ax/n that is known to do not converge, but perhaps with a steeper denominator we can prove NoiseN is convergent for N -> infinite. This would be a very important result because we could extrapolate a NoiseINFINITY level independent by the games played and only due to the quality of testing framework !

In practical terms it would be very interesting to add some code to, say cutechess, to calculate the Noise2 (or NoiseX with x input parameter) on the fly while games are in progression (like it does with ELO) so to verify that:

a) For a given 'order' of Noise, say 2, then Noise2 decreases with the increase of number of played games (limit is zero at infinity)

b) For a fixed number of games, say 5000, a very noisy environment has Noise2 figure higher then a quality test framework

c) Given the a) and b) to find an equivalent reliability between two different testing framework and different number of games. IOW be able to say that for testing framework A (that is very noisy) I need to play 10000 games to reach the same reliability of testing framework B (that is good) is able to give with just 2000 games: This happens when the Noise2 of of the pair (framework, number of games) is the same.

hgm · Post by **hgm** » Mon Jan 07, 2013 2:26 pm

You can determine the empirical variance in a stretch of games empirically, but if the result is significantly different from what the simple sqrt(score*(1-score)-draw/4)/sqrt(N) would predict, there must be something very wrong in your test setup.

For one, if the stretches of games you compare the scores of would start from different, randomly chosen initial positions, you should never see any deviant variance. No matter how 'noisy' the test environment is.

When you start the compared game stretches from the same set of initial positions, in 'non-noisy' test environments, where the engines play highly deterministic, you could get lower variance than the formula predicts. Even upto the point where the variance would be zero, every run giving exactly the same result game by game, because the engines play fully deterministic. Then your test results would be statistically significant with very few games. They could be so even with only one game: if version A would win it, and version B would lose it, version A is obviously better, and no matter how often you repeat the test, you would always find that A is better. There would be no statistical uncertainty at all, and the LOS would be 100%.

The problem, of course, is that you only proved beyond statistical doubt that A is better than B in winning this particular position at this particular TC. Which is not something that would be of much interest to anyone. What is of interest is whether A would on average do better than B from a randomly chosen initial position.

Now randomly selecting starting positions is an obvious way to make the different stretches completely independent; there is nothing that could cause them to reproduce an earlier result as they are already in a different position from move 1 on. But that makes that there also is absolutely nothing for the test environment to spoil, no matter how 'noisy' it is.

So there is nothing to gain by your scheme. If your test setup is vulnerable to dependence of the games, by playing the same initial position more than once, it would be wise to perform the test you describe to ensure that you won't lose anything compared to the theoretical prediction. If your test environment is sufficiently noisy, you can get away with reuse of initial positions, because the games will diverge so rapidly that they can be considered independent before they are decided. So the noise saves the day in that case. Without the noise the games from the same position (against the same opponent) would very likely have the same result, so playing them would be mostly wasted effort, and the effective number of games in your test would have been less than you think. (With, as a consequence, a larger variance than the formula for independent games predicts.)

So in summary: 'noise' in the test environment is good when you reuse games against the same opponent. But selection of initial positions can provide enough noise by itself to not have to worry at all if your test environment is noisy enough. As it is easy enough to attend to the latter, there is never any need to intentionally drive up the noisiness of the test environment, or play extra games if you don't succeed in that.

mcostalba · Post by **mcostalba** » Mon Jan 07, 2013 2:36 pm

Possibly you are right and there is nothing more to deduce. But because I am not an expert of statistic and because I often see in games that testing framework quality does count, I prefer a practical approach and to verify it with some experiment.

To better clarify my proposal I have written some code:

Code: Select all

int elo_estimator(int w, int d, int l); // External standard estimator

void count_results(char* results, int size, int& w, int& d, int& l) {

  w = d = l = 0;
  for (int i = 0; i < size; i++)
           if (results[i] == 'W') w++;
      else if (results[i] == 'D') d++;
      else if (results[i] == 'L') l++;
}

int noise_estimator(char* results, int size, int max_level) {

  int w, d, l, noise = 0;

  count_results(results, size, w, d, l);
  int bestElo = elo_estimator(w, d, l);

  for (int level = 1; level <= max_level; level++)
  {
      int step = pow(2, level);
      int sub_size = size / step;

      for (int i = 0; i < size; i += sub_size)
      {
          count_results(results + i, sub_size, w, d, l);

          int eloDiff = bestElo - elo_estimator(w, d, l);

          noise += (eloDiff * eloDiff) / (step * step);
      }
  }

  return noise;
}

Kempelen · Post by **Kempelen** » Tue Jan 08, 2013 10:17 am

I dont know if related, but I would like to share something I have observed with my tests which is about error bar, and maybe noise. I have a Core Duo so I only play two different tournaments at a time, which I mix in a final pgn to calculate final score and error bar. What I have observed is that those two separated tournaments give sometimes scores and error bars that are both "unrelated" and outside of error windows one from the other. i.e. Tourney 1: 43,5% score 2390 elo +- 23, Tourney 2: 37% 2335 elo +-. 23. This usually happens before 1000 games has been played, but it shows that error margin guess are bad for so few games, as otherway one should finish the test, and if I let it continue it will usually converge and be a valid test. Even somethime it happens with more games. Dont know if maybe something has seen the same....

hgm · Post by **hgm** » Tue Jan 08, 2013 12:00 pm

Depending on what 'sometimes' means, everyone will see the same. The error bars calculated by Elo programs are not absolute error bars, but for a certain confidence like 95%. Which means that in 5% of the cases, the true value will lie outside the error bars.

So if 'sometimes' means once every 20 times, this would be perfectly normal. It would be in fact very worrysome if you never saw an error larger than the 95% error bars.

Michel · Post by **Michel** » Tue Jan 08, 2013 12:36 pm

It would be in fact very worrysome if you never saw an error larger than the 95% error bars.

There is an example of this

Hasse's theorem is a certain estimate

http://en.wikipedia.org/wiki/Hasse%27s_ ... tic_curves

one would heuristically expect to be valid 95% of the time. The fact that this estimate is _always_ true represent a deeply hidden regularity in the solutions of certain equations (which are very important for cryptography BTW).

yolin · Post by **yolin** » Tue Jan 08, 2013 10:43 pm

This is a very interesting topic. An alternative measure to quantify the noise (or the reliability of the experiment), is to give a probability that the observed W/L/D series are generated following the W/L/D percentages of the whole series (so without noise). Of course as such this probability is ill-defined so some choices have to be made.

However a series like WWWWWWWLLLLLLL should be less likely than a series like WWLWLLLWWLW..

One way to obtain a probability is the following (I hope). Suppose we have played N x N games (N submatches of N games):
1. Calculated for each group i of N games the total score Xi=W*1+D*0.5+L*0

2 .We then have N submatch outcomes X1..XN. Since the game scores are supposed to be independent, but equally distributed, we can employ the central limit theorem which states that X follow a normal distribution.

3. If there is something wrong with our testing environment, this is no longer guaranteed (e.g. we can have an excess of outliers). Now to calculate the aforementioned probability (that the observed W/L/D series are generated following the W/L/D percentages of the whole series) we can use standard procedures, for instance http://www.mathworks.nl/help/stats/chi2gof.html, to calculate "the probability, under assumption of the null hypothesis, of observing the given statistic or one more extreme."

4. Doing this gives a probability after each N^2 games, but of course N1*N2 would work equally well, but then an arbitrary choice has to be made for the length of the 'submatches'.

Shouldn't be too hard to implement something like this in cutechess/bayeselo.

Eelco de Groot · Post by **Eelco de Groot** » Tue Jan 08, 2013 11:57 pm

I thought of the Chi-Square(d) test too. Don't remember much about the theory (long time ago that I had to study some of it), but I remember it involved diagrams like this

to describe the dependancies to be investigated between two different parameters. There is much theory material available on the Chi-Squared test. Just looked up a few that should anyone interested get started a bit:

http://en.wikipedia.org/wiki/Pearson%27 ... uared_test

http://www.ndsu.edu/pubweb/~mcclean/pls ... endel4.htm
Short description of Chi-squared test in Mendelian genetics.

http://www2.units.it/ipl/students_area/ ... ecipes.pdf
I don't know if this is exactly legal but I think this is the complete Numerical Recipes, on the official site you can not access all pages (at least not as one visitor). The relevant paragraph is I believe §14.4 p.628

14.4 Contingency Table Analysis of Two
Distributions
In this section, and the next two sections, we deal with measures of association
for two distributions. The situation is this: Each data point has two or more different
quantities associated with it, and wewant to knowwhether knowledge of one quantity
gives us any demonstrable advantage in predicting the value of another quantity. In
many cases, one variable will be an “independent” or “control” variable, and another
will be a “dependent” or “measured” variable. Then, we want to know if the latter
variable is in fact dependent on or associated with the former variable. If it is, we
want to have some quantitative measure of the strength of the association. One often
hears this loosely stated as the question of whether two variables are correlated or
uncorrelated, but we will reserve those terms for a particular kind of association
(linear, or at least monotonic), as discussed in §14.5 and §14.6.

For C programmers it is of course nice that this book is directly related to practical coding and giving examples. I have the Fortran/Pascal version. At the time C was not so important

Still an invaluable book to have I think even if it has incurred criticisms in places.

Eelco

mcostalba · Post by **mcostalba** » Fri Jan 18, 2013 7:53 pm

I have written a small python script to calculate the noise figure as explained before. Actually the script does more than this: it recalculates the noise figure every time increasing the sample base, until all the games are considered. As expected the noise level decreases with increasing number of games. The point is to verify if with very noisy test environment the decrease of noise with played games is less evident.

Script takes as input a pgn file of played games between 2 engines, like the ones created by cute-chess with option --pgnout enabled.

Code: Select all

#! /usr/bin/env python
#
# ELO Noise Estimator
#

import math

# Simple ELO estimator
def elo_estimator(count):
    wins = count[0]
    losses = count[1]
    draws = count[2]
    score = wins + draws / 2
    total = wins + losses + draws
    percent = 1000 * score / total
    elo = -400 * math.log10(1000 / float(percent) - 1)
    # error = 700 * math.sqrt(wins + losses) / (total) # 95% error bar
    return elo


# Parse pgn file into games[]
def parse_pgn(name):
    games = []
    white = ""
    reverse = 0
    fyle = open(name)
    for lyne in fyle:

        # Init first player name
        if white == "" and '[White' in lyne:
            white = lyne.split('"')[1]

        # Who is the white player of current game?
        if '[White' in lyne:
            reverse = (white != lyne.split('"')[1])

        if '[Result ' in lyne:
            if reverse:
                if '1-0' in lyne:
                    games.append('L')
                elif '1/2-1/2' in lyne:
                    games.append('D')
                elif '0-1' in lyne:
                    games.append('W')
            else:
                if '1-0' in lyne:
                    games.append('W')
                elif '1/2-1/2' in lyne:
                    games.append('D')
                elif '0-1' in lyne:
                    games.append('L')
    fyle.close()
    return games


# Count each result type (W, D, L)
def count_results(games):
    count = [0,0,0]
    for r in games :
        if 'W' in r:
            count[0] = count[0] + 1
        elif 'L' in r:
            count[1] = count[1] + 1
        elif 'D' in r:
            count[2] = count[2] + 1
    return count


# Our Noise Estimator
def noise_estimator(games, max_level):

    noise = level = 0
    bestElo = elo_estimator(count_results(games))

    while level < max_level:
        step = 2 << level
        level = level + 1
        sub_size = len(games) / step
        i = 0
        while i + sub_size <= len(games):
            sub_series = games[i:i + sub_size]
            i = i + sub_size
            eloDiff = bestElo - elo_estimator(count_results(sub_series));
            noise = noise + (eloDiff * eloDiff) / (step * step);
    return noise


# Plot how noise changes with number of games
def build_graph(games, max_level):

    graph = []
    i = step = 128

    while i <= len(games):
        graph.append(round(noise_estimator(games[0:i], max_level), 2))
        i = i + step
    return graph



###################################################################

import sys

pgnfile = "log.pgn"
max_level = 2

if len(sys.argv) > 1: pgnfile = sys.argv[1]
if len(sys.argv) > 2: max_level = sys.argv[2]

games = parse_pgn(pgnfile)
print "\nGames:", len(games), ", result:", count_results(games)
print "Estimated ELO:", elo_estimator(count_results(games))
print "Noise as function of number of games:"
print build_graph(games, max_level)

Ajedrecista · Post by **Ajedrecista** » Fri Jan 18, 2013 8:37 pm

Hello Marco:

mcostalba wrote:I have written a small python script to calculate the noise figure as explained before. Actually the script does more than this: it recalculates the noise figure every time increasing the sample base, until all the games are considered. As expected the noise level decreases with increasing number of games. The point is to verify if with very noisy test environment the decrease of noise with played games is less evident.

Script takes as input a pgn file of played games between 2 engines, like the ones created by cute-chess with option --pgnout enabled.

Code: Select all

#! /usr/bin/env python
#
# ELO Noise Estimator
#

import math

# Simple ELO estimator
def elo_estimator(count):
    wins = count[0]
    losses = count[1]
    draws = count[2]
    score = wins + draws / 2
    total = wins + losses + draws
    percent = 1000 * score / total
    elo = -400 * math.log10(1000 / float(percent) - 1)
    # error = 700 * math.sqrt(wins + losses) / (total) # 95% error bar
    return elo


# Parse pgn file into games[]
def parse_pgn(name):
    games = []
    white = ""
    reverse = 0
    fyle = open(name)
    for lyne in fyle:

        # Init first player name
        if white == "" and '[White' in lyne:
            white = lyne.split('"')[1]

        # Who is the white player of current game?
        if '[White' in lyne:
            reverse = (white != lyne.split('"')[1])

        if '[Result ' in lyne:
            if reverse:
                if '1-0' in lyne:
                    games.append('L')
                elif '1/2-1/2' in lyne:
                    games.append('D')
                elif '0-1' in lyne:
                    games.append('W')
            else:
                if '1-0' in lyne:
                    games.append('W')
                elif '1/2-1/2' in lyne:
                    games.append('D')
                elif '0-1' in lyne:
                    games.append('L')
    fyle.close()
    return games


# Count each result type (W, D, L)
def count_results(games):
    count = [0,0,0]
    for r in games :
        if 'W' in r:
            count[0] = count[0] + 1
        elif 'L' in r:
            count[1] = count[1] + 1
        elif 'D' in r:
            count[2] = count[2] + 1
    return count


# Our Noise Estimator
def noise_estimator(games, max_level):

    noise = level = 0
    bestElo = elo_estimator(count_results(games))

    while level < max_level:
        step = 2 << level
        level = level + 1
        sub_size = len(games) / step
        i = 0
        while i + sub_size <= len(games):
            sub_series = games[i:i + sub_size]
            i = i + sub_size
            eloDiff = bestElo - elo_estimator(count_results(sub_series));
            noise = noise + (eloDiff * eloDiff) / (step * step);
    return noise


# Plot how noise changes with number of games
def build_graph(games, max_level):

    graph = []
    i = step = 128

    while i <= len(games):
        graph.append(round(noise_estimator(games[0:i], max_level), 2))
        i = i + step
    return graph



###################################################################

import sys

pgnfile = "log.pgn"
max_level = 2

if len(sys.argv) > 1: pgnfile = sys.argv[1]
if len(sys.argv) > 2: max_level = sys.argv[2]

games = parse_pgn(pgnfile)
print "\nGames:", len(games), ", result:", count_results(games)
print "Estimated ELO:", elo_estimator(count_results(games))
print "Noise as function of number of games:"
print build_graph(games, max_level)

Please note that I do not know anything about Python but I have a question:

Code: Select all

percent = 1000 * score / total
elo = -400 * math.log10(1000 / float(percent) - 1)

Why multiplying by 1000 instead of 100 if it is a percent? At least I see that you are coherent in the next line and maintain 1000 (instead of 100) inside the logarithm to compute elo correctly.

------------

Code: Select all

# error = 700 * math.sqrt(wins + losses) / (total) # 95% error bar

Just a comment: in a more less balanced match with a big number of games, I think that this number 700 can be replaced by 1600/ln(10) ~ 694.87... just a minor thing (not sure about 95% confidence or 2-sigma confidence ~ 95.45% confidence, but it is an even minor thing!):

http://www.talkchess.com/forum/viewtopi ... 65&t=44100

I am not smart enough to opinate about the rest of the topic. Good luck with SF development!

Regards from Spain.

Ajedrecista.

Noise in ELO estimators: a quantitative approach

Noise in ELO estimators: a quantitative approach

Re: Noise in ELO estimators: a quantitative approach

Re: Noise in ELO estimators: a quantitative approach

Re: Noise in ELO estimators: a quantitative approach

Re: Noise in ELO estimators: a quantitative approach

Re: Noise in ELO estimators: a quantitative approach

Re: Noise in ELO estimators: a quantitative approach

Re: Noise in ELO estimators: a quantitative approach

Re: Noise in ELO estimators: a quantitative approach

Re: Noise in ELO estimators: a quantitative approach.