margin of error

Don · Post by **Don** » Mon Sep 24, 2012 2:31 pm

Daniel Shawul wrote:
Logically if I play A vs B and use 20,000 games, both player have played 20,000 games. If I play A vs C, and then play a second match B vs C, I have to play at least twice as many games, 20,000 for the first match and 20,000 for the second match. In other words I have wasted a lot of testing resources involving a 3rd party. So I think it's pretty obvious that the answer is at least 2x.
Yes but you too are forgetting covariance.

Don't forget the simulation I ran which make this completely clear. So you are thinking of this wrong. I was able to predict which program was superior about 87.3 percent of the time when I played 20,000 head to head simulated matches (the superiority was 2 ELO.)

In order to duplicate 87% when each played foreign programs I had to run each of the 2 foreign matches to 40,000 games, in other words 4x the effort.

Here is the source code to this simulation in C. You will have to provide a function that returns a random value between 0 and 1 (I'm using the MT RNG.) I also simulate 50% draws since that is about what we get when we test Komodo vs Komodo.

Code: Select all

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>
#include <stdint.h>

#include "mt.h"   // random number generator


#define DRAW 0.50

int   good = 0;
int   bad = 0;

double expectation( double me, double you )
{
  double x = (you - me) / 400.0;
  double d = 1.0 + pow(10.0, x);
  return 1.0 / d;
}


double  genGame( double wELO, double bELO, double ex )
{
  ex = ex - 0.5;
  ex = ex * (1.0 / (1.0 - DRAW));
  ex += 0.5;

  /* generates a random number on [0,1) with 53-bit resolution*/
  if (genrand_res53() < DRAW) return 0.5;
  if (genrand_res53() < ex) return 1.0;
  
  return 0.0;
}

// ----------------------------------------------------
// evaluate player a with match, return result of match
// ----------------------------------------------------
double eval(double a, double b, int matchsize)
{
  int  i;
  double sc = 0.0;
  double ex = expectation(a, b);

  for (i=0; i<matchsize; i++) {
    sc += genGame( a, b, ex );
  }
  sc = sc / (double) matchsize;

  return sc;
}


int simHeadToHead(int matchsize)
{
  double a = 1502.0;
  double b = 1500.0;
  int    i;

  good = 0;
  bad = 0;

  for (i=0; i<100000; i++) {
    double e = eval(a, b, matchsize);
    if (e > 0.5) good++; else bad++;
  }
  
  printf("    (Heads up)  correct: %9d    incorrect: %9d   %10.4f\n",  good, bad, 100.0 * good / (double)(good + bad));
  return 0;
}

int simHeadForeign(int matchsize)
{
  double a = 1502.0;
  double b = 1500.0;
  double f = 1501.0;
  int    i;

  good = 0;
  bad = 0;
  for (i=0; i<100000; i++) {
    double e0 = eval(a, f, matchsize);
    double f0 = eval(b, f, matchsize);
    if (e0 > f0) good++; else bad++;
  }

  printf("(Foreign test)  correct: %9d    incorrect: %9d   %10.4f\n",  good, bad, 100.0 * good / (double)(good + bad));
  return 0;
}




int main(int argc, char **argv)
{
  init_genrand(time(NULL));

  simHeadToHead(20000);
  simHeadForeign(20000);

  return 0;
}

Remeber Remi warned that my forumal could be wrong since there is usually covariance. Please look at my calculation, and see how the covariance affects A-B significantly. It is equal in magnitude to the variance.
Code: Select all
For A vs B
var(A)=var(B)
cov(A,B)=-sqrt(var(A)var(B))=-var(A)
So var(A-B)=var(A)+var(A)-2(-var(A))=4var(A)
So when you match A with B, you have 4 time as big a variance. Look at my example and tell me where I made a mistake.

Michel · Post by **Michel** » Mon Sep 24, 2012 2:41 pm

Don't forget the simulation I ran which make this completely clear. So you are thinking of this wrong. I was able to predict which program was superior about 87.3 percent of the time when I played 20,000 head to head simulated matches (the superiority was 2 ELO.)

I am bit confused about this. But isn't the reason for this discrepancy that you make the assumption that you have no information whatsoever on the elo of A,C?

Normally you would have good information on the elo of A,C (from earlier tests). And then you would test B against them.

It seems that testing B against A or C should be equivalent under these conditions.

Don · Post by **Don** » Mon Sep 24, 2012 2:44 pm

lucasart wrote:[
The only problem is that the stopping rule must coded in the tournament manager. You could use cutechess-cli to run one experi,ment (ie a game vs each opponent) and a python script that runs it and test the stopping rule.

My tournament manager is a java program which plays multiple games (which can be configured) simultaneously. The games are each played in a separate java thread.

When a game completes it is not recorded unless it is the next game in sequence, although a new game is immediately started. I do that because the PGN output also serves as persistent state in case I want to stop and restart the tester.

So when the PGN record is recorded I could apply the WALD calculation and make the decision to stop the test if needed. The tester doesn't track the results but that would be trivial to add. I think I would like to try this. Maybe I can convert the python code to java.

Don · Post by **Don** » Mon Sep 24, 2012 2:57 pm

Michel wrote:
Don't forget the simulation I ran which make this completely clear. So you are thinking of this wrong. I was able to predict which program was superior about 87.3 percent of the time when I played 20,000 head to head simulated matches (the superiority was 2 ELO.)
I am bit confused about this. But isn't the reason for this discrepancy that you make the assumption that you have no information whatsoever on the elo of A,C?

Normally you would have good information on the elo of A,C (from earlier tests). And then you would test B against them.

It seems that testing B against A or C should be equivalent under these conditions.

A, B, and C simulate "real" programs and my simulation does not know anything about the ratings of these programs, but when they play each other they will win, lose or draw according to their "actual" strength.

I am testing A vs C, then B vs C and it's assumed that C is a foreign program that is roughly equivalent in strength. The actual strength of C does have a small impact on how many game are required and impact can be much larger if the discrepancy is large, but in practice we never test against foreign programs that are wildly weaker or stronger than the program we are trying to measure.

It seems that you are assuming a different test setup? I don't understand why you think that the order I test A vs C and B vs C matters. There is nothing learned from A vs C that will shorten the effort of B vs C.

I get the feeling that you are talking about one thing and the rest of us are talking about something else - so maybe this is where the confusion is. The source code defines how I am thinking about this.

Michel · Post by **Michel** » Mon Sep 24, 2012 3:06 pm

I get the feeling that you are talking about one thing and the rest of us are talking about something else - so maybe this is where the confusion is. The source code defines how I am thinking about this.

I guess my point is that A and C will have played (probably many) games against each other in the past (assuming A is an old version of the engine and C is a foreign test engine). So you would know their relative elo somewhat accurately.

You are discarding that information in your test setup.

EDIT: To use that information you would just add the old games between A and C to the pgn and let BayesElo compute the LOS.

Don · Post by **Don** » Mon Sep 24, 2012 3:13 pm

Michel wrote:
Michel wrote:
If you test against a set of engines with known elo you can use the likelihood ratio test (also known as Wald test). This test is easy to use.

I'll look for that - do you have any references?

We don't really test head to head, but each version of Komodo tests against a set of opponents (not Komodo versions.)
The Wald test also works for non head to head tests. The key point is that the elo of the opponents needs to be known (which is the case for a head to head test since you may assume that the elo of the single opponent is zero).

The Wald test is for testing a single parameter (e.g. elo). It is implemented as a random walk were each outcome (win/draw/loss) changes a certain number (the likelihood ratio) in a precomputed way. Once the likelihood ratio falls outside a precomputed interval you accept either H0 or H1.

In practice you also have a truncation time and a threshold which determines whether you accept H0 or H1 at truncation.

I wrote a python utility that computes the parameters (alas for now only for a head to head test, I could extend it for more opponents).

http://hardy.uhasselt.be/Toga/wald
Code: Select all
Usage: wald [OPTION]...
Generate parameters for a truncated Wald test and optionally
produce an informative graph.

  --alpha       type I error probability when there is no elo difference
  --beta        type II error probability with elo difference equal to epsilon
  --epsilon     see --beta
  --truncation  truncate test after this many observations
  --graph       produce graph
  --draw_ratio  draw ratio when there is no elo difference
  --help        print this message
It will print the parameters of the random walk (i.e. what to add in case of W/D/L) as well as the stopping conditions.

It will also produce a graph which looks as follows

http://hardy.uhasselt.be/Toga/graph3.png

I'm not much of an expert on python, so the script is not running on my machine. How do I get the packages I need?

drd@greencheeks ~/tmp $ ./wald.py
Traceback (most recent call last):
File "./wald.py", line 20, in <module>
from scipy.stats import norm
ImportError: No module named scipy.stats

Michel · Post by **Michel** » Mon Sep 24, 2012 3:16 pm

drd@greencheeks ~/tmp $ ./wald.py
Traceback (most recent call last):
File "./wald.py", line 20, in <module>
from scipy.stats import norm
ImportError: No module named scipy.stats

Are you using Ubuntu? Then fire up synaptic and search for scipy.

Don · Post by **Don** » Mon Sep 24, 2012 3:42 pm

Michel wrote:
I get the feeling that you are talking about one thing and the rest of us are talking about something else - so maybe this is where the confusion is. The source code defines how I am thinking about this.
I guess my point is that A and C will have played (probably many) games against each other in the past (assuming A is an old version of the engine and C is a foreign test engine). So you would know their relative elo somewhat accurately.

You are discarding that information in your test setup.

EDIT: To use that information you would just add the old games between A and C to the pgn and let BayesElo compute the LOS.

I'm not even using BayesElo in the simulation, I am simply using the simulation to determine what my ratio of correct vs incorrect results would be given various test setups. So I now know that if we make a 2 ELO improvement, 20,000 head to head games will give me the wrong answer almost 13% of the time. So we would end up keeping a regression in these cases.

It's actually a lot worse that this since you do not know in advance whether a change is good or bad. Just as an example, if only 1 out of 10 changes actually improve the program and the others were regressions, your test has to have incredibly high confidence - otherwise you end up keeping too many regressions.

Michel · Post by **Michel** » Mon Sep 24, 2012 3:50 pm

I'm not even using BayesElo in the simulation, I am simply using the simulation to determine what my ratio of correct vs incorrect results would be given various test setups. So I now know that if we make a 2 ELO improvement, 20,000 head to head games will give me the wrong answer almost 13% of the time. So we would end up keeping a regression in these cases.

I you are doing head to head there is no point in using BayesElo.

But if you test against a pool of foreign opponents, I think you are discarding a lot of information by not using information from old games (which you seem to suggest you are doing).

Don · Post by **Don** » Mon Sep 24, 2012 3:55 pm

Michel wrote:
drd@greencheeks ~/tmp $ ./wald.py
Traceback (most recent call last):
File "./wald.py", line 20, in <module>
from scipy.stats import norm
ImportError: No module named scipy.stats
Are you using Ubuntu? Then fire up synaptic and search for scipy.

Yes, I am using Mint which is basically Ubuntu.

It works now, that did the trick. Now let me see if I understand it.

Don

margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error