Hello Ilari,
Thanks again for cutechess-cli. I really like it!
Here's a feature suggestion that could be nice: output some basic statistics. Foe example when you show a result like "44-54-31 [0.46]", we (engine testers) typically want to know if this score is statistically significant to decide whether A is better than B. Or (even better) at what confidence level it would be significant.
Let's say we have
p wins
q draws
N-p-q losses
let's note
P = p/N
Q = q/N
then asymptotically the distribution of the result is a gaussian with
* unbiaised estimator of the mean: mu = P+Q/2
* unbiaised estimator of the variance:sigma^2 = [P(1-mu)^2 + Q(.5-mu)^2]/(|N-1)
then if I chose a confidence level, say alpha=95%, a confidence interval is
[mu (+/-) Phi^{-1}(1+apha/2)*sigma]
where Phi is the cumulated distribution function of the Normal(0,1), and Phi^{-1{ it's inverse (ie. quantile function)
You could either hardcode a confidence level alpha=95% for example. Or make it a parameter, meaning you'd have to code a function Phi^{-1}.
If you're interested I'll send you a C function to compute gaussian quantiles. The code would be very simple, I just need to plough through my books and find it.
Also in the same idea, once you have an asymptotic distribution of the result you can produce a likelyhood of superiority as well.
PS: In fact, I'll send you the whole C code, as a function of (p,q,N) if you're interested. I'll go to my math books, find the formulas and test it.
Lucas
cutechess-cli suggestion (to Ilari)
Moderators: hgm, Rebel, chrisw
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
also another feature, which derives from the previous one quite simply is the following:
rather than playing 200 games, let's say.
maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
the good thing about is that, sometimes when I test, either
* I end up doing too many games, because after only 50 the score is already so much in favor of engine A, that it is already significant to say A beats B
* or, I do 200 games, and even then the result is not significant.
this would be a unique feature, because in terms of engine automatic testing, there's really only one program that is good (cutechess-cli). And programs doing statistics only (bayeselo elostat or whatever) don't do engine matches. So combining those in such a way is currently not possible.
Let me know what you think.
Lucas
rather than playing 200 games, let's say.
maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
the good thing about is that, sometimes when I test, either
* I end up doing too many games, because after only 50 the score is already so much in favor of engine A, that it is already significant to say A beats B
* or, I do 200 games, and even then the result is not significant.
this would be a unique feature, because in terms of engine automatic testing, there's really only one program that is good (cutechess-cli). And programs doing statistics only (bayeselo elostat or whatever) don't do engine matches. So combining those in such a way is currently not possible.
Let me know what you think.
Lucas
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
Here's all the code required (written in plain C)to do all this. Hope it helps!
Would be really nice to implement the stop condition based on LOS rather than number of games. That would bring automated engine testing a step further
Lucas
Code: Select all
static const double a = 8*(PI-3)/(3*PI*(4-PI));
double erf(double x)
// error function: good approximation especially close 0 and +/-infinity
{
return (x>0 ? 1 : -1) * sqrt(1-exp(-x*x * (4/PI + a*x*x) / (1 + a*x*x)));
}
double erf_inverse(double x)
// inverted error function
{
double y = log(1-x*x), z = 2/(PI*a) + y/2;
return (x>0 ? 1 : -1) * sqrt(sqrt(z*z - y/a) - z);
}
double pnorm(double x)
// cumulated distribution function of the normal distribution N(0,1)
{
return .5 * (1 + erf(x / sqrt(2)));
}
double qnorm(double q)
// quantile function of the normal distribution N(0,1)
{
return sqrt(2) * erf_inverse(2*q - 1);
}
void calc_stats(int wins, int draws, int N, double alpha, double &confidence, double &LOS)
/*
* A plays against B for N games. Take the number of wins and draws for A, and a confidence level
* alpha (eg. alpha=0.95 to allow 5% error). We compute the confidence interval and the probability
* that A is better than B (Likelyhood Of Superiority)
*
* Example: 50 wins, 30 draws, 120 games, alpha = 95%
* mu = 54.17% score of A)
* sigma = 3.95% stdev of the random variable mu)
* confidence = qnorm(97.5%)*sigma = 7.74%: with proba 95% the true score of A is in the interval
* mu +/- confidence
* LOS = pmorm((mu-.5)/sigma) = 85.42%: probability that A is stronger than B (which is below alpha in
* this case, so it's not significant)
* */
{
double w = wins/N, d = draws/N, l = 1-w-d;
double mu = w+d/2;
double sigma = sqrt((w*(1-mu)*(1-mu) + d*(.5-mu)*(.5-mu) + l*(0-mu)*(0-mu))/(N-1));
*confidence = sigma*qnorm(.5*(1+alpha));
*LOS = pnorm((mu-.5)/sigma);
}
Lucas
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
I forgot an important detail: parameter validity
* wins and draws must be positive (obviously)
* N must be >= wins+draws (otherwise there is a negative number of losses)
* N must be >= 2, otherwise the calculation of sigma will crash. In fact, these stats are meaningless until N is big enough for the asymptotic assumption to be precise enough. Let's say N >= 20
* 0.5 <= alpha < 1. alpha = 1 will obviously crash, since qnorm(1) = +infinity. typically statisticians chose alpha = 95%
also, as a test, the qnorm and pnorm functions should give the same results than the functions normsinv and normsdist defined in Libre Office (or MS Excel)
* wins and draws must be positive (obviously)
* N must be >= wins+draws (otherwise there is a negative number of losses)
* N must be >= 2, otherwise the calculation of sigma will crash. In fact, these stats are meaningless until N is big enough for the asymptotic assumption to be precise enough. Let's say N >= 20
* 0.5 <= alpha < 1. alpha = 1 will obviously crash, since qnorm(1) = +infinity. typically statisticians chose alpha = 95%
also, as a test, the qnorm and pnorm functions should give the same results than the functions normsinv and normsdist defined in Libre Office (or MS Excel)
-
- Posts: 62
- Joined: Mon Oct 03, 2011 9:40 pm
Re: cutechess-cli suggestion (to Ilari)
Wouldn't this result in a too optimistic LOS?lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Observing the process and cutting off when you reach some favorable ratio would skew results right?
-
- Posts: 348
- Joined: Sat Feb 27, 2010 12:21 am
Re: cutechess-cli suggestion (to Ilari)
Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.Zlaire wrote:Wouldn't this result in a too optimistic LOS?lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Observing the process and cutting off when you reach some favorable ratio would skew results right?
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: cutechess-cli suggestion (to Ilari)
Is there a good reason for cutechess-cli to duplicate the functionality of bayeselo?
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
Do you know where I can find a mathematical proof of that ? I can understand that one would want to believe this intuitively, but it doesn't seem trivial to me.marcelk wrote:Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.Zlaire wrote:Wouldn't this result in a too optimistic LOS?lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Observing the process and cutting off when you reach some favorable ratio would skew results right?
Perhaps Remi Coulom would be able to answer that ?
-
- Posts: 438
- Joined: Mon Apr 24, 2006 8:06 pm
Re: cutechess-cli suggestion (to Ilari)
A lecture by a poker programmer on this topic:lucasart wrote:Do you know where I can find a mathematical proof of that ? I can understand that one would want to believe this intuitively, but it doesn't seem trivial to me.marcelk wrote:Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.Zlaire wrote:Wouldn't this result in a too optimistic LOS?lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Observing the process and cutting off when you reach some favorable ratio would skew results right?
Perhaps Remi Coulom would be able to answer that ?
http://videolectures.net/icml08_mnih_ebs/
http://icml2008.cs.helsinki.fi/papers/523.pdf
That one is interesting too:
http://videolectures.net/icml09_igel_hbrs/
Rémi
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: cutechess-cli suggestion (to Ilari)
Thank you Remi. That is exactly what I was looking for!Rémi Coulom wrote:A lecture by a poker programmer on this topic:lucasart wrote:Do you know where I can find a mathematical proof of that ? I can understand that one would want to believe this intuitively, but it doesn't seem trivial to me.marcelk wrote:Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.Zlaire wrote:Wouldn't this result in a too optimistic LOS?lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Observing the process and cutting off when you reach some favorable ratio would skew results right?
Perhaps Remi Coulom would be able to answer that ?
http://videolectures.net/icml08_mnih_ebs/
http://icml2008.cs.helsinki.fi/papers/523.pdf
That one is interesting too:
http://videolectures.net/icml09_igel_hbrs/
Rémi
What I really want is:
* output an asymptitic estimate of the LOS and the confidence interval, since it takes very little programming (it's basically my sample code). that's not really crucial, but I thought it would be nice to have.
* determine a stopping strategy. This really has to be implemented in cutechess-cli, and is not duplicating BayesElo as someone suggested. BayesElo takes a PGN and produces statistics. I want to do a test of A vs B, and stop whenever enough games have been played to be able to draw a statistically significant conclusion.
As I understand this is what the Empirical Bernstein Stopping algorithm does, right ? I'll print out the PDF and read it in details tomorrow. But basically I would like to implement this in cutechess-cli, to automate and rationalise my testing. And I *will*
I may have some questions to ask you along the way, if there are things I don't understand. I will however try to figure things out by myself before asking silly questions.
First question: R = 3 in this case right ? Just want to make sure I'm reading this correctly.
Lucas