cutechess-cli suggestion (to Ilari)

lucasart · Post by **lucasart** » Sun Nov 06, 2011 5:23 pm

Hello Ilari,

Thanks again for cutechess-cli. I really like it!

Here's a feature suggestion that could be nice: output some basic statistics. Foe example when you show a result like "44-54-31 [0.46]", we (engine testers) typically want to know if this score is statistically significant to decide whether A is better than B. Or (even better) at what confidence level it would be significant.

Let's say we have
p wins
q draws
N-p-q losses

let's note
P = p/N
Q = q/N

then asymptotically the distribution of the result is a gaussian with
* unbiaised estimator of the mean: mu = P+Q/2
* unbiaised estimator of the variance:sigma^2 = [P(1-mu)^2 + Q(.5-mu)^2]/(|N-1)

then if I chose a confidence level, say alpha=95%, a confidence interval is
[mu (+/-) Phi^{-1}(1+apha/2)*sigma]
where Phi is the cumulated distribution function of the Normal(0,1), and Phi^{-1{ it's inverse (ie. quantile function)

You could either hardcode a confidence level alpha=95% for example. Or make it a parameter, meaning you'd have to code a function Phi^{-1}.

If you're interested I'll send you a C function to compute gaussian quantiles. The code would be very simple, I just need to plough through my books and find it.

Also in the same idea, once you have an asymptotic distribution of the result you can produce a likelyhood of superiority as well.

PS: In fact, I'll send you the whole C code, as a function of (p,q,N) if you're interested. I'll go to my math books, find the formulas and test it.

Lucas

lucasart · Post by **lucasart** » Sun Nov 06, 2011 5:45 pm

also another feature, which derives from the previous one quite simply is the following:

rather than playing 200 games, let's say.

maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.

the good thing about is that, sometimes when I test, either
* I end up doing too many games, because after only 50 the score is already so much in favor of engine A, that it is already significant to say A beats B
* or, I do 200 games, and even then the result is not significant.

this would be a unique feature, because in terms of engine automatic testing, there's really only one program that is good (cutechess-cli). And programs doing statistics only (bayeselo elostat or whatever) don't do engine matches. So combining those in such a way is currently not possible.

Let me know what you think.

Lucas

lucasart · Post by **lucasart** » Sun Nov 06, 2011 7:42 pm

Here's all the code required (written in plain C)to do all this. Hope it helps!

Code: Select all

static const double a = 8*&#40;PI-3&#41;/&#40;3*PI*&#40;4-PI&#41;);

double erf&#40;double x&#41;
// error function&#58; good approximation especially close 0 and +/-infinity
&#123;
	return &#40;x>0 ? 1 &#58; -1&#41; * sqrt&#40;1-exp&#40;-x*x * &#40;4/PI + a*x*x&#41; / &#40;1 + a*x*x&#41;));
&#125;

double erf_inverse&#40;double x&#41;
// inverted error function
&#123;
	double y = log&#40;1-x*x&#41;, z = 2/&#40;PI*a&#41; + y/2;
	return &#40;x>0 ? 1 &#58; -1&#41; * sqrt&#40;sqrt&#40;z*z - y/a&#41; - z&#41;;
&#125;

double pnorm&#40;double x&#41;
// cumulated distribution function of the normal distribution N&#40;0,1&#41;
&#123;
	return .5 * &#40;1 + erf&#40;x / sqrt&#40;2&#41;));
&#125;

double qnorm&#40;double q&#41;
// quantile function of the normal distribution N&#40;0,1&#41;
&#123;
	return sqrt&#40;2&#41; * erf_inverse&#40;2*q - 1&#41;;
&#125;

void calc_stats&#40;int wins, int draws, int N, double alpha, double &confidence, double &LOS&#41;
/*
 * A plays against B for N games. Take the number of wins and draws for A, and a confidence level
 * alpha &#40;eg. alpha=0.95 to allow 5% error&#41;. We compute the confidence interval and the probability
 * that A is better than B &#40;Likelyhood Of Superiority&#41;
 * 
 * Example&#58; 50 wins, 30 draws, 120 games, alpha = 95%
 * mu = 54.17%		score of A&#41;
 * sigma = 3.95%	stdev of the random variable mu&#41;
 * confidence = qnorm&#40;97.5%)*sigma = 7.74%&#58;	with proba 95% the true score of A is in the interval
 * mu +/- confidence
 * LOS = pmorm&#40;&#40;mu-.5&#41;/sigma&#41; = 85.42%&#58; probability that A is stronger than B &#40;which is below alpha in
 * this case, so it's not significant&#41;
 * */
&#123;
	double w = wins/N, d = draws/N, l = 1-w-d;
	double mu = w+d/2;
	double sigma = sqrt&#40;&#40;w*&#40;1-mu&#41;*&#40;1-mu&#41; + d*(.5-mu&#41;*(.5-mu&#41; + l*&#40;0-mu&#41;*&#40;0-mu&#41;)/&#40;N-1&#41;);
	
	*confidence = sigma*qnorm&#40;.5*&#40;1+alpha&#41;);
	*LOS = pnorm&#40;&#40;mu-.5&#41;/sigma&#41;;
&#125;

Would be really nice to implement the stop condition based on LOS rather than number of games. That would bring automated engine testing a step further

Lucas

lucasart · Post by **lucasart** » Sun Nov 06, 2011 8:32 pm

I forgot an important detail: parameter validity
* wins and draws must be positive (obviously)
* N must be >= wins+draws (otherwise there is a negative number of losses)
* N must be >= 2, otherwise the calculation of sigma will crash. In fact, these stats are meaningless until N is big enough for the asymptotic assumption to be precise enough. Let's say N >= 20
* 0.5 <= alpha < 1. alpha = 1 will obviously crash, since qnorm(1) = +infinity. typically statisticians chose alpha = 95%

also, as a test, the qnorm and pnorm functions should give the same results than the functions normsinv and normsdist defined in Libre Office (or MS Excel)

Zlaire · Post by **Zlaire** » Sun Nov 06, 2011 8:54 pm

lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.

Wouldn't this result in a too optimistic LOS?

Observing the process and cutting off when you reach some favorable ratio would skew results right?

marcelk · Post by **marcelk** » Sun Nov 06, 2011 9:32 pm

Zlaire wrote:
lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Wouldn't this result in a too optimistic LOS?

Observing the process and cutting off when you reach some favorable ratio would skew results right?

Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.

Michel · Post by **Michel** » Sun Nov 06, 2011 9:38 pm

Is there a good reason for cutechess-cli to duplicate the functionality of bayeselo?

lucasart · Post by **lucasart** » Sun Nov 06, 2011 9:44 pm

marcelk wrote:
Zlaire wrote:
lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Wouldn't this result in a too optimistic LOS?

Observing the process and cutting off when you reach some favorable ratio would skew results right?
Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.

Do you know where I can find a mathematical proof of that ? I can understand that one would want to believe this intuitively, but it doesn't seem trivial to me.
Perhaps Remi Coulom would be able to answer that ?

Rémi Coulom · Post by **Rémi Coulom** » Sun Nov 06, 2011 9:51 pm

lucasart wrote:
marcelk wrote:
Zlaire wrote:
lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Wouldn't this result in a too optimistic LOS?

Observing the process and cutting off when you reach some favorable ratio would skew results right?
Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.
Do you know where I can find a mathematical proof of that ? I can understand that one would want to believe this intuitively, but it doesn't seem trivial to me.
Perhaps Remi Coulom would be able to answer that ?

A lecture by a poker programmer on this topic:
http://videolectures.net/icml08_mnih_ebs/
http://icml2008.cs.helsinki.fi/papers/523.pdf

That one is interesting too:
http://videolectures.net/icml09_igel_hbrs/

Rémi

lucasart · Post by **lucasart** » Sun Nov 06, 2011 11:24 pm

Rémi Coulom wrote:
lucasart wrote:
marcelk wrote:
Zlaire wrote:
lucasart wrote:maybe I would like to play an infinite number of games, and stop when both conditions are met:
* N >= 20, let's say that's enough to justify the asymptotic assumption on the distribution
* LOS (likelyhood of superiority) is above a given threshold, like 95% or sth.
Wouldn't this result in a too optimistic LOS?

Observing the process and cutting off when you reach some favorable ratio would skew results right?
Indeed, doing that is a common pitfall in the sense that it decreases the confidence level. But mind that 86.165% of all statistics are somehow flawed anyway.
Do you know where I can find a mathematical proof of that ? I can understand that one would want to believe this intuitively, but it doesn't seem trivial to me.
Perhaps Remi Coulom would be able to answer that ?
A lecture by a poker programmer on this topic:
http://videolectures.net/icml08_mnih_ebs/
http://icml2008.cs.helsinki.fi/papers/523.pdf

That one is interesting too:
http://videolectures.net/icml09_igel_hbrs/

Rémi

Thank you Remi. That is exactly what I was looking for!

What I really want is:
* output an asymptitic estimate of the LOS and the confidence interval, since it takes very little programming (it's basically my sample code). that's not really crucial, but I thought it would be nice to have.
* determine a stopping strategy. This really has to be implemented in cutechess-cli, and is not duplicating BayesElo as someone suggested. BayesElo takes a PGN and produces statistics. I want to do a test of A vs B, and stop whenever enough games have been played to be able to draw a statistically significant conclusion.

As I understand this is what the Empirical Bernstein Stopping algorithm does, right ? I'll print out the PDF and read it in details tomorrow. But basically I would like to implement this in cutechess-cli, to automate and rationalise my testing. And I *will*

I may have some questions to ask you along the way, if there are things I don't understand. I will however try to figure things out by myself before asking silly questions.

First question: R = 3 in this case right ? Just want to make sure I'm reading this correctly.

Lucas

cutechess-cli suggestion (to Ilari)

cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)

Re: cutechess-cli suggestion (to Ilari)