how to do a proper statistical test
Moderators: hgm, Harvey Williamson, bob
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

 Posts: 685
 Joined: Tue May 22, 2007 9:13 am
how to do a proper statistical test
The discussion in the various threads on testing engine improvements so far has been rather ad hoc as far as the statistical methodology is concerned.
H.G. Muller seems to think that the variance determines the confidence intervals around a match outcome. That's not true. What matters is the standard error on the mean, which is equal to the square root of the variance divided by the square root of the number of games. So more games in a match let you test more precisely, just as Robert Hyatt is claiming.
E.g. (building on an example that was given earlier), if you play a 100 game match with 51 wins, 49 losses, then the mean result is that you score 51 points, with a variance of 24.99 (=51*(1.51)^2 + 49*(0.51)^2). The average result per game is 0.51 with a standard error on the mean of 0.04999=sqrt(24.99/100). If we assume that games are statistically indepent, then the outcome of a 100 game match with two oppenents that score 5149 against each other is normally distributed around a mean of 51% and a standard deviation of 0.04999.
If you want to test whether the 5149 match was a signifcant improvement over a 5050 match (no draws again), then you have to calculate the 95% confidence interval (onesided) under the null hypothesis of a 5050 result. That turns out to be a score of 58,2% or more *in a 100 game match*. Hence, a 5149 result in a 100 game match is not a significant improvement.
How many games do you have to repeat in order for a 5149 result to be an improvement over 5050? From the scaling behaviour of the standard error on the mean it turns out that you would need to score 51% in a 6700 game match in order to show a signficant improvement!
Now, that was only the significance, i.e. the chance of having no Type I errors (95% chance that an accepted improvement was genuine). But what do you do when the result of a 100 game is not significant? You can't conclude that there is no improvement either since you might commit a Type II error (falsely rejecting a genuine improvement).
So besides significance, you also want to have a test of high statistical power so that you give yourself a shot of finding, let's say, 95% of all genuine improvements. If you want to simultaneously accept 95% of the genuine improvements and also have 95% of the accepted improvements to be genuine, then you need even more games. For the example in hand it would mean that only after 27,000 games would a 5149 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.
H.G. Muller seems to think that the variance determines the confidence intervals around a match outcome. That's not true. What matters is the standard error on the mean, which is equal to the square root of the variance divided by the square root of the number of games. So more games in a match let you test more precisely, just as Robert Hyatt is claiming.
E.g. (building on an example that was given earlier), if you play a 100 game match with 51 wins, 49 losses, then the mean result is that you score 51 points, with a variance of 24.99 (=51*(1.51)^2 + 49*(0.51)^2). The average result per game is 0.51 with a standard error on the mean of 0.04999=sqrt(24.99/100). If we assume that games are statistically indepent, then the outcome of a 100 game match with two oppenents that score 5149 against each other is normally distributed around a mean of 51% and a standard deviation of 0.04999.
If you want to test whether the 5149 match was a signifcant improvement over a 5050 match (no draws again), then you have to calculate the 95% confidence interval (onesided) under the null hypothesis of a 5050 result. That turns out to be a score of 58,2% or more *in a 100 game match*. Hence, a 5149 result in a 100 game match is not a significant improvement.
How many games do you have to repeat in order for a 5149 result to be an improvement over 5050? From the scaling behaviour of the standard error on the mean it turns out that you would need to score 51% in a 6700 game match in order to show a signficant improvement!
Now, that was only the significance, i.e. the chance of having no Type I errors (95% chance that an accepted improvement was genuine). But what do you do when the result of a 100 game is not significant? You can't conclude that there is no improvement either since you might commit a Type II error (falsely rejecting a genuine improvement).
So besides significance, you also want to have a test of high statistical power so that you give yourself a shot of finding, let's say, 95% of all genuine improvements. If you want to simultaneously accept 95% of the genuine improvements and also have 95% of the accepted improvements to be genuine, then you need even more games. For the example in hand it would mean that only after 27,000 games would a 5149 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.
Re: how to do a proper statistical test
Well, that's what I've been saying all along. And in particular, I've been saying that I am not interested in outcomes of 51 %. Rather than playing more games I will just go back to the code until I find significance earlier. Robert is saying is that he is interested in those 51 % findings, so he needs more games. But what he's also saying is that others should also be interested in those 51 % findings, and thus need more games. And that's what I don't agree with.Rein Halbersma wrote: How many games do you have to repeat in order for a 5149 result to be an improvement over 5050? From the scaling behaviour of the standard error on the mean it turns out that you would need to score 51% in a 6700 game match in order to show a signficant improvement!
Now, that was only the significance, i.e. the chance of having no Type I errors (95% chance that an accepted improvement was genuine). But what do you do when the result of a 100 game is not significant? You can't conclude that there is no improvement either since you might commit a Type II error (falsely rejecting a genuine improvement).
So besides significance, you also want to have a test of high statistical power so that you give yourself a shot of finding, let's say, 95% of all genuine improvements. If you want to simultaneously accept 95% of the genuine improvements and also have 95% of the accepted improvements to be genuine, then you need even more games. For the example in hand it would mean that only after 27,000 games would a 5149 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.
Bob also seems to have a problem with 95 % confidence, saying "1 in 20 will be wrong". What is your answer to that?
Re: how to do a proper statistical test
You would have to repeat the test of 27,000 games to be sure that again the result 5149 is repeated because is possible that after repeating the test you do not have this result. Perhaps your computer has a behavior different after running hours. Perhaps if instead of choosing those rivals you had chosen others the result would be different…Rein Halbersma wrote:For the example in hand it would mean that only after 27,000 games would a 5149 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.
Last edited by pedrox on Wed Sep 19, 2007 9:05 am, edited 1 time in total.

 Posts: 685
 Joined: Tue May 22, 2007 9:13 am
Re: how to do a proper statistical test
You can choose your own threshold of the effect size that you want to be able to measure. The 51% score is about 7 elo points. If you want to test only for larger improvements that's up to you. E.g., if you want to be 95% sure that a match result of 55% (35 elo points improvement) is significant, then you only need 270 games to test. If you want to find 95% of all 35 elo point improvements with 95% significance, you need 4 times as much games so more than 1,000 games.nczempin wrote: Well, that's what I've been saying all along. And in particular, I've been saying that I am not interested in outcomes of 51 %. Rather than playing more games I will just go back to the code until I find significance earlier. Robert is saying is that he is interested in those 51 % findings, so he needs more games. But what he's also saying is that others should also be interested in those 51 % findings, and thus need more games. And that's what I don't agree with.
Bob's right: 5% of the time you make a Type I error and accept a program change that wasn't an actual improvement. And in 95% of the repeated tests it wouldn't show as an improvement, just you were unlucky that it did in your particular match. To only way to avoid is to increase the significance level to say 1%. That would mean that you need twice as much games.nczempin wrote: Bob also seems to have a problem with 95 % confidence, saying "1 in 20 will be wrong". What is your answer to that?
But both the effect size and the significance level are up to the experimenter and to some degree are a matter of taste and available resources. Just make sure that you understand what you are trying to measure and with which accuracy your measurements are made.
Re: how to do a proper statistical test
H.G.Muller never claimed that it is not correct that more games are betterRein Halbersma wrote:
H.G. Muller seems to think that the variance determines the confidence intervals around a match outcome. That's not true. What matters is the standard error on the mean, which is equal to the square root of the variance divided by the square root of the number of games. So more games in a match let you test more precisely, just as Robert Hyatt is claiming.
You simply distort his words
The standard error is the square root of the variance
so knowing one of them is clearly enough.
He never claimed that the confidence interval is proportional to the variance.
Uri
Last edited by Uri Blass on Wed Sep 19, 2007 9:10 am, edited 1 time in total.

 Posts: 685
 Joined: Tue May 22, 2007 9:13 am
Re: how to do a proper statistical test
No, by construction, after 27,000 games you can have 95% confidence that the 5149 result would be repeated if you did a second 27,000 game match (only 5% chance of Type I error in the first match). Moreover, you can also be 95% confident that if the first 27,000 game match gave 5050, then a second 27,000 game match will not result in a 5149 match (only 5% chance of Type II error in the first match)pedrox wrote:You would have to repeat the test of 27,000 games to be sure that again the result 5149 is repeated because is possible that after repeating the test you do not have this result. Perhaps your computer has a behavior different after running hours. Perhaps if instead of choosing those rivals you had chosen others the result would be different…Rein Halbersma wrote:For the example in hand it would mean that only after 27,000 games would a 5149 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.
Re: how to do a proper statistical test
Rein, you have explained very well.
A tester when receives a new version of a engine, waits for an increase of its ELO of at least 30 points.
You indicate to us that for an increase of ELO of about 35 points 270 games are needed, which he is something reasonable for the people that in their house have a single computer and this computer is used to make the program.
A tester when receives a new version of a engine, waits for an increase of its ELO of at least 30 points.
You indicate to us that for an increase of ELO of about 35 points 270 games are needed, which he is something reasonable for the people that in their house have a single computer and this computer is used to make the program.

 Posts: 685
 Joined: Tue May 22, 2007 9:13 am
Re: how to do a proper statistical test
If after 270 games you find a 35 ELO improvement (55% score), then you can conclude it is a significant improvement. If you don't find something after the first 270 games, you can't conclude there is no improvement. Only if you don't find at a 35 ELO improvement after >1,000 games (4 times as much) can you be sure (well, 95% confident) that the change isn't an improvement.pedrox wrote:Rein, you have explained very well.
A tester when receives a new version of a engine, waits for an increase of its ELO of at least 30 points.
You indicate to us that for an increase of ELO of about 35 points 270 games are needed, which he is something reasonable for the people that in their house have a single computer and this computer is used to make the program.
That's the difference between significance and power. You need more games to have a highpowered test than to have a highly significant test.

 Posts: 685
 Joined: Tue May 22, 2007 9:13 am
Re: how to do a proper statistical test
You are right, I just read the very first post of the thread, I had started at the end and got confused by all the heated arguments lately No distortions intended! He indeed calculates things correctly, the difference is that I standardize everything to a single game to look at the scaling behaviour for multiple games.Uri Blass wrote:H.G.Muller never claimed that it is not correct that more games are better
You simply distort his words
The standard error is the square root of the variance
so knowing one of them is clearly enough.
He never claimed that the confidence interval is proportional to the variance.
Uri
My point is though, that one needs to be careful to distinguish between significance (type I errors) and power (type II errors). More games help with both.
Re: how to do a proper statistical test
Well, of course he's right, it is exactly what 95 % confidence means, that you will be wrong 1 times in 20. But his conclusion is that this is not acceptable.Rein Halbersma wrote:Bob's right: 5% of the time you make a Type I error and accept a program change that wasn't an actual improvement. And in 95% of the repeated tests it wouldn't show as an improvement, just you were unlucky that it did in your particular match. To only way to avoid is to increase the significance level to say 1%. That would mean that you need twice as much games.nczempin wrote: Bob also seems to have a problem with 95 % confidence, saying "1 in 20 will be wrong". What is your answer to that?
Yes, and what Bob is saying is that both his required confidence level would be higher, and his results have a lower significance. Therefore for his situation a lot more games are required. And I have no clue why he doesn't acknowledge that the reverse direction is also true; if your significance is higher and your required confidence, you will need fewer games than he does.
But both the effect size and the significance level are up to the experimenter and to some degree are a matter of taste and available resources. Just make sure that you understand what you are trying to measure and with which accuracy your measurements are made.
For me with Eden, even 95 % confidence seems overkill. After all, as long as the new version is not significantly _weaker_ noone would complain (actually noone will complain except a certain L. will complain, and I will be disappointed that my goal for each new Eden version has been violated. But noone will nail me to a tree for that.).