how to do a proper statistical test

Rein Halbersma · Post by **Rein Halbersma** » Wed Sep 19, 2007 9:09 am

The discussion in the various threads on testing engine improvements so far has been rather ad hoc as far as the statistical methodology is concerned.

H.G. Muller seems to think that the variance determines the confidence intervals around a match outcome. That's not true. What matters is the standard error on the mean, which is equal to the square root of the variance divided by the square root of the number of games. So more games in a match let you test more precisely, just as Robert Hyatt is claiming.

E.g. (building on an example that was given earlier), if you play a 100 game match with 51 wins, 49 losses, then the mean result is that you score 51 points, with a variance of 24.99 (=51*(1-.51)^2 + 49*(0-.51)^2). The average result per game is 0.51 with a standard error on the mean of 0.04999=sqrt(24.99/100). If we assume that games are statistically indepent, then the outcome of a 100 game match with two oppenents that score 51-49 against each other is normally distributed around a mean of 51% and a standard deviation of 0.04999.

If you want to test whether the 51-49 match was a signifcant improvement over a 50-50 match (no draws again), then you have to calculate the 95% confidence interval (one-sided) under the null hypothesis of a 50-50 result. That turns out to be a score of 58,2% or more *in a 100 game match*. Hence, a 51-49 result in a 100 game match is not a significant improvement.

How many games do you have to repeat in order for a 51-49 result to be an improvement over 50-50? From the scaling behaviour of the standard error on the mean it turns out that you would need to score 51% in a 6700 game match in order to show a signficant improvement!

Now, that was only the significance, i.e. the chance of having no Type I errors (95% chance that an accepted improvement was genuine). But what do you do when the result of a 100 game is not significant? You can't conclude that there is no improvement either since you might commit a Type II error (falsely rejecting a genuine improvement).

So besides significance, you also want to have a test of high statistical power so that you give yourself a shot of finding, let's say, 95% of all genuine improvements. If you want to simultaneously accept 95% of the genuine improvements and also have 95% of the accepted improvements to be genuine, then you need even more games. For the example in hand it would mean that only after 27,000 games would a 51-49 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.

nczempin · Post by **nczempin** » Wed Sep 19, 2007 10:45 am

Rein Halbersma wrote: How many games do you have to repeat in order for a 51-49 result to be an improvement over 50-50? From the scaling behaviour of the standard error on the mean it turns out that you would need to score 51% in a 6700 game match in order to show a signficant improvement!

Now, that was only the significance, i.e. the chance of having no Type I errors (95% chance that an accepted improvement was genuine). But what do you do when the result of a 100 game is not significant? You can't conclude that there is no improvement either since you might commit a Type II error (falsely rejecting a genuine improvement).

So besides significance, you also want to have a test of high statistical power so that you give yourself a shot of finding, let's say, 95% of all genuine improvements. If you want to simultaneously accept 95% of the genuine improvements and also have 95% of the accepted improvements to be genuine, then you need even more games. For the example in hand it would mean that only after 27,000 games would a 51-49 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.

Well, that's what I've been saying all along. And in particular, I've been saying that I am not interested in outcomes of 51 %. Rather than playing more games I will just go back to the code until I find significance earlier. Robert is saying is that he is interested in those 51 % findings, so he needs more games. But what he's also saying is that others should also be interested in those 51 % findings, and thus need more games. And that's what I don't agree with.

Bob also seems to have a problem with 95 % confidence, saying "1 in 20 will be wrong". What is your answer to that?

pedrox · Post by **pedrox** » Wed Sep 19, 2007 10:52 am

Rein Halbersma wrote:For the example in hand it would mean that only after 27,000 games would a 51-49 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.

You would have to repeat the test of 27,000 games to be sure that again the result 51-49 is repeated because is possible that after repeating the test you do not have this result. Perhaps your computer has a behavior different after running hours. Perhaps if instead of choosing those rivals you had chosen others the result would be different…

Rein Halbersma · Post by **Rein Halbersma** » Wed Sep 19, 2007 11:05 am

nczempin wrote: Well, that's what I've been saying all along. And in particular, I've been saying that I am not interested in outcomes of 51 %. Rather than playing more games I will just go back to the code until I find significance earlier. Robert is saying is that he is interested in those 51 % findings, so he needs more games. But what he's also saying is that others should also be interested in those 51 % findings, and thus need more games. And that's what I don't agree with.

You can choose your own threshold of the effect size that you want to be able to measure. The 51% score is about 7 elo points. If you want to test only for larger improvements that's up to you. E.g., if you want to be 95% sure that a match result of 55% (35 elo points improvement) is significant, then you only need 270 games to test. If you want to find 95% of all 35 elo point improvements with 95% significance, you need 4 times as much games so more than 1,000 games.

nczempin wrote: Bob also seems to have a problem with 95 % confidence, saying "1 in 20 will be wrong". What is your answer to that?

Bob's right: 5% of the time you make a Type I error and accept a program change that wasn't an actual improvement. And in 95% of the repeated tests it wouldn't show as an improvement, just you were unlucky that it did in your particular match. To only way to avoid is to increase the significance level to say 1%. That would mean that you need twice as much games.

But both the effect size and the significance level are up to the experimenter and to some degree are a matter of taste and available resources. Just make sure that you understand what you are trying to measure and with which accuracy your measurements are made.

Uri Blass · Post by **Uri Blass** » Wed Sep 19, 2007 11:07 am

Rein Halbersma wrote:

H.G. Muller seems to think that the variance determines the confidence intervals around a match outcome. That's not true. What matters is the standard error on the mean, which is equal to the square root of the variance divided by the square root of the number of games. So more games in a match let you test more precisely, just as Robert Hyatt is claiming.

H.G.Muller never claimed that it is not correct that more games are better

You simply distort his words
The standard error is the square root of the variance
so knowing one of them is clearly enough.

He never claimed that the confidence interval is proportional to the variance.

Uri

Rein Halbersma · Post by **Rein Halbersma** » Wed Sep 19, 2007 11:08 am

pedrox wrote:
Rein Halbersma wrote:For the example in hand it would mean that only after 27,000 games would a 51-49 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.
You would have to repeat the test of 27,000 games to be sure that again the result 51-49 is repeated because is possible that after repeating the test you do not have this result. Perhaps your computer has a behavior different after running hours. Perhaps if instead of choosing those rivals you had chosen others the result would be different…

No, by construction, after 27,000 games you can have 95% confidence that the 51-49 result would be repeated if you did a second 27,000 game match (only 5% chance of Type I error in the first match). Moreover, you can also be 95% confident that if the first 27,000 game match gave 50-50, then a second 27,000 game match will not result in a 51-49 match (only 5% chance of Type II error in the first match)

pedrox · Post by **pedrox** » Wed Sep 19, 2007 11:33 am

Rein, you have explained very well.

A tester when receives a new version of a engine, waits for an increase of its ELO of at least 30 points.

You indicate to us that for an increase of ELO of about 35 points 270 games are needed, which he is something reasonable for the people that in their house have a single computer and this computer is used to make the program.

Rein Halbersma · Post by **Rein Halbersma** » Wed Sep 19, 2007 11:40 am

pedrox wrote:Rein, you have explained very well.

A tester when receives a new version of a engine, waits for an increase of its ELO of at least 30 points.

You indicate to us that for an increase of ELO of about 35 points 270 games are needed, which he is something reasonable for the people that in their house have a single computer and this computer is used to make the program.

If after 270 games you find a 35 ELO improvement (55% score), then you can conclude it is a significant improvement. If you don't find something after the first 270 games, you can't conclude there is no improvement. Only if you don't find at a 35 ELO improvement after >1,000 games (4 times as much) can you be sure (well, 95% confident) that the change isn't an improvement.

That's the difference between significance and power. You need more games to have a high-powered test than to have a highly significant test.

Rein Halbersma · Post by **Rein Halbersma** » Wed Sep 19, 2007 11:55 am

Uri Blass wrote:H.G.Muller never claimed that it is not correct that more games are better

You simply distort his words
The standard error is the square root of the variance
so knowing one of them is clearly enough.

He never claimed that the confidence interval is proportional to the variance.

Uri

You are right, I just read the very first post of the thread, I had started at the end and got confused by all the heated arguments lately

No distortions intended! He indeed calculates things correctly, the difference is that I standardize everything to a single game to look at the scaling behaviour for multiple games.

My point is though, that one needs to be careful to distinguish between significance (type I errors) and power (type II errors). More games help with both.

nczempin · Post by **nczempin** » Wed Sep 19, 2007 12:50 pm

Rein Halbersma wrote:
nczempin wrote: Bob also seems to have a problem with 95 % confidence, saying "1 in 20 will be wrong". What is your answer to that?
Bob's right: 5% of the time you make a Type I error and accept a program change that wasn't an actual improvement. And in 95% of the repeated tests it wouldn't show as an improvement, just you were unlucky that it did in your particular match. To only way to avoid is to increase the significance level to say 1%. That would mean that you need twice as much games.

Well, of course he's right, it is exactly what 95 % confidence means, that you will be wrong 1 times in 20. But his conclusion is that this is not acceptable.

But both the effect size and the significance level are up to the experimenter and to some degree are a matter of taste and available resources. Just make sure that you understand what you are trying to measure and with which accuracy your measurements are made.

Yes, and what Bob is saying is that both his required confidence level would be higher, and his results have a lower significance. Therefore for his situation a lot more games are required. And I have no clue why he doesn't acknowledge that the reverse direction is also true; if your significance is higher and your required confidence, you will need fewer games than he does.

For me with Eden, even 95 % confidence seems overkill. After all, as long as the new version is not significantly _weaker_ no-one would complain (actually no-one will complain except a certain L. will complain, and I will be disappointed that my goal for each new Eden version has been violated. But no-one will nail me to a tree for that.).

how to do a proper statistical test

how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test