Page 1 of 6

### Engine Testing - Statistics

Posted: Thu Jan 14, 2010 7:39 am
Whenever I do engine testing I usually refer to the following table computed by Joseph Ciarrochi:
http://www.husvankempen.de/nunn/rating/tablejoseph.htm

It is a great resource to me, but what surprises me is one of the sidenotes by the author:
# It is critical that you choose your sample size ahead of time, and do not make any conclusions until you have run the full tournament. It is incorrect, statistically, to watch the running of the tournament, wait until an engine reaches a cut-off, and then stop the tournament.
Does this make sense? Why would it make a difference if I stopped the tournament in the middle (eg to save time) in case of obviously stronger players. After all I only want to tell is A > B and no exact rating.

### Re: Engine Testing - Statistics

Posted: Thu Jan 14, 2010 9:17 am
Because the statistics (confidence margins) are calculated on the basis that you do not do that.

If you do, the actual confidence will be significantly less than what is in the tables.

### Re: Engine Testing - Statistics

Posted: Thu Jan 14, 2010 9:56 am
Stopping early because you're out of time should be the same as a shorter tournament.

Stopping early because of some sort of rating or result cutoff would screw up the confidence margins.

Take coin flipping.
Setting out for 1000 flips and stopping after 500 because you're bored is fine.
Setting out for 1000 flips and stopping once you have 20 more heads than tails is bad.

### Re: Engine Testing - Statistics

Posted: Thu Jan 14, 2010 10:55 am
Thanks for the kind answers. This makes sense.
So no way to trick statistics then ..

### Re: Engine Testing - Statistics

Posted: Thu Jan 14, 2010 11:36 am
Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.

### Re: Engine Testing - Statistics

Posted: Thu Jan 14, 2010 12:02 pm
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?

### Re: Engine Testing - Statistics

Posted: Thu Jan 14, 2010 11:01 pm
Edmund wrote:
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.

### Re: Engine Testing - Statistics

Posted: Fri Jan 15, 2010 6:46 am
I think this has exactly the same error as the original proposition.

### Re: Engine Testing - Statistics

Posted: Fri Jan 15, 2010 8:51 am
Don wrote:
Edmund wrote:
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.
I would imagine your N also depends on the number of games played sofar.

### Re: Engine Testing - Statistics

Posted: Fri Jan 15, 2010 9:02 am
Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.
So margins have to be doubled then? ie use alpha of 0.5% instead of 1%?