Engine Testing - Statistics

Edmund · Post by **Edmund** » Thu Jan 14, 2010 8:39 am

Whenever I do engine testing I usually refer to the following table computed by Joseph Ciarrochi:
http://www.husvankempen.de/nunn/rating/tablejoseph.htm

It is a great resource to me, but what surprises me is one of the sidenotes by the author:

# It is critical that you choose your sample size ahead of time, and do not make any conclusions until you have run the full tournament. It is incorrect, statistically, to watch the running of the tournament, wait until an engine reaches a cut-off, and then stop the tournament.

Does this make sense? Why would it make a difference if I stopped the tournament in the middle (eg to save time) in case of obviously stronger players. After all I only want to tell is A > B and no exact rating.

Gian-Carlo Pascutto · Thu Jan 14, 2010 10:17 am

Because the statistics (confidence margins) are calculated on the basis that you do not do that.

If you do, the actual confidence will be significantly less than what is in the tables.

MattieShoes · Post by **MattieShoes** » Thu Jan 14, 2010 10:56 am

Stopping early because you're out of time should be the same as a shorter tournament.

Stopping early because of some sort of rating or result cutoff would screw up the confidence margins.

Take coin flipping.
Setting out for 1000 flips and stopping after 500 because you're bored is fine.
Setting out for 1000 flips and stopping once you have 20 more heads than tails is bad.

Edmund · Post by **Edmund** » Thu Jan 14, 2010 11:55 am

Thanks for the kind answers. This makes sense.
So no way to trick statistics then ..

John Major · Post by **John Major** » Thu Jan 14, 2010 12:36 pm

Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.

Edmund · Post by **Edmund** » Thu Jan 14, 2010 1:02 pm

John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.

Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?

Don · Post by **Don** » Fri Jan 15, 2010 12:01 am

Edmund wrote:
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?

One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Fri Jan 15, 2010 7:46 am

I think this has exactly the same error as the original proposition.

Edmund · Post by **Edmund** » Fri Jan 15, 2010 9:51 am

Don wrote:
Edmund wrote:
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.

I would imagine your N also depends on the number of games played sofar.

Edmund · Post by **Edmund** » Fri Jan 15, 2010 10:02 am

Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.

So margins have to be doubled then? ie use alpha of 0.5% instead of 1%?

Engine Testing - Statistics

Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics