Engine Testing - Statistics

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Edmund
Posts: 668
Joined: Mon Dec 03, 2007 2:01 pm
Location: Barcelona, Spain
Contact:

Engine Testing - Statistics

Post by Edmund » Thu Jan 14, 2010 7:39 am

Whenever I do engine testing I usually refer to the following table computed by Joseph Ciarrochi:
http://www.husvankempen.de/nunn/rating/tablejoseph.htm

It is a great resource to me, but what surprises me is one of the sidenotes by the author:
# It is critical that you choose your sample size ahead of time, and do not make any conclusions until you have run the full tournament. It is incorrect, statistically, to watch the running of the tournament, wait until an engine reaches a cut-off, and then stop the tournament.
Does this make sense? Why would it make a difference if I stopped the tournament in the middle (eg to save time) in case of obviously stronger players. After all I only want to tell is A > B and no exact rating.

Gian-Carlo Pascutto
Posts: 1189
Joined: Sat Dec 13, 2008 6:00 pm
Contact:

Re: Engine Testing - Statistics

Post by Gian-Carlo Pascutto » Thu Jan 14, 2010 9:17 am

Because the statistics (confidence margins) are calculated on the basis that you do not do that.

If you do, the actual confidence will be significantly less than what is in the tables.

MattieShoes
Posts: 718
Joined: Fri Mar 20, 2009 7:59 pm

Re: Engine Testing - Statistics

Post by MattieShoes » Thu Jan 14, 2010 9:56 am

Stopping early because you're out of time should be the same as a shorter tournament.

Stopping early because of some sort of rating or result cutoff would screw up the confidence margins.

Take coin flipping.
Setting out for 1000 flips and stopping after 500 because you're bored is fine.
Setting out for 1000 flips and stopping once you have 20 more heads than tails is bad.

Edmund
Posts: 668
Joined: Mon Dec 03, 2007 2:01 pm
Location: Barcelona, Spain
Contact:

Re: Engine Testing - Statistics

Post by Edmund » Thu Jan 14, 2010 10:55 am

Thanks for the kind answers. This makes sense.
So no way to trick statistics then .. :)

John Major
Posts: 27
Joined: Fri Dec 11, 2009 9:23 pm

Re: Engine Testing - Statistics

Post by John Major » Thu Jan 14, 2010 11:36 am

Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.

Edmund
Posts: 668
Joined: Mon Dec 03, 2007 2:01 pm
Location: Barcelona, Spain
Contact:

Re: Engine Testing - Statistics

Post by Edmund » Thu Jan 14, 2010 12:02 pm

John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 2:27 pm

Re: Engine Testing - Statistics

Post by Don » Thu Jan 14, 2010 11:01 pm

Edmund wrote:
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.

Gian-Carlo Pascutto
Posts: 1189
Joined: Sat Dec 13, 2008 6:00 pm
Contact:

Re: Engine Testing - Statistics

Post by Gian-Carlo Pascutto » Fri Jan 15, 2010 6:46 am

I think this has exactly the same error as the original proposition.

Edmund
Posts: 668
Joined: Mon Dec 03, 2007 2:01 pm
Location: Barcelona, Spain
Contact:

Re: Engine Testing - Statistics

Post by Edmund » Fri Jan 15, 2010 8:51 am

Don wrote:
Edmund wrote:
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.
I would imagine your N also depends on the number of games played sofar.

Edmund
Posts: 668
Joined: Mon Dec 03, 2007 2:01 pm
Location: Barcelona, Spain
Contact:

Re: Engine Testing - Statistics

Post by Edmund » Fri Jan 15, 2010 9:02 am

Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.
So margins have to be doubled then? ie use alpha of 0.5% instead of 1%?

Post Reply