Page 1 of 6

Re: Engine Testing - Statistics

Posted: Thu Jan 14, 2010 12:36 pm
by John Major
Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.

Re: Engine Testing - Statistics

Posted: Thu Jan 14, 2010 1:02 pm
by Edmund
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?

Re: Engine Testing - Statistics

Posted: Fri Jan 15, 2010 12:01 am
by Don
Edmund wrote:
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.

Re: Engine Testing - Statistics

Posted: Fri Jan 15, 2010 7:46 am
by Gian-Carlo Pascutto
I think this has exactly the same error as the original proposition.

Re: Engine Testing - Statistics

Posted: Fri Jan 15, 2010 9:51 am
by Edmund
Don wrote:
Edmund wrote:
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.
I would imagine your N also depends on the number of games played sofar.

Re: Engine Testing - Statistics

Posted: Fri Jan 15, 2010 10:02 am
by Edmund
Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.
So margins have to be doubled then? ie use alpha of 0.5% instead of 1%?

Re: Engine Testing - Statistics

Posted: Fri Jan 15, 2010 10:21 am
by Gian-Carlo Pascutto
Edmund wrote:
Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.
So margins have to be doubled then? ie use alpha of 0.5% instead of 1%?
Someone would need to do the math :P I wouldn't be surprised if the proposition to play until N wins more turns out to have zero confidence, because having N more wins always happens for n->inf.

Where are the maths PhD's ?

Re: Engine Testing - Statistics

Posted: Fri Jan 15, 2010 2:10 pm
by Don
Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.
Doesn't it just have to be calculated differently?

Re: Engine Testing - Statistics

Posted: Fri Jan 15, 2010 3:05 pm
by bob
MattieShoes wrote:Stopping early because you're out of time should be the same as a shorter tournament.

Stopping early because of some sort of rating or result cutoff would screw up the confidence margins.

Take coin flipping.
Setting out for 1000 flips and stopping after 500 because you're bored is fine.
Setting out for 1000 flips and stopping once you have 20 more heads than tails is bad.
This is the same nonsensical stuff I hear on the blackjack forums where people claim that there are ways to "beat the game" without counting cards or shuffle-tracking. One common theme. Play until you get ahead and then quit. And you do end up with more "winning sessions". But what happens when you hit a deep slump, which happens? You play until you get ahead. Or you play until you run out of money. The latter is much more likely.

SO stopping a test at some arbitrary point chosen _before_ the match starts eliminates using that kind of logic. If you watch and wait, between two fairly equal programs, at some point either will be ahead. And if you stop at that point, you stop before "the truth".

Re: Engine Testing - Statistics

Posted: Fri Jan 15, 2010 4:51 pm
by krazyken
Edmund wrote:Thanks for the kind answers. This makes sense.
So no way to trick statistics then .. :)
But there are plenty of ways to trick the experimenter. Most of the statistical tools in use rely on an assumption of good sampling. Random sampling is one of the better methods to use, but you need to know what it is you are sampling. If you are trying to determine the strength of your engine as it compares to the other engines out there, then you want to be taking random samples from the types of tournaments it's likely to be playing in. If what matters to you is CCRL rating, then you'll want a random sampling of CCRL opponents, at CCRL time controls, using CCRL opening books. If you are looking for the best WCCC performance, you should adjust your parameters to satisfy the WCCC conditions. If you want to speed up the testing by using very fast time controls, you should do some work to make sure that your engine's (and opponent's) results correlates across different time controls. If your testing is based off of opening positions, and those positions aren't a representative sample of what your engine might play, you have picked the wrong positions.

Messing up the sampling process is the most common way to bias a statistical analysis.