Engine Testing - Statistics

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

John Major
Posts: 27
Joined: Fri Dec 11, 2009 10:23 pm

Re: Engine Testing - Statistics

Post by John Major »

Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Engine Testing - Statistics

Post by Edmund »

John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Engine Testing - Statistics

Post by Don »

Edmund wrote:
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Engine Testing - Statistics

Post by Gian-Carlo Pascutto »

I think this has exactly the same error as the original proposition.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Engine Testing - Statistics

Post by Edmund »

Don wrote:
Edmund wrote:
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
One thing you could do is play until one side is ahead by N points and this will automatically adjust resource usage based on how much confidence you need. But I don't know how to make the calculation to determine what N should be for a given confidence.
I would imagine your N also depends on the number of games played sofar.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Engine Testing - Statistics

Post by Edmund »

Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.
So margins have to be doubled then? ie use alpha of 0.5% instead of 1%?
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Engine Testing - Statistics

Post by Gian-Carlo Pascutto »

Edmund wrote:
Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.
So margins have to be doubled then? ie use alpha of 0.5% instead of 1%?
Someone would need to do the math :P I wouldn't be surprised if the proposition to play until N wins more turns out to have zero confidence, because having N more wins always happens for n->inf.

Where are the maths PhD's ?
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Engine Testing - Statistics

Post by Don »

Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.
Doesn't it just have to be calculated differently?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Engine Testing - Statistics

Post by bob »

MattieShoes wrote:Stopping early because you're out of time should be the same as a shorter tournament.

Stopping early because of some sort of rating or result cutoff would screw up the confidence margins.

Take coin flipping.
Setting out for 1000 flips and stopping after 500 because you're bored is fine.
Setting out for 1000 flips and stopping once you have 20 more heads than tails is bad.
This is the same nonsensical stuff I hear on the blackjack forums where people claim that there are ways to "beat the game" without counting cards or shuffle-tracking. One common theme. Play until you get ahead and then quit. And you do end up with more "winning sessions". But what happens when you hit a deep slump, which happens? You play until you get ahead. Or you play until you run out of money. The latter is much more likely.

SO stopping a test at some arbitrary point chosen _before_ the match starts eliminates using that kind of logic. If you watch and wait, between two fairly equal programs, at some point either will be ahead. And if you stop at that point, you stop before "the truth".
krazyken

Re: Engine Testing - Statistics

Post by krazyken »

Edmund wrote:Thanks for the kind answers. This makes sense.
So no way to trick statistics then .. :)
But there are plenty of ways to trick the experimenter. Most of the statistical tools in use rely on an assumption of good sampling. Random sampling is one of the better methods to use, but you need to know what it is you are sampling. If you are trying to determine the strength of your engine as it compares to the other engines out there, then you want to be taking random samples from the types of tournaments it's likely to be playing in. If what matters to you is CCRL rating, then you'll want a random sampling of CCRL opponents, at CCRL time controls, using CCRL opening books. If you are looking for the best WCCC performance, you should adjust your parameters to satisfy the WCCC conditions. If you want to speed up the testing by using very fast time controls, you should do some work to make sure that your engine's (and opponent's) results correlates across different time controls. If your testing is based off of opening positions, and those positions aren't a representative sample of what your engine might play, you have picked the wrong positions.

Messing up the sampling process is the most common way to bias a statistical analysis.