Engine Testing - Statistics

Gian-Carlo Pascutto · Fri Jan 15, 2010 10:21 am

Edmund wrote:
Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.
So margins have to be doubled then? ie use alpha of 0.5% instead of 1%?

Someone would need to do the math

I wouldn't be surprised if the proposition to play until N wins more turns out to have zero confidence, because having N more wins always happens for n->inf.

Where are the maths PhD's ?

Don · Post by **Don** » Fri Jan 15, 2010 2:10 pm

Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.

Doesn't it just have to be calculated differently?

bob · Post by **bob** » Fri Jan 15, 2010 3:05 pm

MattieShoes wrote:Stopping early because you're out of time should be the same as a shorter tournament.

Stopping early because of some sort of rating or result cutoff would screw up the confidence margins.

Take coin flipping.
Setting out for 1000 flips and stopping after 500 because you're bored is fine.
Setting out for 1000 flips and stopping once you have 20 more heads than tails is bad.

This is the same nonsensical stuff I hear on the blackjack forums where people claim that there are ways to "beat the game" without counting cards or shuffle-tracking. One common theme. Play until you get ahead and then quit. And you do end up with more "winning sessions". But what happens when you hit a deep slump, which happens? You play until you get ahead. Or you play until you run out of money. The latter is much more likely.

SO stopping a test at some arbitrary point chosen _before_ the match starts eliminates using that kind of logic. If you watch and wait, between two fairly equal programs, at some point either will be ahead. And if you stop at that point, you stop before "the truth".

krazyken · Post by **krazyken** » Fri Jan 15, 2010 4:51 pm

Edmund wrote:Thanks for the kind answers. This makes sense.
So no way to trick statistics then ..

But there are plenty of ways to trick the experimenter. Most of the statistical tools in use rely on an assumption of good sampling. Random sampling is one of the better methods to use, but you need to know what it is you are sampling. If you are trying to determine the strength of your engine as it compares to the other engines out there, then you want to be taking random samples from the types of tournaments it's likely to be playing in. If what matters to you is CCRL rating, then you'll want a random sampling of CCRL opponents, at CCRL time controls, using CCRL opening books. If you are looking for the best WCCC performance, you should adjust your parameters to satisfy the WCCC conditions. If you want to speed up the testing by using very fast time controls, you should do some work to make sure that your engine's (and opponent's) results correlates across different time controls. If your testing is based off of opening positions, and those positions aren't a representative sample of what your engine might play, you have picked the wrong positions.

Messing up the sampling process is the most common way to bias a statistical analysis.

BubbaTough · Post by **BubbaTough** » Fri Jan 15, 2010 5:35 pm

Well, there sure are a lot of purests out here

.

I stop tests early, and theory be damned! For example, if there is a certain score I need to get in my testbed to use a new feature, and it becomes impossible to reach that score even if I win the rest of my games, why not stop the test? Heck, even if it is possible, but really really unlikely (like winning 180 / 200 against opponents that I have been scoring 50% against) I might stop the test if I happen to be watching.

Not stopping tests is a luxury for people who have a lot of computing power or time on their hands. If you don't then you just have to take a little care not to stop too early (like if you win 1 out of 5 of your games with a target score of 50%, and are planning to run 1000 games, its a bit premature to stop...but if you win 100 out of 500...it probably is fine to stop).

It would be nice to formalize when its ok to stop, and I am sure it is possible despite what others are likely to say, but so far I haven't bothered.

-Sam

michiguel · Post by **michiguel** » Fri Jan 15, 2010 5:45 pm

BubbaTough wrote:Well, there sure are a lot of purests out here .

I stop tests early, and theory be damned! For example, if there is a certain score I need to get in my testbed to use a new feature, and it becomes impossible to reach that score even if I win the rest of my games, why not stop the test? Heck, even if it is possible, but really really unlikely (like winning 180 / 200 against opponents that I have been scoring 50% against) I might stop the test if I happen to be watching.

Not stopping tests is a luxury for people who have a lot of computing power or time on their hands. If you don't then you just have to take a little care not to stop too early (like if you win 1 out of 5 of your games with a target score of 50%, and are planning to run 1000 games, its a bit premature to stop...but if you win 100 out of 500...it probably is fine to stop).

It would be nice to formalize when its ok to stop, and I am sure it is possible despite what others are likely to say, but so far I haven't bothered.

-Sam

It is ok to stop whenever the heck you want. Not doing so is OCD. There are many factors to take into account, and efficiency is sometimes more important than accuracy. Particularly, in intermediate stages of development. Resources and time are limited, not to mention patience. Not every experiment is science should be subjected to statistics.

Miguel

krazyken · Post by **krazyken** » Fri Jan 15, 2010 5:52 pm

nothing wrong with stopping a test early if you are going to throw the results out anyways. If the results are exceeding your confidence margin by a significant factor, there is no reason to continue. Most testers seem to be intent on splitting hairs looking for very small improvements, so they must continue on to a ridiculous number of games to get the confidence they want.

Yes it is possible to build the formal theory for stopping early, but there you are using different assumptions, and many of the common tools do not apply to this kind of test.

zamar · Post by **zamar** » Fri Jan 15, 2010 7:51 pm

bob wrote:
MattieShoes wrote:Stopping early because you're out of time should be the same as a shorter tournament.

Stopping early because of some sort of rating or result cutoff would screw up the confidence margins.

Take coin flipping.
Setting out for 1000 flips and stopping after 500 because you're bored is fine.
Setting out for 1000 flips and stopping once you have 20 more heads than tails is bad.
This is the same nonsensical stuff

What here is nonsense? For me it makes sense perfectly. (the only possible loophole is that there might be connection between being bored and the result, but no example is perfect)

I hear on the blackjack forums where people claim that there are ways to "beat the game" without counting cards or shuffle-tracking. One common theme. Play until you get ahead and then quit.

I invented this technique when I was ten, used it succesfully for gambling six months, then thought more about it, realized it doesn't work and stopped

Common sense can be badly misleading with statistics.

SO stopping a test at some arbitrary point chosen _before_ the match starts eliminates using that kind of logic.

Matt said exactly this, didn't he?

UncombedCoconut · Post by **UncombedCoconut** » Sat Jan 16, 2010 9:21 am

Edmund wrote:
John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.

It was developed during WW2 and deemed so important that it was classified till '45.
Thats interesting ...
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.

Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?

It's quite doable. As GCP mentioned "the proposition to play until N wins more turns out to have zero confidence", but you can formulate tests where this N varies with the number of games. As a very first step, you can try this C++ program. It tells you which cutoff scores are OK to use after each game, and lets you choose where to draw the line. (Note: I wrote the program, and bugs may exist.)

There's a tradeoff whenever you add a way for the test to end early: since you risk early false positives, you have to lower your risk of later false positives. You have to pay for an inconclusive early test with more stringent later tests. As an extreme case, the program can optimize for earliest cutoffs when the strength difference is huge. If you do that, you'll see insane cutoffs for the score after large numbers of games.

If you're still interested in the idea I'll develop it further. I think one would want to pick a threshold ELO difference to detect, and optimize the expected # of games needed to see it. I'd bet the resulting rules would be related to a Sequential Probability Ratio Test (with an extra rule about when to call the programs equal.)

Edmund · Post by **Edmund** » Sat Jan 16, 2010 12:05 pm

UncombedCoconut wrote:
Edmund wrote:Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
It's quite doable. As GCP mentioned "the proposition to play until N wins more turns out to have zero confidence", but you can formulate tests where this N varies with the number of games. As a very first step, you can try this C++ program. It tells you which cutoff scores are OK to use after each game, and lets you choose where to draw the line. (Note: I wrote the program, and bugs may exist.)

There's a tradeoff whenever you add a way for the test to end early: since you risk early false positives, you have to lower your risk of later false positives. You have to pay for an inconclusive early test with more stringent later tests. As an extreme case, the program can optimize for earliest cutoffs when the strength difference is huge. If you do that, you'll see insane cutoffs for the score after large numbers of games.

If you're still interested in the idea I'll develop it further. I think one would want to pick a threshold ELO difference to detect, and optimize the expected # of games needed to see it. I'd bet the resulting rules would be related to a Sequential Probability Ratio Test (with an extra rule about when to call the programs equal.)

Thank you! I didn't have full time to think it through yet, but I quckly tested your code and it compiled alright. My first test was then to compare the result to the one from the table I posted in my first post in this thread. I wasn't able to match the data ...

here is a graph of the table with the experimental data:
the 3 lines indicate the alpha values 0.05, 0.01 and 0.001; the x axis in logaritmic view indicates the number of games (predefined amount) and the y axis indicates the cutoff win percentage.

So I compared the value alpha = 0.05; n=100 which was 57.5 in the table to the output from your program.

Here is what I get:

Code: Select all

Allowable type I error &#40;for two-sided test&#41;&#58; 0.05
Maximum number of games&#58; 100
Assumed draw probability &#40;guess low&#41;&#58; .32
Customize the cutoff values? &#40;1&#58; Yes; 0&#58; Optimize for very different strength.)
0
After     1 games, continue if   0.0 <= score <=   1.0 (  0.0% - 100.0%). Cumulative type-I error 0.0000
After     2 games, continue if   0.0 <= score <=   2.0 (  0.0% - 100.0%). Cumulative type-I error 0.0000
After     3 games, continue if   0.0 <= score <=   3.0 (  0.0% - 100.0%). Cumulative type-I error 0.0000
After     4 games, continue if   0.5 <= score <=   3.5 ( 12.5% -  87.5%). Cumulative type-I error 0.0134
After     5 games, continue if   1.0 <= score <=   4.0 ( 20.0% -  80.0%). Cumulative type-I error 0.0347
After     6 games, continue if   1.0 <= score <=   5.0 ( 16.7% -  83.3%). Cumulative type-I error 0.0347
After     7 games, continue if   1.5 <= score <=   5.5 ( 21.4% -  78.6%). Cumulative type-I error 0.0434
After     8 games, continue if   1.5 <= score <=   6.5 ( 18.8% -  81.3%). Cumulative type-I error 0.0434
After     9 games, continue if   2.0 <= score <=   7.0 ( 22.2% -  77.8%). Cumulative type-I error 0.0474
After    10 games, continue if   2.0 <= score <=   8.0 ( 20.0% -  80.0%). Cumulative type-I error 0.0474
After    11 games, continue if   2.0 <= score <=   9.0 ( 18.2% -  81.8%). Cumulative type-I error 0.0474
After    12 games, continue if   2.5 <= score <=   9.5 ( 20.8% -  79.2%). Cumulative type-I error 0.0481
After    13 games, continue if   3.0 <= score <=  10.0 ( 23.1% -  76.9%). Cumulative type-I error 0.0495
After    14 games, continue if   3.0 <= score <=  11.0 ( 21.4% -  78.6%). Cumulative type-I error 0.0495
After    15 games, continue if   3.0 <= score <=  12.0 ( 20.0% -  80.0%). Cumulative type-I error 0.0495
After    16 games, continue if   3.5 <= score <=  12.5 ( 21.9% -  78.1%). Cumulative type-I error 0.0497
After    17 games, continue if   3.5 <= score <=  13.5 ( 20.6% -  79.4%). Cumulative type-I error 0.0497
After    18 games, continue if   4.0 <= score <=  14.0 ( 22.2% -  77.8%). Cumulative type-I error 0.0498
After    19 games, continue if   4.0 <= score <=  15.0 ( 21.1% -  78.9%). Cumulative type-I error 0.0498
After    20 games, continue if   4.5 <= score <=  15.5 ( 22.5% -  77.5%). Cumulative type-I error 0.0499
After    21 games, continue if   4.5 <= score <=  16.5 ( 21.4% -  78.6%). Cumulative type-I error 0.0499
After    22 games, continue if   5.0 <= score <=  17.0 ( 22.7% -  77.3%). Cumulative type-I error 0.0500
After    23 games, continue if   5.0 <= score <=  18.0 ( 21.7% -  78.3%). Cumulative type-I error 0.0500
After    24 games, continue if   5.0 <= score <=  19.0 ( 20.8% -  79.2%). Cumulative type-I error 0.0500
After    25 games, continue if   5.5 <= score <=  19.5 ( 22.0% -  78.0%). Cumulative type-I error 0.0500
After    26 games, continue if   5.5 <= score <=  20.5 ( 21.2% -  78.8%). Cumulative type-I error 0.0500
After    27 games, continue if   6.0 <= score <=  21.0 ( 22.2% -  77.8%). Cumulative type-I error 0.0500
After    28 games, continue if   6.0 <= score <=  22.0 ( 21.4% -  78.6%). Cumulative type-I error 0.0500
After    29 games, continue if   6.0 <= score <=  23.0 ( 20.7% -  79.3%). Cumulative type-I error 0.0500
After    30 games, continue if   6.5 <= score <=  23.5 ( 21.7% -  78.3%). Cumulative type-I error 0.0500
After    31 games, continue if   6.5 <= score <=  24.5 ( 21.0% -  79.0%). Cumulative type-I error 0.0500
After    32 games, continue if   6.5 <= score <=  25.5 ( 20.3% -  79.7%). Cumulative type-I error 0.0500
After    33 games, continue if   6.5 <= score <=  26.5 ( 19.7% -  80.3%). Cumulative type-I error 0.0500
After    34 games, continue if   7.0 <= score <=  27.0 ( 20.6% -  79.4%). Cumulative type-I error 0.0500
After    35 games, continue if   7.0 <= score <=  28.0 ( 20.0% -  80.0%). Cumulative type-I error 0.0500
After    36 games, continue if   7.5 <= score <=  28.5 ( 20.8% -  79.2%). Cumulative type-I error 0.0500
After    37 games, continue if   7.5 <= score <=  29.5 ( 20.3% -  79.7%). Cumulative type-I error 0.0500
After    38 games, continue if   8.0 <= score <=  30.0 ( 21.1% -  78.9%). Cumulative type-I error 0.0500
After    39 games, continue if   8.0 <= score <=  31.0 ( 20.5% -  79.5%). Cumulative type-I error 0.0500
After    40 games, continue if   8.0 <= score <=  32.0 ( 20.0% -  80.0%). Cumulative type-I error 0.0500
After    41 games, continue if   8.5 <= score <=  32.5 ( 20.7% -  79.3%). Cumulative type-I error 0.0500
After    42 games, continue if   8.5 <= score <=  33.5 ( 20.2% -  79.8%). Cumulative type-I error 0.0500
After    43 games, continue if   9.0 <= score <=  34.0 ( 20.9% -  79.1%). Cumulative type-I error 0.0500
After    44 games, continue if   9.0 <= score <=  35.0 ( 20.5% -  79.5%). Cumulative type-I error 0.0500
After    45 games, continue if   9.5 <= score <=  35.5 ( 21.1% -  78.9%). Cumulative type-I error 0.0500
After    46 games, continue if   9.5 <= score <=  36.5 ( 20.7% -  79.3%). Cumulative type-I error 0.0500
After    47 games, continue if   9.5 <= score <=  37.5 ( 20.2% -  79.8%). Cumulative type-I error 0.0500
After    48 games, continue if   9.5 <= score <=  38.5 ( 19.8% -  80.2%). Cumulative type-I error 0.0500
After    49 games, continue if  10.0 <= score <=  39.0 ( 20.4% -  79.6%). Cumulative type-I error 0.0500
After    50 games, continue if  10.0 <= score <=  40.0 ( 20.0% -  80.0%). Cumulative type-I error 0.0500
After    51 games, continue if  10.5 <= score <=  40.5 ( 20.6% -  79.4%). Cumulative type-I error 0.0500
After    52 games, continue if  10.5 <= score <=  41.5 ( 20.2% -  79.8%). Cumulative type-I error 0.0500
After    53 games, continue if  11.0 <= score <=  42.0 ( 20.8% -  79.2%). Cumulative type-I error 0.0500
After    54 games, continue if  11.0 <= score <=  43.0 ( 20.4% -  79.6%). Cumulative type-I error 0.0500
After    55 games, continue if  11.0 <= score <=  44.0 ( 20.0% -  80.0%). Cumulative type-I error 0.0500
After    56 games, continue if  11.5 <= score <=  44.5 ( 20.5% -  79.5%). Cumulative type-I error 0.0500
After    57 games, continue if  11.5 <= score <=  45.5 ( 20.2% -  79.8%). Cumulative type-I error 0.0500
After    58 games, continue if  12.0 <= score <=  46.0 ( 20.7% -  79.3%). Cumulative type-I error 0.0500
After    59 games, continue if  12.0 <= score <=  47.0 ( 20.3% -  79.7%). Cumulative type-I error 0.0500
After    60 games, continue if  12.0 <= score <=  48.0 ( 20.0% -  80.0%). Cumulative type-I error 0.0500
After    61 games, continue if  12.5 <= score <=  48.5 ( 20.5% -  79.5%). Cumulative type-I error 0.0500
After    62 games, continue if  12.5 <= score <=  49.5 ( 20.2% -  79.8%). Cumulative type-I error 0.0500
After    63 games, continue if  13.0 <= score <=  50.0 ( 20.6% -  79.4%). Cumulative type-I error 0.0500
After    64 games, continue if  13.0 <= score <=  51.0 ( 20.3% -  79.7%). Cumulative type-I error 0.0500
After    65 games, continue if  13.5 <= score <=  51.5 ( 20.8% -  79.2%). Cumulative type-I error 0.0500
After    66 games, continue if  13.5 <= score <=  52.5 ( 20.5% -  79.5%). Cumulative type-I error 0.0500
After    67 games, continue if  13.5 <= score <=  53.5 ( 20.1% -  79.9%). Cumulative type-I error 0.0500
After    68 games, continue if  14.0 <= score <=  54.0 ( 20.6% -  79.4%). Cumulative type-I error 0.0500
After    69 games, continue if  14.0 <= score <=  55.0 ( 20.3% -  79.7%). Cumulative type-I error 0.0500
After    70 games, continue if  14.5 <= score <=  55.5 ( 20.7% -  79.3%). Cumulative type-I error 0.0500
After    71 games, continue if  14.5 <= score <=  56.5 ( 20.4% -  79.6%). Cumulative type-I error 0.0500
After    72 games, continue if  15.0 <= score <=  57.0 ( 20.8% -  79.2%). Cumulative type-I error 0.0500
After    73 games, continue if  15.0 <= score <=  58.0 ( 20.5% -  79.5%). Cumulative type-I error 0.0500
After    74 games, continue if  15.0 <= score <=  59.0 ( 20.3% -  79.7%). Cumulative type-I error 0.0500
After    75 games, continue if  15.5 <= score <=  59.5 ( 20.7% -  79.3%). Cumulative type-I error 0.0500
After    76 games, continue if  15.5 <= score <=  60.5 ( 20.4% -  79.6%). Cumulative type-I error 0.0500
After    77 games, continue if  16.0 <= score <=  61.0 ( 20.8% -  79.2%). Cumulative type-I error 0.0500
After    78 games, continue if  16.0 <= score <=  62.0 ( 20.5% -  79.5%). Cumulative type-I error 0.0500
After    79 games, continue if  16.5 <= score <=  62.5 ( 20.9% -  79.1%). Cumulative type-I error 0.0500
After    80 games, continue if  16.5 <= score <=  63.5 ( 20.6% -  79.4%). Cumulative type-I error 0.0500
After    81 games, continue if  17.0 <= score <=  64.0 ( 21.0% -  79.0%). Cumulative type-I error 0.0500
After    82 games, continue if  17.0 <= score <=  65.0 ( 20.7% -  79.3%). Cumulative type-I error 0.0500
After    83 games, continue if  17.5 <= score <=  65.5 ( 21.1% -  78.9%). Cumulative type-I error 0.0500
After    84 games, continue if  17.5 <= score <=  66.5 ( 20.8% -  79.2%). Cumulative type-I error 0.0500
After    85 games, continue if  18.0 <= score <=  67.0 ( 21.2% -  78.8%). Cumulative type-I error 0.0500
After    86 games, continue if  18.0 <= score <=  68.0 ( 20.9% -  79.1%). Cumulative type-I error 0.0500
After    87 games, continue if  18.0 <= score <=  69.0 ( 20.7% -  79.3%). Cumulative type-I error 0.0500
After    88 games, continue if  18.0 <= score <=  70.0 ( 20.5% -  79.5%). Cumulative type-I error 0.0500
After    89 games, continue if  18.5 <= score <=  70.5 ( 20.8% -  79.2%). Cumulative type-I error 0.0500
After    90 games, continue if  18.5 <= score <=  71.5 ( 20.6% -  79.4%). Cumulative type-I error 0.0500
After    91 games, continue if  19.0 <= score <=  72.0 ( 20.9% -  79.1%). Cumulative type-I error 0.0500
After    92 games, continue if  19.0 <= score <=  73.0 ( 20.7% -  79.3%). Cumulative type-I error 0.0500
After    93 games, continue if  19.0 <= score <=  74.0 ( 20.4% -  79.6%). Cumulative type-I error 0.0500
After    94 games, continue if  19.5 <= score <=  74.5 ( 20.7% -  79.3%). Cumulative type-I error 0.0500
After    95 games, continue if  19.5 <= score <=  75.5 ( 20.5% -  79.5%). Cumulative type-I error 0.0500
After    96 games, continue if  20.0 <= score <=  76.0 ( 20.8% -  79.2%). Cumulative type-I error 0.0500
After    97 games, continue if  20.0 <= score <=  77.0 ( 20.6% -  79.4%). Cumulative type-I error 0.0500
After    98 games, continue if  20.5 <= score <=  77.5 ( 20.9% -  79.1%). Cumulative type-I error 0.0500
After    99 games, continue if  20.5 <= score <=  78.5 ( 20.7% -  79.3%). Cumulative type-I error 0.0500
After   100 games, continue if  20.5 <= score <=  79.5 ( 20.5% -  79.5%). Cumulative type-I error 0.0500

it is a much stricter margin indeed, but I would have imagined the closer you approach the final value the smaller the difference should become. More like in the following sketch:

Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics