Someone would need to do the mathEdmund wrote:So margins have to be doubled then? ie use alpha of 0.5% instead of 1%?Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.

Where are the maths PhD's ?
Moderator: Ras
Someone would need to do the mathEdmund wrote:So margins have to be doubled then? ie use alpha of 0.5% instead of 1%?Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.
Doesn't it just have to be calculated differently?Gian-Carlo Pascutto wrote:I think this has exactly the same error as the original proposition.
This is the same nonsensical stuff I hear on the blackjack forums where people claim that there are ways to "beat the game" without counting cards or shuffle-tracking. One common theme. Play until you get ahead and then quit. And you do end up with more "winning sessions". But what happens when you hit a deep slump, which happens? You play until you get ahead. Or you play until you run out of money. The latter is much more likely.MattieShoes wrote:Stopping early because you're out of time should be the same as a shorter tournament.
Stopping early because of some sort of rating or result cutoff would screw up the confidence margins.
Take coin flipping.
Setting out for 1000 flips and stopping after 500 because you're bored is fine.
Setting out for 1000 flips and stopping once you have 20 more heads than tails is bad.
But there are plenty of ways to trick the experimenter. Most of the statistical tools in use rely on an assumption of good sampling. Random sampling is one of the better methods to use, but you need to know what it is you are sampling. If you are trying to determine the strength of your engine as it compares to the other engines out there, then you want to be taking random samples from the types of tournaments it's likely to be playing in. If what matters to you is CCRL rating, then you'll want a random sampling of CCRL opponents, at CCRL time controls, using CCRL opening books. If you are looking for the best WCCC performance, you should adjust your parameters to satisfy the WCCC conditions. If you want to speed up the testing by using very fast time controls, you should do some work to make sure that your engine's (and opponent's) results correlates across different time controls. If your testing is based off of opening positions, and those positions aren't a representative sample of what your engine might play, you have picked the wrong positions.Edmund wrote:Thanks for the kind answers. This makes sense.
So no way to trick statistics then ..
It is ok to stop whenever the heck you want. Not doing so is OCD. There are many factors to take into account, and efficiency is sometimes more important than accuracy. Particularly, in intermediate stages of development. Resources and time are limited, not to mention patience. Not every experiment is science should be subjected to statistics.BubbaTough wrote:Well, there sure are a lot of purests out here.
I stop tests early, and theory be damned! For example, if there is a certain score I need to get in my testbed to use a new feature, and it becomes impossible to reach that score even if I win the rest of my games, why not stop the test? Heck, even if it is possible, but really really unlikely (like winning 180 / 200 against opponents that I have been scoring 50% against) I might stop the test if I happen to be watching.
Not stopping tests is a luxury for people who have a lot of computing power or time on their hands. If you don't then you just have to take a little care not to stop too early (like if you win 1 out of 5 of your games with a target score of 50%, and are planning to run 1000 games, its a bit premature to stop...but if you win 100 out of 500...it probably is fine to stop).
It would be nice to formalize when its ok to stop, and I am sure it is possible despite what others are likely to say, but so far I haven't bothered.
-Sam
What here is nonsense? For me it makes sense perfectly. (the only possible loophole is that there might be connection between being bored and the result, but no example is perfect)bob wrote:This is the same nonsensical stuffMattieShoes wrote:Stopping early because you're out of time should be the same as a shorter tournament.
Stopping early because of some sort of rating or result cutoff would screw up the confidence margins.
Take coin flipping.
Setting out for 1000 flips and stopping after 500 because you're bored is fine.
Setting out for 1000 flips and stopping once you have 20 more heads than tails is bad.
I invented this technique when I was ten, used it succesfully for gambling six months, then thought more about it, realized it doesn't work and stoppedI hear on the blackjack forums where people claim that there are ways to "beat the game" without counting cards or shuffle-tracking. One common theme. Play until you get ahead and then quit.
Matt said exactly this, didn't he?SO stopping a test at some arbitrary point chosen _before_ the match starts eliminates using that kind of logic.
It's quite doable. As GCP mentioned "the proposition to play until N wins more turns out to have zero confidence", but you can formulate tests where this N varies with the number of games. As a very first step, you can try this C++ program. It tells you which cutoff scores are OK to use after each game, and lets you choose where to draw the line. (Note: I wrote the program, and bugs may exist.)Edmund wrote:Thats interesting ...John Major wrote:Comparing continuously till a cutoff is called 'sequential analysis'. It requires a different table that takes into account the larger likelihood of error, because you make many comparisons.
It was developed during WW2 and deemed so important that it was classified till '45.
how could this idea be ported to chess engine testing? I think a lot of effort (time, computer resources that could be used of different tests) could be saved if we didn't have to run through the whole tournament in case of obvious differences.
Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
Thank you! I didn't have full time to think it through yet, but I quckly tested your code and it compiled alright. My first test was then to compare the result to the one from the table I posted in my first post in this thread. I wasn't able to match the data ...UncombedCoconut wrote:It's quite doable. As GCP mentioned "the proposition to play until N wins more turns out to have zero confidence", but you can formulate tests where this N varies with the number of games. As a very first step, you can try this C++ program. It tells you which cutoff scores are OK to use after each game, and lets you choose where to draw the line. (Note: I wrote the program, and bugs may exist.)Edmund wrote:Is there a way to determine the cutoff win-% after n played games in case we want to have an errorbar of say 1% ?
There's a tradeoff whenever you add a way for the test to end early: since you risk early false positives, you have to lower your risk of later false positives. You have to pay for an inconclusive early test with more stringent later tests. As an extreme case, the program can optimize for earliest cutoffs when the strength difference is huge. If you do that, you'll see insane cutoffs for the score after large numbers of games.
If you're still interested in the idea I'll develop it further. I think one would want to pick a threshold ELO difference to detect, and optimize the expected # of games needed to see it. I'd bet the resulting rules would be related to a Sequential Probability Ratio Test (with an extra rule about when to call the programs equal.)
Code: Select all
Allowable type I error (for two-sided test): 0.05
Maximum number of games: 100
Assumed draw probability (guess low): .32
Customize the cutoff values? (1: Yes; 0: Optimize for very different strength.)
0
After 1 games, continue if 0.0 <= score <= 1.0 ( 0.0% - 100.0%). Cumulative type-I error 0.0000
After 2 games, continue if 0.0 <= score <= 2.0 ( 0.0% - 100.0%). Cumulative type-I error 0.0000
After 3 games, continue if 0.0 <= score <= 3.0 ( 0.0% - 100.0%). Cumulative type-I error 0.0000
After 4 games, continue if 0.5 <= score <= 3.5 ( 12.5% - 87.5%). Cumulative type-I error 0.0134
After 5 games, continue if 1.0 <= score <= 4.0 ( 20.0% - 80.0%). Cumulative type-I error 0.0347
After 6 games, continue if 1.0 <= score <= 5.0 ( 16.7% - 83.3%). Cumulative type-I error 0.0347
After 7 games, continue if 1.5 <= score <= 5.5 ( 21.4% - 78.6%). Cumulative type-I error 0.0434
After 8 games, continue if 1.5 <= score <= 6.5 ( 18.8% - 81.3%). Cumulative type-I error 0.0434
After 9 games, continue if 2.0 <= score <= 7.0 ( 22.2% - 77.8%). Cumulative type-I error 0.0474
After 10 games, continue if 2.0 <= score <= 8.0 ( 20.0% - 80.0%). Cumulative type-I error 0.0474
After 11 games, continue if 2.0 <= score <= 9.0 ( 18.2% - 81.8%). Cumulative type-I error 0.0474
After 12 games, continue if 2.5 <= score <= 9.5 ( 20.8% - 79.2%). Cumulative type-I error 0.0481
After 13 games, continue if 3.0 <= score <= 10.0 ( 23.1% - 76.9%). Cumulative type-I error 0.0495
After 14 games, continue if 3.0 <= score <= 11.0 ( 21.4% - 78.6%). Cumulative type-I error 0.0495
After 15 games, continue if 3.0 <= score <= 12.0 ( 20.0% - 80.0%). Cumulative type-I error 0.0495
After 16 games, continue if 3.5 <= score <= 12.5 ( 21.9% - 78.1%). Cumulative type-I error 0.0497
After 17 games, continue if 3.5 <= score <= 13.5 ( 20.6% - 79.4%). Cumulative type-I error 0.0497
After 18 games, continue if 4.0 <= score <= 14.0 ( 22.2% - 77.8%). Cumulative type-I error 0.0498
After 19 games, continue if 4.0 <= score <= 15.0 ( 21.1% - 78.9%). Cumulative type-I error 0.0498
After 20 games, continue if 4.5 <= score <= 15.5 ( 22.5% - 77.5%). Cumulative type-I error 0.0499
After 21 games, continue if 4.5 <= score <= 16.5 ( 21.4% - 78.6%). Cumulative type-I error 0.0499
After 22 games, continue if 5.0 <= score <= 17.0 ( 22.7% - 77.3%). Cumulative type-I error 0.0500
After 23 games, continue if 5.0 <= score <= 18.0 ( 21.7% - 78.3%). Cumulative type-I error 0.0500
After 24 games, continue if 5.0 <= score <= 19.0 ( 20.8% - 79.2%). Cumulative type-I error 0.0500
After 25 games, continue if 5.5 <= score <= 19.5 ( 22.0% - 78.0%). Cumulative type-I error 0.0500
After 26 games, continue if 5.5 <= score <= 20.5 ( 21.2% - 78.8%). Cumulative type-I error 0.0500
After 27 games, continue if 6.0 <= score <= 21.0 ( 22.2% - 77.8%). Cumulative type-I error 0.0500
After 28 games, continue if 6.0 <= score <= 22.0 ( 21.4% - 78.6%). Cumulative type-I error 0.0500
After 29 games, continue if 6.0 <= score <= 23.0 ( 20.7% - 79.3%). Cumulative type-I error 0.0500
After 30 games, continue if 6.5 <= score <= 23.5 ( 21.7% - 78.3%). Cumulative type-I error 0.0500
After 31 games, continue if 6.5 <= score <= 24.5 ( 21.0% - 79.0%). Cumulative type-I error 0.0500
After 32 games, continue if 6.5 <= score <= 25.5 ( 20.3% - 79.7%). Cumulative type-I error 0.0500
After 33 games, continue if 6.5 <= score <= 26.5 ( 19.7% - 80.3%). Cumulative type-I error 0.0500
After 34 games, continue if 7.0 <= score <= 27.0 ( 20.6% - 79.4%). Cumulative type-I error 0.0500
After 35 games, continue if 7.0 <= score <= 28.0 ( 20.0% - 80.0%). Cumulative type-I error 0.0500
After 36 games, continue if 7.5 <= score <= 28.5 ( 20.8% - 79.2%). Cumulative type-I error 0.0500
After 37 games, continue if 7.5 <= score <= 29.5 ( 20.3% - 79.7%). Cumulative type-I error 0.0500
After 38 games, continue if 8.0 <= score <= 30.0 ( 21.1% - 78.9%). Cumulative type-I error 0.0500
After 39 games, continue if 8.0 <= score <= 31.0 ( 20.5% - 79.5%). Cumulative type-I error 0.0500
After 40 games, continue if 8.0 <= score <= 32.0 ( 20.0% - 80.0%). Cumulative type-I error 0.0500
After 41 games, continue if 8.5 <= score <= 32.5 ( 20.7% - 79.3%). Cumulative type-I error 0.0500
After 42 games, continue if 8.5 <= score <= 33.5 ( 20.2% - 79.8%). Cumulative type-I error 0.0500
After 43 games, continue if 9.0 <= score <= 34.0 ( 20.9% - 79.1%). Cumulative type-I error 0.0500
After 44 games, continue if 9.0 <= score <= 35.0 ( 20.5% - 79.5%). Cumulative type-I error 0.0500
After 45 games, continue if 9.5 <= score <= 35.5 ( 21.1% - 78.9%). Cumulative type-I error 0.0500
After 46 games, continue if 9.5 <= score <= 36.5 ( 20.7% - 79.3%). Cumulative type-I error 0.0500
After 47 games, continue if 9.5 <= score <= 37.5 ( 20.2% - 79.8%). Cumulative type-I error 0.0500
After 48 games, continue if 9.5 <= score <= 38.5 ( 19.8% - 80.2%). Cumulative type-I error 0.0500
After 49 games, continue if 10.0 <= score <= 39.0 ( 20.4% - 79.6%). Cumulative type-I error 0.0500
After 50 games, continue if 10.0 <= score <= 40.0 ( 20.0% - 80.0%). Cumulative type-I error 0.0500
After 51 games, continue if 10.5 <= score <= 40.5 ( 20.6% - 79.4%). Cumulative type-I error 0.0500
After 52 games, continue if 10.5 <= score <= 41.5 ( 20.2% - 79.8%). Cumulative type-I error 0.0500
After 53 games, continue if 11.0 <= score <= 42.0 ( 20.8% - 79.2%). Cumulative type-I error 0.0500
After 54 games, continue if 11.0 <= score <= 43.0 ( 20.4% - 79.6%). Cumulative type-I error 0.0500
After 55 games, continue if 11.0 <= score <= 44.0 ( 20.0% - 80.0%). Cumulative type-I error 0.0500
After 56 games, continue if 11.5 <= score <= 44.5 ( 20.5% - 79.5%). Cumulative type-I error 0.0500
After 57 games, continue if 11.5 <= score <= 45.5 ( 20.2% - 79.8%). Cumulative type-I error 0.0500
After 58 games, continue if 12.0 <= score <= 46.0 ( 20.7% - 79.3%). Cumulative type-I error 0.0500
After 59 games, continue if 12.0 <= score <= 47.0 ( 20.3% - 79.7%). Cumulative type-I error 0.0500
After 60 games, continue if 12.0 <= score <= 48.0 ( 20.0% - 80.0%). Cumulative type-I error 0.0500
After 61 games, continue if 12.5 <= score <= 48.5 ( 20.5% - 79.5%). Cumulative type-I error 0.0500
After 62 games, continue if 12.5 <= score <= 49.5 ( 20.2% - 79.8%). Cumulative type-I error 0.0500
After 63 games, continue if 13.0 <= score <= 50.0 ( 20.6% - 79.4%). Cumulative type-I error 0.0500
After 64 games, continue if 13.0 <= score <= 51.0 ( 20.3% - 79.7%). Cumulative type-I error 0.0500
After 65 games, continue if 13.5 <= score <= 51.5 ( 20.8% - 79.2%). Cumulative type-I error 0.0500
After 66 games, continue if 13.5 <= score <= 52.5 ( 20.5% - 79.5%). Cumulative type-I error 0.0500
After 67 games, continue if 13.5 <= score <= 53.5 ( 20.1% - 79.9%). Cumulative type-I error 0.0500
After 68 games, continue if 14.0 <= score <= 54.0 ( 20.6% - 79.4%). Cumulative type-I error 0.0500
After 69 games, continue if 14.0 <= score <= 55.0 ( 20.3% - 79.7%). Cumulative type-I error 0.0500
After 70 games, continue if 14.5 <= score <= 55.5 ( 20.7% - 79.3%). Cumulative type-I error 0.0500
After 71 games, continue if 14.5 <= score <= 56.5 ( 20.4% - 79.6%). Cumulative type-I error 0.0500
After 72 games, continue if 15.0 <= score <= 57.0 ( 20.8% - 79.2%). Cumulative type-I error 0.0500
After 73 games, continue if 15.0 <= score <= 58.0 ( 20.5% - 79.5%). Cumulative type-I error 0.0500
After 74 games, continue if 15.0 <= score <= 59.0 ( 20.3% - 79.7%). Cumulative type-I error 0.0500
After 75 games, continue if 15.5 <= score <= 59.5 ( 20.7% - 79.3%). Cumulative type-I error 0.0500
After 76 games, continue if 15.5 <= score <= 60.5 ( 20.4% - 79.6%). Cumulative type-I error 0.0500
After 77 games, continue if 16.0 <= score <= 61.0 ( 20.8% - 79.2%). Cumulative type-I error 0.0500
After 78 games, continue if 16.0 <= score <= 62.0 ( 20.5% - 79.5%). Cumulative type-I error 0.0500
After 79 games, continue if 16.5 <= score <= 62.5 ( 20.9% - 79.1%). Cumulative type-I error 0.0500
After 80 games, continue if 16.5 <= score <= 63.5 ( 20.6% - 79.4%). Cumulative type-I error 0.0500
After 81 games, continue if 17.0 <= score <= 64.0 ( 21.0% - 79.0%). Cumulative type-I error 0.0500
After 82 games, continue if 17.0 <= score <= 65.0 ( 20.7% - 79.3%). Cumulative type-I error 0.0500
After 83 games, continue if 17.5 <= score <= 65.5 ( 21.1% - 78.9%). Cumulative type-I error 0.0500
After 84 games, continue if 17.5 <= score <= 66.5 ( 20.8% - 79.2%). Cumulative type-I error 0.0500
After 85 games, continue if 18.0 <= score <= 67.0 ( 21.2% - 78.8%). Cumulative type-I error 0.0500
After 86 games, continue if 18.0 <= score <= 68.0 ( 20.9% - 79.1%). Cumulative type-I error 0.0500
After 87 games, continue if 18.0 <= score <= 69.0 ( 20.7% - 79.3%). Cumulative type-I error 0.0500
After 88 games, continue if 18.0 <= score <= 70.0 ( 20.5% - 79.5%). Cumulative type-I error 0.0500
After 89 games, continue if 18.5 <= score <= 70.5 ( 20.8% - 79.2%). Cumulative type-I error 0.0500
After 90 games, continue if 18.5 <= score <= 71.5 ( 20.6% - 79.4%). Cumulative type-I error 0.0500
After 91 games, continue if 19.0 <= score <= 72.0 ( 20.9% - 79.1%). Cumulative type-I error 0.0500
After 92 games, continue if 19.0 <= score <= 73.0 ( 20.7% - 79.3%). Cumulative type-I error 0.0500
After 93 games, continue if 19.0 <= score <= 74.0 ( 20.4% - 79.6%). Cumulative type-I error 0.0500
After 94 games, continue if 19.5 <= score <= 74.5 ( 20.7% - 79.3%). Cumulative type-I error 0.0500
After 95 games, continue if 19.5 <= score <= 75.5 ( 20.5% - 79.5%). Cumulative type-I error 0.0500
After 96 games, continue if 20.0 <= score <= 76.0 ( 20.8% - 79.2%). Cumulative type-I error 0.0500
After 97 games, continue if 20.0 <= score <= 77.0 ( 20.6% - 79.4%). Cumulative type-I error 0.0500
After 98 games, continue if 20.5 <= score <= 77.5 ( 20.9% - 79.1%). Cumulative type-I error 0.0500
After 99 games, continue if 20.5 <= score <= 78.5 ( 20.7% - 79.3%). Cumulative type-I error 0.0500
After 100 games, continue if 20.5 <= score <= 79.5 ( 20.5% - 79.5%). Cumulative type-I error 0.0500