Engine Testing - Statistics

Uri Blass · Post by **Uri Blass** » Sun Jan 17, 2010 8:28 pm

zamar wrote:
bob wrote: That said, one can compute how bad things might get, worse case, or best case, and choose to stop if the results drop below this threshold, or if they go above. But you have to be _sure_ you are willing to accept the resulting inaccuracy. Simply makes more sense to have a fixed stop point set before you start. Anything else is based on a fallacy that is well-known but not well-understood. Sudden up-swings or down-swings are expected. And they can't be what you use to measure overeall results.
I think that most of us in here agree what you are saying, but I want to express things from a different point of view:

* As an inexperienced programmer I _often_ write patches which are extremely bad. In Stockfish development we have a rule that each patch which gets accepted must be tested in 1000 games match versus original.

* Now if after 150 games result is Mod - Orig: 50 - 100, there is no point of continuing anymore, patch is completely garbage. But I'd like to get scientifically valid condition when to stop.

* Another example is that if after 1000 games result is +30 elo (which is rear, but sometimes do happen!), there is no need for further games. But if it is only say +6elo, there might be need for another 1000 games match. But I'd really like to know what happens to error bars in this case. Because now it's question about conditional probability I cannot use the usual formula for the error bars.

In short this comes down to question: We want to commit patches which are improvements and discard the rest. We want the success rate to be X% (for example 90%). What is the optimal testing strategy when you want to minimize the number of games?

I think that the target is not correct.

The target is to improve the engine and I see no reason for deciding about constant X% success rate.

If after many games you do not get a significant result then you still may want to accept the version that scored better even if you have less than X% success rate because even if 6 of these changes give 1 elo and 4 of these changes reduce 1 elo you still have a 2 elo improvement.

I think that a good testing strategy may be something like this(when you can modify 10,000 and 200 to different numbers).

1)For every change play 10,000 games or less than it.
2)If the new program is leading by 200 points(relative to the old program)
than then accept the change(it means that if you see result like 2600:2400 you accept the change).
3)If the old program is leading by 200 points reject the change.
4)In other cases choose the winner(in other cases you do not know if the change is good or bad but in most cases the program that scored better is better and you do not care if the probability that the change is positive is only 60%.

Note that this strategy is also possible for people who do not like testing engine against a previous version and they can simply compare the result of version X against some opponents after N games with the result of version X+1 against the same opponents after N games.

Uri

mcostalba · Post by **mcostalba** » Sun Jan 17, 2010 9:30 pm

Uri Blass wrote: 4)In other cases choose the winner

So after 6 months you have grown up a total mess out of the engine

It is important to know if a change works or doesn't because many changes are correlated and you come out with idea B and C only after you have found idea A is good.

Another problem is that your engine becomes features inflated and you end up throwing in any garbage together with good stuff.

I prefer to add 1 proven good idea then 5 good and 5 bad ones even if the result is less in terms of ELO, but for program maintainability and for reference with future research / studies I want to know that what is in is useful and does work.

Edmund · Post by **Edmund** » Sun Jan 17, 2010 10:03 pm

UncombedCoconut wrote:
Edmund wrote:Great idea! I didn't think of that. The problem I see though is that you may worsen rounding errors by that. Do you have any ideas on how to travers the permutations most efficiently given that win is changed at last (to generate the output) and to minimize the rounding errors (so maybe go from biggest to smallest)
Sure, rounding errors could happen. I wouldn't worry much because none of the operations cancel significant digits. Furthermore you do the same number of ops regardless of the order of recursion (since you increment w, d, and l the same # of times). So I'd make an arbitrary choice (and also memoize results if my goal was to output a table), then compare answers after changing doubles to floats in the program to gauge the roundoff errors.

as I said .. great idea ... I was just able to generate a LOS table for >10000 games in a few seconds.

zamar · Post by **zamar** » Sun Jan 17, 2010 10:45 pm

BTW sequential probability ( http://en.wikipedia.org/wiki/Sequential ... ratio_test ) is not easy and you don't get it reading some introductionary statistical book.
No idea. But you do get it by visiting any of the online blackjack forums and start to discuss "progressive betting strategies". There are world-class statisticians which will be quite happy to explain why the idea is broken. It happens every day.

Sorry Bob, but I don't get what you are trying to say.

I understand fully well why "progressive betting strategies" won't work, because I could prove it when I was kid, but I can't see the connection between "progressive betting strategies" and "determining chess engine relative strength through sequential probability ratio test".

I'll explain my view once more: When the match between two engines goes on, every new result provides us with information. And always when we get information, probabilities change. Now the question is how can we use that information to adjust match length so that we reach wanted confidence level. You say that there is no way, and that's false, and very easy to prove. Here is counter-example: Suppose we want to play 1000 games match and accept the confidence level we get from there. After 700 games match A-B is 700-0. You claim "because progressive betting strategies won't work, you have to play the full 1000 games before we can know anything with wanted confidence level", although it can clearly be seen from here that match could at worst end 700-300 and A could be clear winner with wanted confidence level and we can immediately stop the match.

Of course 700-0 is just extreme, but then you start "milding" it step by step 699-1, 698-2, ..., but when can we stop? And that's the question where we are looking for the answer...

Edmund · Post by **Edmund** » Sun Jan 17, 2010 11:34 pm

UncombedCoconut wrote:
Edmund wrote:Back to the original problem; any ideas on how we could use these new findings to answer the question on when to stop without the need of a simulation.
Well, now we have (assuming my code doesn't have more nasty bugs) a huge family of statistically sound tests, some more aggressive than others about stopping early. To choose between them you'd need to guess how probable each ELO difference is, since that determines the average # of games these tests require. I haven't done this yet, but I'd start with a very crude guess and see whether you can even get a significant improvement in average testing time vs the standard method of running a fixed # of games.

Once again I thought about this "stopping early" topic and I think finally I have found the answer.

I'll demonstrate it again with the figures form my LOS table; alpha = 5%; tournament size 200 games (predefined); question whether to stop after 100 games:

According to the table you need to be 20 games ahead after 200 games to be superiour regarding the 5% margin.

If we now still have 100 games to go, and we want to know with 95% security that the engine is better, we have to make sure that with 95% certainty it will score 20 games better than the previous version.

This value can be deducted from the table too. With > 95% certainty you will lose less than 14 points during 100 games if you are stronger.

So we have to add 14+20 to find out that we have to be 34 points ahead.

Now finally also in % scores:
to be significantly stronger after 200 games one needs to score 55%; one may stop the tournament after 100 games with no loss of significance at a score of 67%

regards,
Edmund

UncombedCoconut · Post by **UncombedCoconut** » Sun Jan 17, 2010 11:43 pm

bob wrote:You are doing the same thing with the Elo information. It goes up and down, as the data I posted clearly shows. If you stop when it looks good, the test is meaningless. yes, if you totally bust the code it will plummet like a rock and you could stop quickly when the results look horrible. But to make the test statistically valid, the "stop point" has to be something set _before" the test starts, not something that is based on the test results themselves while the match is "in progress".

That said, one can compute how bad things might get, worse case, or best case, and choose to stop if the results drop below this threshold, or if they go above. But you have to be _sure_ you are willing to accept the resulting inaccuracy. Simply makes more sense to have a fixed stop point set before you start. Anything else is based on a fallacy that is well-known but not well-understood. Sudden up-swings or down-swings are expected. And they can't be what you use to measure overeall results.

I apologize if you understand this point, but: You can eliminate the resulting inaccuracy [*] if you adjust the thresholds. It is not the structure of a test (i.e., whether or not it allows a conditional stop in the middle) which makes it valid or invalid. Rather, it is the test's probabilities of failure -- of passing off an equal engine as improved, and of failing to notice an improvement. I'd like to illustrate with a simplified example (to which the OP's table is relevant).

Suppose you're testing whether your engine is stronger after some very promising change, and you want to get 95% confidence or else throw out the change.

Basic approach: Run 1000 games. If you have 52.2% wins, keep the change.

Gambler's approach (which you correctly criticize): Run 10 games. If you have 75% wins, smile and feel lucky. Else run another 10. If you have 67.7% wins, smile and feel lucky. Else play the remaining 980 games and apply the above test.

Why is the gambler wrong? He wanted a 5% chance of an equal engine passing the test for superiority. Instead he got a 5% chance of it doing so after 10 games, a 95% * 5% chance of it doing so after 20 games, and a 95% * 95% * 5% chance of it doing so after 1000. Whoops, that's a 14% chance! So disaster struck because he changed the structure of the test without changing the parameters.

Statistician's approach: Since the change is promising we'd like to be able to vet it after fewer games, like the gambler, but we need the same level of confidence.We could use the winning percentages from the 1% column. Now your probability of false positive is 1% + 99% * 1% + 99% * 99% * 1% = 2.9%. This is acceptable! (And if we calculate a threshold that gives 97% confidence after 1000 tests, instead of using the table, we get a false positive rate of 1% + 99% * 1% + 99% * 99% * 3% = 5%.)

Sometimes this new test and the original 1,000 round test will give different results. But if in fact the engines are (near) equal, it's a 50-50 chance which test is right. Either test can give a false positive when the other wouldn't. If the modified engine is much stronger, this new test will tell you its superiority sooner.

With all this said, there's no such thing as a free lunch. So here's the asterisk. You can expect the old test to reject a far weaker new version more often than the new test can. In this sense it will be less accurate. It will, however, pass off a weaker program as stronger <5% of the time -- so it still does just what you asked for!

UncombedCoconut · Post by **UncombedCoconut** » Mon Jan 18, 2010 12:10 am

Edmund wrote:I'll demonstrate it again with the figures form my LOS table; alpha = 5%; tournament size 200 games (predefined); question whether to stop after 100 games:

According to the table you need to be 20 games ahead after 200 games to be superiour regarding the 5% margin.

If we now still have 100 games to go, and we want to know with 95% security that the engine is better, we have to make sure that with 95% certainty it will score 20 games better than the previous version.

This value can be deducted from the table too. With > 95% certainty you will lose less than 14 points during 100 games if you are stronger.

So we have to add 14+20 to find out that we have to be 34 points ahead.

I think that in general this increases the probability of false positives unless you also require more victories after the 200 games if a cutoff is not made.
(In this case the integer problem dwarfs the impact: original test has false positive rate 0.047211; new test has false positive rate 0.047212. I think the effect is so much smaller than the integer problem's because you picked a huge early cutoff threshold compared to what is possible. My tests show even 21 ahead is OK: less accurate than the plain stop-after-200 test, but more accurate than 5%.)

I think the moral is "always compute the type-I error as your final test".

UncombedCoconut · Post by **UncombedCoconut** » Mon Jan 18, 2010 12:23 am

zamar wrote:You say that there is no way, and that's false, and very easy to prove. Here is counter-example:

Your simpler example is nice.

BTW, a literal sequential probability ratio test is unsuitable for testing "engine A is stronger than engine B" vs "engine A is not stronger than engine B". It requires a gap between hypotheses: e.g., "engine A is >= 50 Elo stronger than engine B" vs "engine A is not stronger than engine B". If the truth is somewhere between the two hypotheses, the test result will have little meaning. This probably makes the SPRT ill-suited to computer chess purposes, so I'd rather stick to the broader concept of "sequential analysis".

Aaron Becker · Post by **Aaron Becker** » Mon Jan 18, 2010 12:37 am

UncombedCoconut wrote: BTW, a literal sequential probability ratio test is unsuitable for testing "engine A is stronger than engine B" vs "engine A is not stronger than engine B". It requires a gap between hypotheses: e.g., "engine A is >= 50 Elo stronger than engine B" vs "engine A is not stronger than engine B". If the truth is somewhere between the two hypotheses, the test result will have little meaning. This probably makes the SPRT ill-suited to computer chess purposes, so I'd rather stick to the broader concept of "sequential analysis".

I don't think this is that much of a problem. You always have to design your experiments to find improvements of a certain size: a 1000 game test will never resolve 1 elo changes, regardless of the statistical method used. If you pick a hypothesis that's appropriately scaled to the size of the test you're willing to run, you could still use a SPRT.

Edmund · Post by **Edmund** » Mon Jan 18, 2010 12:38 am

UncombedCoconut wrote:
Edmund wrote:I'll demonstrate it again with the figures form my LOS table; alpha = 5%; tournament size 200 games (predefined); question whether to stop after 100 games:

According to the table you need to be 20 games ahead after 200 games to be superiour regarding the 5% margin.

If we now still have 100 games to go, and we want to know with 95% security that the engine is better, we have to make sure that with 95% certainty it will score 20 games better than the previous version.

This value can be deducted from the table too. With > 95% certainty you will lose less than 14 points during 100 games if you are stronger.

So we have to add 14+20 to find out that we have to be 34 points ahead.
I think that in general this increases the probability of false positives unless you also require more victories after the 200 games if a cutoff is not made.
(In this case the integer problem dwarfs the impact: original test has false positive rate 0.047211; new test has false positive rate 0.047212. I think the effect is so much smaller than the integer problem's because you picked a huge early cutoff threshold compared to what is possible. My tests show even 21 ahead is OK: less accurate than the plain stop-after-200 test, but more accurate than 5%.)

I think the moral is "always compute the type-I error as your final test".

I was taking conservative values: eg LOS when 20 points ahead after 200 games is 95.67% or 95.50% when 14 points ahead after 100 games. So maybe if you combine the two you actually don't need to be 34 points ahead, but 33 would do. But that is no major problem I think.

Or what other false positives do you see?

Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics

Re: Engine Testing - Statistics