GUI idea: Testing until certainty

Albert Silver · Post by **Albert Silver** » Tue Dec 07, 2010 3:52 pm

Here is an idea for the GUI designers here:

In backgammon, rollouts (monte carlo simulations with variance reduction) are done to find the truth of a position. The number of trials required to reach certainty varies from position to position (due to volatility), so a common setting is to request it continue rollouts until one move has reached mathematical certainty as the best.

It occurred to me this might be an interesting feature (with a twist) for engine testers working on settings and features. Suppose you have a feature or setting that you believe might be an improvement, but you are not sure by how much. You could set the testing in ultrafast games to test until there was a certainty of one way or the other, and the GUI would only stop when a conclusion was reached. There would be added options such as to extend the matches, in case there was an improvement, to try and figure out by how much, etc. Also, the GUI might offer one to test until a specific Elo range of certainty. Ex: it would test until it knew the strength within 5-10-20 Elo. Etc. Of course the latter is already a known set number of games, so when choosing this, it would advise the user how many games to expect, and even a rough estimate on the time for the testing to undergo (via average of course).

hgm · Post by **hgm** » Tue Dec 07, 2010 5:19 pm

The ability to terminate a match early is only useful when the GUI has something to do next. If it jst means thecomputer (or core) will become idle, it is a bit pointless.

That being said, what you propose would be extremely easy to implement in WinBoard. Just add a test when a match game ends to see if the set confidence is reached, and only start a new game when it is not.

Albert Silver · Post by **Albert Silver** » Tue Dec 07, 2010 6:09 pm

hgm wrote:The ability to terminate a match early is only useful when the GUI has something to do next. If it jst means thecomputer (or core) will become idle, it is a bit pointless.

That being said, what you propose would be extremely easy to implement in WinBoard. Just add a test when a match game ends to see if the set confidence is reached, and only start a new game when it is not.

In backgammon, it is not unusual to have more than one rollout to be done, so this also allows it to start the next one without wasting time. The same could be done here with scheduling testing with more than one set of parameters. In any case, considering the level of misunderstanding of confidence levels, it might also be interesting to report the confidence levels of any given result.

UncombedCoconut · Post by **UncombedCoconut** » Tue Dec 07, 2010 9:19 pm

hgm wrote:The ability to terminate a match early is only useful when the GUI has something to do next. If it jst means thecomputer (or core) will become idle, it is a bit pointless.

That being said, what you propose would be extremely easy to implement in WinBoard. Just add a test when a match game ends to see if the set confidence is reached, and only start a new game when it is not.

Maybe not extremely easy -- you have to be careful to redesign the usual statistical tests with early termination in mind. That change in test structure also changes the relevant error formulas. There was a thread about such schemes a while back, but I don't think anybody decided a "best" way to run variable-length tests. SPRT seemed like a good method for cases when there's a minimum ELO gain you're willing to accept.

hgm · Post by **hgm** » Tue Dec 07, 2010 10:26 pm

Sure, the statistic will be different. But I assume they can be cast into a formula. Just a different formula than usual.

hgm · Post by **hgm** » Wed Dec 08, 2010 9:56 am

I did some math, and a simple way to do it would be with a fixed threshold: terminate the match when the score is more than a certain number of points above 50%.

E.g. if you set a (maximum) number of games of 1600, the standard deviation in the score would be 40%/sqt(1600) = 1%, or 16 points. So normally a devaton of 2 sigma = 32 points (i.e.scores above 832 or below 768) would give yo 95% confidence that the engines are not equally strong. And ina one-sided test scores above 832 in favor of A would give you 97.5% confidence that A is stronger.

Now when you would abort the match as soon as the score exceeds 50% by 32 points in favor of A, this would double the probability that equal engines would give a falsepositive compared to only taking the score at the end. So it would give you 95% confidence in the single-sided test.

GUI idea: Testing until certainty

GUI idea: Testing until certainty

Re: GUI idea: Testing until certainty

Re: GUI idea: Testing until certainty

Re: GUI idea: Testing until certainty

Re: GUI idea: Testing until certainty

Re: GUI idea: Testing until certainty