Maximum ELO gain per test game played?

Zenmastur · Post by **Zenmastur** » Wed Apr 22, 2015 1:52 pm

Laskos wrote: If you think this thing is worth something, and are not averse to typing in LaTeX (I am), then that's good, your contribution is anyway crucial in posing the problem and having the idea that Type II error might behave differently. I did this for fun and hastily, in fact I would need to re-check everything.

Any application that involves searching where testing changes for improvement is a major fraction of the work done AND the acceptance rate of changes is highly skewed would benefit. Obviously other games that use minmax searches would be a prime example. I'm sure that there would be other applications as well. I don't know anything about LaTeX, but I suppose this isn't the only application that can be used to write a paper. Even if the only place the paper found a home was on the chess programming Wiki it would be worth writing. It would probably need an extensive introduction section that details the testing framework as well as the mechanics of implementation. Anyway it was just an idea.

Laskos wrote: Upon waking up this morning, I plotted several graphs, for fun again (managed to invert numerically Type I and II errors at optima).

Optima of Type I and Type II errors as function of Pgood.

I actually thought about this exact graph yesterday before going to bed but I was too tired to post anything else. Again... nice job!

Laskos wrote: Efficiency at Optima and efficiency of Stockfish Framework. We can see that in relevant region of Pgood 10%-30%, the efficiency of the Framework can be improved by 10%-30%. Also, while the efficiency at Optima goes as Pgood^2, that of Stockfish Framework is linear in Pgood.

If Pgood drops below 10% using the current bounds it looks as if efficiency could easily drop an order of magnitude compared to the optimal interval if I'm reading the graph correctly.

Laskos wrote: Efficiency per Pgood at fixed workload for Optima and Stockfish Frameowrk. We see that if one has some "reasonable quality" submits, it's important to not pollute them with too many suspect submits.

SF Framework can suffer a lot if Pgood goes below 15%.

If you could predict/calculate the probability of a patch being accepted or rejected then the CI could be adjusted on a per patch basis. This could lead to huge reductions in the number of test games required and a huge increase in the ELO gain per test game played. So, it's probably worth looking into how this might be accomplished.

I have several other ideas but its almost 5:00am ... well past my bedtime and I'm not likely to be thinking very clearly.. so maybe after I get some sleep.

Regards,

Zen

jwes · Post by **jwes** » Wed Apr 22, 2015 7:25 pm

What is the right way to run tests with different Type I and Type II error rates?

Laskos · Post by **Laskos** » Wed Apr 22, 2015 11:09 pm

jwes wrote:What is the right way to run tests with different Type I and Type II error rates?

The goal discussed here is the maximum efficiency when the queue for patches is larger than the testing possibilities, and the problem to solve is the largest ELO gain at a maximum workload (maximum computing resources). It is exemplified by Stockfish testing framework. If you find yourself in this environment, and I think many developers need to optimize ELO gain with respect to their resources, then this plot shows what Type I and II errors are optimal for different probabilities of the patch to pass the test. If you can estimate (even roughly) what is the probability of the patch to be accepted, to pass successfully the SPRT test, say with probability of 30%, then from the plot you can see that Type I error (alpha in SPRT of the cutechess-cli) is best set at 7.5% or 0.075, and Type II error (beta in SPRT of the cutechess-cli) is best set to 20% or 0.20.

Zenmastur · Post by **Zenmastur** » Thu Apr 23, 2015 3:41 am

Laskos wrote:
jwes wrote:What is the right way to run tests with different Type I and Type II error rates?
The goal discussed here is the maximum efficiency when the queue for patches is larger than the testing possibilities, and the problem to solve is the largest ELO gain at a maximum workload (maximum computing resources). It is exemplified by Stockfish testing framework. If you find yourself in this environment, and I think many developers need to optimize ELO gain with respect to their resources, then this plot shows what Type I and II errors are optimal for different probabilities of the patch to pass the test. If you can estimate (even roughly) what is the probability of the patch to be accepted, to pass successfully the SPRT test, say with probability of 30%, then from the plot you can see that Type I error (alpha in SPRT of the cutechess-cli) is best set at 7.5% or 0.075, and Type II error (beta in SPRT of the cutechess-cli) is best set to 20% or 0.20.

To extend Kai's comments, instead of seeing something like this for a rejected phase I test:

Code: Select all

LLR&#58; -2.97 (-2.94,2.94&#41; &#91;-1.50,4.50&#93;
Total&#58; 7945 W&#58; 1475 L&#58; 1560 D&#58; 4910

You would instead see something like this for a phase I test if the acceptance ratio is 30% and the patch is rejected:

Code: Select all

LLR&#58; -1.54 (-1.53,2.38&#41; &#91;-1.50,4.50&#93;

The upper and lower bounds were changed because the CI changed. The terminal LLR value changed to a value sufficiently small to cause rejection.

A table of approximate bounds calculated from optimal CI's based on Kai's calculations:

Acceptance Ratio Lower Bound Upper Bound Optimal Alpha Optimal Beta
0.05 -2.28 5.01 0.006 0.102
0.10 -1.95 4.14 0.0136225 0.1407400
0.15 -1.84 3.57 0.0237276 0.1554220
0.20 -1.74 3.12 0.0365275 0.1695050
0.25 -1.64 2.74 0.0528825 0.1841800
0.30 -1.53 2.38 0.0742149 0.2005280
0.35 -1.41 2.03 0.102 0.220
0.40 -1.25 1.65 0.1444320 0.2457490

After considering this whole subject further, I have my doubts that the SF framework is anywhere near optimal. Example: SPRT, stand for Sequential Probability Ratio Test. One of the main ideas is that any testing done can be discontinuous. This means that you can use phase one results to begin phase two testing by simply recalculating the LLR based on the phase two ELO bounds. There's no need to discard all the game results of phase one tests when phase II is started. This practice wastes a HUGE amount of work without any good reason that I can see.

In addition, I have my doubts that the selected ELO bounds used in phase I and II tests in the SF framework are in anyway optimal. I'm therefore very curious to know how these bounds were selected and what tests were performed to determine that they are optimal considering the two phase testing approach used.

Regards,

Zen

P.S. Apparently the only way to make the table appear as intended is to turn it into a graphic. Maybe I'll do this later.

Laskos · Post by **Laskos** » Thu Apr 23, 2015 11:52 am

Zenmastur wrote:
Laskos wrote: If you think this thing is worth something, and are not averse to typing in LaTeX (I am), then that's good, your contribution is anyway crucial in posing the problem and having the idea that Type II error might behave differently. I did this for fun and hastily, in fact I would need to re-check everything.
Any application that involves searching where testing changes for improvement is a major fraction of the work done AND the acceptance rate of changes is highly skewed would benefit. Obviously other games that use minmax searches would be a prime example. I'm sure that there would be other applications as well. I don't know anything about LaTeX, but I suppose this isn't the only application that can be used to write a paper. Even if the only place the paper found a home was on the chess programming Wiki it would be worth writing. It would probably need an extensive introduction section that details the testing framework as well as the mechanics of implementation. Anyway it was just an idea.

If you feel fit to write it in whatever you like, you are welcome, I will send you all my results.

The results up to now have some generality. However, many developers don't even use SPRT. Here and there one hears "is outside error margins" (usually 2 Standard Deviations) and the case is "closed". To see how their testing framework works, I computed the efficiency of using error margins (or LOS or p-value), by using some specific assumptions.

I will use 1, 2, 3 Standard Deviations bounds with well known CI and LOS = (1+CI)/2 and p-value = 1-CI.

Code: Select all

1 SD    CI = 68.3%    LOS = 84.1%     p-value = 0.317
2 SD    CI = 95.45%   LOS = 97.72%    p-value = 0.0455
3 SD    CI = 99.73%   LOS = 99.87%    p-value = 0.0027

Now I will be specific about setting the testing framework, as the Type II error is not defined yet. This art of deriving results from error margins needs introduction of some specific assumptions:

1/ The tester plans in advance the number of games and doesn't stop early.
2/ The tester has some idea about the ELO size of the submit.
3/ I took here the typical good submit to be 4 ELO points improvement.
4/ The length of matches: 1 SD 3000 games, 2 SD 3000*2^2=12000 games, 3 SD 3000*3^2=27000 games. Thus, I have the same Type II error (different type I errors given by different CI in each case). I am here being kind to the testers who use error margins, as these numbers of games are fit to what is required to measure within those error bounds.
5/ I performed simulations to compute the Type II error.

With that in hand, I computed the efficiency of testing framework which uses error margins (or LOS or p-value) compared to Stockfish SPRT and Optimal SPRT:

In the relevant range 8%-33% of Pgood, 2 Standard Deviations seem to perform the best, probably that's why it's so popular in computer chess (and with such tools as Bayeselo). On the entire interval this 2 SD choice is approximately 2 times less efficient that Stockfish SPRT (and 2.3-4 times less efficient than Optimal SPRT).

For very small Pgood (below 8%), 3 Standard Deviations seem to have higher efficiency than 1 SD or 2 SD, but still much worse than Optimal SPRT.
For very large Pgood (above 33%), 1 Standard Deviation seems to be more efficient than 2 Standard Deviations, although still worse than SPRT.

Uri Blass · Post by **Uri Blass** » Thu Apr 23, 2015 2:13 pm

Laskos wrote:
Zenmastur wrote:But I'm still curious about using an asymmetrical CI. If it's not too complicated would it be possible for you to try to optimize the efficiency while allowing each tail to assume a value independent of the other tail? E.G. for a 95% CI both tales are usually 0.025 wide instead I'm thinking that with a highly skewed acceptance ratio, like 10%, accepting 1% of bad patches while rejecting 4% of good patches would produce greater efficiency. I'm sure 1% and 4% aren't the right numbers but it would be nice to know how much more efficient an optimized pair would be.

Regards,

Zen
You are up to something, Zen, I found that Type II error (false negatives, fn here) can be much laxer than Type I error (false positives, fp here), in proportion similar to your guess. The result could spare Stockfish framework of 20% workload if the proportion of good patches (Pgood here) is above 10%-15%. I estimate that proportion of good patches is about 20% in SF tests during last several weeks.

Similarly to symmetric CI, the efficiency is proportional to:
Code: Select all
&#40;fp (-1 + Pgood&#41; + Pgood - fn Pgood&#41;/(-Log&#91;fn/&#40;1 - fp&#41;&#93; + Log&#91;&#40;1 - fn&#41;/fp&#93;)
For Pgood=20%, the shape of efficiency as a function of Type I error and Type II error is:

It's already visible that the maximum efficiency is somewhere for small Type I error, but pretty large Type II error.

To pinpoint the optima, I took the first derivative on both fp and fn, and luckily it wasn't necessary to compute second derivatives and the Hessian.
Code: Select all
ContourPlot&#91;&#123;&#40;fp + (-1 + fn - fp&#41; Pgood + (-1 + fp&#41; fp (-1 + Pgood&#41; (-Log&#91;fn/&#40;1 - fp&#41;&#93; + Log&#91;&#40;1 - fn&#41;/fp&#93;))/((-1 + fp&#41; fp &#40;Log&#91;fn/&#40;1 - fp&#41;&#93; - Log&#91;&#40;1 - fn&#41;/fp&#93;)^2&#41; == 0, &#40;fp + (-1 + fn - fp&#41; Pgood + (-1 + fn&#41; fn Pgood &#40;Log&#91;fn/&#40;1 - fp&#41;&#93; - Log&#91;&#40;1 - fn&#41;/fp&#93;))/((-1 + fn&#41; fn &#40;Log&#91;fn/&#40;1 - fp&#41;&#93; - Log&#91;&#40;1 - fn&#41;/fp&#93;)^2&#41; == 0&#125;&#93;
The 3D contour plot is useless, one cannot see anything clearly there, so I just plotted 2D contour plots for several reasonable values of Pgood (proportion of good patches), the maximum efficiency is achieved at the intersection of the 2 curves. We see that Type II error can safely be taken as 15% or higher if the proportion of good patches is above 10%, while Type I error is similarly strict to symmetrical CI (much smaller than Type II error).

The intersection is the optimum for efficiency, clickable thumbnails.

Pgood=10%

Pgood=15%

Pgood=20%

Pgood=25%

Pgood=30%

Pgood=40%

Absurd case Pgood=51% (just submit them all instantly), no solution

The absurd case is because it seems that you assume that good earn the same elo as bad lose.

It is practically not the case even if you are in the early stages of your engine and you are a good programer so in most of cases there are no bugs so Pgood>50%.

You can have a big bug in your implementation and lose more than 800 elo by one patch but no chance that you earn even 200 elo by one patch.

Uri

Laskos · Post by **Laskos** » Thu Apr 23, 2015 4:18 pm

Uri Blass wrote:
The absurd case is because it seems that you assume that good earn the same elo as bad lose.

It is practically not the case even if you are in the early stages of your engine and you are a good programer so in most of cases there are no bugs so Pgood>50%.

You can have a big bug in your implementation and lose more than 800 elo by one patch but no chance that you earn even 200 elo by one patch.

Uri

Probably some pretest consuming say 5% of resources is needed to filter out this sort of patches. I do not have a distribution on values of good/bad patches, they are uniform here. I can easily add K_bad factor, to account that bad patches have in average different ELO value compared to good patches.

For K_bad = 2, the plots have a solution up to Pgood = 66.66% instead of 50%. I plotted here efficiency up to 65% of different testing schemes for K_bad = 2, meaning that bad submits have twice the ELO of good submits.

Compare that with my previous graph, with the default K_bad = 1.
If you are correct, and one has to introduce K_bad > 1, then the Stockfish Testing Framework is even farther from the optimum for Pgood = 20% (a reasonable value), and SPRT 5% 5% breaks down below Pgood of 10%. We might expect 10%-15% rate of good submits fairly soon in SF framework, as the engine is already rounded well.

Zenmastur · Post by **Zenmastur** » Thu Apr 23, 2015 10:50 pm

Laskos wrote:
Uri Blass wrote:
The absurd case is because it seems that you assume that good earn the same elo as bad lose.

It is practically not the case even if you are in the early stages of your engine and you are a good programer so in most of cases there are no bugs so Pgood>50%.

You can have a big bug in your implementation and lose more than 800 elo by one patch but no chance that you earn even 200 elo by one patch.

Uri
Probably some pretest consuming say 5% of resources is needed to filter out this sort of patches. I do not have a distribution on values of good/bad patches, they are uniform here. I can easily add K_bad factor, to account that bad patches have in average different ELO value compared to good patches.

For K_bad = 2, the plots have a solution up to Pgood = 66.66% instead of 50%. I plotted here efficiency up to 65% of different testing schemes for K_bad = 2, meaning that bad submits have twice the ELO of good submits.

Compare that with my previous graph, with the default K_bad = 1.
If you are correct, and one has to introduce K_bad > 1, then the Stockfish Testing Framework is even farther from the optimum for Pgood = 20% (a reasonable value), and SPRT 5% 5% breaks down below Pgood of 10%. We might expect 10%-15% rate of good submits fairly soon in SF framework, as the engine is already rounded well.

I think someone already had the idea of a pre-test and implemented it in the SF frame work. This is the only reason I can think of that justifies testing with two different bounds, [-1.5, 4.5] as a pretest and [ 0.0, 6.0] as the final test. So I'm going to assume that this is the purpose of phase one tests having ELO bounds of -1.5 and 4.5 while the phase two test have bounds of 0.0 and 6.0.

There doesn't seem to be any advantage to doing this. If a patch has a -800 ELO it will never pass any test. Instead it'll be rejected with very few games being played, near the minimum required to reject a patch. I looked around on the internet to see what other uses SPRT is put to. It seems that one of it's major uses is safety monitoring of newly approved drugs and vaccines or those under-going clinical trials. In those testing regimes many require a minimum number of observations be accumulated before any comparison is made against the LLR bounds. The reason for this is to limit the number of false alarms/ false rejections. I don't know if the SF framework does something similar. If they do, then a patch with a true ELO difference of -800 or some other absurd number would require only the minimum number of games to be rejected. If they don't, then it'll be rejected within a very, very small number of games. So there is no advantage to running a pre-test.

In fact, there is significant dis-advantage to doing this because it wastes computational resources. Testing with a lower set of ELO bounds in phase one allows many tests to continue and be completed and accepted at phase one that would have been rejected if the higher set of bounds were used. Then they are allowed to consume further resources by being tested again in phase two. To my eye, this double testing serves no purpose other than to waste a scarce resource. All this does is add a large number games to the testing process that contribute nothing to producing an ELO gain for the engine.

If you include all these extra games in the SF testing framework into the graph you posted it would cut the efficiency of the line labeled "Type I= Type II=%5" by a third and it might be as high as a half. So, in effect, their "High Tech" advantage has been largely squandered by poor framework design.

I can understand the idea of wanting a stopping point in the testing so that a supervisor can check to make sure that no errors in the code or test parameters have been inadvertently introduced prior to a large amount of resources being committed to completing it. If this is the purpose of the two phase approach used then there is a less wasteful way to accomplish this.

As far as writing a paper on this subject, I think it would be useful. But, I as yet, don't fully understand the SF testing framework which seems to be a key anchor point to this discussion. Conceptually I think I understand what is happening but I have too many questions about why things are the way they are. I don't feel like I've been around the subject long enough to have "robust" knowledge of it. Until I have an intuitive sense of all aspects of the testing method and can in short order/immediately see the implications of different approaches I don't think I'd be comfortable writing a paper.

Regards,

Zen

Zenmastur · Post by **Zenmastur** » Fri Apr 24, 2015 6:03 am

Is it possible to change the parameters in such a way that the net effect of all Type I errors is ELO neutral?

If so, what other unintended consequences would this have?

Example: Suppose we raised the lower ELO bound enough that the sum of all false positive was zero ELO.

How would this effect the CI used and the overall efficiency?

If this could be done without having to change the CI then there may be an opportunity to increase efficiency since this would likely change the optimum value used for the lower bound of the CI since false positive will asymptotically no longer have an effect on total ELO gained. If this were the case, couldn't we then set the lower interval to match that of upper interval? Would such a change increase overall efficiency?

Regards,

Zen

Michel · Post by **Michel** » Fri Apr 24, 2015 7:02 am

Zenmastur wrote: So, in effect, their "High Tech" advantage has been largely squandered by poor framework design.

Ah ok. Good to know this.

Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?