Maximum ELO gain per test game played?

Laskos · Post by **Laskos** » Mon Apr 20, 2015 11:22 pm

Zenmastur wrote:
Is there any significance to the funny looking kink at the top left of the second graph?

No, I guess it encountered some limit of 0/0 or similar. I increased the precision, and it looks normal:

From looking at the graphs it appears that if the ratio is 9 to 1 in favor of a patch being rejected then the CI should be something like 98% (or higher it's hard to tell the exact value from looking at the graph ) not 95%. But I'm still curious about using an asymmetrical CI. If it's not too complicated would it be possible for you to try to optimize the efficiency while allowing each tail to assume a value independent of the other tail? E.G. for a 95% CI both tales are usually 0.025 wide instead I'm thinking that with a highly skewed acceptance ratio, like 10%, accepting 1% of bad patches while rejecting 4% of good patches would produce greater efficiency. I'm sure 1% and 4% aren't the right numbers but it would be nice to know how much more efficient an optimized pair would be.

Regards,

Zen

Probably I can do that, maybe tomorrow. I will need 3D contour plots, and need to see if Mathematica solves numerically what is needed.

bob · Post by **bob** » Tue Apr 21, 2015 4:03 am

You can run 5x as many tests, yes. But one of every 4 will be a bogus result. Doesn't sound like a good trade to me. There is no way to cheat here. If you make a change that is +100 Elo, you might get by with a few hundred or a couple of thousand games. But for the more usual +4 or so Elo changes, 30K games is a minimum. Painful but unavoidable. If you short-cut, you get chaos. A bad change passes, then the next test looks bad because that bad change more than offsets the good change you just added, and the errors compound.

Desperado · Post by **Desperado** » Tue Apr 21, 2015 8:50 am

bob wrote:You can run 5x as many tests, yes. But one of every 4 will be a bogus result. Doesn't sound like a good trade to me. There is no way to cheat here. If you make a change that is +100 Elo, you might get by with a few hundred or a couple of thousand games. But for the more usual +4 or so Elo changes, 30K games is a minimum. Painful but unavoidable. If you short-cut, you get chaos. A bad change passes, then the next test looks bad because that bad change more than offsets the good change you just added, and the errors compound.

Hello,

just a very dumb question without any formulars. Doesn't this logic apply in 2 directions ? Eg: if you measure +5 Elo, you might really have +1 but you also can have +9. Isn't there a balance in a long run ? Now, with long run i mean measuring different features and to keep positive results always.

I remember a privat discussion on this topic with Sven a long time ago, but of course i am not a mathematician. I havethe basic idea of simulated annealing in mind, which takes any improvement and allows some regression on probabilities. In a long run this makes progress.

Well, this was always a picture i have in mind, without fundamental reasoning.

Laskos · Post by **Laskos** » Tue Apr 21, 2015 8:53 am

Zenmastur wrote:But I'm still curious about using an asymmetrical CI. If it's not too complicated would it be possible for you to try to optimize the efficiency while allowing each tail to assume a value independent of the other tail? E.G. for a 95% CI both tales are usually 0.025 wide instead I'm thinking that with a highly skewed acceptance ratio, like 10%, accepting 1% of bad patches while rejecting 4% of good patches would produce greater efficiency. I'm sure 1% and 4% aren't the right numbers but it would be nice to know how much more efficient an optimized pair would be.

Regards,

Zen

You are up to something, Zen, I found that Type II error (false negatives, fn here) can be much laxer than Type I error (false positives, fp here), in proportion similar to your guess. The result could spare Stockfish framework of 20% workload if the proportion of good patches (Pgood here) is above 10%-15%. I estimate that proportion of good patches is about 20% in SF tests during last several weeks.

Similarly to symmetric CI, the efficiency is proportional to:

Code: Select all

&#40;fp (-1 + Pgood&#41; + Pgood - fn Pgood&#41;/(-Log&#91;fn/&#40;1 - fp&#41;&#93; + Log&#91;&#40;1 - fn&#41;/fp&#93;)

For Pgood=20%, the shape of efficiency as a function of Type I error and Type II error is:

It's already visible that the maximum efficiency is somewhere for small Type I error, but pretty large Type II error.

To pinpoint the optima, I took the first derivative on both fp and fn, and luckily it wasn't necessary to compute second derivatives and the Hessian.

Code: Select all

ContourPlot&#91;&#123;&#40;fp + (-1 + fn - fp&#41; Pgood + (-1 + fp&#41; fp (-1 + Pgood&#41; (-Log&#91;fn/&#40;1 - fp&#41;&#93; + Log&#91;&#40;1 - fn&#41;/fp&#93;))/((-1 + fp&#41; fp &#40;Log&#91;fn/&#40;1 - fp&#41;&#93; - Log&#91;&#40;1 - fn&#41;/fp&#93;)^2&#41; == 0, &#40;fp + (-1 + fn - fp&#41; Pgood + (-1 + fn&#41; fn Pgood &#40;Log&#91;fn/&#40;1 - fp&#41;&#93; - Log&#91;&#40;1 - fn&#41;/fp&#93;))/((-1 + fn&#41; fn &#40;Log&#91;fn/&#40;1 - fp&#41;&#93; - Log&#91;&#40;1 - fn&#41;/fp&#93;)^2&#41; == 0&#125;&#93;

The 3D contour plot is useless, one cannot see anything clearly there, so I just plotted 2D contour plots for several reasonable values of Pgood (proportion of good patches), the maximum efficiency is achieved at the intersection of the 2 curves. We see that Type II error can safely be taken as 15% or higher if the proportion of good patches is above 10%, while Type I error is similarly strict to symmetrical CI (much smaller than Type II error).

The intersection is the optimum for efficiency, clickable thumbnails.

Pgood=10%

Pgood=15%

Pgood=20%

Pgood=25%

Pgood=30%

Pgood=40%

Absurd case Pgood=51% (just submit them all instantly), no solution

Zenmastur · Post by **Zenmastur** » Tue Apr 21, 2015 12:54 pm

Laskos wrote:
Zenmastur wrote:But I'm still curious about using an asymmetrical CI. If it's not too complicated would it be possible for you to try to optimize the efficiency while allowing each tail to assume a value independent of the other tail? E.G. for a 95% CI both tales are usually 0.025 wide instead I'm thinking that with a highly skewed acceptance ratio, like 10%, accepting 1% of bad patches while rejecting 4% of good patches would produce greater efficiency. I'm sure 1% and 4% aren't the right numbers but it would be nice to know how much more efficient an optimized pair would be.

Regards,

Zen
You are up to something, Zen, I found that Type II error (false negatives, fn here) can be much laxer than Type I error (false positives, fp here), in proportion similar to your guess. The result could spare Stockfish framework of 20% workload if the proportion of good patches (Pgood here) is above 10%-15%. I estimate that proportion of good patches is about 20% in SF tests during last several weeks.

Similarly to symmetric CI, the efficiency is proportional to:
Code: Select all
&#40;fp (-1 + Pgood&#41; + Pgood - fn Pgood&#41;/(-Log&#91;fn/&#40;1 - fp&#41;&#93; + Log&#91;&#40;1 - fn&#41;/fp&#93;)
For Pgood=20%, the shape of efficiency as a function of Type I error and Type II error is:

It's already visible that the maximum efficiency is somewhere for small Type I error, but pretty large Type II error.

To pinpoint the optima, I took the first derivative on both fp and fn, and luckily it wasn't necessary to compute second derivatives and the Hessian.
Code: Select all
ContourPlot&#91;&#123;&#40;fp + (-1 + fn - fp&#41; Pgood + (-1 + fp&#41; fp (-1 + Pgood&#41; (-Log&#91;fn/&#40;1 - fp&#41;&#93; + Log&#91;&#40;1 - fn&#41;/fp&#93;))/((-1 + fp&#41; fp &#40;Log&#91;fn/&#40;1 - fp&#41;&#93; - Log&#91;&#40;1 - fn&#41;/fp&#93;)^2&#41; == 0, &#40;fp + (-1 + fn - fp&#41; Pgood + (-1 + fn&#41; fn Pgood &#40;Log&#91;fn/&#40;1 - fp&#41;&#93; - Log&#91;&#40;1 - fn&#41;/fp&#93;))/((-1 + fn&#41; fn &#40;Log&#91;fn/&#40;1 - fp&#41;&#93; - Log&#91;&#40;1 - fn&#41;/fp&#93;)^2&#41; == 0&#125;&#93;
The 3D contour plot is useless, one cannot see anything clearly there, so I just plotted 2D contour plots for several reasonable values of Pgood (proportion of good patches), the maximum efficiency is achieved at the intersection of the 2 curves. We see that Type II error can safely be taken as 15% or higher if the proportion of good patches is above 10%, while Type I error is similarly strict to symmetrical CI (much smaller than Type II error).

The intersection is the optimum for efficiency, clickable thumbnails.

Pgood=10%

Pgood=15%

Pgood=20%

Pgood=25%

Pgood=30%

Pgood=40%

Absurd case Pgood=51% (just submit them all instantly), no solution

WOW!!!

Very nice work!
Thank you!

I do have a couple of questions just so I'm sure that what I'm seeing on the smaller graphs really means what I think it means.

Example1: on the 10% pgood graph it looks as if the optimal bounds would be 0.01 and 0.86 which equates to ~ 85% CI

Example2: on the 20% pgood graph it looks as if the optimal bounds would be 0.02 and 0.83 which equates to ~ 81% CI

Am I reading and interpreting these graphs correctly?

Could you see fit to post the exact intersection points so no guess work is involved?

I'm sure that the SF team will enjoy a 20% boost in testing rates. But, the people that will benefit the most will be those that have the greatest disparity between the number of test they wish to run and their available CPU resources. Which seems to include most of the one or two man teams!

Again... Thanks for doing all the heavy lifting!!!

Regards,

Zen

Zenmastur · Post by **Zenmastur** » Tue Apr 21, 2015 1:16 pm

After thinking about it a little more...

Maybe you should write a short white paper on it complete with graphs and associated data.

Regards,

Zen

Laskos · Post by **Laskos** » Tue Apr 21, 2015 2:15 pm

Zenmastur wrote:
I do have a couple of questions just so I'm sure that what I'm seeing on the smaller graphs really means what I think it means.

Example1: on the 10% pgood graph it looks as if the optimal bounds would be 0.01 and 0.86 which equates to ~ 85% CI

Example2: on the 20% pgood graph it looks as if the optimal bounds would be 0.02 and 0.83 which equates to ~ 81% CI

Am I reading and interpreting these graphs correctly?

Could you see fit to post the exact intersection points so no guess work is involved?

Yes, you interpreted correctly. The Type I and Type II errors (fp and fn in my notation from "false positive" and "false negative") for those intersections are here:

Code: Select all

Pgood=10%
&#123;fp -> 0.0136225, fn -> 0.14074&#125;

Pgood=15%
&#123;fp -> 0.0237276, fn -> 0.155422&#125;

Pgood=20%
&#123;fp -> 0.0365275, fn -> 0.169505&#125;

Pgood=25%
&#123;fp -> 0.0528825, fn -> 0.18418&#125;

Pgood=30%
&#123;fp -> 0.0742149, fn -> 0.200528&#125;

Pgood=40%
&#123;fp -> 0.144432, fn -> 0.245749&#125;

I'm sure that the SF team will enjoy a 20% boost in testing rates. But, the people that will benefit the most will be those that have the greatest disparity between the number of test they wish to run and their available CPU resources. Which seems to include most of the one or two man teams!

Again... Thanks for doing all the heavy lifting!!!

Regards,

Zen

Laskos · Post by **Laskos** » Wed Apr 22, 2015 11:05 am

Zenmastur wrote:After thinking about it a little more...

Maybe you should write a short white paper on it complete with graphs and associated data.

Regards,

Zen

If you think this thing is worth something, and are not averse to typing in LaTeX (I am), then that's good, your contribution is anyway crucial in posing the problem and having the idea that Type II error might behave differently. I did this for fun and hastily, in fact I would need to re-check everything.

Upon waking up this morning, I plotted several graphs, for fun again (managed to invert numerically Type I and II errors at optima).

Optima of Type I and Type II errors as function of Pgood.

Efficiency at Optima and efficiency of Stockfish Framework. We can see that in relevant region of Pgood 10%-30%, the efficiency of the Framework can be improved by 10%-30%. Also, while the efficiency at Optima goes as Pgood^2, that of Stockfish Framework is linear in Pgood.

Efficiency per Pgood at fixed workload for Optima and Stockfish Frameowrk. We see that if one has some "reasonable quality" submits, it's important to not pollute them with too many suspect submits.

SF Framework can suffer a lot if Pgood goes below 15%.

bob · Post by **bob** » Wed Apr 22, 2015 12:49 pm

Desperado wrote:
bob wrote:You can run 5x as many tests, yes. But one of every 4 will be a bogus result. Doesn't sound like a good trade to me. There is no way to cheat here. If you make a change that is +100 Elo, you might get by with a few hundred or a couple of thousand games. But for the more usual +4 or so Elo changes, 30K games is a minimum. Painful but unavoidable. If you short-cut, you get chaos. A bad change passes, then the next test looks bad because that bad change more than offsets the good change you just added, and the errors compound.
Hello,

just a very dumb question without any formulars. Doesn't this logic apply in 2 directions ? Eg: if you measure +5 Elo, you might really have +1 but you also can have +9. Isn't there a balance in a long run ? Now, with long run i mean measuring different features and to keep positive results always.

I remember a privat discussion on this topic with Sven a long time ago, but of course i am not a mathematician. I havethe basic idea of simulated annealing in mind, which takes any improvement and allows some regression on probabilities. In a long run this makes progress.

Well, this was always a picture i have in mind, without fundamental reasoning.

It goes both directions. Unfortunately there are FAR more bad changes than good ones. Which means you are taking a significant chance of accepting a bad change without knowing it is bad. That totally disrupts development when you step backward and don't know.

There isn't any way to cheat the system here, unfortunately. You can play fewer games if you use self-play, but I don't consider that much of an option either for reasons already discussed.

bob · Post by **bob** » Wed Apr 22, 2015 12:53 pm

The key is that in general, unless an engine is very young/new, the majority of patches will be bad. I suppose one could argue about which is worse, accepting a bad patch as good, or rejecting a good patch as bad. The latter has less harmful effects, IMHO, but either is to be avoided if possible.

Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?