Maximum ELO gain per test game played?

Zenmastur · Post by **Zenmastur** » Mon Apr 20, 2015 4:52 am

I'm curious...

With limited computational resources, what confidence bounds should be used to maximize the ELO gain per test game played. I guess to be technically accurate we should maximize the ELO gain per unit of computational resources, but to simplify it we will assume that all computers used are identical and all games require the same amount of time to be played.

I'm curious if anyone has seriously looked at this. I know that during the cold war the US sponsored a lot of math research into this type of problem. e.g. It's very expensive to set up, maintain, and secure an ICBM test range. It's even more expensive to launch 20 or so ICBM's at specified targets. Reducing the number of missiles required would obviously save a lot of time and money. The idea was to minimize the number fired while still being able to determine if the design criterion had been met and to periodically re-certify the fleet as it ages.

Anyway, all I really want to know is what confidence interval maximizes ELO gain per unit of computation. I was thinking that a 95% CI was way too high for those of us that have limited resources available.

Regards,

Zen

bob · Post by **bob** » Mon Apr 20, 2015 5:04 am

Zenmastur wrote:I'm curious...

With limited computational resources, what confidence bounds should be used to maximize the ELO gain per test game played. I guess to be technically accurate we should maximize the ELO gain per unit of computational resources, but to simplify it we will assume that all computers used are identical and all games require the same amount of time to be played.

I'm curious if anyone has seriously looked at this. I know that during the cold war the US sponsored a lot of math research into this type of problem. e.g. It's very expensive to set up, maintain, and secure an ICBM test range. It's even more expensive to launch 20 or so ICBM's at specified targets. Reducing the number of missiles required would obviously save a lot of time and money. The idea was to minimize the number fired while still being able to determine if the design criterion had been met and to periodically re-certify the fleet as it ages.

Anyway, all I really want to know is what confidence interval maximizes ELO gain per unit of computation. I was thinking that a 95% CI was way too high for those of us that have limited resources available.

Regards,

Zen

A CI of 95% means one of every 20 runs will produce a wrong answer. If you drop that to 75% do you REALLY want to accept one bad change for every three good ones? Or reject one good change for every three bad ones? That is getting dangerously close to constant regression.

Uri Blass · Post by **Uri Blass** » Mon Apr 20, 2015 6:20 am

I think that it is dependent on the percentage of failures.

If 99% of your tests suggest no improvement then even 95% may be not enough to get improvement because you can expect some lucky regressions also to pass(it can still be improvement if the lucky regressions are very small when the improvements are higher so it is not clear).

The question is also how many games you plan to play in the future to make improvement.

If the target is to maximize elo when you are allowed to play only 100,000 games then I think the confidence should be lower relative to the case the target is to maximize elo when you are allowed to play 10,000,000 games.

The main point is that if you have many games you want to know exactly what patch helps and what patch does not help even if you pay for it with games because it may give you better ideas in the future.

My opinion is that every patch that pass should be tested later after it pass with fixed number of games to get an estimate for the elo improvement of it.

I believe that the stockfish framework could do better in improvement of stockfish in case of playing 50,000 games at short time control and long time control for every patch that pass both stages because the value of knowledge that you get is more important than the time that you lose
and knowing what are the main factors that improve stockfish is important for having better ideas in the future what to change in stockfish.

Zenmastur · Post by **Zenmastur** » Mon Apr 20, 2015 6:43 am

bob wrote:
Zenmastur wrote:I'm curious...

With limited computational resources, what confidence bounds should be used to maximize the ELO gain per test game played. I guess to be technically accurate we should maximize the ELO gain per unit of computational resources, but to simplify it we will assume that all computers used are identical and all games require the same amount of time to be played.

I'm curious if anyone has seriously looked at this. I know that during the cold war the US sponsored a lot of math research into this type of problem. e.g. It's very expensive to set up, maintain, and secure an ICBM test range. It's even more expensive to launch 20 or so ICBM's at specified targets. Reducing the number of missiles required would obviously save a lot of time and money. The idea was to minimize the number fired while still being able to determine if the design criterion had been met and to periodically re-certify the fleet as it ages.

Anyway, all I really want to know is what confidence interval maximizes ELO gain per unit of computation. I was thinking that a 95% CI was way too high for those of us that have limited resources available.

Regards,

Zen
A CI of 95% means one of every 20 runs will produce a wrong answer. If you drop that to 75% do you REALLY want to accept one bad change for every three good ones? Or reject one good change for every three bad ones? That is getting dangerously close to constant regression.

Well... they (meaning the SF team) take the center 95%. Isn't one end of the distribution favorable to an ELO gain? If so, wouldn't you want use all of that end. Or am I confusing this with a "normal" distribution.

If I have many more test to run than I have resources with which to run them, I assume this describes most people on this board, then if requires 100 units of computer resources to run a single test at a CI of 95% and only 20 units (This isn't the "real" number I just picked a number at random) to run the same test at a CI of 75% then you can run 5 times as many tests with the same resources. Example:

19 * "X" elo - 1 * "X" elo = + 18X ELO using 95% CI

Vice

75 * "X" elo - 25 * "X" elo = 50X elo using 75% CI

For the same resources we have a 2.78 X increase in ELO gain per game played/unit time/unit resource etc.

Now, I know that the difference in the number of games that I used in this example isn't accurate. But I also know that the difference in the number of games required isn't linear with the CI. i.e. the number of games required to produce a CI of 95% isn't 0.95/0.75 = 1.267 times the number to produce a CI of 75%. I'm sure it's much greater. Therefore, I have high expectations that a 95% CI is MASSIVE over kill for this application "IF" the end user has insufficient CPU resources for the number of tests they need to run in a given time.

So the short answer is: YES, I believe that a 95% CI is way too high. What I want to know is: What's the optimum CI to maximize ELO gain per unit game/time/resource given that the changes being tested are identical.

Regards

Zen

Zenmastur · Post by **Zenmastur** » Mon Apr 20, 2015 9:46 am

bob wrote:A CI of 95% means one of every 20 runs will produce a wrong answer. If you drop that to 75% do you REALLY want to accept one bad change for every three good ones? Or reject one good change for every three bad ones? That is getting dangerously close to constant regression.

OK ... after reading this again... I see I didn't get the full implication of all of what you said to start with and I wasn't thinking clearly about what each end of the CI meant. Nor did I consider the effect of the ration of changes that would be accepted vice those that would be rejected. So lets try this a second time.

If 90% of all test would fail if we knew with 100% CI their "true" ELO change, and I can run 400 tests at a CI of 95% in the allotted time then we have this:

Of the 360 test that should fail 2.5% or 9 actually pass. Of the 40 that should pass 2.5% or 1 is actually rejected.

39 * X - 9 * X = + 30X ELO

If we assume that I can test 5 times as fast at 75% CI.
Of 1,800 test that should fail 25 % or 450 actually pass. Of the 200 that should pass only 75% or 150 actually do.

So we have:

150 * X - 450 * X = -200X ELO

A potential loss of 200X ELO if only 10% of the submitted test are actually ELO gainers.

Assuming this analysis is a little closer to the truth of the matter, this scenario isn't really that appealing.

If we change the ratio of good ideas to bad ideas this has a significant affect on the outcome if we leave the CI unchanged. It appears as if the CI used by the SF team ( 0.025 and 0.975 ) is symmetrical and isn't optimal for a highly asymmetrical acceptance ratio like 9 rejected to 1 accepted.

Of 400 test at 95% CI 200 tests should fail but 2.5% or 5 actually pass. Of the 200 that should pass 2.5% or 5 actually fail so we pass only 195. So:

195 * X - 5 * X = 190X ELO gain.

IF I can test 5 times as fast with a 75% CI I can run 2000 tests in the same time. So, of 2,000 tests at 75% CI 1,000 should fail but 25% or 250 actually pass. Of the 1,000 that should pass 25% or 250 actually fail so we pass only 750. So:

750 * X - 250 * X = 500X ELO gain

For a 2.63 X increase in ELO per unit time.

So, it seems as if the ratio of "good" patches to "bad" patches plays an important role in CI selection. If most ideas are ELO losers then why would you leave the interval at 0.025 and .975. It occurs to me that the number of games used in the tests could be held constant with better ELO gains if the CI at each end of the interval were adjusted to a level more appropriate for the expected ratio of good ideas to bad ones.
The Problem for me is I don't know what the governing equations are and I'm sure that they are highly non-linear at either end of the distribution. So you couldn't do something like make the interval 0.0125 to .9625 and expect the number of games needed for the test to remain constant.

Its also seems that if you have a high ratio patches that gain ELO vice those that lose ELO if would strongly favor reducing the CI as well as using an asymmetrical CI.

So I guess that's why you ask questions. It would be helpful if I could calculate the number of games required to accept or reject a patch with a specific ELO difference. I actual went to the trouble to find this information and played with a little and then put it some place where I though I could find it. Now that I actually have a use for it I have no clue where I put or where I got it from in the first place.

I looked at SF tests to get an idea of what the ratio of good test to bad test were but their are so many repeated tests and modifications before a test is accepted or rejected it would take a while for me to determine what the actual ratio of accepted to rejected patches is.

Regards,

Zen

Uri Blass · Post by **Uri Blass** » Mon Apr 20, 2015 9:52 am

Zenmastur wrote:
bob wrote:
Zenmastur wrote:I'm curious...

With limited computational resources, what confidence bounds should be used to maximize the ELO gain per test game played. I guess to be technically accurate we should maximize the ELO gain per unit of computational resources, but to simplify it we will assume that all computers used are identical and all games require the same amount of time to be played.

I'm curious if anyone has seriously looked at this. I know that during the cold war the US sponsored a lot of math research into this type of problem. e.g. It's very expensive to set up, maintain, and secure an ICBM test range. It's even more expensive to launch 20 or so ICBM's at specified targets. Reducing the number of missiles required would obviously save a lot of time and money. The idea was to minimize the number fired while still being able to determine if the design criterion had been met and to periodically re-certify the fleet as it ages.

Anyway, all I really want to know is what confidence interval maximizes ELO gain per unit of computation. I was thinking that a 95% CI was way too high for those of us that have limited resources available.

Regards,

Zen
A CI of 95% means one of every 20 runs will produce a wrong answer. If you drop that to 75% do you REALLY want to accept one bad change for every three good ones? Or reject one good change for every three bad ones? That is getting dangerously close to constant regression.
Well... they (meaning the SF team) take the center 95%. Isn't one end of the distribution favorable to an ELO gain? If so, wouldn't you want use all of that end. Or am I confusing this with a "normal" distribution.

If I have many more test to run than I have resources with which to run them, I assume this describes most people on this board, then if requires 100 units of computer resources to run a single test at a CI of 95% and only 20 units (This isn't the "real" number I just picked a number at random) to run the same test at a CI of 75% then you can run 5 times as many tests with the same resources. Example:

19 * "X" elo - 1 * "X" elo = + 18X ELO using 95% CI

Vice

75 * "X" elo - 25 * "X" elo = 50X elo using 75% CI

For the same resources we have a 2.78 X increase in ELO gain per game played/unit time/unit resource etc.

Now, I know that the difference in the number of games that I used in this example isn't accurate. But I also know that the difference in the number of games required isn't linear with the CI. i.e. the number of games required to produce a CI of 95% isn't 0.95/0.75 = 1.267 times the number to produce a CI of 75%. I'm sure it's much greater. Therefore, I have high expectations that a 95% CI is MASSIVE over kill for this application "IF" the end user has insufficient CPU resources for the number of tests they need to run in a given time.

So the short answer is: YES, I believe that a 95% CI is way too high. What I want to know is: What's the optimum CI to maximize ELO gain per unit game/time/resource given that the changes being tested are identical.

Regards

Zen

95% confidence does not mean that 95% of the patches that pass are improvements and it may be clearly less than it.

If almost every patch that you test is good then 75% is clearly better than 95% for getting a fast improvment but in this case you may get improvement even without testing.

If 50% of the patches that you test are good then you certainly need testing and I guess that 75% may be better for fast improvement.

If only 5% of the patches that you test are good then I guess that 95% is practically better than 75% and 75% may lead to a regression but I am not sure about it.

Without having some opinion about the distribution of the value of your patches it is impossible to have a basis to suggest you the optimal decision for you in order to improve your engine.

Even the percentage of the good patches is not enough because the question is what is the distribution of the bad patches.

If the bad patches include big regressions then the probability for them to pass is close to 0 and there is no risk that you accept big regression.

The main problem is if the bad patches that you test include many small regression of 1 elo or less than 1 elo.

In the last case(assuming big majority of your tests show no improvement) 75% confidence may lead you to accept too many bad lucky patches that lose 1 elo and you may not earn enough from good patches to compensate for it.

Zenmastur · Post by **Zenmastur** » Mon Apr 20, 2015 10:27 am

Uri Blass wrote: 95% confidence does not mean that 95% of the patches that pass are improvements and it may be clearly less than it.

If almost every patch that you test is good then 75% is clearly better than 95% for getting a fast improvment but in this case you may get improvement even without testing.

After I made that post, I started playing with some numbers and noticed that the ratio of good to bad patches is important. Thus my second post since I could no longer modify my first.

Uri Blass wrote:If 50% of the patches that you test are good then you certainly need testing and I guess that 75% may be better for fast improvement.

I agree in principal but what I really want is to know exactly what the interval should be for any particular ratio. Right now I think it should be asymmetrical if the ratio of good to bad patches is asymmetrical.

Uri Blass wrote:If only 5% of the patches that you test are good then I guess that 95% is practically better than 75% and 75% may lead to a regression but I am not sure about it.

Again, after playing with some numbers I agree. But without being able to determine the exact game count required for a particular ELO change with a particular CI it's difficult to say what the correct CI should be.

Uri Blass wrote:Without having some opinion about the distribution of the value of your patches it is impossible to have a basis to suggest you the optimal decision for you in order to improve your engine.

Even the percentage of the good patches is not enough because the question is what is the distribution of the bad patches.

Seems that this could be partially mitigated if an asymmetrical interval is used.

Uri Blass wrote:If the bad patches include big regressions then the probability for them to pass is close to 0 and there is no risk that you accept big regression.

The main problem is if the bad patches that you test include many small regression of 1 elo or less than 1 elo.

In the last case(assuming big majority of your tests show no improvement) 75% confidence may lead you to accept too many bad lucky patches that lose 1 elo and you may not earn enough from good patches to compensate for it.

Well to start with I don't think I'll be dealing with too many patches that are +/- 1 as it takes too many games. I was thinking about early termination and temporary rejection if game counts go past some fixed value and you could just increase the required ELO gain to match the error band + some small amount. This would make you reject very small improvements too but with a new engine these probably arent the one's you want to spend time on anyway at least until the programs rating is up in the high 2000's area. Then when more CPU resources are available I can go back and retest at a later date with more "normal" requirements.

Regards,

Zen

Laskos · Post by **Laskos** » Mon Apr 20, 2015 11:34 am

I guess I can concoct something for Sotckfish framework using SPRT, but I need another quantifier, named "actual probability of a patch to be good", Pgood. Without introducing that, it seems the problem is not well posed.

In SPRT the length of the match (inversely proportional to efficiency) is linear in Log[Confidence/(1-Confidence)]. Then the efficiency is given by

Code: Select all

&#40;Pgood + Confidence - 1&#41;/Log&#91;Confidence/&#40;1 - Confidence&#41;&#93;

I plotted here the efficiency for Pgood=10%,20%,30%,40%,50%

Funny thing, if the probability of a good submit is 50%, then for best efficiency one doesn't need to play games at all

(asymptotically).

Finding the optima needs taking a derivative on Confidence and solving an equation, the procedure is not very illuminating, will write only the code of Mathematica:

Code: Select all

ContourPlot&#91;((-1 + Confid + Pgood&#41;/((-1 + Confid&#41; Confid&#41; + 
   Log&#91;Confid/&#40;1 - Confid&#41;&#93;)/Log&#91;Confid/&#40;1 - Confid&#41;&#93;^2 == 0, &#123;Pgood, 
  0.0, 0.5&#125;, &#123;Confid, 0.5, 1&#125;&#93;

And the final plot to see the optimum confidence for best efficiency of Stockfish framework as a function of estimated probability of good patches Pgood:

This was 15 minutes detour, I hope I didn't mess up something.

Zenmastur · Post by **Zenmastur** » Mon Apr 20, 2015 10:32 pm

Laskos wrote:I guess I can concoct something for Sotckfish framework using SPRT, but I need another quantifier, named "actual probability of a patch to be good", Pgood. Without introducing that, it seems the problem is not well posed.

It would also be interesting to go back through patches tried on SF and see what the current ratio is vice what it was when they began keeping records and then calculate what could have been gained using different CI.

I wonder if there is a way to estimate the probability that a patch will pass based on the early less expensive tests. If there is, then adjusting the CI to maximize the efficiency at the probability would be possible.

Laskos wrote:In SPRT the length of the match (inversely proportional to efficiency) is linear in Log[Confidence/(1-Confidence)]. Then the efficiency is given by
Code: Select all
&#40;Pgood + Confidence - 1&#41;/Log&#91;Confidence/&#40;1 - Confidence&#41;&#93;
I plotted here the efficiency for Pgood=10%,20%,30%,40%,50%

Funny thing, if the probability of a good submit is 50%, then for best efficiency one doesn't need to play games at all (asymptotically).

Finding the optima needs taking a derivative on Confidence and solving an equation, the procedure is not very illuminating, will write only the code of Mathematica:
Code: Select all
ContourPlot&#91;((-1 + Confid + Pgood&#41;/((-1 + Confid&#41; Confid&#41; + 
   Log&#91;Confid/&#40;1 - Confid&#41;&#93;)/Log&#91;Confid/&#40;1 - Confid&#41;&#93;^2 == 0, &#123;Pgood, 
  0.0, 0.5&#125;, &#123;Confid, 0.5, 1&#125;&#93;
And the final plot to see the optimum confidence for best efficiency of Stockfish framework as a function of estimated probability of good patches Pgood:

This was 15 minutes detour, I hope I didn't mess up something.

Well, those are very nice graphs. So early in the development cycle when a large percentage of patches will produce substantial ELO gains you can get away with very few test games and these should be done with reduced CI's. As the engine gains strength the CI should be tightened and many more test games will be needed since the "average" ELO gain per patch is likely to be smaller as the program matures.

Is there any significance to the funny looking kink at the top left of the second graph?

From looking at the graphs it appears that if the ratio is 9 to 1 in favor of a patch being rejected then the CI should be something like 98% (or higher it's hard to tell the exact value from looking at the graph ) not 95%. But I'm still curious about using an asymmetrical CI. If it's not too complicated would it be possible for you to try to optimize the efficiency while allowing each tail to assume a value independent of the other tail? E.G. for a 95% CI both tales are usually 0.025 wide instead I'm thinking that with a highly skewed acceptance ratio, like 10%, accepting 1% of bad patches while rejecting 4% of good patches would produce greater efficiency. I'm sure 1% and 4% aren't the right numbers but it would be nice to know how much more efficient an optimized pair would be.

Regards,

Zen

Laskos · Post by **Laskos** » Mon Apr 20, 2015 10:43 pm

I saw now that introducing Pgood was tackled in an earlier post by Forrest, and it's correct, otherwise the problem is not defined.

I think that efficiency of Stockfish framework can be summed up by the following:

1/ Patches are better to be large Elo improvement or large Elo loss, and be outside (H0,H1). In SPRT, the number of games is inversely proportional to (scale_factor)^2 outside the interval (H0,H1). So, (H1=5, H0=0, real difference 15 Elo points) needs 3^2=9 times as many games as (H1=15, H0=0, real difference 45 Elo poins). 3 is the scale factor here.

2/ It's important for better efficiency to keep the proportion of good patches high. For optimal confidences not far from 1, the required confidence is (from the plot) ~ 1 - Pgood^2, which relates to the number of required games and efficiency.

3/ It's important to choose H0 and H1 such that real Elo is not somewhere in between them.

4/ Using p-value as stopping rule can be misleading. I have a plot which shows counter-intuitiveness of it (playing games at Pgood >50%).

Maximum ELO gain per test game played?

Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?