Maximum ELO gain per test game played?

Zenmastur · Post by **Zenmastur** » Fri Apr 24, 2015 3:42 pm

Laskos wrote: My model is not adapted to take care sequentially of dropping faster very bad submits, and separating by ELO. ELO dependence of Efficiency could be implemented, but then we will enter into some nasty details, as guessing the bounds and such. In SPRT it goes roughly as that: the Efficiency goes as 1/scale^2, where scale is the ELO scale of the problem. Separately, it goes roughly as 1/(difference between the bounds) and 1/(real ELO difference between the submit and the bounds). That's all outside the in-between of H0 and H1, where things are complicated.

As for my model valid only to decreasing Pgood for decreasing ELO of bad submits, it's because I consider all the bad submits of the same ELO value. So false positives are drawn from the pool of identical bad submits. The problem is that if Pgood is high enough, and bad submits are low on negative ELO, there is no more optima for sufficiently large Pgood. It's better to accept all the submits immediately, as the gain is instant. Type I and II errors in my model have no optima, and Efficiency is diverging at larger Pgood.

I would need to separate into 3 pools: good submits, mildly bad submits, which give false positives and add up to 0 ELO, and very bad submits, which don't produce false positives and dropt out quickly. Then I would need to do that sequentially. I am not sure this complication is really worth the effort.

Well, their is another way. It requires that the W-L-D sequence of real tests be saved. These should be saved in any case because they would be extremely useful for tuning the SF testing frame work. Then you could, after the fact, re-run the tests with different ELO bounds and different CI as many times as desired with out having to play any games. You simply use the sequences saved from the original tests. This assumes enough games were played during the original tests. This seems likely for most tests since we are going to be using more optimal CI's than used in the original test although it's by no means assured. So we would need a large number of saved W-L-D sequences from real testing. If these were saved as a matter of course, playing what-ifs to better tune the testing framework would be childs play.

This may sould like a lot of work, but I think it's worth doing because unlike the previous optimization were the CI's were changed and we only got a 20% increase over current practice. Removing the lower CI's dependancy on the ELO loss caused by false positives has the potential to produce much greater gains in effeciency. I'm not sure how much greater but I suspect that 100%-200% is possible. Potentialy doubling your through put using the same resources is worth a good deal of extra effort IMO.

Regards,

Zen

Uri Blass · Post by **Uri Blass** » Fri Apr 24, 2015 4:09 pm

Zenmastur wrote:
Uri Blass wrote:The reason for (-1.5,4.5) at stage 1 is to increase the probability of patches that help mainly at long time control to pass.

If you use (0,6) at stage I they will probably not pass at short time control
so you reject positive patches.

If the target is to optimize stockfish for 1 minute per game and not for long time control then my opinion is that it is more logical to use (-1.5,4.5) also for stage 2.

Note that for tuning weights the stockfish framework use SPRT(0,4) twice and I find it non consistent so I think that it is better to use SPRT(-1,3) at stage 1 and SPRT(0,4) at stage 2 for tuning.
How does lower stage one bounds help a patch pass a stage two test?

Since it still must pass a stage two test you end up doing the test twice with no additional benefit.

Regards,

Zen

lowering stage 1 bounds can help good patches to pass stage 1 so the benefit from lowering stage 1 bounds is that you practically accept some good patches that you do not accept without it.

I want patches that give 1 elo at short time control(15+0.05) and 2 elo at long time control(60+0.06) to pass and patches that give 2 elo at short time control and 1 elo at long time control(60+0.05) not to pass.

Uri Blass · Post by **Uri Blass** » Fri Apr 24, 2015 4:12 pm

Zenmastur wrote:
Uri Blass wrote:I can add that I disagree with the target of get improvement as fast as possible and I think that getting knowledge is more interesting and may be productive if you think some years forward and not only one or 2 months forward so my opinion is that it is better for the stockfish team to have 100,000 games at short time control and long time control for every patch that they accept to have a good evaluation for the value of the patch in elo points both at long time control and short time control.

I think that this knowledge can help to suggest better patches in the future.
I think it was Marco that pointed out that there aren't any top engines that use long time controls for testing. The reason being that a very limited number of tests could be run with a given set of resources. Also, I think it would be very hard to convince people to lend you their computers for testing if the results were glacially slow in coming. If your opponents can test 100 times as many patches as you, after a very short period of time he will have an advantage that can't be overcome in any reasonable length match. I.E. longer than 10 or so games.

So, I think getting the most from available resources is a must, unless you have an unlimited budget. Most people don't and they don't want to spend the rest of their natural lives waiting for a set of tests to complete.

Regards,

Zen

I meant 60+0.05 when I wrote long time control(when 15+0.05 is the short time control in the stockfish framework)

100,000 games for every functional patch that pass is going to slow the progress in the stockfish framework but not by a very big factor(I think that it is going to be factor of less than 2 considering the fact that big majority of the patches do not pass stage 1 or do not pass stage 2).

Zenmastur · Post by **Zenmastur** » Sat Apr 25, 2015 12:57 am

Laskos wrote: My model is not adapted to take care sequentially of dropping faster very bad submits, and separating by ELO. ELO dependence of Efficiency could be implemented, but then we will enter into some nasty details, as guessing the bounds and such.

After thinking about this a little more it seems that neither of these changes would need to be implemented. I think using the same ELO value for all false positive could still work as long as this value could be different than the average ELO gain of a "normal" submit that passes the test. This would allow the size of the ELO lose for a false positive to be controlled. If we make this too complicated nothing useful is likely to be gained. So lets keep it as simple as we can make it and do one step at a time.

It would be helpful to know how the efficiency changes if the average ELO of a false positive is cut in half (and perhaps a third value such as it being doubled or quartered).

So if you half the ELO of a false positive how far does this move the optimal lower CI?
What effect does this have on best efficiency?

Laskos wrote: In SPRT it goes roughly as that: the Efficiency goes as 1/scale^2, where scale is the ELO scale of the problem.

I'm not sure why SF uses a scale factor. Is there some advantage to doing this? To me it seems like a needless complication.

Laskos wrote:Separately, it goes roughly as 1/(difference between the bounds) and 1/(real ELO difference between the submit and the bounds). That's all outside the in-between of H0 and H1, where things are complicated.

Hmmm... For some reason I thought that it would vary with 1/(real ELO Diff between submit and bounds )^2. I assume that the last statement means that the first statement applies only to values outside the bounds H0 to H1 and that inside those bounds it becomes complicated because in order to be either accepted or rejected the patch has to become either a type I or Type II error. Is this correct?

Laskos wrote:As for my model valid only to decreasing Pgood for decreasing ELO of bad submits, it's because I consider all the bad submits of the same ELO value. So false positives are drawn from the pool of identical bad submits. The problem is that if Pgood is high enough, and bad submits are low on negative ELO, there is no more optima for sufficiently large Pgood. It's better to accept all the submits immediately, as the gain is instant. Type I and II errors in my model have no optima, and Efficiency is diverging at larger Pgood.

Well, for the purposes of this investigation lets assume that we can cut the ELO of a false positive in half without raising the upper ELO bound. And we can restrict Pgood to a value small enough that there is still an optimum to be found. I'm assuming that with these restrictions the problem becomes much easier to solve using your model.

Regards,

Zen

Zenmastur · Post by **Zenmastur** » Sat Apr 25, 2015 12:59 am

Uri Blass wrote: I meant 60+0.05 when I wrote long time control(when 15+0.05 is the short time control in the stockfish framework)

100,000 games for every functional patch that pass is going to slow the progress in the stockfish framework but not by a very big factor(I think that it is going to be factor of less than 2 considering the fact that big majority of the patches do not pass stage 1 or do not pass stage 2).

Ahhh....

I should have guessed this is what you meant. Sorry for the confusion.

Regards,

Zen

Laskos · Post by **Laskos** » Sat Apr 25, 2015 8:56 am

Zenmastur wrote:
Laskos wrote: My model is not adapted to take care sequentially of dropping faster very bad submits, and separating by ELO. ELO dependence of Efficiency could be implemented, but then we will enter into some nasty details, as guessing the bounds and such.
After thinking about this a little more it seems that neither of these changes would need to be implemented. I think using the same ELO value for all false positive could still work as long as this value could be different than the average ELO gain of a "normal" submit that passes the test. This would allow the size of the ELO lose for a false positive to be controlled. If we make this too complicated nothing useful is likely to be gained. So lets keep it as simple as we can make it and do one step at a time.

It would be helpful to know how the efficiency changes if the average ELO of a false positive is cut in half (and perhaps a third value such as it being doubled or quartered).

So if you half the ELO of a false positive how far does this move the optimal lower CI?
What effect does this have on best efficiency?

Laskos wrote: In SPRT it goes roughly as that: the Efficiency goes as 1/scale^2, where scale is the ELO scale of the problem.
I'm not sure why SF uses a scale factor. Is there some advantage to doing this? To me it seems like a needless complication.

Laskos wrote:Separately, it goes roughly as 1/(difference between the bounds) and 1/(real ELO difference between the submit and the bounds). That's all outside the in-between of H0 and H1, where things are complicated.
Hmmm... For some reason I thought that it would vary with 1/(real ELO Diff between submit and bounds )^2. I assume that the last statement means that the first statement applies only to values outside the bounds H0 to H1 and that inside those bounds it becomes complicated because in order to be either accepted or rejected the patch has to become either a type I or Type II error. Is this correct?

Laskos wrote:As for my model valid only to decreasing Pgood for decreasing ELO of bad submits, it's because I consider all the bad submits of the same ELO value. So false positives are drawn from the pool of identical bad submits. The problem is that if Pgood is high enough, and bad submits are low on negative ELO, there is no more optima for sufficiently large Pgood. It's better to accept all the submits immediately, as the gain is instant. Type I and II errors in my model have no optima, and Efficiency is diverging at larger Pgood.
Well, for the purposes of this investigation lets assume that we can cut the ELO of a false positive in half without raising the upper ELO bound. And we can restrict Pgood to a value small enough that there is still an optimum to be found. I'm assuming that with these restrictions the problem becomes much easier to solve using your model.

Regards,

Zen

This morning I have managed to extract analytically the dependency of Type I and II errors function of (elo, elo0, elo1, alpha, beta) in SPRT. The result is very simple, a logistic.

Of the form 1/(1+exp{C*(elo-average[elo0,elo1])/(elo1-elo0)}).

Valid on the entire range, inside (elo0,elo1) too. C depends only on (alpha, beta).

I also know the efficiency (or 1/games) behavior as a function of (elo, elo0, elo1) outside (elo0, elo1). It's (scale)^2 (this has nothing to do with Stockfish, it's a general SPRT property, in fact a property of the class of universality) and separately (elo1 - elo0)^1 and (elo - average(elo0, elo1))^1. Inside (elo0, elo1) it varies mildly as a function of elo, yet to derive, but it's not a problem.

That's all what is needed, if I find time, I may concoct a model of efficiency to have all these.

Zenmastur · Post by **Zenmastur** » Sat Apr 25, 2015 9:34 am

Laskos wrote:
Zenmastur wrote:
Laskos wrote: My model is not adapted to take care sequentially of dropping faster very bad submits, and separating by ELO. ELO dependence of Efficiency could be implemented, but then we will enter into some nasty details, as guessing the bounds and such.
After thinking about this a little more it seems that neither of these changes would need to be implemented. I think using the same ELO value for all false positive could still work as long as this value could be different than the average ELO gain of a "normal" submit that passes the test. This would allow the size of the ELO lose for a false positive to be controlled. If we make this too complicated nothing useful is likely to be gained. So lets keep it as simple as we can make it and do one step at a time.

It would be helpful to know how the efficiency changes if the average ELO of a false positive is cut in half (and perhaps a third value such as it being doubled or quartered).

So if you half the ELO of a false positive how far does this move the optimal lower CI?
What effect does this have on best efficiency?

Laskos wrote: In SPRT it goes roughly as that: the Efficiency goes as 1/scale^2, where scale is the ELO scale of the problem.
I'm not sure why SF uses a scale factor. Is there some advantage to doing this? To me it seems like a needless complication.

Laskos wrote:Separately, it goes roughly as 1/(difference between the bounds) and 1/(real ELO difference between the submit and the bounds). That's all outside the in-between of H0 and H1, where things are complicated.
Hmmm... For some reason I thought that it would vary with 1/(real ELO Diff between submit and bounds )^2. I assume that the last statement means that the first statement applies only to values outside the bounds H0 to H1 and that inside those bounds it becomes complicated because in order to be either accepted or rejected the patch has to become either a type I or Type II error. Is this correct?

Laskos wrote:As for my model valid only to decreasing Pgood for decreasing ELO of bad submits, it's because I consider all the bad submits of the same ELO value. So false positives are drawn from the pool of identical bad submits. The problem is that if Pgood is high enough, and bad submits are low on negative ELO, there is no more optima for sufficiently large Pgood. It's better to accept all the submits immediately, as the gain is instant. Type I and II errors in my model have no optima, and Efficiency is diverging at larger Pgood.
Well, for the purposes of this investigation lets assume that we can cut the ELO of a false positive in half without raising the upper ELO bound. And we can restrict Pgood to a value small enough that there is still an optimum to be found. I'm assuming that with these restrictions the problem becomes much easier to solve using your model.

Regards,

Zen
This morning I have managed to extract analytically the dependency of Type I and II errors function of (elo, elo0, elo1, alpha, beta) in SPRT. The result is very simple, a logistic.

Of the form 1/(1+exp{C*(elo-average[elo0,elo1])/(elo1-elo0)}).

Valid on the entire range, inside (elo0,elo1) too. C depends only on (alpha, beta).

I also know the efficiency (or 1/games) behavior as a function of (elo, elo0, elo1) outside (elo0, elo1). It's (scale)^2 (this has nothing to do with Stockfish, it's a general SPRT property, in fact a property of the class of universality) and separately (elo1 - elo0)^1 and (elo - average(elo0, elo1))^1. Inside (elo0, elo1) it varies mildly as a function of elo, yet to derive, but it's not a problem.

That's all what is needed, if I find time, I may concoct a model of efficiency to have all these.

Very nice!

Sounds like you've had a busy morning.

Hell... I thought you were gone for the weekend and was lamenting the fact that I would probably have to wait until Monday or Tuesday for a response and I certainly wasn't expecting an outright solution to the problem!

If you can't tell, I'm eager to see the results of your labors.

Again ... Nice work!

Regards,

Zen

Laskos · Post by **Laskos** » Sun Apr 26, 2015 9:58 am

Zenmastur wrote:
Laskos wrote: This morning I have managed to extract analytically the dependency of Type I and II errors function of (elo, elo0, elo1, alpha, beta) in SPRT. The result is very simple, a logistic.

Of the form 1/(1+exp{C*(elo-average[elo0,elo1])/(elo1-elo0)}).

Valid on the entire range, inside (elo0,elo1) too. C depends only on (alpha, beta).

I also know the efficiency (or 1/games) behavior as a function of (elo, elo0, elo1) outside (elo0, elo1). It's (scale)^2 (this has nothing to do with Stockfish, it's a general SPRT property, in fact a property of the class of universality) and separately (elo1 - elo0)^1 and (elo - average(elo0, elo1))^1. Inside (elo0, elo1) it varies mildly as a function of elo, yet to derive, but it's not a problem.

That's all what is needed, if I find time, I may concoct a model of efficiency to have all these.
Very nice!

Sounds like you've had a busy morning.

Hell... I thought you were gone for the weekend and was lamenting the fact that I would probably have to wait until Monday or Tuesday for a response and I certainly wasn't expecting an outright solution to the problem!

If you can't tell, I'm eager to see the results of your labors.

Again ... Nice work!

Regards,

Zen

This morning I finished setting up the necessary expressions. In spare time it might take a week to have the first real results, as it is pretty involved and working on that half an hour here and there (if I actually find some time to spare) brings results slowly.

Zenmastur · Post by **Zenmastur** » Sun Apr 26, 2015 11:48 am

Laskos wrote:
Zenmastur wrote:
Laskos wrote: This morning I have managed to extract analytically the dependency of Type I and II errors function of (elo, elo0, elo1, alpha, beta) in SPRT. The result is very simple, a logistic.

Of the form 1/(1+exp{C*(elo-average[elo0,elo1])/(elo1-elo0)}).

Valid on the entire range, inside (elo0,elo1) too. C depends only on (alpha, beta).

I also know the efficiency (or 1/games) behavior as a function of (elo, elo0, elo1) outside (elo0, elo1). It's (scale)^2 (this has nothing to do with Stockfish, it's a general SPRT property, in fact a property of the class of universality) and separately (elo1 - elo0)^1 and (elo - average(elo0, elo1))^1. Inside (elo0, elo1) it varies mildly as a function of elo, yet to derive, but it's not a problem.

That's all what is needed, if I find time, I may concoct a model of efficiency to have all these.
Very nice!

Sounds like you've had a busy morning.

Hell... I thought you were gone for the weekend and was lamenting the fact that I would probably have to wait until Monday or Tuesday for a response and I certainly wasn't expecting an outright solution to the problem!

If you can't tell, I'm eager to see the results of your labors.

Again ... Nice work!

Regards,

Zen
This morning I finished setting up the necessary expressions. In spare time it might take a week to have the first real results, as it is pretty involved and working on that half an hour here and there (if I actually find some time to spare) brings results slowly.

I've set up models before. Even after they're built it can take a long time to modify them while making sure you don't inadvertently screw something up. So I already figured it was going to take at least a week if not more.

No need to hurry, just do it right the first time. Doing it right the second time always takes longer!

Regards,

Zen

Zenmastur · Post by **Zenmastur** » Wed Apr 29, 2015 3:16 am

After looking at this graph you posted:

I got to thinking, wouldn't it be nice if we could simultaneously calculate multiple statistics and then stop when any one of them reaches a valid bound. e.g. calculate the statistic(s) for the ELO bounds (0,4), (3,8), and (7, 16) for each game played. When any one of these is accepted your done. If the (7,16) bound is rejected you would continue because there is still a chance that either the (3,8) bound or (0,4) bound could still be accepted with more games. I think a change like this would cut the number of games required by a substantial amount.

Further more, we could also introduce bounds like these, (-2, 2), (-4, -1), and (-8, -3). The point of these calculation would be that if one of these bounds accepts H0 we can be pretty sure that the patch is an ELO loser and there is no need to continue the test.

In a different thread

Laskos wrote:I played a bit with SPRT of cutechess-cli and a true value of 8 Elo points difference. It seems a bit of an art.
Code: Select all
H0    H1    accepted   games   LOS 

 0     2       H1      15700  99.99% 
 0     3       H1      12000  99.95% 
 0     5       H1       6800  99.70% 
 0    10       H1       5800  99.28% 
 0    15       H1       2300  99.25% 
 0    20       H0       8000  99.61% 
 0    25       H0       2900  95.60%
H0 is always 0 points here, alpha=beta=0.05, and the true difference is 8 Elo points.For these alpha,beta, if one sets H1 too low, he will play more games than 99.9% LOS before a stop picking H1. If one sets H1 reasonably, one saves ~20% time compared to LOS 99.9%. If one is setting H1 too large, H0 is picked for stopping as a false negative, while the LOS of the stronger engine can be 99.6%. This is frustrating, as a bit more games and the result is a true positive, but one has to reset H1 to a lower value, and redo the test.
I am not claiming that LOS is in fact a stopping rule, but on the range 50-50,000 games (including draws), where the Type I error is smaller than 5%, LOS of 99.9% is only 20% slower in stopping on true positive or true negative, without this art of guessing what hypothesis H1 I have to choose. Sure, SPRT is robust in handling alpha and beta errors, and in setting the epsilon interval, and should be used as such, but it will not prevent me to stop whenever I see 3.1SD advantage for one engine, if not using SPRT, in a 50-50,000 games match. I played a bit with SPRT of cutechess-cli and a true value of 8 Elo points difference. It seems a bit of an art.

After reading this post I decided that the optimal bound to reduce the number of games required is with the upper bound is set to twice the expected ELO of the patch assuming that the lower bound is zero. And that H0 was highly likely to be accepted if the bound is made higher than this. This seems to fit the data you gave in the post. So, I'm not so sure it's an art to pick the bounds as you put it. I think for the lowest number of games multiple bounds should be tested simultaneously. If asymmetrical CI's are used then these tests would seem to be sufficient to increase efficiency to near maximum values. My understanding is that calculating the current LLR with a single set of bounds doesn't take much time as compared to playing a single game. So I assume that several bounds could be calculated with out adding significantly to the time used. Is there any reason that you can think of that doing this would produce invalid results?I can't think of any reason that this couldn't be done.

It also seems that having some idea of the ELO bounds of the individual patches, when taken as a group, the percentages of patches that fall into each of the different ELO groups could be used to tune the CI's used for each different set of ELO bounds used. A kind of self tuning.

Another idea for those reluctant to accept asymmetrical CI's:

If a symmetrical CI is used, I'm not sure why anyone would want to do this, but if they did then it might be best to also calculate the LOS after each game at least until the accumulation of type I errors exceeds the given CI. If the LOS reaches 99.9% (3.1 SD) then the test has reached a valid stopping point. The LOS should be calculated both ways. Meaning that program 1 (P1), the program that is being tested is superior to program 2 (P2) the reference program, for an LOS of 99.9% AND P2 is tested against P1 to see if its LOS is 99.9%. The first test (P1 is superior to P2) is a valid early stopping point for accepting H1 and the second test (P2 is superior to P1) is a valid early stopping point for accepting H0. This would, in effect, take the best parts of both tests and combine them so that in the graph the efficiency line for this compound test would follow the 3 SD line until it roughly intercepted the +/- 5% line of SPRT.

So the question: For those that insist on using symmetrical CI's, I assume their will be a few holdouts, Is there a simple way to calculate LOS's accumulated type I errors as shown in your post? By simple I mean simple enough that it could easily be incorporated into a program with the standard math functions available in most math libraries.

Laskos wrote:If one is stopping early a match of planned N games (not shorter than 50 games) as soon as the Likelihood Of Superiority (LOS) reaches a certain value, and is not using SPRT as a stopping rule, he should be aware that a fixed LOS steadily accumulates Type I errors with the number N of planned games. Here is a table of Type I errors with the number of games (in fact wins+losses, as LOS is independent of draws).
Code: Select all
                 TYPE   I   ERROR
N Games  LOS=0.95   LOS=0.99   LOS=0.999

  100       27%        7.1%       1.2% 
  200       38%       11.5%       1.7%
  400       49%       14.6%       2.3%
  800       56%       17.8%       2.7%
 1500       62%       21.9%       2.9%
 3000       68%       24.0%       3.6%
 5000       72%       25.8%       4.4%
10000       76%       28.9%       4.8%
30000       82%       33.9%       5.4%
LOS of 95% is totally useless as early stopping. If one wants to have a Type I error of less than 5% for N up to ~30,000 (wins+losses) games, a LOS of 99.9% could be used as an early stopping rule. One can stop the match as soon as LOS gets to 99.9%, if the match is shorter than 50,000 games, and longer than 50 games. LOS is easy to calculate as (1 + Erf[(wins - losses)/Sqrt{2*(wins + losses)}])/2. Or just use SPRT of cutechess-cli.

Anyway, I will be traveling for several days starting Thursday morning. I'm not sure how much internet access I'll have and will be busy most of the time, so I though I would give you a few things to think about and test when the new model is ready. If it turns out that several statistics can be calculated and used then it would be interesting to know what the best set of such bounds would be if given a limited number of them. i.e. 4 to 8

Regards,

Zen

Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?