Maximum ELO gain per test game played?

Zenmastur · Post by **Zenmastur** » Fri Apr 24, 2015 10:18 am

Michel wrote:
Zenmastur wrote: So, in effect, their "High Tech" advantage has been largely squandered by poor framework design.
Ah ok. Good to know this.

Why is this good to know?

If it turns out that there isn't a good reason to do this from an efficiency stand point I would think it would be a TERRIBLE thing to know. It means that they have probably cut their efficiency roughly in half. Besides, there's a VERY easy fix for this that would probably take all of 10 seconds to implement.

Regards,

Zen

Uri Blass · Post by **Uri Blass** » Fri Apr 24, 2015 10:41 am

The reason for (-1.5,4.5) at stage 1 is to increase the probability of patches that help mainly at long time control to pass.

If you use (0,6) at stage I they will probably not pass at short time control
so you reject positive patches.

If the target is to optimize stockfish for 1 minute per game and not for long time control then my opinion is that it is more logical to use (-1.5,4.5) also for stage 2.

Note that for tuning weights the stockfish framework use SPRT(0,4) twice and I find it non consistent so I think that it is better to use SPRT(-1,3) at stage 1 and SPRT(0,4) at stage 2 for tuning.

Uri Blass · Post by **Uri Blass** » Fri Apr 24, 2015 10:48 am

I can add that I disagree with the target of get improvement as fast as possible and I think that getting knowledge is more interesting and may be productive if you think some years forward and not only one or 2 months forward so my opinion is that it is better for the stockfish team to have 100,000 games at short time control and long time control for every patch that they accept to have a good evaluation for the value of the patch in elo points both at long time control and short time control.

I think that this knowledge can help to suggest better patches in the future.

Zenmastur · Post by **Zenmastur** » Fri Apr 24, 2015 10:56 am

Uri Blass wrote:The reason for (-1.5,4.5) at stage 1 is to increase the probability of patches that help mainly at long time control to pass.

If you use (0,6) at stage I they will probably not pass at short time control
so you reject positive patches.

If the target is to optimize stockfish for 1 minute per game and not for long time control then my opinion is that it is more logical to use (-1.5,4.5) also for stage 2.

Note that for tuning weights the stockfish framework use SPRT(0,4) twice and I find it non consistent so I think that it is better to use SPRT(-1,3) at stage 1 and SPRT(0,4) at stage 2 for tuning.

How does lower stage one bounds help a patch pass a stage two test?

Since it still must pass a stage two test you end up doing the test twice with no additional benefit.

Regards,

Zen

Zenmastur · Post by **Zenmastur** » Fri Apr 24, 2015 11:12 am

Uri Blass wrote:I can add that I disagree with the target of get improvement as fast as possible and I think that getting knowledge is more interesting and may be productive if you think some years forward and not only one or 2 months forward so my opinion is that it is better for the stockfish team to have 100,000 games at short time control and long time control for every patch that they accept to have a good evaluation for the value of the patch in elo points both at long time control and short time control.

I think that this knowledge can help to suggest better patches in the future.

I think it was Marco that pointed out that there aren't any top engines that use long time controls for testing. The reason being that a very limited number of tests could be run with a given set of resources. Also, I think it would be very hard to convince people to lend you their computers for testing if the results were glacially slow in coming. If your opponents can test 100 times as many patches as you, after a very short period of time he will have an advantage that can't be overcome in any reasonable length match. I.E. longer than 10 or so games.

So, I think getting the most from available resources is a must, unless you have an unlimited budget. Most people don't and they don't want to spend the rest of their natural lives waiting for a set of tests to complete.

Regards,

Zen

Laskos · Post by **Laskos** » Fri Apr 24, 2015 11:51 am

Zenmastur wrote:
Michel wrote:
Zenmastur wrote: So, in effect, their "High Tech" advantage has been largely squandered by poor framework design.
Ah ok. Good to know this.
Why is this good to know?

If it turns out that there isn't a good reason to do this from an efficiency stand point I would think it would be a TERRIBLE thing to know. It means that they have probably cut their efficiency roughly in half. Besides, there's a VERY easy fix for this that would probably take all of 10 seconds to implement.

Regards,

Zen

Even at half the efficiency, SPRT 5% 5% is still miles ahead of usual "outside error margins" (usually 2 SD) argument. SPRT 5% 5% introduces no only fairly close to optimum efficiency, but some discipline too. What even experienced testers/developers do? They rarely use SPRT, instead run scripts for rating tools showing ELO and error margins (or LOS or p-value). And the main problem is not even that. They often do it negligently.

Say, you plan a match of 12000 games. From time to time you have a glimpse at the intermediate result, and after 7000 games the long-expected outcome is apparent, it's "outside error margins", stop it. Inadvertently, this sort of negligence accumulates the dangerous Type I error at a fast pace.

I made simulations, for a match of 12000 games, using 2 SD (CI=95.4%), if the tester has a glimpse on ELO and error margins after each 1000 games (12 total glimpses maximum) in a hope of a clear outcome, then the "bad" Type I error climbs from theoretical 2.3% to 9.1%. I plotted here the efficiency of many developers/testers, who use "Sloppy 2 SD".

SPRT 5% 5% is 3-5 times faster in the relevant range, with "Sloppy 2 SD" breaking down at 16% Pgood. Even at half the efficiency SPRT is a big step forward from the artful use of error margins (or LOS or p-value). In fact, for artists, I would recommend to use 3 SD instead of 2 SD, as Type I error due to indiscipline is much tamer.

Zenmastur · Post by **Zenmastur** » Fri Apr 24, 2015 12:01 pm

Laskos wrote:
Zenmastur wrote:
Michel wrote:
Zenmastur wrote: So, in effect, their "High Tech" advantage has been largely squandered by poor framework design.
Ah ok. Good to know this.
Why is this good to know?

If it turns out that there isn't a good reason to do this from an efficiency stand point I would think it would be a TERRIBLE thing to know. It means that they have probably cut their efficiency roughly in half. Besides, there's a VERY easy fix for this that would probably take all of 10 seconds to implement.

Regards,

Zen
Even at half the efficiency, SPRT 5% 5% is still miles ahead of usual "outside error margins" (usually 2 SD) argument. SPRT 5% 5% introduces no only fairly close to optimum efficiency, but some discipline too. What even experienced testers/developers do? They rarely use SPRT, instead run scripts for rating tools showing ELO and error margins (or LOS or p-value). And the main problem is not even that. They often do it negligently.

Say, you plan a match of 12000 games. From time to time you have a glimpse at the intermediate result, and after 7000 games the long-expected outcome is apparent, it's "outside error margins", stop it. Inadvertently, this sort of negligence accumulates the dangerous Type I error at a fast pace.

I made simulations, for a match of 12000 games, using 2 SD (CI=95.4%), if the tester has a glimpse on ELO and error margins after each 1000 games (12 total glimpses maximum) in a hope of a clear outcome, then the "bad" Type I error climbs from theoretical 2.3% to 9.1%. I plotted here the efficiency of many developers/testers, who use "Sloppy 2 SD".

SPRT 5% 5% is 3-5 times faster in the relevant range, with "Sloppy 2 SD" breaking down at 16% Pgood. Even at half the efficiency SPRT is a big step forward from the artful use of error margins (or LOS or p-value).

I don't doubt that what you say is true. I'm sure many developers ruin their own results by not sticking to their chosen regime. But their mistakes don't justify making mistakes yourself.

So prey tell, is there a way to make the net result of all type I errors ELO neutral in your opinion. If so how difficult would it be to back that up with some math?

Regards,

Zen

Laskos · Post by **Laskos** » Fri Apr 24, 2015 12:16 pm

Zenmastur wrote:
So prey tell, is there a way to make the net result of all type I errors ELO neutral in your opinion. If so how difficult would it be to back that up with some math?

Regards,

Zen

I didn't quite understand the problem. If false positives accumulate to 0, then some pretty absurd things seem to happen. First, in my model, the dependence of Efficiency on Type I error disappears. Well, I can make it disappear asymptotically, keeping low values of its ELO impact. Then, setting higher ELO bounds for "good submits" in order for false positives to not accumulate ELO must be compensated by lower Pgood. In these conditions, I can plot the optimal errors for low ELO impact of false positives and low P_very_good. I am not sure I understood the problem.

Zenmastur · Post by **Zenmastur** » Fri Apr 24, 2015 12:44 pm

Laskos wrote: I didn't quite understand the problem. If false positives accumulate to 0, then some pretty absurd things seem to happen. First, in my model, the dependence of Efficiency on Type I error disappears.

I think you got the jist of what I wanted. I figured that if we could get rid of the ELO problems caused by false positives that the dependancy of type I errors would disappear, which is exactly what I wanted to happen!

Is there more than one way to do this? E.G. You could raise the upper ELO bounds but this might have some unwanted/bad consequences. But maybe you have enough indirect control by raising the lower ELO bound. This would cause more bad submits to drop out earlier. The ones with the worst ELO's would drop out first statistically. Thus the average ELO of a false positive should rise. If it can be raised enough that some become positive then the net effect of false positives will drop substantially. Even if it doesn't actual become zero it would still have an impact on the effect of false positives on the optimal lower CI which would have a good effect on efficiency. If insufficient control is available by raising only the lower ELO bound the a combination of raising both could be used. The Idea is to not raise the upper bound by more than is absolutely required.

Laskos wrote:Well, I can make it disappear asymptotically, keeping low values of its ELO impact. Then, setting higher ELO bounds for "good submits" in order for false positives to not accumulate ELO must be compensated by lower Pgood.

I'm not sure I understand the last statement.

Laskos wrote:In these conditions, I can plot the optimal errors for low ELO impact of false positives and low P_very_good. I am not sure I understood the problem.

[/quote]

Regards,

Zen

Laskos · Post by **Laskos** » Fri Apr 24, 2015 1:31 pm

Zenmastur wrote:
Laskos wrote: I didn't quite understand the problem. If false positives accumulate to 0, then some pretty absurd things seem to happen. First, in my model, the dependence of Efficiency on Type I error disappears.
I think you got the jist of what I wanted. I figured that if we could get rid of the ELO problems caused by false positives that the dependancy of type I errors would disappear, which is exactly what I wanted to happen!

Is there more than one way to do this? E.G. You could raise the upper ELO bounds but this might have some unwanted/bad consequences. But maybe you have enough indirect control by raising the lower ELO bound. This would cause more bad submits to drop out earlier. The ones with the worst ELO's would drop out first statistically. Thus the average ELO of a false positive should rise. If it can be raised enough that some become positive then the net effect of false positives will drop substantially. Even if it doesn't actual become zero it would still have an impact on the effect of false positives on the optimal lower CI which would have a good effect on efficiency. If insufficient control is available by raising only the lower ELO bound the a combination of raising both could be used. The Idea is to not raise the upper bound by more than is absolutely required.

Laskos wrote:Well, I can make it disappear asymptotically, keeping low values of its ELO impact. Then, setting higher ELO bounds for "good submits" in order for false positives to not accumulate ELO must be compensated by lower Pgood.

I'm not sure I understand the last statement.

My model is not adapted to take care sequentially of dropping faster very bad submits, and separating by ELO. ELO dependence of Efficiency could be implemented, but then we will enter into some nasty details, as guessing the bounds and such. In SPRT it goes roughly as that: the Efficiency goes as 1/scale^2, where scale is the ELO scale of the problem. Separately, it goes roughly as 1/(difference between the bounds) and 1/(real ELO difference between the submit and the bounds). That's all outside the in-between of H0 and H1, where things are complicated.

As for my model valid only to decreasing Pgood for decreasing ELO of bad submits, it's because I consider all the bad submits of the same ELO value. So false positives are drawn from the pool of identical bad submits. The problem is that if Pgood is high enough, and bad submits are low on negative ELO, there is no more optima for sufficiently large Pgood. It's better to accept all the submits immediately, as the gain is instant. Type I and II errors in my model have no optima, and Efficiency is diverging at larger Pgood.

I would need to separate into 3 pools: good submits, mildly bad submits, which give false positives and add up to 0 ELO, and very bad submits, which don't produce false positives and dropt out quickly. Then I would need to do that sequentially. I am not sure this complication is really worth the effort.

Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?

Re: Maximum ELO gain per test game played?