hgm wrote:Laskos wrote:Also, the error bars in Elo points would become large for very skewed 99:1 matches. A round robin with different engines would take too much time, I guess, so do it just with Crafty.
Note that I did _not_ suggest to play 1*t vs 16*t; only upto 1*t vs 8*t (N ply vs N+3 ply, assuming an EBF of 2). With the usual assumption of 70 Elo gain per doubling a time odds factor 8 only produces 210 rating point. That should produce a 75% result.
True, a 75% result will not deprive us of so much Elo precision. But, I have a problem even with 2,4,8, as I do not know if the results are linearly additive, i.e. that 2*2 gives equal results to 1*4 in Elo points. I suspect that we will be even more confused by the issue if we do 2,4,8 multiplicants of "t", compared to simple doubling at each step. Elo curve pretends to have this additivity, but at least Sonas results with human players are disagreeing.
The increased steepness of the Elo vs score curve when you move away from 50% score is partly offset by the fact that the statistical noise is sqrt(N*s*(1-s)) (forgetting about draws for simplicity), which decreases if the score s approaches 0 or 1.
Yes, but it is a square root damping, I do not know if it could catch with the steepness of the Elo curve around 0.99. But it surely is ok at 0.75, the value you propose at N+3. If only 8*t is proposed, I see no problems with error bars. I was worried about crazy 512:1 results and such.
The problem with only playing N vs N+1 is that a single fluke affects all subsequent Elo estimates, and thus has a huge effect on the fit.
I do not understand. The results are not cumulative, each doubling in time is independent from another, and we are not interested in the final Elo, only in Elo difference from doubling at different time controls. A fluke certainly will affect the fit, but a fluke at N+3 will affect the same, or we will make some corrections to be in line with N+1 and N+2?
As a last remark: Playing different engines does not take anymore time than playing just Crafty. In stead of 30,000 games Crafty vs Crafty you would play 2000 games of each engine against each other (with 6). The aim might be to see how much strength Crafty gains on doubling the time, but te Crafty strength is measured much more accurately by playing it against a mix of other engines than in self play.
Yes, I remember one your post about that, but I forgot your argument

. Crafty will play a total of 10,000 games. Isn't it easier just to play a gauntlet with Crafty playing all 30,000 games (with 5 different engines), as we do not need the "absolute" rating, only that relative to doubling in "t" for Crafty only? Also, the engines should be of similar strength, as we don't like 0.90+ results for Elo error bars concerns.
Kai