TalkChess.com

Posted: **Thu Nov 29, 2012 7:00 pm**

hgm wrote:Cutting short a match erodes your confidence. For example, when you test until the the score exceeds the the 1.96 STD interval around 50%, and abort the test immediately when it does this, you can only have 90% confidence that they were indeed of different strength. While normally 1.96 STD corresponds to a confidence of 95%, when it happens over a predetermined number of games (without paying attention to intermediate results). The possibility to stop early exactly doubles the probability that you will accept a fluke (from 5% to 10%), because you don't give it the opportunity to correct itself by an average result in the remainder of the test.

H.G.,

You use the terminology, "doubles the probability that you will accept a fluke."
Is there a simple way to calculate or approximate a modified error margin?

I am mathematically challenged but this would be useful to know. Here is an example that may help:

For example if the error margin of +/- 8 ELO is reported but you are exactly half way through the test you intent to run, could you pretend the error margin was 16 or perhaps 11.3137 or some other value? The 11.3137 value is the error margin multiplied by the square root of 1 / (1/2), which seems like a good guess to me if this is valid at all.

Don

Posted: **Thu Nov 29, 2012 7:28 pm**

Well, it is a bit tricky, because there is not a single answer. Because it does not only depend on the actual result, but also on how you intended to use it!

If from the very outset you intended only to play half that number of games, then the error margin would indeed be ~11 Elo (sqrt(2) as large). But if you intended to continue all the way to the full number of games in case you did not like the half-way result, you have built extra opportunities to reach a good (or bad) result in your test strategy, and this makes the fact that you reach that result less convincing (i.e., it results in a lower confidence for the same error margin, or a larger error margin to get the same confidence).

So you always have to consider your entire testing strategy. 'confidence' is basically "what are the chances the result is NOT a fluke". If you have a testing strategy that selects flukes by stopping half-way if you happen to encounter one, the chances that you end up with a fluke one way or the other obviously will increase.

The interesting thing is that when you decide to stop on a fluke of a previously-determined number of standard errors (e.g. when you exceed 8 Elo after the ful nr of games, 11 Elo after half, 16 after a quarter, 22 after 1/8, 32 after 1/16 etc.), the chances that the result is a fluke exactly double compared to when you only would have accepted +8 Elo after the full number of games, and would have continued no matter how unbalanced any intermediate result was.

Posted: **Thu Nov 29, 2012 7:48 pm**

hgm wrote:Well, it is a bit tricky, because there is not a single answer. Because it does not only depend on the actual result, but also on how you intended to use it!

If from the very outset you intended only to play half that number of games, then the error margin would indeed be ~11 Elo (sqrt(2) as large). But if you intended to continue all the way to the full number of games in case you did not like the half-way result, you have built extra opportunities to reach a good (or bad) result in your test strategy, and this makes the fact that you reach that result less convincing (i.e., it results in a lower confidence for the same error margin, or a larger error margin to get the same confidence).

So you always have to consider your entire testing strategy. 'confidence' is basically "what are the chances the result is NOT a fluke". If you have a testing strategy that selects flukes by stopping half-way if you happen to encounter one, the chances that you end up with a fluke one way or the other obviously will increase.

The interesting thing is that when you decide to stop on a fluke of a previously-determined number of standard errors (e.g. when you exceed 8 Elo after the ful nr of games, 11 Elo after half, 16 after a quarter, 22 after 1/8, 32 after 1/16 etc.), the chances that the result is a fluke exactly double compared to when you only would have accepted +8 Elo after the full number of games, and would have continued no matter how unbalanced any intermediate result was.

I guess what I am looking for is a way to standardize our testing such that we can have strict stopping rules that make sense statistically. I don't want it to be based on random observation or our own wishful thinking (stop the test when you "like" the results) because that obviously introduces all sorts of bias.

Also, I would like to specify before each test how interested we are in the results - there are clearly some changes that we are more willing to stop early than others assuming they are bad. The worst that can happen is that you fail to accept a good change, but by far the biggest change is accepting regressions so the bar has to be much higher for what changes we accept.

Posted: **Thu Nov 29, 2012 7:57 pm**

Hello John:

John_F wrote:Does anyone know of a spreadsheet formula to calculate LOS? (Like for example, to determine the LOS of the winner in a match between two programs)

You can try BayesElo by Rémi Coulom, which has a command called los that computes a LOS matrix with several engines.

I wrote a more modest tool to compute LOS only in matches between two engines a while ago. My programme does not accept PGN files. You should take a look to this thread and download the compressed file, which is still active. You must double click on LOS_and_Elo_uncertainties_calculator.exe and wait few seconds: the programme will ask for the number of wins, loses and draws, which is enough for the model I use to compute LOS. The programme also asks for a confidence level (just for compute error bars) and the clock rate of your CPU for timing the job; this programme only works in Windows. Here is an example:

Code: Select all

LOS_and_Elo_uncertainties_calculator, ® 2012.

----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines&#58;
----------------------------------------------------------------

&#40;The input and output data is referred to the first engine&#41;.

Please write down non-negative integers.

Maximum number of games supported&#58; 2147483647.

Write down the number of wins &#40;up to 1825361100&#41;&#58;

323

Write down the number of loses &#40;up to 1825361100&#41;&#58;

288

Write down the number of draws &#40;up to 2147483036&#41;&#58;

389

 Write down the confidence level &#40;in percentage&#41; between 65% and 99.9% &#40;it will be rounded up to 0.01%)&#58;

95

Write down the clock rate of the CPU &#40;in GHz&#41;, only for timing the elapsed time of the calculations&#58;

3

---------------------------------------
Elo interval for 95.00 % confidence&#58;

Elo rating difference&#58;     12.17 Elo

Lower rating difference&#58;   -4.66 Elo
Upper rating difference&#58;   29.04 Elo

Lower bound uncertainty&#58;  -16.82 Elo
Upper bound uncertainty&#58;   16.88 Elo
Average error&#58;        +/-  16.85 Elo

K = &#40;average error&#41;*&#91;sqrt&#40;n&#41;&#93; =  532.82

Elo interval&#58; &#93;  -4.66,   29.04&#91;
---------------------------------------

Number of games of the match&#58;      1000
Score&#58; 51.75 %
Elo rating difference&#58;   12.17 Elo
Draw ratio&#58; 38.90 %

*********************************************************
Standard deviation&#58;  2.4199 % of the points of the match.
*********************************************************

 Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.

-------------------------------------------------------------------
Calculation of likelihood of superiority &#40;LOS&#41; in a one-sided test&#58;
-------------------------------------------------------------------

LOS &#40;taking into account draws&#41; is always calculated, if possible.

LOS &#40;not taking into account draws&#41; is only calculated if wins + loses < 16001.

LOS &#40;average value&#41; is calculated only when LOS &#40;not taking into account draws&#41;
is calculated.
______________________________________________

LOS&#58;  92.18 % &#40;taking into account draws&#41;.
LOS&#58;  92.15 % &#40;not taking into account draws&#41;.
LOS&#58;  92.16 % &#40;average value&#41;.
______________________________________________

These values of LOS are rounded up to 0.01%

End of the calculations. Approximated elapsed time&#58;   79 ms.

Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.

In this case, a match of 1000 games with a final result of +323 -288 =389 gives a LOS value of almost 92.2% for the winner (and of course the loser has a LOS of around 7.8%). The programme spent only 79 ms in my computer this time, so it is relatively fast. It is open source (with Fortran 95 sources included) so you can take a look on how it works. I hope it will be helpful for you!

Regards from Spain.

Ajedrecista.

Posted: **Thu Nov 29, 2012 8:31 pm**

hgm wrote:Well, it is a bit tricky, because there is not a single answer. Because it does not only depend on the actual result, but also on how you intended to use it!

If from the very outset you intended only to play half that number of games, then the error margin would indeed be ~11 Elo (sqrt(2) as large). But if you intended to continue all the way to the full number of games in case you did not like the half-way result, you have built extra opportunities to reach a good (or bad) result in your test strategy, and this makes the fact that you reach that result less convincing (i.e., it results in a lower confidence for the same error margin, or a larger error margin to get the same confidence).

So you always have to consider your entire testing strategy. 'confidence' is basically "what are the chances the result is NOT a fluke". If you have a testing strategy that selects flukes by stopping half-way if you happen to encounter one, the chances that you end up with a fluke one way or the other obviously will increase.

The interesting thing is that when you decide to stop on a fluke of a previously-determined number of standard errors (e.g. when you exceed 8 Elo after the ful nr of games, 11 Elo after half, 16 after a quarter, 22 after 1/8, 32 after 1/16 etc.), the chances that the result is a fluke exactly double compared to when you only would have accepted +8 Elo after the full number of games, and would have continued no matter how unbalanced any intermediate result was.

I did develop a simple stop early test strategy based on a large monte carlo simulation designed to test different stopping strategies. I don't understand enough of the math behind it to compute it so I simulated it. The basic idea it to SET various thresholds:

1. maximum number of games.
2. reject strategy if results falls N or more games behind
3. accept strategy if results ahead by N or more games.
4. convergence strategy

So you can try all sorts of different parameters and then test them with a simulation - basically looking for a very low accept rate for regressions.

The strategy is tested after each game is played. If the reject or accept strategy does not kick in before the set completes, the results are automatically rejected even if the score is positive. I based the simulation on the expected number of draws and generating players that were 1 ELO weaker or 1 ELO stronger than the comparison. So the answer did not tell me how likely it was that I would accepts a regression but it did tell me how likely it was that I would accept a 1 ELO regression. I also simulated a zero ELO change as it is interesting to see how many of those are accepted or rejected as that can help you understand how many small regressions you will accept.

It was then improved significantly with a strategy to change the threshold numbers as the match progressed (convergence strategy) so that you are more likely to accept or reject a change as the match proceeds. The best values for doing that turned out to be modest and does not converge to zero at the end or even close - but with the right balance you can do much better. The name of the game is to get the best possible results with the least number of games. And this only works with head to head matches. I'm quite sure that my linear convergence strategy combined with the win by N strategy simply approximated a much better formula that I could have used instead.

This was all inspired by the thread where the "wald" procedure was being discussed as I think it is "wald-like" in some ways but much easier for me to grok. The math behind the wald stuff is hairy (for me) and I like to understand what it is I'm actually doing.

Posted: **Fri Nov 30, 2012 12:16 am**

Don wrote:
hgm wrote:Well, it is a bit tricky, because there is not a single answer. Because it does not only depend on the actual result, but also on how you intended to use it!

If from the very outset you intended only to play half that number of games, then the error margin would indeed be ~11 Elo (sqrt(2) as large). But if you intended to continue all the way to the full number of games in case you did not like the half-way result, you have built extra opportunities to reach a good (or bad) result in your test strategy, and this makes the fact that you reach that result less convincing (i.e., it results in a lower confidence for the same error margin, or a larger error margin to get the same confidence).

So you always have to consider your entire testing strategy. 'confidence' is basically "what are the chances the result is NOT a fluke". If you have a testing strategy that selects flukes by stopping half-way if you happen to encounter one, the chances that you end up with a fluke one way or the other obviously will increase.

The interesting thing is that when you decide to stop on a fluke of a previously-determined number of standard errors (e.g. when you exceed 8 Elo after the ful nr of games, 11 Elo after half, 16 after a quarter, 22 after 1/8, 32 after 1/16 etc.), the chances that the result is a fluke exactly double compared to when you only would have accepted +8 Elo after the full number of games, and would have continued no matter how unbalanced any intermediate result was.
I did develop a simple stop early test strategy based on a large monte carlo simulation designed to test different stopping strategies. I don't understand enough of the math behind it to compute it so I simulated it. The basic idea it to SET various thresholds:

1. maximum number of games.
2. reject strategy if results falls N or more games behind
3. accept strategy if results ahead by N or more games.
4. convergence strategy

So you can try all sorts of different parameters and then test them with a simulation - basically looking for a very low accept rate for regressions.

The strategy is tested after each game is played. If the reject or accept strategy does not kick in before the set completes, the results are automatically rejected even if the score is positive. I based the simulation on the expected number of draws and generating players that were 1 ELO weaker or 1 ELO stronger than the comparison. So the answer did not tell me how likely it was that I would accepts a regression but it did tell me how likely it was that I would accept a 1 ELO regression. I also simulated a zero ELO change as it is interesting to see how many of those are accepted or rejected as that can help you understand how many small regressions you will accept.

It was then improved significantly with a strategy to change the threshold numbers as the match progressed (convergence strategy) so that you are more likely to accept or reject a change as the match proceeds. The best values for doing that turned out to be modest and does not converge to zero at the end or even close - but with the right balance you can do much better. The name of the game is to get the best possible results with the least number of games. And this only works with head to head matches. I'm quite sure that my linear convergence strategy combined with the win by N strategy simply approximated a much better formula that I could have used instead.

This was all inspired by the thread where the "wald" procedure was being discussed as I think it is "wald-like" in some ways but much easier for me to grok. The math behind the wald stuff is hairy (for me) and I like to understand what it is I'm actually doing.

Have you tries the sequential probability ratio test, which is now available in cutechess-cli (if you compile from source) ?
It works very well. The only important thing I see to improve is to calculate dynamically the value of drawelo, instead of using the bayeselo value as a hardcoded constant. I need to experiment with it, then I can propose a fix to Ilari.

Posted: **Fri Nov 30, 2012 2:22 pm**

lucasart wrote:
Don wrote:
hgm wrote:Well, it is a bit tricky, because there is not a single answer. Because it does not only depend on the actual result, but also on how you intended to use it!

If from the very outset you intended only to play half that number of games, then the error margin would indeed be ~11 Elo (sqrt(2) as large). But if you intended to continue all the way to the full number of games in case you did not like the half-way result, you have built extra opportunities to reach a good (or bad) result in your test strategy, and this makes the fact that you reach that result less convincing (i.e., it results in a lower confidence for the same error margin, or a larger error margin to get the same confidence).

So you always have to consider your entire testing strategy. 'confidence' is basically "what are the chances the result is NOT a fluke". If you have a testing strategy that selects flukes by stopping half-way if you happen to encounter one, the chances that you end up with a fluke one way or the other obviously will increase.

The interesting thing is that when you decide to stop on a fluke of a previously-determined number of standard errors (e.g. when you exceed 8 Elo after the ful nr of games, 11 Elo after half, 16 after a quarter, 22 after 1/8, 32 after 1/16 etc.), the chances that the result is a fluke exactly double compared to when you only would have accepted +8 Elo after the full number of games, and would have continued no matter how unbalanced any intermediate result was.
I did develop a simple stop early test strategy based on a large monte carlo simulation designed to test different stopping strategies. I don't understand enough of the math behind it to compute it so I simulated it. The basic idea it to SET various thresholds:

1. maximum number of games.
2. reject strategy if results falls N or more games behind
3. accept strategy if results ahead by N or more games.
4. convergence strategy

So you can try all sorts of different parameters and then test them with a simulation - basically looking for a very low accept rate for regressions.

The strategy is tested after each game is played. If the reject or accept strategy does not kick in before the set completes, the results are automatically rejected even if the score is positive. I based the simulation on the expected number of draws and generating players that were 1 ELO weaker or 1 ELO stronger than the comparison. So the answer did not tell me how likely it was that I would accepts a regression but it did tell me how likely it was that I would accept a 1 ELO regression. I also simulated a zero ELO change as it is interesting to see how many of those are accepted or rejected as that can help you understand how many small regressions you will accept.

It was then improved significantly with a strategy to change the threshold numbers as the match progressed (convergence strategy) so that you are more likely to accept or reject a change as the match proceeds. The best values for doing that turned out to be modest and does not converge to zero at the end or even close - but with the right balance you can do much better. The name of the game is to get the best possible results with the least number of games. And this only works with head to head matches. I'm quite sure that my linear convergence strategy combined with the win by N strategy simply approximated a much better formula that I could have used instead.

This was all inspired by the thread where the "wald" procedure was being discussed as I think it is "wald-like" in some ways but much easier for me to grok. The math behind the wald stuff is hairy (for me) and I like to understand what it is I'm actually doing.
Have you tries the sequential probability ratio test, which is now available in cutechess-cli (if you compile from source) ?
It works very well. The only important thing I see to improve is to calculate dynamically the value of drawelo, instead of using the bayeselo value as a hardcoded constant. I need to experiment with it, then I can propose a fix to Ilari.

I have not given cutechess-cli a try yet since I had long ago developed my own autotesting system. But I should try it - I understand it works quite well.

I assume it will work from Linux too, right?

Don

Posted: **Fri Nov 30, 2012 2:38 pm**

Don wrote: I have not given cutechess-cli a try yet since I had long ago developed my own autotesting system. But I should try it - I understand it works quite well.

I assume it will work from Linux too, right?

Don

Yes, it's avilable on github. You need to install Qt to compile it though
https://github.com/cutechess/cutechess

It has basically 3 projects:
- a library with all common functions
- a cli tournament manager. very powerful and feature rich.
- a GUI. very nice, but in the relatively early stages of developpement. what is there is really neat, but there's a lot missing in terms of features.

You can also program the SPRT algorithm yourself, it's not complicated. Have a look at sprt.h and sprt.cpp in cutechess (library). The only major thing that needs improving on it is the dynamic calibration of the drawelo value. This is important, especially in self testing, where the drawelo value is typically much higher, than the 97.3 default value (which came from BayesElo and was calibrated a long time ago).

Posted: **Fri Nov 30, 2012 2:46 pm**

lucasart wrote:
Don wrote: I have not given cutechess-cli a try yet since I had long ago developed my own autotesting system. But I should try it - I understand it works quite well.

I assume it will work from Linux too, right?

Don
Yes, it's avilable on github. You need to install Qt to compile it though
https://github.com/cutechess/cutechess

It has basically 3 projects:
- a library with all common functions
- a cli tournament manager. very powerful and feature rich.
- a GUI. very nice, but in the relatively early stages of developpement. what is there is really neat, but there's a lot missing in terms of features.

You can also program the SPRT algorithm yourself, it's not complicated. Have a look at sprt.h and sprt.cpp in cutechess (library). The only major thing that needs improving on it is the dynamic calibration of the drawelo value. This is important, especially in self testing, where the drawelo value is typically much higher, than the 97.3 default value (which came from BayesElo and was calibrated a long time ago).

The system I created with simulation also simulates the draw percentages we experiment in our self testing. Can the drawelo be calculated from that somehow?

Don

Posted: **Fri Nov 30, 2012 3:04 pm**

Don wrote:
lucasart wrote:
Don wrote: I have not given cutechess-cli a try yet since I had long ago developed my own autotesting system. But I should try it - I understand it works quite well.

I assume it will work from Linux too, right?

Don
Yes, it's avilable on github. You need to install Qt to compile it though
https://github.com/cutechess/cutechess

It has basically 3 projects:
- a library with all common functions
- a cli tournament manager. very powerful and feature rich.
- a GUI. very nice, but in the relatively early stages of developpement. what is there is really neat, but there's a lot missing in terms of features.

You can also program the SPRT algorithm yourself, it's not complicated. Have a look at sprt.h and sprt.cpp in cutechess (library). The only major thing that needs improving on it is the dynamic calibration of the drawelo value. This is important, especially in self testing, where the drawelo value is typically much higher, than the 97.3 default value (which came from BayesElo and was calibrated a long time ago).
The system I created with simulation also simulates the draw percentages we experiment in our self testing. Can the drawelo be calculated from that somehow?

Don

Yes, although one has to be careful not to draw early conclusion from a drawelo calculated over a small sample.
Perhaps a weighted average between th initial value (analytically derived from drawelo = 100 and elodiff=0) and the estimated value (analytically derived from the results of the sample). A reasonable weighting could be something like this
- x = 1/log2(nb_draws): weight for the initial value
- 1-x for the estimated value, so as log2(nb_draw) grows we trust the sample more and more

Let's look at a practical example. You play K0 (Komodo0) against K1 (Komnodo1), and get the following results: 520-500-1600 [N=2620 games, 520 wins, 500 losses, 1600 draws]

Your sample estimated values are:
p(win) = 520/2620 = 19.85%
p(loss) = 500/2620 = 19.08%
p(draw) = 1600/2620 = 61.07%

Now here's the "drawelo equation" (in its simplified form as we don't care about color biais, as I suppose you play alternatively W/B)
- let L(x) = 1/(1+10^(x/400))
- p(win) = L(-elo+drawelo)
- p(loss) = L(+elo+drawelo)

Now you can solve this equation to get (elo,drawelo) from the sample. elo represents the estimation of elo(K1)-elo(K0).

elo = 2.65
drawelo = 247

and as the sample grows you can trust this value more and more rather then the initial value 100.

anyway, to be verified & tested, but that's the idea

TalkChess.com

LOS (again)

Re: LOS (again)

Re: LOS (again)

Re: LOS (again)

Re: LOS again.

Re: LOS (again)

Re: LOS (again)

Re: LOS (again)

Re: LOS (again)

Re: LOS (again)

Re: LOS (again)