Beta for Stockfish distributed testing

Ajedrecista · Post by **Ajedrecista** » Tue Mar 19, 2013 8:27 pm

Hello Gary:

gladius wrote:Since the beta announcement, we have had up to six computers testing at once, and have played over 400,000 games!

Am I wrong, or at this moment there are almost 1.6 million games played? It is quite a lot.

Now:

Stockfish Testing Framework wrote:Retest regression with new hash size of 128MB, previous result with 32MB hash size was +16.62 ELO
20000 @ 60+0.05 th 1 (3698d9aa5573ca Vs. sf_2.3.1_base):
Code: Select all
ELO: 16.12 +-7.3 (95%) LOS: 100.0%
Total: 8756 W: 1748 L: 1342 D: 5666

If I include the draw ratio of more less 64.71%, I obtain an error bar of more less ± 4.32 Elo with 95% confidence (of course, LOS ~ 100%).

Given the fact that self tests tend to exaggerate Elo improvements, can be said that real improvement (for example in IPON) is around 10 Elo ± error bars?

I guess that it is too soon for a 2.3.2 or 2.4 release... thanks in advance for your attention.

Regards from Spain.

Ajedrecista.

gladius · Post by **gladius** » Tue Mar 19, 2013 9:11 pm

Ajedrecista wrote:Am I wrong, or at this moment there are almost 1.6 million games played? It is quite a lot.

Yes, the 400k games figure came from me just glancing at the results page, we are actually at 1,589,926 total games right now! Thanks to everyone for contributing their CPU time

.

Ralph Stoesser · Post by **Ralph Stoesser** » Sat Mar 23, 2013 11:39 pm

A chat window would be nice!

Michel · Post by **Michel** » Sun Mar 24, 2013 12:09 am

Yes that looks like a good idea!

lucasart · Post by **lucasart** » Sun Mar 24, 2013 2:52 am

Gary,

Jesus is right. There is something wrong in your formula of +/- elo points at 95%. You need the quantiles 2.5% and 97.5% so that the proba of being in the interval is 95%. Here are the formulas you should use:

=> notations
(W,L,D) = (#win,#loss,#draw)
N = W+L+D

=> score and it's stdev
x = (W+D/2)/N
sigma = sqrt((W.(1-x)^2 + L.(1/2-x)^2 + D.(0-x)^2)) / N

=> ELO and LOS
elo = -400.log10(1/x-1)
LOS = Phi((mu-1/2)/sigma)

=> quantiles of x as follows
quantile(x,alpha) = x + sigma.Phi^{-1}(alpha)

From the quantiles it's easy to calculate the bounds in score and transform them to elo points:

xmin = quantile(x,2.5%)
xmax = quantile(x,97.5%)

elomin = -400.log10(1/xmin-1)
elomax = -400.log10(1/xmax-1)

As for the "+/-", it can actually be an asymmetric interval
"+" = elomax - elo
"-" = elo - elomin

I hope it's all self explanatory and I didn't do any typo, but I get the same numbers as Jesus in the end. If you want to play with the formulas and test them, I suggest you use a spreadsheet program (like Calc or Gnumeric), and you can use the distribution and quantile functions of the standard gaussian law:
Phi = NORMSDIST
Phi^{-1} = NORMSINV

PS: All of the above is obviously asymptotic! Don't be surprised to get funny values for N=2, for instance.

PPS: Phi(x) = (1/2).[1+erf(x/sqrt(2))]
For approximations of erf(), see the wikipedia page
http://en.wikipedia.org/wiki/Error_function
Phi^{-1} you don't actually need, just hardcode the values Phi^{-1}(2.5%) and Phi^{-1}(97.5%). Otherwise there are also erf^{-1} approximations on the same wikipedia page if you prefer.

gladius · Post by **gladius** » Sun Mar 24, 2013 2:58 am

Ralph Stoesser wrote:A chat window would be nice!

On the fishtest results page? Might be a bit tricky, as right now seeing updated results requires a refresh. But definitely an interesting idea!

Ajedrecista · Post by **Ajedrecista** » Sun Mar 24, 2013 3:35 am

Hello Lucas:

lucasart wrote:Gary,

Jesus is right. There is something wrong in your formula of +/- elo points at 95%. You need the quantiles 2.5% and 97.5% so that the proba of being in the interval is 95%. Here are the formulas you should use:

=> notations
(W,L,D) = (#win,#loss,#draw)
N = W+L+D

=> score and it's stdev
x = (W+D/2)/N
sigma = sqrt((W.(1-x)^2 + L.(1/2-x)^2 + D.(0-x)^2)) / N

=> ELO and LOS
elo = -400.log10(1/x-1)
LOS = Phi((mu-1/2)/sigma)

=> quantiles of x as follows
quantile(x,alpha) = x + sigma.Phi^{-1}(alpha)

From the quantiles it's easy to calculate the bounds in score and transform them to elo points:

xmin = quantile(x,2.5%)
xmax = quantile(x,97.5%)

elomin = -400.log10(1/xmin-1)
elomax = -400.log10(1/xmax-1)

As for the "+/-", it can actually be an asymmetric interval
"+" = elomax - elo
"-" = elo - elomin

I hope it's all self explanatory and I didn't do any typo, but I get the same numbers as Jesus in the end. If you want to play with the formulas and test them, I suggest you use a spreadsheet program (like Calc or Gnumeric), and you can use the distribution and quantile functions of the standard gaussian law:
Phi = NORMSDIST
Phi^{-1} = NORMSINV

PS: All of the above is obviously asymptotic! Don't be surprised to get funny values for N=2, for instance.

PPS: Phi(x) = (1/2).[1+erf(x/sqrt(2))]
For approximations of erf(), see the wikipedia page
http://en.wikipedia.org/wiki/Error_function
Phi^{-1} you don't actually need, just hardcode the values Phi^{-1}(2.5%) and Phi^{-1}(97.5%). Otherwise there are also erf^{-1} approximations on the same wikipedia page if you prefer.

Gary and me calculate different error bars because we use different sample standard deviations:

Code: Select all

s: sample standard deviation.
mu: score.
D: draw ratio.
n: number of games.

Gary's:
s = sqrt{[mu*(1 - mu)]/(n - 1)}

Me (I guess that also you):
s = sqrt{[mu*(1 - mu) - D/4]/(n - 1)]}

My sample standard deviation is smaller than Gary's one, except with the draw ratio D = 0; hence my error bars are smaller than his error bars with a given confidence level. Please note that if mu = 0.5 or near it, we can approximate both error bars by the following formula:

Code: Select all

(Gary's error bar) ~ (my error bar)/sqrt(1 - D)

Gary may download my Fortran tools (link in my signature) and try LOS_and_Elo_uncertainties_calculator just for comparison purposes. Anyway, all the differences come from the additional term -D/4 found in the sample standard deviation. I think that the main difference is the assumption of a binomial approach (wins and loses; draws do not count) in Gary's formula, and the assumption of a trinomial approach (wins, draws and loses) in the formula that I use.

Thank you very much again for this distributed testing framework, Gary! Good luck to SF and DiscoCheck.

Regards from Spain.

Ajedrecista.

lucasart · Post by **lucasart** » Sun Mar 24, 2013 9:46 am

Jesus,

Drop FORTRAN, and learn something modern like Python. Seriously...

Same for COBOL

mcostalba · Post by **mcostalba** » Sun Mar 24, 2013 10:30 am

gladius wrote:
Ralph Stoesser wrote:A chat window would be nice!
On the fishtest results page? Might be a bit tricky, as right now seeing updated results requires a refresh. But definitely an interesting idea!

I have just setup a google group:

https://groups.google.com/forum/?fromgr ... ishcooking

Hopefully it will be useful. It is more persistent than a chat, but at the same time has almost the same responsiveness.

Suggestions are welcomed !

lucasart · Post by **lucasart** » Sun Mar 24, 2013 1:36 pm

It's nice to see Joona Kiiski is back and pushing patches

Beta for Stockfish distributed testing

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.

Re: Beta for Stockfish distributed testing.