Beta for Stockfish distributed testing

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

User avatar
Ajedrecista
Posts: 2230
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Beta for Stockfish distributed testing.

Post by Ajedrecista »

Hello Gary:
gladius wrote:Since the beta announcement, we have had up to six computers testing at once, and have played over 400,000 games!
Am I wrong, or at this moment there are almost 1.6 million games played? It is quite a lot.

Now:
Stockfish Testing Framework wrote:Retest regression with new hash size of 128MB, previous result with 32MB hash size was +16.62 ELO
20000 @ 60+0.05 th 1 (3698d9aa5573ca Vs. sf_2.3.1_base):

Code: Select all

ELO: 16.12 +-7.3 (95%) LOS: 100.0%
Total: 8756 W: 1748 L: 1342 D: 5666
If I include the draw ratio of more less 64.71%, I obtain an error bar of more less ± 4.32 Elo with 95% confidence (of course, LOS ~ 100%).

Given the fact that self tests tend to exaggerate Elo improvements, can be said that real improvement (for example in IPON) is around 10 Elo ± error bars?

I guess that it is too soon for a 2.3.2 or 2.4 release... thanks in advance for your attention.

Regards from Spain.

Ajedrecista.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Beta for Stockfish distributed testing.

Post by gladius »

Ajedrecista wrote:Am I wrong, or at this moment there are almost 1.6 million games played? It is quite a lot.
Yes, the 400k games figure came from me just glancing at the results page, we are actually at 1,589,926 total games right now! Thanks to everyone for contributing their CPU time :).
Ralph Stoesser
Posts: 408
Joined: Sat Mar 06, 2010 9:28 am

Re: Beta for Stockfish distributed testing.

Post by Ralph Stoesser »

A chat window would be nice!
Michel
Posts: 2292
Joined: Mon Sep 29, 2008 1:50 am

Re: Beta for Stockfish distributed testing.

Post by Michel »

Yes that looks like a good idea!
lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Beta for Stockfish distributed testing.

Post by lucasart »

Gary,

Jesus is right. There is something wrong in your formula of +/- elo points at 95%. You need the quantiles 2.5% and 97.5% so that the proba of being in the interval is 95%. Here are the formulas you should use:

=> notations
(W,L,D) = (#win,#loss,#draw)
N = W+L+D

=> score and it's stdev
x = (W+D/2)/N
sigma = sqrt((W.(1-x)^2 + L.(1/2-x)^2 + D.(0-x)^2)) / N

=> ELO and LOS
elo = -400.log10(1/x-1)
LOS = Phi((mu-1/2)/sigma)

=> quantiles of x as follows
quantile(x,alpha) = x + sigma.Phi^{-1}(alpha)

From the quantiles it's easy to calculate the bounds in score and transform them to elo points:

xmin = quantile(x,2.5%)
xmax = quantile(x,97.5%)

elomin = -400.log10(1/xmin-1)
elomax = -400.log10(1/xmax-1)

As for the "+/-", it can actually be an asymmetric interval
"+" = elomax - elo
"-" = elo - elomin

I hope it's all self explanatory and I didn't do any typo, but I get the same numbers as Jesus in the end. If you want to play with the formulas and test them, I suggest you use a spreadsheet program (like Calc or Gnumeric), and you can use the distribution and quantile functions of the standard gaussian law:
Phi = NORMSDIST
Phi^{-1} = NORMSINV

PS: All of the above is obviously asymptotic! Don't be surprised to get funny values for N=2, for instance.

PPS: Phi(x) = (1/2).[1+erf(x/sqrt(2))]
For approximations of erf(), see the wikipedia page
http://en.wikipedia.org/wiki/Error_function
Phi^{-1} you don't actually need, just hardcode the values Phi^{-1}(2.5%) and Phi^{-1}(97.5%). Otherwise there are also erf^{-1} approximations on the same wikipedia page if you prefer.
Last edited by lucasart on Sun Mar 24, 2013 3:05 am, edited 2 times in total.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Beta for Stockfish distributed testing.

Post by gladius »

Ralph Stoesser wrote:A chat window would be nice!
On the fishtest results page? Might be a bit tricky, as right now seeing updated results requires a refresh. But definitely an interesting idea!
User avatar
Ajedrecista
Posts: 2230
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Beta for Stockfish distributed testing.

Post by Ajedrecista »

Hello Lucas:
lucasart wrote:Gary,

Jesus is right. There is something wrong in your formula of +/- elo points at 95%. You need the quantiles 2.5% and 97.5% so that the proba of being in the interval is 95%. Here are the formulas you should use:

=> notations
(W,L,D) = (#win,#loss,#draw)
N = W+L+D

=> score and it's stdev
x = (W+D/2)/N
sigma = sqrt((W.(1-x)^2 + L.(1/2-x)^2 + D.(0-x)^2)) / N

=> ELO and LOS
elo = -400.log10(1/x-1)
LOS = Phi((mu-1/2)/sigma)

=> quantiles of x as follows
quantile(x,alpha) = x + sigma.Phi^{-1}(alpha)

From the quantiles it's easy to calculate the bounds in score and transform them to elo points:

xmin = quantile(x,2.5%)
xmax = quantile(x,97.5%)

elomin = -400.log10(1/xmin-1)
elomax = -400.log10(1/xmax-1)

As for the "+/-", it can actually be an asymmetric interval
"+" = elomax - elo
"-" = elo - elomin

I hope it's all self explanatory and I didn't do any typo, but I get the same numbers as Jesus in the end. If you want to play with the formulas and test them, I suggest you use a spreadsheet program (like Calc or Gnumeric), and you can use the distribution and quantile functions of the standard gaussian law:
Phi = NORMSDIST
Phi^{-1} = NORMSINV

PS: All of the above is obviously asymptotic! Don't be surprised to get funny values for N=2, for instance.

PPS: Phi(x) = (1/2).[1+erf(x/sqrt(2))]
For approximations of erf(), see the wikipedia page
http://en.wikipedia.org/wiki/Error_function
Phi^{-1} you don't actually need, just hardcode the values Phi^{-1}(2.5%) and Phi^{-1}(97.5%). Otherwise there are also erf^{-1} approximations on the same wikipedia page if you prefer.
Gary and me calculate different error bars because we use different sample standard deviations:

Code: Select all

s: sample standard deviation.
mu: score.
D: draw ratio.
n: number of games.

Gary's:
s = sqrt{[mu*(1 - mu)]/(n - 1)}

Me (I guess that also you):
s = sqrt{[mu*(1 - mu) - D/4]/(n - 1)]}
My sample standard deviation is smaller than Gary's one, except with the draw ratio D = 0; hence my error bars are smaller than his error bars with a given confidence level. Please note that if mu = 0.5 or near it, we can approximate both error bars by the following formula:

Code: Select all

(Gary's error bar) ~ (my error bar)/sqrt(1 - D)
Gary may download my Fortran tools (link in my signature) and try LOS_and_Elo_uncertainties_calculator just for comparison purposes. Anyway, all the differences come from the additional term -D/4 found in the sample standard deviation. I think that the main difference is the assumption of a binomial approach (wins and loses; draws do not count) in Gary's formula, and the assumption of a trinomial approach (wins, draws and loses) in the formula that I use.

Thank you very much again for this distributed testing framework, Gary! Good luck to SF and DiscoCheck. ;-)

Regards from Spain.

Ajedrecista.
lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Beta for Stockfish distributed testing.

Post by lucasart »

Jesus,

Drop FORTRAN, and learn something modern like Python. Seriously...

Same for COBOL :lol:
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Beta for Stockfish distributed testing.

Post by mcostalba »

gladius wrote:
Ralph Stoesser wrote:A chat window would be nice!
On the fishtest results page? Might be a bit tricky, as right now seeing updated results requires a refresh. But definitely an interesting idea!
I have just setup a google group:

https://groups.google.com/forum/?fromgr ... ishcooking

Hopefully it will be useful. It is more persistent than a chat, but at the same time has almost the same responsiveness.

Suggestions are welcomed !
lucasart
Posts: 3243
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Beta for Stockfish distributed testing.

Post by lucasart »

It's nice to see Joona Kiiski is back and pushing patches :-)
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.