Discussion of chess software programming and technical issues.
Moderators: bob, hgm, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

Don
 Posts: 5106
 Joined: Tue Apr 29, 2008 2:27 pm
Post
by Don » Mon Aug 19, 2013 9:53 pm
Laskos wrote:Don wrote:AlvaroBegue wrote:You are absolutely right that the optimal stopping rule depends on the distribution of ELO increments of the ideas being tested.
Could the people with lots of testing experience give us an idea for what the distribution looks like? A mean and a standard deviation would probably suffice.
I have thought of a procedure to compute the optimal stopping rule given the distribution of ELO increments, and I know that at least Don expressed interest in it in the past. The quantity to maximize is expected ELO gain per game played. Unfortunately, I haven't found the time to actually implement it: I have three kids under 3...
I want to integrate SPRT into our own tester but I am unsure of the math. However it is not difficult to run simulations using the criteria of 2 programs which are equal in strength (simulated programs) while duplicating the draw percentage that you are actually getting. But that is still not SPRT.
So the idea is to find a stopping rule you can live with. You don't want to accept too many regressions. A trivial stopping rule is to stop after your target program is ahead by W games or behind by L games. Using a simulation to avoid the math will tell you quickly how many "regressions" you will accept. You could consider a zero ELO improvement a regression.
But it's also important that the regression you accept are small ones. It's a given that you will accept some regressions but you want those to be very small ones.
So I am going to have to either wade through the math or find the calculation online already code up.
SPRT is easy to implement, summing up sequentially some logarithms of some ratios, even the links in this thread are enough to follow it. SPRT behaves usually very well, and is the best recommendation, but if you don't want to go through all the logarithms, you can use LOS of 99.9% in the 5050,000 games range for a Type I error smaller than 5%. It's usually slower than SPRT, but if you have Erf function, it's easy to calculate LOS whenever you want (not necessarily sequentially) as (1 + Erf[(wins  losses)/Sqrt{2*(wins + losses)}])/2. If you want to control Type I and II errors, then SPRT is the way to go.
I'm looking at the wikipedia and it does not seem quite as daunting as I had imagined.
What is the difference between a type 1 and type 2 error?
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.

Don
 Posts: 5106
 Joined: Tue Apr 29, 2008 2:27 pm
Post
by Don » Mon Aug 19, 2013 9:58 pm
Don wrote:
I'm looking at the wikipedia and it does not seem quite as daunting as I had imagined.
What is the difference between a type 1 and type 2 error?
Ignore that last question, I knew the answer already I just assumed it meaning something else in this context.
Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.

Ajedrecista
 Posts: 1399
 Joined: Wed Jul 13, 2011 7:04 pm
 Location: Madrid, Spain.

Contact:
Post
by Ajedrecista » Tue Aug 20, 2013 9:40 am
Hello Don:
Don wrote:AlvaroBegue wrote:You are absolutely right that the optimal stopping rule depends on the distribution of ELO increments of the ideas being tested.
Could the people with lots of testing experience give us an idea for what the distribution looks like? A mean and a standard deviation would probably suffice.
I have thought of a procedure to compute the optimal stopping rule given the distribution of ELO increments, and I know that at least Don expressed interest in it in the past. The quantity to maximize is expected ELO gain per game played. Unfortunately, I haven't found the time to actually implement it: I have three kids under 3...
I want to integrate SPRT into our own tester but I am unsure of the math. However it is not difficult to run simulations using the criteria of 2 programs which are equal in strength (simulated programs) while duplicating the draw percentage that you are actually getting. But that is still not SPRT.
So the idea is to find a stopping rule you can live with. You don't want to accept too many regressions. A trivial stopping rule is to stop after your target program is ahead by W games or behind by L games. Using a simulation to avoid the math will tell you quickly how many "regressions" you will accept. You could consider a zero ELO improvement a regression.
But it's also important that the regression you accept are small ones. It's a given that you will accept some regressions but you want those to be very small ones.
So I am going to have to either wade through the math or find the calculation online already code up.
You can follow the FishTest approach in the case of a match between two engines:
https://github.com/glinscott/fishtest/b ... at_util.py
Currently, lines from 54 to 121 (more less) do the SPRT job if I am not wrong. A more compact pseudocode without comments is here (
please do not pay much attention on the syntesis: I do not know C/C++):
Code: Select all
// alpha, beta, bayeselo_0 and bayeselo_1 are known parameters.
// The number of wins, draws and loses are monitored and therefore known.
lower_bound = log(beta/(1.0  alpha));
upper_bound = log((1.0  beta)/alpha);
games = wins + draws + loses;
if ((wins > 0) .and. (draws > 0) .and. (loses > 0)) {
W = wins/games;
D = draws/games;
L = loses/games;
bayeselo = 200.0*log10(W*(1.0  L)/(L*(1.0  W)));
drawelo = 200.0*log10((1.0  L)*(1.0  W)/(L*W));
P0_W = 1.0/(1.0 + 10.0**((drawelo  bayeselo_0)/400.0));
P0_L = 1.0/(1.0 + 10.0**((drawelo + bayeselo_0)/400.0));
P0_D = 1.0  P0_W  P0_L;
P1_W = 1.0/(1.0 + 10.0**((drawelo  bayeselo_1)/400.0));
P1_L = 1.0/(1.0 + 10.0**((drawelo + bayeselo_1)/400.0));
P1_D = 1.0  P1_W  P1_L;
LLR_wins = wins*log(P1_W/P0_W);
LLR_draws = draws*log(P1_D/P0_D);
LLR_loses = loses*log(P1_L/P0_L);
LLR = LLR_wins + LLR_draws + LLR_loses;
if (LLR < lower_bound) {
// The patch fails.
else if (LLR > upper_bound) {
// The patch succeeds.
}
}
Good luck with your SPRT implementation!
Regards from Spain.
Ajedrecista.