Type I error for p-value stopping: Balanced and Unbalanced

Laskos · Post by **Laskos** » Thu Jun 16, 2016 9:23 am

Several months ago I concentrated on the significance, the value t-value = (W-L)/(W+L)^(1/2) for unbalanced opening positions compared to those balanced. The results were inconclusive, with unbalanced opening positions being at least on par in revealing ELO differences with those balanced. What we really need is the goodness of the stopping rule based on t-value. The quantity needed is the Type I error (false positive) when one is stopping at say several standard deviations result, say 2 or 3. For the same t-value we have a bit different Type I error for unbalanced and balanced opening postions. Games from unbalanced (and balanced) positions proceed color and reversed, so for identical adversaries (in our case two recent Stockfishes) we have the needed empirical data:

60s+0.1s games between two identical Stockfishes

Balanaced (-30cp, 30cp) (white advantage below 20 ELO): White wins 17%, Black wins 16%, draws 67%
Unbalanced (70cp, 100cp) (white advantage above 120 ELO):White wins 42.5%, Black wins 4%, draws 53.5%

With this empirical data I computed in 100,000 simulations the Type I error function of t-value (number of standard deviations) used as stopping rule. Whenever one sees a difference of say 2.5 or 3.0 standard deviations, he stops and declares the winner as stronger. This stopping rule has no upper bound for Type I error, but it is still controllable to some reasonable number of games. Very few are testing beyond say 100,000 games. The type I error in the cases unbalanced openings/balanced openings is shown in this table:

Code: Select all

Number           Type I error        Type I error
of Games         Balanced            Unbalanced
           rule&#58; 2.5sd   3.0sd       2.5sd   3.0sd
=======================================================           
   100            5.6%   0.95%       1.6%    0.21%
  1000           13.6%   3.4%        3.9%    0.66%
 10000           21.7%   6.1%        6.5%    0.9%
100000           30.6%   8.6%        9.2%    1.2%

We see that the Type I error for the same t-value (2.5 and 3 here) is significantly smaller in the case of unbalanced opening positions. In fact, t=2.5 stopping rule for unbalanced is pretty much equivalent to t=3.0 stopping rule for balanced. This amounts to about 40% less games needed to stop for unbalanced openings with a given Type I error. The rule of thumb would be: Type I error is about 5% for 3 standard deviations stop from balanced opening positions, and 2.5 standard deviations for unbalanced ones.

Ajedrecista · Post by **Ajedrecista** » Thu Jun 16, 2016 1:43 pm

Hello Kai:

It is very interesting although I do not fully understand the problem. You introduce the t statistic:

Code: Select all

t = &#40;wins - loses&#41;/sqrt&#40;wins - loses&#41;

t = games*&#40;2*score - 1&#41;/sqrt&#40;games - draws&#41;
t = games*&#40;2*score - 1&#41;/sqrt&#91;games*&#40;1 - draw_ratio&#41;&#93;
t = &#40;2*score - 1&#41;*sqrt&#91;games/&#40;1 - draw_ratio&#41;&#93;

t/sqrt&#40;games&#41; = &#40;2*score - 1&#41;/sqrt&#40;1 - draw_ratio&#41;

Am I right?

You have played a certain number of games between two identical SF recent versions with some sets of balanced and unbalanced positions... Right.

Laskos wrote:60s+0.1s games between two identical Stockfishes

Balanced (-30cp, 30cp) (white advantage below 20 ELO): White wins 17%, Black wins 16%, draws 67%
Unbalanced (70cp, 100cp) (white advantage above 120 ELO):White wins 42.5%, Black wins 4%, draws 53.5%

I guess that your games were played in this way: SF_A vs. SF_B (opening 1), SF_B vs. SF_A (opening 1); SF_A vs. SF_B (opening 2), SF_B vs. SF_A (opening 2); and so on.

But my doubt is in the simulations. Writing about balanced openings for example, I suppose that you use the info W = 0.17, B = 0.16, D = 0.67 and you 'play' virtual games via assigning PRNG numbers. For example:

Code: Select all

// Pseudocode.

v = random_number

if &#40;v < 0.17&#41; then
  white wins
else if &#40;v > 1 - 0.16&#41; then
  black wins
else
  draw
end if

// And you take care of changing sides of the opponents. For example&#58;

// Odd games &#40;1, 3, 5...)&#58;
if &#40;v < 0.17&#41; then
  A wins
else if &#40;v > 1 - 0.16&#41; then
  B wins
else
  draw
end if

// Even games &#40;2, 4, 6...)&#58;
if &#40;v < 0.17&#41; then
  B wins
else if &#40;v > 1 - 0.16&#41; then
  A wins
else
  draw
end if

You compute t after each game (except wins = loses = 0, of course) and you finish each simulation when |t| > 2.5 (or 3) is reached for the first time, right?

I am lost from here. First of all, I guess that the results of your table:

Laskos wrote:

Code: Select all

Number           Type I error        Type I error 
of Games         Balanced            Unbalanced 
           rule&#58; 2.5sd   3.0sd       2.5sd   3.0sd 
=======================================================            
   100            5.6%   0.95%       1.6%    0.21% 
  1000           13.6%   3.4%        3.9%    0.66% 
 10000           21.7%   6.1%        6.5%    0.9% 
100000           30.6%   8.6%        9.2%    1.2%

In the column of number of games, with 1000 are you meant to say from 101 to 1000 and so on?

Secondly, I do not understand the amount of false positives or type I errors. Well, being honest I do not know what 'false positive' is in this context. Could you please explain it a little more in a 'Stats for dummies' fashion? Thanks in advance. I have doubts if 'wins - loses' means 'wins(opponent A) - loses(opponent A)' or 'wins(white) - loses(white)'... I think it is the first option.

Who knows if I would be able to come here (once I understand this) with my own numbers just to check your results.

And your conclusion is that unbalanced openings need in average less games in a match with this stopping rule. It is good to know.

Regards from Spain.

Ajedrecista.

Laskos · Post by **Laskos** » Thu Jun 16, 2016 3:54 pm

Ajedrecista wrote:Hello Kai:

It is very interesting although I do not fully understand the problem. You introduce the t statistic:
Code: Select all
t = &#40;wins - loses&#41;/sqrt&#40;wins - loses&#41;

t = games*&#40;2*score - 1&#41;/sqrt&#40;games - draws&#41;
t = games*&#40;2*score - 1&#41;/sqrt&#91;games*&#40;1 - draw_ratio&#41;&#93;
t = &#40;2*score - 1&#41;*sqrt&#91;games/&#40;1 - draw_ratio&#41;&#93;

t/sqrt&#40;games&#41; = &#40;2*score - 1&#41;/sqrt&#40;1 - draw_ratio&#41;

Am I right?
You have played a certain number of games between two identical SF recent versions with some sets of balanced and unbalanced positions... Right.

Laskos wrote:60s+0.1s games between two identical Stockfishes

Balanced (-30cp, 30cp) (white advantage below 20 ELO): White wins 17%, Black wins 16%, draws 67%
Unbalanced (70cp, 100cp) (white advantage above 120 ELO):White wins 42.5%, Black wins 4%, draws 53.5%
I guess that your games were played in this way: SF_A vs. SF_B (opening 1), SF_B vs. SF_A (opening 1); SF_A vs. SF_B (opening 2), SF_B vs. SF_A (opening 2); and so on.

But my doubt is in the simulations. Writing about balanced openings for example, I suppose that you use the info W = 0.17, B = 0.16, D = 0.67 and you 'play' virtual games via assigning PRNG numbers. For example:
Code: Select all
// Pseudocode.

v = random_number

if &#40;v < 0.17&#41; then
  white wins
else if &#40;v > 1 - 0.16&#41; then
  black wins
else
  draw
end if

// And you take care of changing sides of the opponents. For example&#58;

// Odd games &#40;1, 3, 5...)&#58;
if &#40;v < 0.17&#41; then
  A wins
else if &#40;v > 1 - 0.16&#41; then
  B wins
else
  draw
end if

// Even games &#40;2, 4, 6...)&#58;
if &#40;v < 0.17&#41; then
  B wins
else if &#40;v > 1 - 0.16&#41; then
  A wins
else
  draw
end if
You compute t after each game (except wins = loses = 0, of course) and you finish each simulation when |t| > 2.5 (or 3) is reached for the first time, right?

Up to this point you seem to do it correctly.

I am lost from here. First of all, I guess that the results of your table:
Laskos wrote:
Code: Select all
Number           Type I error        Type I error 
of Games         Balanced            Unbalanced 
           rule&#58; 2.5sd   3.0sd       2.5sd   3.0sd 
=======================================================            
   100            5.6%   0.95%       1.6%    0.21% 
  1000           13.6%   3.4%        3.9%    0.66% 
 10000           21.7%   6.1%        6.5%    0.9% 
100000           30.6%   8.6%        9.2%    1.2%
In the column of number of games, with 1000 are you meant to say from 101 to 1000 and so on?

No, each run is starting from 1, 1 to 100, 1 to 1000, 1 to 10000 ans so on. The rule of thumb for Type I error as I computed it several months ago, is that it goes as log(N_games), so the error from 100 to 1000 is roughly equal to the error from 1000 to 10000 and so on. Again, here the error is total accumulated, starting from game 1, the first time I meet t>2.5 (or 3) I stop and admit the result as the finding of a (false) positive.

Secondly, I do not understand the amount of false positives or type I errors. Well, being honest I do not know what 'false positive' is in this context. Could you please explain it a little more in a 'Stats for dummies' fashion? Thanks in advance. I have doubts if 'wins - loses' means 'wins(opponent A) - loses(opponent A)' or 'wins(white) - loses(white)'... I think it is the first option.

Type I error is the wrong rejection of a true null hypothesis, in our case the null hypothesis is that the engines are equal (here we know they are). At t=2.5 (or 3) we stop and accept the hypothesis that the strength is not equal, we found a "positive" which disproves the null hypothesis, but in fact it is a false "positive". I am no better at explaining this than Wikipedia:
https://en.wikipedia.org/wiki/Type_I_and_type_II_errors

Yes, it is wins(A) - losses (A) (= losses(B) - wins(B)) which we compute, and the openings are taken in pairs and played side and reverse in independent games (the same opening, though), so I generate a random number for each game (2 random numbers for each opening).

Who knows if I would be able to come here (once I understand this) with my own numbers just to check your results.

And your conclusion is that unbalanced openings need in average less games in a match with this stopping rule. It is good to know.

Regards from Spain.

Ajedrecista.

I hope you will re-do the simulations, as I did it quickly, and the possibility of some error of mine (false positive

) is high.

Ajedrecista · Post by **Ajedrecista** » Thu Jun 16, 2016 7:18 pm

Hello again:

I understood it this time! Thank you very much for explain it. I wrote the following Fortran 95 code:

Code: Select all

program Type_I

implicit none

integer, parameter &#58;&#58; simulations = 1e4, max_rounds = 5e1
integer &#58;&#58; s, r, wins, loses, false_positives
real&#40;KIND=2&#41; &#58;&#58; v1, v2, t, t_criterium, W, B, start, finish

start = cpu_clock@()

t_criterium = 2.5d0
W = 1.7d-1; B = 1.6d-1
false_positives = 0

do s = 1, simulations
  wins = 0; loses = 0
  
  do r = 1, max_rounds
    
    v1=random@()
    if &#40;v1 < W&#41; then
      wins = wins + 1
    else if &#40;v1 > 1d0 - B&#41; then
      loses = loses + 1
    end if
    
    if &#40;wins*loses > 0&#41; then
      t = 1d0*&#40;wins - loses&#41;/sqrt&#40;wins + loses + 0d0&#41;
      if &#40;abs&#40;t&#41; > t_criterium&#41; then
        false_positives = false_positives + 1
        exit
      end if
    end if
    
    v2 = random@()
    if &#40;v2 < W&#41; then
      loses = loses + 1
    else if &#40;v2 > 1d0 - B&#41; then
      wins = wins + 1
    end if
    if &#40;wins*loses > 0&#41; then
      t = 1d0*&#40;wins - loses&#41;/sqrt&#40;wins + loses + 0d0&#41;
      if &#40;abs&#40;t&#41; > t_criterium&#41; then
        false_positives = false_positives + 1
        exit
      end if
    end if
    
  end do
  
end do

write&#40;*,'&#40;A,I5&#41;') 'Simulations&#58; ', simulations
write&#40;*,'&#40;A,I6&#41;') 'Games&#58; ', max_rounds + max_rounds
write&#40;*,'&#40;A,F3.1&#41;') 't rule&#58; abs&#40;t&#41; > ', t_criterium
write&#40;*,'&#40;A,I6,A,F6.2,A&#41;') 'False positives&#58; ', false_positives, ' (', 1d2*false_positives/simulations, ' %).'
write&#40;*,*)

finish = cpu_clock@()

write&#40;*,'&#40;A,F6.1,A&#41;') 'Elapsed time&#58; ', &#40;finish - start&#41;/3d9, ' seconds.'  ! 3d9 = 3 GHz of my CPU.

end program Type_I

And I change the values of W, B, t_criterium and max_rounds = games/2 for each case. I only ran 10k simulations for each example because my do loops are not optimized. Here are my results:

Code: Select all

10000 simulations each time.

 Number           Type I error          Type I error
of games            Balanced             Unbalanced
                2.5 SD    3.0 SD      2.5 SD    3.0 SD
------------------------------------------------------
    100          5.47%     0.98%       1.95%     0.16%
   1000         13.49%     3.08%       4.21%     0.49%
  10000         20.65%     5.74%       6.52%     0.74%
 100000         28.44%     8.63%       8.95%     1.22%

Comparing with your results:

Laskos wrote:

Code: Select all

Number           Type I error        Type I error
of Games         Balanced            Unbalanced
           rule&#58; 2.5sd   3.0sd       2.5sd   3.0sd
=======================================================
   100            5.6%   0.95%       1.6%    0.21%
  1000           13.6%   3.4%        3.9%    0.66%
 10000           21.7%   6.1%        6.5%    0.9%
100000           30.6%   8.6%        9.2%    1.2%

They are a bit similar so I can say that we agree. Of course I need more simulations. 100k of them for each case would need a little more than an hour in total in my PC with my current code.

Regards from Spain.

Ajedrecista.

Laskos · Post by **Laskos** » Thu Jun 16, 2016 7:45 pm

Ajedrecista wrote:I only ran 10k simulations for each example because my do loops are not optimized. Here are my results:
Code: Select all
10000 simulations each time.

 Number           Type I error          Type I error
of games            Balanced             Unbalanced
                2.5 SD    3.0 SD      2.5 SD    3.0 SD
------------------------------------------------------
    100          5.47%     0.98%       1.95%     0.16%
   1000         13.49%     3.08%       4.21%     0.49%
  10000         20.65%     5.74%       6.52%     0.74%
 100000         28.44%     8.63%       8.95%     1.22%
Comparing with your results:
Laskos wrote:
Code: Select all
Number           Type I error        Type I error
of Games         Balanced            Unbalanced
           rule&#58; 2.5sd   3.0sd       2.5sd   3.0sd
=======================================================
   100            5.6%   0.95%       1.6%    0.21%
  1000           13.6%   3.4%        3.9%    0.66%
 10000           21.7%   6.1%        6.5%    0.9%
100000           30.6%   8.6%        9.2%    1.2%
They are a bit similar so I can say that we agree. Of course I need more simulations. 100k of them for each case would need a little more than an hour in total in my PC with my current code.

Regards from Spain.

Ajedrecista.

I am glad we agree almost completely, now I trust my results more

, I did it quickly, and in fact IIRC some of my results are from 30,000 simulations instead of 100,000 because it was too time consuming.

BeyondCritics · Post by **BeyondCritics** » Mon Jun 20, 2016 1:33 pm

Laskos wrote: This stopping rule has no upper bound for Type I error, but it is still controllable to some reasonable number of games. Very few are testing beyond say 100,000 games.

I don't get your point here. What it is good for, doing it completely wrong in the first place? You just confuses matters. Reasonable is SPRS (https://en.wikipedia.org/wiki/Sequentia ... ratio_test), as used by the stockfish team.

Laskos · Post by **Laskos** » Mon Jun 20, 2016 1:45 pm

BeyondCritics wrote:
Laskos wrote: This stopping rule has no upper bound for Type I error, but it is still controllable to some reasonable number of games. Very few are testing beyond say 100,000 games.
I don't get your point here. What it is good for, doing it completely wrong in the first place? You just confuses matters. Reasonable is SPRS (https://en.wikipedia.org/wiki/Sequentia ... ratio_test), as used by the stockfish team.

Right, SPRT is the way to do for computer chess matches. But first, most people, not only in computer chess, use p-value stopping, and often slopily. Second, the smaller number of games to LLR stop with unbalanced positions over balanced ones, is valid conclusion for SPRT stop as used by Stockfish team too.

BeyondCritics · Post by **BeyondCritics** » Mon Jun 20, 2016 2:02 pm

Why not tell others how to test correctly?
I agree that SPRT is likely the way to go in chess testing, since it just works and is accessible without a degree. Using it would be beneficial at least for stockfish testers.

Laskos · Post by **Laskos** » Tue Jun 21, 2016 8:44 am

Ajedrecista wrote:

Code: Select all

10000 simulations each time.

 Number           Type I error          Type I error
of games            Balanced             Unbalanced
                2.5 SD    3.0 SD      2.5 SD    3.0 SD
------------------------------------------------------
    100          5.47%     0.98%       1.95%     0.16%
   1000         13.49%     3.08%       4.21%     0.49%
  10000         20.65%     5.74%       6.52%     0.74%
 100000         28.44%     8.63%       8.95%     1.22%

Comparing with your results:

Laskos wrote:

Code: Select all

Number           Type I error        Type I error
of Games         Balanced            Unbalanced
           rule&#58; 2.5sd   3.0sd       2.5sd   3.0sd
=======================================================
   100            5.6%   0.95%       1.6%    0.21%
  1000           13.6%   3.4%        3.9%    0.66%
 10000           21.7%   6.1%        6.5%    0.9%
100000           30.6%   8.6%        9.2%    1.2%

They are a bit similar so I can say that we agree. Of course I need more simulations. 100k of them for each case would need a little more than an hour in total in my PC with my current code.

Regards from Spain.

Ajedrecista.

Hello Jesus, I think you might do something interesting to me. IIRC you have a SPRT simulator, correct me if I am wrong. I got also that Type II error is smaller too for unbalanced openings, and I would like to check if SPRT with the same H0, H1, alpha, beta will stop with fewer number of games on average with unbalanced openings compared to balanced ones.

I took two Stockfishes, a recent SF_strong and an older SF_weak, the difference between them being of the order of 50 ELO points. I let them play for 2,000 games at 30''+0.3'' each match (balanced openings/unbalanced openings) and got the following performances:

1/ Balanced openings

SF_strong as White:
+30% =58% -12%

SF_strong as Black:
+27% =58% -15%

And the opposite:

SF_weak as White:
+15% =58% -27%

SF_weak as Black:
+12% =58% -30%

2/ Unbalanced openings:

SF_strong as White:
+59% =39% -2%

SF_strong as Black:
+5% =65% -30%

And the opposite:

SF_weak as White:
+30% =65% -5%

SF_weak as Black:
+2% =39% -59%

Can you input these outcome probabilities for 2 separate SPRT simulations (balanced/unbalanced) to get the average (in many runs) number of games needed to stop for each of these cases? With say H0=0, H1=30, alpha, beta=0.05. As significance the matches were very similar, but as Type I,II errors I guess the unbalanced openings will favor a faster stop (smaller number of games for SPRT to stop, picking the hypothesis H1).

Ajedrecista · Post by **Ajedrecista** » Tue Jun 21, 2016 11:52 am

Hello Kai:

Laskos wrote:Hello Jesus, I think you might do something interesting to me. IIRC you have a SPRT simulator, correct me if I am wrong. I got also that Type II error is smaller too for unbalanced openings, and I would like to check if SPRT with the same H0, H1, alpha, beta will stop with fewer number of games on average with unbalanced openings compared to balanced ones.

I took two Stockfishes, a recent SF_strong and an older SF_weak, the difference between them being of the order of 50 ELO points. I let them play for 2,000 games at 30''+0.3'' each match (balanced openings/unbalanced openings) and got the following performances:

1/ Balanced openings

SF_strong as White:
+30% =58% -12%

SF_strong as Black:
+27% =58% -15%

And the opposite:

SF_weak as White:
+15% =58% -27%

SF_weak as Black:
+12% =58% -30%

2/ Unbalanced openings:

SF_strong as White:
+59% =39% -2%

SF_strong as Black:
+5% =65% -30%

And the opposite:

SF_weak as White:
+30% =65% -5%

SF_weak as Black:
+2% =39% -59%
Can you input these outcome probabilities for 2 separate SPRT simulations (balanced/unbalanced) to get the average (in many runs) number of games needed to stop for each of these cases? With say H0=0, H1=30, alpha, beta=0.05. As significance the matches were very similar, but as Type I,II errors I guess the unbalanced openings will favor a faster stop (smaller number of games for SPRT to stop, picking the hypothesis H1).

I programmed a SPRT simulator almost three years ago and I recently added parameter estimation to fit the cumulative distribution function of the length of simulations (number of games) to a log-normal distribution, which fits reasonably well.

My simulator works in a slightly different way that the one you propose: I input alpha, beta, lower and upper bounds of SPRT (in Bayeselo units) and two parameters: expected Elo gain (in Bayeselo units, but there is a known relationship of conversion) and drawelo (which is related to the draw ratio).

So, the overall input is summarized in prob.(A wins), prob.(B wins) and prob.(draw) = 1 - prob.(A wins) - prob.(B wins); instead of prob.(A wins with white), prob.(A wins with black), prob.(B wins with white), prob.(B wins with black), prob.(draw of A-B) = 1 - prob.(A wins with white) - prob.(B wins with black), prob.(draw of B-A) = 1 - prob.(A wins with black) - prob.(B wins with white).

I see that your scores with balanced and unbalanced openings are not the same (µ_SF_strong = 57.5% and µ_SF_strong = 58% respectively) although it is expected due to error bars. OTOH, draw_ratio(balanced) = 58% and draw_ratio(unbalanced) = 52%. More draws usually translate into a larger average value in the number of games.

I need to do changes because my usual way to proceed is game after game with the same set of probabilities, not round after round (A-B, B-A and repeat) with two sets of probabilities.

First of all, averaging white and black probabilities (I know that you do not want this, but it is to get a rough idea):

Code: Select all

bayeselo = 200*log10&#123;W*&#40;1 - L&#41;/&#91;L*&#40;1 - W&#41;&#93;&#125;
drawelo = 200*log10&#91;&#40;1 - L&#41;*&#40;1 - W&#41;/&#40;L*W&#41;&#93;  // Estimated from the sample of games.

From SF_strong POV&#58;

Balanced&#58;
W = &#40;0.30 + 0.27&#41;/2 = 0.285
L = &#40;0.12 + 0.15&#41;/2 = 0.135
Elo      ~  52.5116
bayeselo ~  81.4442
drawelo  ~ 241.2287

Unbalanced&#58;
W = &#40;0.59 + 0.05&#41;/2 = 0.32
L = &#40;0.02 + 0.30&#41;/2 = 0.16
Elo      ~  56.0715
bayeselo ~  78.5601
drawelo  ~ 209.5036

I input these values into a SPRT tool by Michel van den Bergh that gives theoretical results (not simulations). I have the following doubt: I suppose that H0 and H1 have the meaning of a SPRT(H0, H1) test, but... [H0] = [H1] are Bayeselo or logistic Elo? There is a conversion formula between Bayeselo and logistic Elo that works for small numbers (let us say |value| < 10 Bayeselo for example), but I am not so sure about larger values. Anyway:

Code: Select all

x = 10^(-drawelo/400&#41;
bayeselo_to_Elo_scale = 4*x/&#40;1 + x&#41;²  // Elo = &#40;bayeselo_to_Elo_scale&#41;*bayeselo.
H0 = 0 Bayeselo = 0 Elo
H1?
H1 = 30 Bayeselo
or
H1&#40;30 Elo, drawelo ~ 241.2287&#41; ~ 46.9406 Bayeselo
or
H1&#40;30 Elo, drawelo ~ 209.5036&#41; ~ 42.2962 Bayeselo

------------------------

C&#58;\&#91;...&#93;\sprta>sprt_w32
Usage&#58; sprta.py elo0 elo1 draw_elo elo
elo0,elo1 are expressed in BayesElo
elo is expressed in LogisticElo

Balanced&#58;
C&#58;\&#91;...&#93;\sprta>sprt_w32 0 30 241.2287 52.5116
elo0     =     0.00
elo1     =    30.00
draw_elo =   241.23
elo      =    52.51
pass probability&#58;      100.03%
avg running time&#58;        172

Unbalanced&#58;
C&#58;\&#91;...&#93;\sprta>sprt_w32 0 30 209.5036 56.0715
elo0     =     0.00
elo1     =    30.00
draw_elo =   209.50
elo      =    56.07
pass probability&#58;      100.00%
avg running time&#58;        170

************************

Balanced&#58;
C&#58;\&#91;...&#93;\sprta>sprt_w32 0 46.9406 241.2287 52.5116
elo0     =     0.00
elo1     =    46.94
draw_elo =   241.23
elo      =    52.51
pass probability&#58;      99.94%
avg running time&#58;        125

Unbalanced&#58;
C&#58;\&#91;...&#93;\sprta>sprt_w32 0 42.2962 209.5036 56.0715
elo0     =     0.00
elo1     =    42.30
draw_elo =   209.50
elo      =    56.07
pass probability&#58;      99.97%
avg running time&#58;        133

'avg running time' is the average number of games. It is funny to see 'pass probability: 100.03%' in one of the outputs.

Please realise that avg_games(balanced) > avg_games(unbalanced) with H1 = 30 Bayeselo but the opposite with H1 = 30 Elo. It could happen due to the changing ratio (elo - H0)/(H1 - H0) (all written in the same units).

Last but not least, please remember that there is a chess SPRT online calculator with theoretical results (not simulations), based on Michel's tool if I am not wrong.

------------------------

I do not run simulations in your proposed way right now because of two reasons:

a) I would like to know if you want H1 = 30 Elo (logistic Elo) or H1 = 30 Bayeselo.
b) I have not done the changes yet and I am not sure that I will have enough time and skills for make it work properly. Sorry.

So, reading b), it is unlikely that I can do what you requested though odds may change (like the great comeback of 1999 Champions League Final).

Regards from Spain.

Ajedrecista.

Type I error for p-value stopping: Balanced and Unbalanced

Type I error for p-value stopping: Balanced and Unbalanced

Type I error for p-value stopping: balanced and unbalanced.

Re: Type I error for p-value stopping: balanced and unbalanc

Type I error for p-value stopping: balanced and unbalanced.

Re: Type I error for p-value stopping: balanced and unbalanc

Re: Type I error for p-value stopping: Balanced and Unbalanc

Re: Type I error for p-value stopping: Balanced and Unbalanc

Re: Type I error for p-value stopping: Balanced and Unbalanc

Re: Type I error for p-value stopping: balanced and unbalanc

Type I error for p-value stopping: balanced and unbalanced.