Type I error for p-value stopping: Balanced and Unbalanced

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
Laskos
Posts: 9441
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Type I error for p-value stopping: balanced and unbalanc

Post by Laskos » Tue Jun 21, 2016 11:52 am

Ajedrecista wrote:
I programmed a SPRT simulator almost three years ago and I recently added parameter estimation to fit the cumulative distribution function of the length of simulations (number of games) to a log-normal distribution, which fits reasonably well.

My simulator works in a slightly different way that the one you propose: I input alpha, beta, lower and upper bounds of SPRT (in Bayeselo units) and two parameters: expected Elo gain (in Bayeselo units, but there is a known relationship of conversion) and drawelo (which is related to the draw ratio).

So, the overall input is summarized in prob.(A wins), prob.(B wins) and prob.(draw) = 1 - prob.(A wins) - prob.(B wins); instead of prob.(A wins with white), prob.(A wins with black), prob.(B wins with white), prob.(B wins with black), prob.(draw of A-B) = 1 - prob.(A wins with white) - prob.(B wins with black), prob.(draw of B-A) = 1 - prob.(A wins with black) - prob.(B wins with white).

I see that your scores with balanced and unbalanced openings are not the same (µ_SF_strong = 57.5% and µ_SF_strong = 58% respectively) although it is expected due to error bars. OTOH, draw_ratio(balanced) = 58% and draw_ratio(unbalanced) = 52%. More draws usually translate into a larger average value in the number of games.

I need to do changes because my usual way to proceed is game after game with the same set of probabilities, not round after round (A-B, B-A and repeat) with two sets of probabilities.

First of all, averaging white and black probabilities (I know that you do not want this, but it is to get a rough idea):

Code: Select all

bayeselo = 200*log10{W*(1 - L)/[L*(1 - W)]}
drawelo = 200*log10[(1 - L)*(1 - W)/(L*W)]  // Estimated from the sample of games.

From SF_strong POV:

Balanced:
W = (0.30 + 0.27)/2 = 0.285
L = (0.12 + 0.15)/2 = 0.135
Elo      ~  52.5116
bayeselo ~  81.4442
drawelo  ~ 241.2287

Unbalanced:
W = (0.59 + 0.05)/2 = 0.32
L = (0.02 + 0.30)/2 = 0.16
Elo      ~  56.0715
bayeselo ~  78.5601
drawelo  ~ 209.5036
I input these values into a SPRT tool by Michel van den Bergh that gives theoretical results (not simulations). I have the following doubt: I suppose that H0 and H1 have the meaning of a SPRT(H0, H1) test, but... [H0] = [H1] are Bayeselo or logistic Elo? There is a conversion formula between Bayeselo and logistic Elo that works for small numbers (let us say |value| < 10 Bayeselo for example), but I am not so sure about larger values. Anyway:

Code: Select all

x = 10^(-drawelo/400&#41;
bayeselo_to_Elo_scale = 4*x/&#40;1 + x&#41;²  // Elo = &#40;bayeselo_to_Elo_scale&#41;*bayeselo.
H0 = 0 Bayeselo = 0 Elo
H1?
H1 = 30 Bayeselo
or
H1&#40;30 Elo, drawelo ~ 241.2287&#41; ~ 46.9406 Bayeselo
or
H1&#40;30 Elo, drawelo ~ 209.5036&#41; ~ 42.2962 Bayeselo

------------------------

C&#58;\&#91;...&#93;\sprta>sprt_w32
Usage&#58; sprta.py elo0 elo1 draw_elo elo
elo0,elo1 are expressed in BayesElo
elo is expressed in LogisticElo

Balanced&#58;
C&#58;\&#91;...&#93;\sprta>sprt_w32 0 30 241.2287 52.5116
elo0     =     0.00
elo1     =    30.00
draw_elo =   241.23
elo      =    52.51
pass probability&#58;      100.03%
avg running time&#58;        172

Unbalanced&#58;
C&#58;\&#91;...&#93;\sprta>sprt_w32 0 30 209.5036 56.0715
elo0     =     0.00
elo1     =    30.00
draw_elo =   209.50
elo      =    56.07
pass probability&#58;      100.00%
avg running time&#58;        170

************************

Balanced&#58;
C&#58;\&#91;...&#93;\sprta>sprt_w32 0 46.9406 241.2287 52.5116
elo0     =     0.00
elo1     =    46.94
draw_elo =   241.23
elo      =    52.51
pass probability&#58;      99.94%
avg running time&#58;        125

Unbalanced&#58;
C&#58;\&#91;...&#93;\sprta>sprt_w32 0 42.2962 209.5036 56.0715
elo0     =     0.00
elo1     =    42.30
draw_elo =   209.50
elo      =    56.07
pass probability&#58;      99.97%
avg running time&#58;        133
'avg running time' is the average number of games. It is funny to see 'pass probability: 100.03%' in one of the outputs.

Please realise that avg_games(balanced) > avg_games(unbalanced) with H1 = 30 Bayeselo but the opposite with H1 = 30 Elo. It could happen due to the changing ratio (elo - H0)/(H1 - H0) (all written in the same units).

Last but not least, please remember that there is a chess SPRT online calculator with theoretical results (not simulations), based on Michel's tool if I am not wrong.

------------------------

I do not run simulations in your proposed way right now because of two reasons:

a) I would like to know if you want H1 = 30 Elo (logistic Elo) or H1 = 30 Bayeselo.
b) I have not done the changes yet and I am not sure that I will have enough time and skills for make it work properly. Sorry.

So, reading b), it is unlikely that I can do what you requested though odds may change (like the great comeback of 1999 Champions League Final).

Regards from Spain.

Ajedrecista.
Hello Jesus, thanks for your time. The ELO difference is not exactly equal in these matches (it is not a rounding error), what is pretty equal in these matches is the significance (W-L)/sigma. Difference is a little larger for the cases of unbalanced, but error margin too, so significance (W-L)/sigma is equal in the two cases.

I forgot about this mess with logistic Elo and BayesElo, take H1 as 30 BayesElo. By averaging scores White-Black, it is expected that the number of necessary games to stop would be almost equal, as it is given mostly by significance. The desired effect is about very different White/Black compared to very similar White/Black, and I think that keeping sequentially pairwise color-reversed is important. I am no sure how it is done in the simulator, but if you manage to have this pairwise separated White/Black outcomes, it would be great. I expect not a small effect of say 1% less games, I would expect in excess of 20% less games for unbalanced.

Thanks and hoping for your Great Comeback :).

User avatar
Ajedrecista
Posts: 1398
Joined: Wed Jul 13, 2011 7:04 pm
Location: Madrid, Spain.
Contact:

Type I error for p-value stopping: balanced and unbalanced.

Post by Ajedrecista » Tue Jun 21, 2016 5:31 pm

Hello again:
Laskos wrote:Hello Jesus, thanks for your time. The ELO difference is not exactly equal in these matches (it is not a rounding error), what is pretty equal in these matches is the significance (W-L)/sigma. Difference is a little larger for the cases of unbalanced, but error margin too, so significance (W-L)/sigma is equal in the two cases.

I forgot about this mess with logistic Elo and BayesElo, take H1 as 30 BayesElo. By averaging scores White-Black, it is expected that the number of necessary games to stop would be almost equal, as it is given mostly by significance. The desired effect is about very different White/Black compared to very similar White/Black, and I think that keeping sequentially pairwise color-reversed is important. I am no sure how it is done in the simulator, but if you manage to have this pairwise separated White/Black outcomes, it would be great. I expect not a small effect of say 1% less games, I would expect in excess of 20% less games for unbalanced.

Thanks and hoping for your Great Comeback :).
I managed to get some working code, just duplicating the check condition of LLR in the do loop. My usual SPRT is more complete but I got rid of many lines of code that were not useful for this experiment. Here is my code:

Code: Select all

program SPRT

implicit none 

character&#40;len=*) &#58;&#58; F0, F1
parameter &#40;F0 = '&#40;I6,A,I6,A,I6,A&#41;'); parameter &#40;F1 = '&#40;I6,A,I7&#41;')
integer, parameter &#58;&#58; simulations = 1e5, max_rounds = 5e6
integer &#58;&#58; i, n, games, wins, draws, loses, number_of_games&#40;1&#58;simulations&#41;
integer &#58;&#58; ongoing_number_of_passes&#40;0&#58;simulations&#41;, ongoing_number_of_fails&#40;0&#58;simulations&#41;
real&#40;KIND=2&#41; &#58;&#58; P0_W, P0_D, P0_L, P1_W, P1_D, P1_L
real&#40;KIND=2&#41; &#58;&#58; LLR, bayeselo_0, bayeselo_1, drawelo, alpha, beta, lower_bound, upper_bound
real&#40;KIND=2&#41; &#58;&#58; avg_games&#40;0&#58;simulations&#41;, v1, v2, WStrong, BStrong, WWeak, BWeak, start, finish

start = cpu_clock@()

alpha = 5d-2; beta = 5d-2
lower_bound = log&#40;beta/&#40;1d0 - alpha&#41;)  ! By definition.
upper_bound = log&#40;&#40;1d0 - beta&#41;/alpha&#41;  ! By definition.

bayeselo_0 = 0d0; bayeselo_1 = 3d1

WStrong = 3d-1; BStrong = 2.7d-1
WWeak = 1.5d-1; BWeak = 1.2d-1

avg_games&#40;0&#41; = 0d0  ! Initialization of this value.
ongoing_number_of_passes&#40;0&#41; = 0; ongoing_number_of_fails&#40;0&#41; = 0
do i = 1, simulations
  wins = 0; draws = 0; loses = 0
  do n = 1, max_rounds  ! From Strong POV.
    
    ! 'Play' games without compute LLR until &#40;wins > 0&#41; & &#40;draws > 0&#41; & &#40;loses > 0&#41;&#58;
    do while &#40;wins*draws*loses < 1&#41;
      v1 = random@()  ! Strong &#40;white&#41; versus Weak &#40;black&#41;.
      if (&#40;v1 > BWeak&#41; .and. &#40;v1 < 1d0 - WStrong&#41;) then
        draws = draws + 1
      else if &#40;v1 <= BWeak&#41; then
        loses = loses + 1
      else
        wins = wins + 1
      end if

      v2 = random@()  ! Weak &#40;white&#41; versus Strong &#40;black&#41;.
      if (&#40;v2 > WWeak&#41; .and. &#40;v2 < 1d0 - BStrong&#41;) then
        draws = draws + 1
      else if &#40;v2 <= WWeak&#41; then
        loses = loses + 1
      else
        wins = wins + 1
      end if
    end do
    
    ! Compute LLR once &#40;wins > 0&#41; & &#40;draws > 0&#41; & &#40;loses > 0&#41; and compare it with the bounds&#58;
    games = wins + draws + loses
    ! drawelo = 200*log&#91;&#40;1 - L&#41;*&#40;1 - W&#41;/&#40;L*W&#41;&#93;; this parameter is estimated in the next line of code.
    drawelo = 2d2*log10&#40;&#40;1d0*games/loses - 1d0&#41;*&#40;1d0*games/wins - 1d0&#41;)  ! Estimate of drawelo from the simulation.
    P0_W = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo - bayeselo_0&#41;))
    P0_L = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo + bayeselo_0&#41;))
    P0_D = 1d0 - P0_W - P0_L
    P1_W = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo - bayeselo_1&#41;))
    P1_L = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo + bayeselo_1&#41;))
    P1_D = 1d0 - P1_W - P1_L
    LLR = wins*log&#40;P1_W/P0_W&#41; + draws*log&#40;P1_D/P0_D&#41; + loses*log&#40;P1_L/P0_L&#41;
    if &#40;LLR < lower_bound&#41; then
      ongoing_number_of_passes&#40;i&#41; = ongoing_number_of_passes&#40;i-1&#41;
      ongoing_number_of_fails&#40;i&#41; = ongoing_number_of_fails&#40;i-1&#41; + 1
      number_of_games&#40;i&#41; = games
      avg_games&#40;i&#41; = (&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; - 1d0&#41;*avg_games&#40;i-1&#41; + number_of_games&#40;i&#41;) &
      & /&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; + 0d0&#41;
      write&#40;*,F0,advance='no') i, '/', simulations, '    Passes&#58; ', ongoing_number_of_passes&#40;i&#41;, '    Fails&#58; '
      write&#40;*,F1&#41;  ongoing_number_of_fails&#40;i&#41;, '    <Games>/simulation&#58; ', nint&#40;avg_games&#40;i&#41;)
      exit
    else if &#40;LLR > upper_bound&#41; then
      ongoing_number_of_passes&#40;i&#41; = ongoing_number_of_passes&#40;i-1&#41; + 1
      ongoing_number_of_fails&#40;i&#41; = ongoing_number_of_fails&#40;i-1&#41;
      number_of_games&#40;i&#41; = games
      avg_games&#40;i&#41; = (&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; - 1d0&#41;*avg_games&#40;i-1&#41; + number_of_games&#40;i&#41;) &
      & /&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; + 0d0&#41;
      write&#40;*,F0,advance='no') i, '/', simulations, '    Passes&#58; ', ongoing_number_of_passes&#40;i&#41;, '    Fails&#58; '
      write&#40;*,F1&#41;  ongoing_number_of_fails&#40;i&#41;, '    <Games>/simulation&#58; ', nint&#40;avg_games&#40;i&#41;)
      exit
    end if
    
    ! 'Play' more games if &#40;lower_bound < LLR < upper_bound&#41;&#58;
    v1 = random@()  ! Strong &#40;white&#41; versus Weak &#40;black&#41;.
    if (&#40;v1 > BWeak&#41; .and. &#40;v1 < 1d0 - WStrong&#41;) then
      draws = draws + 1
    else if &#40;v1 <= BWeak&#41; then
      loses = loses + 1
    else
      wins = wins + 1
    end if
    ! Compute LLR once &#40;wins > 0&#41; & &#40;draws > 0&#41; & &#40;loses > 0&#41; and compare it with the bounds&#58;
    games = wins + draws + loses
    ! drawelo = 200*log&#91;&#40;1 - L&#41;*&#40;1 - W&#41;/&#40;L*W&#41;&#93;; this parameter is estimated in the next line of code.
    drawelo = 2d2*log10&#40;&#40;1d0*games/loses - 1d0&#41;*&#40;1d0*games/wins - 1d0&#41;)  ! Estimate of drawelo from the simulation.
    P0_W = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo - bayeselo_0&#41;))
    P0_L = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo + bayeselo_0&#41;))
    P0_D = 1d0 - P0_W - P0_L
    P1_W = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo - bayeselo_1&#41;))
    P1_L = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo + bayeselo_1&#41;))
    P1_D = 1d0 - P1_W - P1_L
    LLR = wins*log&#40;P1_W/P0_W&#41; + draws*log&#40;P1_D/P0_D&#41; + loses*log&#40;P1_L/P0_L&#41;
    if &#40;LLR < lower_bound&#41; then
      ongoing_number_of_passes&#40;i&#41; = ongoing_number_of_passes&#40;i-1&#41;
      ongoing_number_of_fails&#40;i&#41; = ongoing_number_of_fails&#40;i-1&#41; + 1
      number_of_games&#40;i&#41; = games
      avg_games&#40;i&#41; = (&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; - 1d0&#41;*avg_games&#40;i-1&#41; + number_of_games&#40;i&#41;) &
      & /&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; + 0d0&#41;
      write&#40;*,F0,advance='no') i, '/', simulations, '    Passes&#58; ', ongoing_number_of_passes&#40;i&#41;, '    Fails&#58; '
      write&#40;*,F1&#41;  ongoing_number_of_fails&#40;i&#41;, '    <Games>/simulation&#58; ', nint&#40;avg_games&#40;i&#41;)
      exit
    else if &#40;LLR > upper_bound&#41; then
      ongoing_number_of_passes&#40;i&#41; = ongoing_number_of_passes&#40;i-1&#41; + 1
      ongoing_number_of_fails&#40;i&#41; = ongoing_number_of_fails&#40;i-1&#41;
      number_of_games&#40;i&#41; = games
      avg_games&#40;i&#41; = (&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; - 1d0&#41;*avg_games&#40;i-1&#41; + number_of_games&#40;i&#41;) &
      & /&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; + 0d0&#41;
      write&#40;*,F0,advance='no') i, '/', simulations, '    Passes&#58; ', ongoing_number_of_passes&#40;i&#41;, '    Fails&#58; '
      write&#40;*,F1&#41;  ongoing_number_of_fails&#40;i&#41;, '    <Games>/simulation&#58; ', nint&#40;avg_games&#40;i&#41;)
      exit
    end if

    ! 'Play' more games if &#40;lower_bound < LLR < upper_bound&#41;&#58;
    v2 = random@()  ! Weak &#40;white&#41; versus Strong &#40;black&#41;.
    if (&#40;v2 > WWeak&#41; .and. &#40;v2 < 1d0 - BStrong&#41;) then
      draws = draws + 1
    else if &#40;v2 <= WWeak&#41; then
      loses = loses + 1
    else
      wins = wins + 1
    end if
    
  end do
end do

write&#40;*,*) 

finish = cpu_clock@() 

write&#40;*,'&#40;A,F6.1,A&#41;') 'Elapsed time&#58; ', &#40;finish - start&#41;/3d9, ' seconds.'  ! 3d9 = 3 GHz of my CPU.

end program SPRT
This example is for balanced openings. It is needed a change of WStrong, BStrong, WWeak, BWeak variables for the unbalanced case. I got the following:

Code: Select all

With Kai's data.
SPRT&#40;H0, H1&#41; = SPRT&#40;0, 30&#41; &#40;Bayeselo units&#41;.

Balanced openings&#58;
100000/100000    Passes&#58; 100000    Fails&#58;      0    <Games>/simulation&#58;     176
&#40;Average running time was 172 when averaging white and black with Michel's tool&#41;.

Unbalanced openings&#58;
100000/100000    Passes&#58; 100000    Fails&#58;      0    <Games>/simulation&#58;     171
&#40;Average running time was 170 when averaging white and black with Michel's tool&#41;.
I do not see big differences, supposing that my code is bug free.

Let me do an experiment between similar opponents and balanced/unbalanced openings. I made up the values of WStrong, BStrong, WWeak, BWeak:

Code: Select all

Balanced openings &#40;50.5% for Strong&#41;&#58;

Strong - Weak
+19% =64% -17%
&#40;51% for Strong&#41;.

Weak - Strong
+17% =66% -17%
&#40;50% for Strong&#41;.

--------------

Unbalanced openings &#40;50.5% for Strong&#41;&#58;

Strong - Weak
+51% =40% -9%
&#40;71% for Strong&#41;.

Weak - Strong
+50% =40% -10%
&#40;30% for strong&#41;.
And I run SPRT(-1.5, 4.5) (Bayeselo units). Here are my results:

Code: Select all

Balanced openings&#58;
10000/ 10000    Passes&#58;   9890    Fails&#58;    110    <Games>/simulation&#58;   13861

Unbalanced openings&#58;
10000/ 10000    Passes&#58;   9726    Fails&#58;    274    <Games>/simulation&#58;   18360
I think that I have not messed up the initialization of WStrong, BStrong, WWeak, BWeak variables and that the piece of code where results are assigned in basis of pseudorandom numbers is correct.

I get exactly the opposite results: more games for unbalanced openings although the ratio of avg_games(balanced)/avg_games(unbalanced) is not close to 1, as you expected. There is also an appreciable difference in the pass/fail ratios. I do not know what to say.

Regards from Spain.

Ajedrecista.

User avatar
Laskos
Posts: 9441
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Type I error for p-value stopping: balanced and unbalanc

Post by Laskos » Wed Jun 22, 2016 4:46 am

Ajedrecista wrote:Hello again:
Laskos wrote:Hello Jesus, thanks for your time. The ELO difference is not exactly equal in these matches (it is not a rounding error), what is pretty equal in these matches is the significance (W-L)/sigma. Difference is a little larger for the cases of unbalanced, but error margin too, so significance (W-L)/sigma is equal in the two cases.

I forgot about this mess with logistic Elo and BayesElo, take H1 as 30 BayesElo. By averaging scores White-Black, it is expected that the number of necessary games to stop would be almost equal, as it is given mostly by significance. The desired effect is about very different White/Black compared to very similar White/Black, and I think that keeping sequentially pairwise color-reversed is important. I am no sure how it is done in the simulator, but if you manage to have this pairwise separated White/Black outcomes, it would be great. I expect not a small effect of say 1% less games, I would expect in excess of 20% less games for unbalanced.

Thanks and hoping for your Great Comeback :).
I managed to get some working code, just duplicating the check condition of LLR in the do loop. My usual SPRT is more complete but I got rid of many lines of code that were not useful for this experiment. Here is my code:

Code: Select all

program SPRT

implicit none 

character&#40;len=*) &#58;&#58; F0, F1
parameter &#40;F0 = '&#40;I6,A,I6,A,I6,A&#41;'); parameter &#40;F1 = '&#40;I6,A,I7&#41;')
integer, parameter &#58;&#58; simulations = 1e5, max_rounds = 5e6
integer &#58;&#58; i, n, games, wins, draws, loses, number_of_games&#40;1&#58;simulations&#41;
integer &#58;&#58; ongoing_number_of_passes&#40;0&#58;simulations&#41;, ongoing_number_of_fails&#40;0&#58;simulations&#41;
real&#40;KIND=2&#41; &#58;&#58; P0_W, P0_D, P0_L, P1_W, P1_D, P1_L
real&#40;KIND=2&#41; &#58;&#58; LLR, bayeselo_0, bayeselo_1, drawelo, alpha, beta, lower_bound, upper_bound
real&#40;KIND=2&#41; &#58;&#58; avg_games&#40;0&#58;simulations&#41;, v1, v2, WStrong, BStrong, WWeak, BWeak, start, finish

start = cpu_clock@()

alpha = 5d-2; beta = 5d-2
lower_bound = log&#40;beta/&#40;1d0 - alpha&#41;)  ! By definition.
upper_bound = log&#40;&#40;1d0 - beta&#41;/alpha&#41;  ! By definition.

bayeselo_0 = 0d0; bayeselo_1 = 3d1

WStrong = 3d-1; BStrong = 2.7d-1
WWeak = 1.5d-1; BWeak = 1.2d-1

avg_games&#40;0&#41; = 0d0  ! Initialization of this value.
ongoing_number_of_passes&#40;0&#41; = 0; ongoing_number_of_fails&#40;0&#41; = 0
do i = 1, simulations
  wins = 0; draws = 0; loses = 0
  do n = 1, max_rounds  ! From Strong POV.
    
    ! 'Play' games without compute LLR until &#40;wins > 0&#41; & &#40;draws > 0&#41; & &#40;loses > 0&#41;&#58;
    do while &#40;wins*draws*loses < 1&#41;
      v1 = random@()  ! Strong &#40;white&#41; versus Weak &#40;black&#41;.
      if (&#40;v1 > BWeak&#41; .and. &#40;v1 < 1d0 - WStrong&#41;) then
        draws = draws + 1
      else if &#40;v1 <= BWeak&#41; then
        loses = loses + 1
      else
        wins = wins + 1
      end if

      v2 = random@()  ! Weak &#40;white&#41; versus Strong &#40;black&#41;.
      if (&#40;v2 > WWeak&#41; .and. &#40;v2 < 1d0 - BStrong&#41;) then
        draws = draws + 1
      else if &#40;v2 <= WWeak&#41; then
        loses = loses + 1
      else
        wins = wins + 1
      end if
    end do
    
    ! Compute LLR once &#40;wins > 0&#41; & &#40;draws > 0&#41; & &#40;loses > 0&#41; and compare it with the bounds&#58;
    games = wins + draws + loses
    ! drawelo = 200*log&#91;&#40;1 - L&#41;*&#40;1 - W&#41;/&#40;L*W&#41;&#93;; this parameter is estimated in the next line of code.
    drawelo = 2d2*log10&#40;&#40;1d0*games/loses - 1d0&#41;*&#40;1d0*games/wins - 1d0&#41;)  ! Estimate of drawelo from the simulation.
    P0_W = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo - bayeselo_0&#41;))
    P0_L = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo + bayeselo_0&#41;))
    P0_D = 1d0 - P0_W - P0_L
    P1_W = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo - bayeselo_1&#41;))
    P1_L = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo + bayeselo_1&#41;))
    P1_D = 1d0 - P1_W - P1_L
    LLR = wins*log&#40;P1_W/P0_W&#41; + draws*log&#40;P1_D/P0_D&#41; + loses*log&#40;P1_L/P0_L&#41;
    if &#40;LLR < lower_bound&#41; then
      ongoing_number_of_passes&#40;i&#41; = ongoing_number_of_passes&#40;i-1&#41;
      ongoing_number_of_fails&#40;i&#41; = ongoing_number_of_fails&#40;i-1&#41; + 1
      number_of_games&#40;i&#41; = games
      avg_games&#40;i&#41; = (&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; - 1d0&#41;*avg_games&#40;i-1&#41; + number_of_games&#40;i&#41;) &
      & /&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; + 0d0&#41;
      write&#40;*,F0,advance='no') i, '/', simulations, '    Passes&#58; ', ongoing_number_of_passes&#40;i&#41;, '    Fails&#58; '
      write&#40;*,F1&#41;  ongoing_number_of_fails&#40;i&#41;, '    <Games>/simulation&#58; ', nint&#40;avg_games&#40;i&#41;)
      exit
    else if &#40;LLR > upper_bound&#41; then
      ongoing_number_of_passes&#40;i&#41; = ongoing_number_of_passes&#40;i-1&#41; + 1
      ongoing_number_of_fails&#40;i&#41; = ongoing_number_of_fails&#40;i-1&#41;
      number_of_games&#40;i&#41; = games
      avg_games&#40;i&#41; = (&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; - 1d0&#41;*avg_games&#40;i-1&#41; + number_of_games&#40;i&#41;) &
      & /&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; + 0d0&#41;
      write&#40;*,F0,advance='no') i, '/', simulations, '    Passes&#58; ', ongoing_number_of_passes&#40;i&#41;, '    Fails&#58; '
      write&#40;*,F1&#41;  ongoing_number_of_fails&#40;i&#41;, '    <Games>/simulation&#58; ', nint&#40;avg_games&#40;i&#41;)
      exit
    end if
    
    ! 'Play' more games if &#40;lower_bound < LLR < upper_bound&#41;&#58;
    v1 = random@()  ! Strong &#40;white&#41; versus Weak &#40;black&#41;.
    if (&#40;v1 > BWeak&#41; .and. &#40;v1 < 1d0 - WStrong&#41;) then
      draws = draws + 1
    else if &#40;v1 <= BWeak&#41; then
      loses = loses + 1
    else
      wins = wins + 1
    end if
    ! Compute LLR once &#40;wins > 0&#41; & &#40;draws > 0&#41; & &#40;loses > 0&#41; and compare it with the bounds&#58;
    games = wins + draws + loses
    ! drawelo = 200*log&#91;&#40;1 - L&#41;*&#40;1 - W&#41;/&#40;L*W&#41;&#93;; this parameter is estimated in the next line of code.
    drawelo = 2d2*log10&#40;&#40;1d0*games/loses - 1d0&#41;*&#40;1d0*games/wins - 1d0&#41;)  ! Estimate of drawelo from the simulation.
    P0_W = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo - bayeselo_0&#41;))
    P0_L = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo + bayeselo_0&#41;))
    P0_D = 1d0 - P0_W - P0_L
    P1_W = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo - bayeselo_1&#41;))
    P1_L = 1d0/&#40;1d0 + 1d1**&#40;2.5d-3*&#40;drawelo + bayeselo_1&#41;))
    P1_D = 1d0 - P1_W - P1_L
    LLR = wins*log&#40;P1_W/P0_W&#41; + draws*log&#40;P1_D/P0_D&#41; + loses*log&#40;P1_L/P0_L&#41;
    if &#40;LLR < lower_bound&#41; then
      ongoing_number_of_passes&#40;i&#41; = ongoing_number_of_passes&#40;i-1&#41;
      ongoing_number_of_fails&#40;i&#41; = ongoing_number_of_fails&#40;i-1&#41; + 1
      number_of_games&#40;i&#41; = games
      avg_games&#40;i&#41; = (&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; - 1d0&#41;*avg_games&#40;i-1&#41; + number_of_games&#40;i&#41;) &
      & /&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; + 0d0&#41;
      write&#40;*,F0,advance='no') i, '/', simulations, '    Passes&#58; ', ongoing_number_of_passes&#40;i&#41;, '    Fails&#58; '
      write&#40;*,F1&#41;  ongoing_number_of_fails&#40;i&#41;, '    <Games>/simulation&#58; ', nint&#40;avg_games&#40;i&#41;)
      exit
    else if &#40;LLR > upper_bound&#41; then
      ongoing_number_of_passes&#40;i&#41; = ongoing_number_of_passes&#40;i-1&#41; + 1
      ongoing_number_of_fails&#40;i&#41; = ongoing_number_of_fails&#40;i-1&#41;
      number_of_games&#40;i&#41; = games
      avg_games&#40;i&#41; = (&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; - 1d0&#41;*avg_games&#40;i-1&#41; + number_of_games&#40;i&#41;) &
      & /&#40;ongoing_number_of_passes&#40;i&#41; + ongoing_number_of_fails&#40;i&#41; + 0d0&#41;
      write&#40;*,F0,advance='no') i, '/', simulations, '    Passes&#58; ', ongoing_number_of_passes&#40;i&#41;, '    Fails&#58; '
      write&#40;*,F1&#41;  ongoing_number_of_fails&#40;i&#41;, '    <Games>/simulation&#58; ', nint&#40;avg_games&#40;i&#41;)
      exit
    end if

    ! 'Play' more games if &#40;lower_bound < LLR < upper_bound&#41;&#58;
    v2 = random@()  ! Weak &#40;white&#41; versus Strong &#40;black&#41;.
    if (&#40;v2 > WWeak&#41; .and. &#40;v2 < 1d0 - BStrong&#41;) then
      draws = draws + 1
    else if &#40;v2 <= WWeak&#41; then
      loses = loses + 1
    else
      wins = wins + 1
    end if
    
  end do
end do

write&#40;*,*) 

finish = cpu_clock@() 

write&#40;*,'&#40;A,F6.1,A&#41;') 'Elapsed time&#58; ', &#40;finish - start&#41;/3d9, ' seconds.'  ! 3d9 = 3 GHz of my CPU.

end program SPRT
This example is for balanced openings. It is needed a change of WStrong, BStrong, WWeak, BWeak variables for the unbalanced case. I got the following:

Code: Select all

With Kai's data.
SPRT&#40;H0, H1&#41; = SPRT&#40;0, 30&#41; &#40;Bayeselo units&#41;.

Balanced openings&#58;
100000/100000    Passes&#58; 100000    Fails&#58;      0    <Games>/simulation&#58;     176
&#40;Average running time was 172 when averaging white and black with Michel's tool&#41;.

Unbalanced openings&#58;
100000/100000    Passes&#58; 100000    Fails&#58;      0    <Games>/simulation&#58;     171
&#40;Average running time was 170 when averaging white and black with Michel's tool&#41;.
I do not see big differences, supposing that my code is bug free.

Let me do an experiment between similar opponents and balanced/unbalanced openings. I made up the values of WStrong, BStrong, WWeak, BWeak:

Code: Select all

Balanced openings &#40;50.5% for Strong&#41;&#58;

Strong - Weak
+19% =64% -17%
&#40;51% for Strong&#41;.

Weak - Strong
+17% =66% -17%
&#40;50% for Strong&#41;.

--------------

Unbalanced openings &#40;50.5% for Strong&#41;&#58;

Strong - Weak
+51% =40% -9%
&#40;71% for Strong&#41;.

Weak - Strong
+50% =40% -10%
&#40;30% for strong&#41;.
And I run SPRT(-1.5, 4.5) (Bayeselo units). Here are my results:

Code: Select all

Balanced openings&#58;
10000/ 10000    Passes&#58;   9890    Fails&#58;    110    <Games>/simulation&#58;   13861

Unbalanced openings&#58;
10000/ 10000    Passes&#58;   9726    Fails&#58;    274    <Games>/simulation&#58;   18360
I think that I have not messed up the initialization of WStrong, BStrong, WWeak, BWeak variables and that the piece of code where results are assigned in basis of pseudorandom numbers is correct.

I get exactly the opposite results: more games for unbalanced openings although the ratio of avg_games(balanced)/avg_games(unbalanced) is not close to 1, as you expected. There is also an appreciable difference in the pass/fail ratios. I do not know what to say.

Regards from Spain.

Ajedrecista.
Thank you very much for your time. So, if everything is ok, you get only 3% reduction of the number of games for my case? About your made-up values, you equaled the ELO difference, both balanced and unbalanced cases at 0.505 performance, but sigma is not the same in their cases, because the draw rate is very different. In fact sigma is much larger for unbalanced case (smaller draw ratio), and for equal (Win-Loss) significance is much smaller for unbalanced case in your example. Therefore the larger number of games. What I am more curious is for equal significance level (W-L)/(W+L)^0.5, not equal ELO difference (W-L).

If this 3% only decrease holds, I suspect I have a misunderstanding about SPRT as applied here. I understood it as controlling Type I, II errors, and stopping when they are below thresholds outside the interval (H0, H1). If in fact it stops more or less at significance level, then why do it sequentially and not as instantaneous verdict after the end of the match?

To have a better feeling about it, can you compute with the initial simulator (not pairwise White/Black, just simple) the number of games for probabilities I concocted, having the same significance level for the same number of games:

H0 = 0
H1 = 6 BayesElo
alpha, beta = 0.05

1/ W=0.1113 L=0.0887
2/ W=0.4226 L=0.3774

They have the same significance level for the same number of games (observe, not ELO difference), but pretty different errors for accepting hypotheses. If SPRT, as applied here, will say that they will stop at very similar number of games, then I guess the significance level is the most important factor for it.
Thank you.

User avatar
Laskos
Posts: 9441
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Type I error for p-value stopping: balanced and unbalanc

Post by Laskos » Wed Jun 22, 2016 7:15 am

Laskos wrote:Thank you very much for your time. So, if everything is ok, you get only 3% reduction of the number of games for my case? About your made-up values, you equaled the ELO difference, both balanced and unbalanced cases at 0.505 performance, but sigma is not the same in their cases, because the draw rate is very different. In fact sigma is much larger for unbalanced case (smaller draw ratio), and for equal (Win-Loss) significance is much smaller for unbalanced case in your example. Therefore the larger number of games. What I am more curious is for equal significance level (W-L)/(W+L)^0.5, not equal ELO difference (W-L).

If this 3% only decrease holds, I suspect I have a misunderstanding about SPRT as applied here. I understood it as controlling Type I, II errors, and stopping when they are below thresholds outside the interval (H0, H1). If in fact it stops more or less at significance level, then why do it sequentially and not as instantaneous verdict after the end of the match?

To have a better feeling about it, can you compute with the initial simulator (not pairwise White/Black, just simple) the number of games for probabilities I concocted, having the same significance level for the same number of games:

H0 = 0
H1 = 6 BayesElo
alpha, beta = 0.05

1/ W=0.1113 L=0.0887
2/ W=0.4226 L=0.3774

They have the same significance level for the same number of games (observe, not ELO difference), but pretty different errors for accepting hypotheses. If SPRT, as applied here, will say that they will stop at very similar number of games, then I guess the significance level is the most important factor for it.
Thank you.
Also, with your new program (pairwise White/Black), for accepting H0=0 hypothesis, 2 equal engines S1 and S2:

H0: 0
H1: 10 BayesElo
alpha, beta: 0.05

3/ Balanced:
WS1 = 0.20; BS1 = 0.16
WS2 = 0.20; BS2 = 0.16

4/ Unbalanced:
WS1 = 0.46; BS1 = 0.02
WS2 = 0.46; BS2 = 0.02

The unbalanced case has much smaller Type I error, so I would expect
a/ Less games in the case of unbalanced to pass H0
b/ Less fails for unbalanced

User avatar
Ajedrecista
Posts: 1398
Joined: Wed Jul 13, 2011 7:04 pm
Location: Madrid, Spain.
Contact:

Type I error for p-value stopping: balanced and unbalanced.

Post by Ajedrecista » Wed Jun 22, 2016 9:37 am

Hello:
Laskos wrote:Thank you very much for your time. So, if everything is ok, you get only 3% reduction of the number of games for my case? About your made-up values, you equaled the ELO difference, both balanced and unbalanced cases at 0.505 performance, but sigma is not the same in their cases, because the draw rate is very different. In fact sigma is much larger for unbalanced case (smaller draw ratio), and for equal (Win-Loss) significance is much smaller for unbalanced case in your example. Therefore the larger number of games. What I am more curious is for equal significance level (W-L)/(W+L)^0.5, not equal ELO difference (W-L).

If this 3% only decrease holds, I suspect I have a misunderstanding about SPRT as applied here. I understood it as controlling Type I, II errors, and stopping when they are below thresholds outside the interval (H0, H1). If in fact it stops more or less at significance level, then why do it sequentially and not as instantaneous verdict after the end of the match?

To have a better feeling about it, can you compute with the initial simulator (not pairwise White/Black, just simple) the number of games for probabilities I concocted, having the same significance level for the same number of games:

H0 = 0
H1 = 6 BayesElo
alpha, beta = 0.05

1/ W=0.1113 L=0.0887
2/ W=0.4226 L=0.3774

They have the same significance level for the same number of games (observe, not ELO difference), but pretty different errors for accepting hypotheses. If SPRT, as applied here, will say that they will stop at very similar number of games, then I guess the significance level is the most important factor for it.
Thank you.
[(W - L)/sqrt(W + L)]_1 = [(W - L)/sqrt(W + L)]_2. I use Michel's script so I must convert {W, L} into {Elo, drawelo}:

Code: Select all

Elo     = 400*log10&#91;&#40;1 + W - L&#41;/&#40;1 - W + L&#41;&#93;
drawelo = 200*log10&#91;&#40;1 - L&#41;*&#40;1 - W&#41;/&#40;L*W&#41;&#93;

1&#41; W = 0.1113, L = 0.0887
Elo      ~   7.8534
drawelo  ~ 382.7996

2&#41; W = 0.4226, L = 0.3774
Elo      ~ 15.7148
drawelo  ~ 70.5909

Code: Select all

Usage&#58; sprta.py elo0 elo1 draw_elo elo
elo0,elo1 are expressed in BayesElo
elo is expressed in LogisticElo

1&#41; SPRT&#40;0, 6&#41;&#58;

elo0     =     0.00
elo1     =     6.00
draw_elo =   382.80
elo      =     7.85
pass probability&#58;      100.00%
avg running time&#58;       4844

2&#41; SPRT&#40;0, 6&#41;&#58;

elo0     =     0.00
elo1     =     6.00
draw_elo =    70.59
elo      =    15.71
pass probability&#58;      100.0%
avg running time&#58;       3847
I hope no typos. This time: (3847/4844 - 1)*100 ~ -20.58%.

------------------------
Laskos wrote:Also, with your new program (pairwise White/Black), for accepting H0=0 hypothesis, 2 equal engines S1 and S2:

H0: 0
H1: 10 BayesElo
alpha, beta: 0.05

3/ Balanced:
WS1 = 0.20; BS1 = 0.16
WS2 = 0.20; BS2 = 0.16

4/ Unbalanced:
WS1 = 0.46; BS1 = 0.02
WS2 = 0.46; BS2 = 0.02

The unbalanced case has much smaller Type I error, so I would expect
a/ Less games in the case of unbalanced to pass H0
b/ Less fails for unbalanced

Code: Select all

Balanced openings&#58;
100000/100000    Passes&#58;   4672    Fails&#58;  95328    <Games>/simulation&#58;    6970

Unbalanced openings&#58;
100000/100000    Passes&#58;    640    Fails&#58;  99360    <Games>/simulation&#58;    6599
I obtain more games and more passes for balanced openings. a) seems correct but not b).

Just a reminder: each time that LLR is updated via a {win, lose, draw}, LLR changes in this way:

LLR(1 win) > 0.
LLR(1 lose) < 0.
LLR(1 draw): it depends. How?

A run is tested under SPRT(elo0, elo1), where {elo0, elo1} are the {lower, upper} SPRT bounds, written in Bayeselo units. Its central point is (elo0 + elo1)/2 and the sign of LLR(1 draw) depends on the sign(central point) = sign[(elo0 + elo1)/2] = sign(elo0 + elo1):

Sign[LLR(1 draw)] = -sign(elo0 + elo1).

LLR[(1 draw) | (elo0 + elo1 > 0)] < 0 ==> a draw is penalized.
LLR[(1 draw) | (elo0 + elo1 = 0)] = 0 ==> a draw is indifferent (higher draw ratios tend to lengthen SPRT).
LLR[(1 draw) | (elo0 + elo1 < 0)] > 0 ==> a draw is rewarded.

------------------------

Lucas Braesch wrote a fast SPRT simulator (not pairwise):

SPRT simulator (GitHub).

Alpha = 0.05 = beta is hardcoded but you can compile and change whatever you want.

------------------------

Michel van den Bergh wrote a SPRT script (not pairwise) that returns theoretical results:

sprta.py (source code)
sprta.zip (with Windows executable)

Again, alpha = 0.05 = beta is hardcoded in the source.

------------------------

Both programmes are very useful. I recommend them.

Regards from Spain.

Ajedrecista.

User avatar
Laskos
Posts: 9441
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Type I error for p-value stopping: balanced and unbalanc

Post by Laskos » Wed Jun 22, 2016 1:44 pm

Ajedrecista wrote:
Both programmes are very useful. I recommend them.

Regards from Spain.

Ajedrecista.
Hello, excellent, thanks for the links. In fact both a) and b) are as expected, because I called "fails" those which reject H0 (opposite to the output of the program). The effect on the number of games is less significant than I expected in 3) 4) cases, while Type I error is significantly smaller for unbalanced cases (you can compute it to see as earlier in the thread). So, maybe 3% effect with unbalanced (instead of 20%) is reasonable to expect from SPRT. SPRT seems to take upper bounds of Type I, II errors outside the interval loosely, these 0.05 are just upper bounds, real errors outside the interval are usually much smaller and varied. Seems not very different from say 3 standard deviations rule of thumb of chess matches for Type I error.

Post Reply