I started to run some tests again and just to do something different I decided to use the 2-SPRT instead of the SPRT. Recall that the 2-SPRT optimizes the worst case behaviour of a sequential test which is when the real elo is precisely in the middle of the upper and lower elo bounds. Also unlike the SPRT the 2-SPRT has an absolute upper bound on its running time.
Here is the script I wrote some time ago to calculate the 2-SPRT parameters.
http://hardy.uhasselt.be/Toga/sprt2.py
To validate the results the script can do simulation and it can even compute the operating characteristic (OC) and average sample number (ASN) of the test exactly!!
Here are some notes on the formulas
http://hardy.uhasselt.be/Toga/support_whitehead.pdf
These are mainly notes to myself, so they are neither very rigorous nor very readable. You can see that the formulas used by the script are based on approximations but these hold to a high degree of accuracy for small elo differences.
2-SPRT
Moderators: hgm, Rebel, chrisw
-
- Posts: 1971
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: 2-SPRT.
Hello Michel:
I see that the point is to minimize the expected length of a test whose patch has an Bayeselo gain of e = (theta_0 + theta_1)/2 using the notation of the paper. I hope 2-SPRT does not harm the average run of other cases such as e = theta_0.
Could you post some examples of the saving of the number of games, please? I remember that it was posted in SF Google Group and I recall a number of 10% less games for the case of e = (theta_0 + theta_1)/2. Is it right? Here are two links that I found:
https://groups.google.com/forum/#!msg/f ... jQl8P_jAsJ
https://groups.google.com/forum/#!msg/f ... 4hXAj7t2IJ
For example, in a normal SPRT(-1.5, 4.5) with e = 1.5, d = 240, alpha = 0.05 = beta and 20000 simulations (446.06 seconds with a speed of 1305193 games/second), I get an average run of circa (29110 ± 326) games with 95% confidence, a median run of 22178 games, a shortest run of 1338 games (+211 -314 =813, fail), a longest run of 242686 games (+48992 -48427 =145267, fail), 10117 passes and 9883 fails (near to the expected 50% - 50%), and circa 99.39% of finished simulations with less than 128000 games (useful info for Fishtest). These values match with yours? If that, in this case I expect something like 26200 games on average using 2-SPRT.
Thanks in advance.
------------------------
On other side, I made a correction here. I think I am right... but I am very, very far from perfection. Any confirmation is welcome.
Regards from Spain.
Ajedrecista.
The paper is very interesting, thanks for share it. I guess that the script is also very valuable, but since I do not have Python and do not know how to run a script (my bad! :O), I can not get info from it.Michel wrote:I started to run some tests again and just to do something different I decided to use the 2-SPRT instead of the SPRT. Recall that the 2-SPRT optimizes the worst case behaviour of a sequential test which is when the real elo is precisely in the middle of the upper and lower elo bounds. Also unlike the SPRT the 2-SPRT has an absolute upper bound on its running time.
Here is the script I wrote some time ago to calculate the 2-SPRT parameters.
http://hardy.uhasselt.be/Toga/sprt2.py
To validate the results the script can do simulation and it can even compute the operating characteristic (OC) and average sample number (ASN) of the test exactly!!
Here are some notes on the formulas
http://hardy.uhasselt.be/Toga/support_whitehead.pdf
These are mainly notes to myself, so they are neither very rigorous nor very readable. You can see that the formulas used by the script are based on approximations but these hold to a high degree of accuracy for small elo differences.
I see that the point is to minimize the expected length of a test whose patch has an Bayeselo gain of e = (theta_0 + theta_1)/2 using the notation of the paper. I hope 2-SPRT does not harm the average run of other cases such as e = theta_0.
Could you post some examples of the saving of the number of games, please? I remember that it was posted in SF Google Group and I recall a number of 10% less games for the case of e = (theta_0 + theta_1)/2. Is it right? Here are two links that I found:
https://groups.google.com/forum/#!msg/f ... jQl8P_jAsJ
https://groups.google.com/forum/#!msg/f ... 4hXAj7t2IJ
For example, in a normal SPRT(-1.5, 4.5) with e = 1.5, d = 240, alpha = 0.05 = beta and 20000 simulations (446.06 seconds with a speed of 1305193 games/second), I get an average run of circa (29110 ± 326) games with 95% confidence, a median run of 22178 games, a shortest run of 1338 games (+211 -314 =813, fail), a longest run of 242686 games (+48992 -48427 =145267, fail), 10117 passes and 9883 fails (near to the expected 50% - 50%), and circa 99.39% of finished simulations with less than 128000 games (useful info for Fishtest). These values match with yours? If that, in this case I expect something like 26200 games on average using 2-SPRT.
Thanks in advance.
------------------------
On other side, I made a correction here. I think I am right... but I am very, very far from perfection. Any confirmation is welcome.
Regards from Spain.
Ajedrecista.
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: 2-SPRT.
Yes it hurts a bit at theta0 and theta1. It has to because the SPRT is precisely optimal at these values. The choice between the SPRT and 2-SPRT is mainly psychological. For me improving the worst case behaviour is important since testing basically freezes my workflow.
Here is some sample output
Here is some sample output
Code: Select all
$ python sprt2.py --expected
Test parameters:
alpha=0.05
beta=0.05
elo0=-1.50
elo1=4.50
draw_elo=240
Continuation region:
1.00277097*N-83.412576 < S < 1.00000000*N+83.412576
Worst case expected running time: 25363
Maximal running time: 60205
Approximate expected running time:
Elo SPRT SPRT2 Hoeffding
===== ===== ===== =========
-4.50 9570 12052
-3.90 10587 13078
-3.30 11815 14270
-2.70 13309 15649
-2.10 15129 17219
-1.50(H0) 17322 18953
-0.90 19891 20768
-0.30 22714 22516
0.30 25459 24000
0.90 27544 25011
1.50 28336 25363 24448
2.10 27544 25011
2.70 25458 24000
3.30 22714 22516
3.90 19890 20767
4.50(H1) 17322 18952
5.10 15128 17219
5.70 13309 15649
6.30 11814 14270
6.90 10586 13078
7.50 9570 12052
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: 2-SPRT.
The best sequential test is actually the "Schwarz" test.
http://hardy.uhasselt.be/Toga/ref_schwarz.pdf
Here is an old .pdf I found
http://hardy.uhasselt.be/Toga/schwarz_results.pdf
One can see that the Schwarz test is almost equivalent to the SPRT at elo0,elo1 but the worst case behaviour is substantially better (similar to the 2-SPRT).
The results in the above .pdf were established by simulation so they took a long time to generate. Meanwhile I know how to compute the parameters of a Schwarz test theoretically and I have a script for that as well. Since I now plan to switch to the Schwarz test for my own testing, I will clean up the script and post it here.
http://hardy.uhasselt.be/Toga/ref_schwarz.pdf
Here is an old .pdf I found
http://hardy.uhasselt.be/Toga/schwarz_results.pdf
One can see that the Schwarz test is almost equivalent to the SPRT at elo0,elo1 but the worst case behaviour is substantially better (similar to the 2-SPRT).
The results in the above .pdf were established by simulation so they took a long time to generate. Meanwhile I know how to compute the parameters of a Schwarz test theoretically and I have a script for that as well. Since I now plan to switch to the Schwarz test for my own testing, I will clean up the script and post it here.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.
-
- Posts: 1971
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Comparison of our SPRT results.
Hello Michel:
In my case: power(-7.5) > power(-7)... well, bad luck! I see that in general my average number of games is slightly higher than yours... I suppose that we choosed the same value for drawelo parameter. The opposite for the values of power: yours are higher in general.
The sum of my elapsed times is 2283.11 seconds = 0:38:03.11 if I am right. Of course I spent extra time in changing the Bayeselo value each time in the input notepad, but less than an hour for 160000 simulations in SPRT(-3, 3) is not bad in such an old computer IMHO. I do not know the number of simulations you ran, but I would say that we agree in our results in general terms.
Your PDF is trimmed at the right margin (the last numbers are not visible). For 0 Bayeselo, the four values are: 28431.7, 0.4996; 25349.3, 0.4993. Just as a side note: my results for average number of games are rounded to the closest integers.
Good luck with your Schwarz tests! I stay tuned.
Regards from Spain.
Ajedrecista.
Just to check my SPRT simulator, I ran it in my Intel Pentium D930 of year 2006. The programme is 32-bit and single core. I understand the SPRT power as the probability to accept the patch. Here are my results:Michel wrote:The results in the above .pdf were established by simulation so they took a long time to generate.
Code: Select all
alpha: 0.0500
beta: 0.0500
drawelo_fixed: 240.0000
bayeselo_0: -3.0000
bayeselo_1: 3.0000
Simulations: 10000 (each time)
Code: Select all
Comparison of results (M = Michel's; J = Jesús'):
Bayeselo <Games (M)> <Games (J)> Power(M) Power(J) Elapse time (J)
======== =========== =========== ======== ======== ===============
-7.5 7709.9 7906 0.0006 0.0008 79.55 sec.
-7 8276.9 8470 0.0012 0.0005 83.23 sec.
-6.5 8894.5 9101 0.0017 0.0011 88.23 sec.
-6 9604.4 9840 0.0028 0.0012 93.72 sec.
-5.5 10440.0 10619 0.0045 0.0035 102.09 sec.
-5 11426.9 11617 0.0074 0.0060 104.10 sec.
-4.5 12587.8 12787 0.0115 0.0091 112.02 sec.
-4 13926.1 14122 0.0193 0.0194 121.12 sec.
-3.5 15532.2 15755 0.0312 0.0294 132.37 sec.
-3 17409.8 17593 0.0483 0.0476 149.39 sec.
-2.5 19561.5 19713 0.0798 0.0771 162.78 sec.
-2 21918.5 22128 0.1240 0.1210 179.91 sec.
-1.5 24377.8 24903 0.1842 0.1790 196.77 sec.
-1 26456.1 26916 0.2729 0.2687 214.55 sec.
-0.5 27987.2 28501 0.3799 0.3791 228.82 sec.
0 28431.7 28899 0.4996 0.5010 234.46 sec.
The sum of my elapsed times is 2283.11 seconds = 0:38:03.11 if I am right. Of course I spent extra time in changing the Bayeselo value each time in the input notepad, but less than an hour for 160000 simulations in SPRT(-3, 3) is not bad in such an old computer IMHO. I do not know the number of simulations you ran, but I would say that we agree in our results in general terms.
Your PDF is trimmed at the right margin (the last numbers are not visible). For 0 Bayeselo, the four values are: 28431.7, 0.4996; 25349.3, 0.4993. Just as a side note: my results for average number of games are rounded to the closest integers.
Good luck with your Schwarz tests! I stay tuned.
Regards from Spain.
Ajedrecista.
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: Comparison of our SPRT results.
Well I don't remember how I generated the SPRT results... This is just an old .pdf I found from the time I was learning about sequential testing.In my case: power(-7.5) > power(-7)... well, bad luck! I see that in general my average number of games is slightly higher than yours... I suppose that we choosed the same value for drawelo parameter. The opposite for the values of power: yours are higher in general.
Sequential testing involves really beautiful math. It is in a sense a solved problem as for a given prior there is a way to generate an optimal solution (using dynamic programming) that minimizes the average duration of the test for that prior, given overall type 1 and type 2 error bounds. Asymptotically things improve and the optimal solution depends only on the support of the prior (places where it is non-zero). The SPRT, 2-SPRT and the Schwarz tests are all asymptotic solutions of this type.
Ah. It is not really trimmed, but really stops at 0 (since the table is symmetric the values >0 can be derived from those <0). But I agree the layout could be much better.Your PDF is trimmed at the right margin (the last numbers are not visible). For 0 Bayeselo, the four values are: 28431.7, 0.4996; 25349.3, 0.4993. Just as a side note: my results for average number of games are rounded to the closest integers.
I am adding a section to my document on the 2-SPRT about the Schwarz test. To simplify things I am again using Taylor series approximations to the likelihood functions. After that I will implement the formulas in a script.Good luck with your Schwarz tests! I stay tuned.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.
-
- Posts: 2272
- Joined: Mon Sep 29, 2008 1:50 am
Re: Comparison of our SPRT results.
Addendum to avoid potential confusion: by a lucky mathematical accident the optimality properties of the SPRT are not just asymptotic. They hold always!The SPRT, 2-SPRT and the Schwarz tests are all asymptotic solutions of this type.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.
-
- Posts: 1971
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: Comparison of our SPRT results.
Hello again:
I agree: the layout could be better just writing the value of drawelo for this example (probably 240) and fixing the issue of the right margin, as well as the values of alpha and beta (I know that they could be derived easily with power(elo0) if we suppose that alpha = beta, and it looks like alpha = 1/20 = beta this time) and the number of simulations... but if this info is quite old and you do not remember the details, there is nothing to be done. Otherwise it is perfect.
Once again, good luck with your new tests.
Regards from Spain.
Ajedrecista.
I was not talking about the lack of values for Bayeselo = {0.5, 1, ...}. Since elo0 + elo1 = 0, I already understood that you said that the values for elo_i are the same (in number of games) to the values of -elo_i, while power(elo_i) = 1 - power(-elo_i) in this special case of elo0 = -elo1. What I wanted to say is that for 0 Bayeselo results I only see {28431, 0.499; 25349, 0.499} instead of {28431.7, 0.4996; 25349.3, 0.4993}, that is, the last number of each result is missing due to the margin, but I managed to get the full results just copying and pasting the full text of the PDF into a Notepad.Michel wrote:Ah. It is not really trimmed, but really stops at 0 (since the table is symmetric the values >0 can be derived from those <0). But I agree the layout could be much better.Ajedrecista wrote:Your PDF is trimmed at the right margin (the last numbers are not visible). For 0 Bayeselo, the four values are: 28431.7, 0.4996; 25349.3, 0.4993. Just as a side note: my results for average number of games are rounded to the closest integers.
I agree: the layout could be better just writing the value of drawelo for this example (probably 240) and fixing the issue of the right margin, as well as the values of alpha and beta (I know that they could be derived easily with power(elo0) if we suppose that alpha = beta, and it looks like alpha = 1/20 = beta this time) and the number of simulations... but if this info is quite old and you do not remember the details, there is nothing to be done. Otherwise it is perfect.
Once again, good luck with your new tests.
Regards from Spain.
Ajedrecista.