Usage sprt / cutechess-cli

Desperado · Post by **Desperado** » Sun Nov 16, 2014 7:21 pm

Hello,

can someone enlighten me how to use the sqrt command ?

Taken from the description:

-sprt elo0=E0 elo1=E1 alpha=&#945; beta=&#946;
Use a Sequential Probability Ratio Test as a termination criterion for
the match.
This option should only be used in matches between two players to test if engine P1 is stronger than engine P2.
Hypothesis H1 is that P1 is stronger than P2 by at least E0 ELO points, and H0 &#40;the null hypothesis&#41; is that
 P1 is not stronger than P2 by at least E1 ELO points.
The maximum probabilities for type I and type II errors outside the interval &#91; E0, E1 &#93; are &#945; and &#946;.

The match is stopped if either H0 or H1 is accepted or if the maximum number of games set by -rounds and / or -games is reached.

My understanding of this is for example:

H1: Elo_P1 > Elo_P2 + 10 ( at least 10 elo better )
H0: Elo_P1 < Elo_P2 + 25 ( not more than 25 elo ?! )

-sprt elo0=10 elo1=25 alpha=? beta=?

So, what values and what format must be used for alpha and beta ?
Do i even understand the usage of H0 ?

Thank you in advance.

Ajedrecista · Post by **Ajedrecista** » Sun Nov 16, 2014 8:00 pm

Hello Michael:

Desperado wrote:Hello,

can someone enlighten me how to use the sqrt command ?

Taken from the description:
Code: Select all
-sprt elo0=E0 elo1=E1 alpha=&#945; beta=&#946;
Use a Sequential Probability Ratio Test as a termination criterion for
the match.
This option should only be used in matches between two players to test if engine P1 is stronger than engine P2.
Hypothesis H1 is that P1 is stronger than P2 by at least E0 ELO points, and H0 &#40;the null hypothesis&#41; is that
 P1 is not stronger than P2 by at least E1 ELO points.
The maximum probabilities for type I and type II errors outside the interval &#91; E0, E1 &#93; are &#945; and &#946;.

The match is stopped if either H0 or H1 is accepted or if the maximum number of games set by -rounds and / or -games is reached.
My understanding of this is for example:

H1: Elo_P1 > Elo_P2 + 10 ( at least 10 elo better )
H0: Elo_P1 < Elo_P2 + 25 ( not more than 25 elo ?! )

-sprt elo0=10 elo1=25 alpha=? beta=?

So, what values and what format must be used for alpha and beta ?
Do i even understand the usage of H0 ?

Thank you in advance.

I am not an expert at all. SF testing framework uses the same values for alpha and beta (5% = 0.05). I think that alpha is the proportion of false positives and beta is the proportion of false negatives, but I am not sure (it could be the other way).

If you use alpha = beta = 0.05 (which seems reasonable) together with elo0 = 10 and elo1 = 25, then you have a chance of 5% of rejecting a patch better than 25 Elo and other 5% of accepting a patch worse than 10 Elo. If the patch really has an impact of (10 + 25)/2 = 17.5 Elo, then you have equal chances of accept and reject it.

I would use:

Code: Select all

-sprt elo0=10 elo1=25 alpha=0.05 beta=0.05

I see that cutechess-cli can stop the test if a certain number of games is reached. I wrote a SPRT simulator in the past, but it is written in Bayeselo units, the same that SF testing framework uses in elo0 and elo1 bounds. I supposed a drawelo parameter of 240 (typical at STC 15" + 0.05"/move in SF testing framework). Then:

Elo = [4X/(1 + X)²]*Bayeselo, where X = 10^(drawelo/400). I got 15.5807 Bayeselo for 10 Elo, 38.9516 Bayeselo for 25 Elo and 27.2661 Bayeselo for 17.5 Elo. With 100,000 simulations each time:

Code: Select all

alpha = 0.05; beta = 0.05
I used drawelo = 240, which could be understood as a probability of draw.

------------------------

SPRT&#40;10, 25&#41; with a patch of 10 Elo&#58;

100000/100000    Passes&#58;   4927    Fails&#58;  95073    <Games>/sim

Shortest simulation&#58;      77 games &#40;simulation  70421&#41;.
Longest simulation&#58;    12258 games &#40;simulation  22606&#41;.

Average number of games per simulation&#58;    1165
Median of the distribution&#58;                 928

Type I errors  &#40;false positives&#41;&#58;   4.93 %
Type II errors &#40;false negatives&#41;&#58;   0.00 %

There are  72404 simulations with score > 50% that failed SPRT.
There are   1529 simulations with score = 50% that failed SPRT.

------------------------

SPRT&#40;10, 25&#41; with a patch of 17.5 Elo&#58;

100000/100000    Passes&#58;  50217    Fails&#58;  49783    <Games>/sim

Shortest simulation&#58;      82 games &#40;simulation  83272&#41;.
Longest simulation&#58;    21767 games &#40;simulation  38001&#41;.

Average number of games per simulation&#58;    1965
Median of the distribution&#58;                1449

Type I errors  &#40;false positives&#41;&#58;   0.00 %
Type II errors &#40;false negatives&#41;&#58;  49.78 %

There are  43443 simulations with score > 50% that failed SPRT.
There are    443 simulations with score = 50% that failed SPRT.

------------------------

SPRT&#40;10, 25&#41; with a patch of 25 Elo&#58;

100000/100000    Passes&#58;  95071    Fails&#58;   4929    <Games>/sim

Shortest simulation&#58;      83 games &#40;simulation  83537&#41;.
Longest simulation&#58;    10785 games &#40;simulation  26971&#41;.

Average number of games per simulation&#58;    1159
Median of the distribution&#58;                 921

Type I errors  &#40;false positives&#41;&#58;   0.00 %
Type II errors &#40;false negatives&#41;&#58;   4.93 %

There are   3738 simulations with score > 50% that failed SPRT.
There are     93 simulations with score = 50% that failed SPRT.

You can see that the probabilities of accept/reject are very near to the theoretical values 5%, 95%; 50%, 50%; and 95%, 5%, respectively. If you want to set a limit of number of games with these parameters, 15000 or 20000 are enough IMHO (you will get few unfinished SPRT).

Regards from Spain.

Ajedrecista.

Desperado · Post by **Desperado** » Sun Nov 16, 2014 8:39 pm

Thank you.

I can follow the logic that accepting a patch with 5% chance to be lower
than 10 Elo.

But if the question is: Is P1 at least better by 10 Elo than P2 ?

Why should i care if a patch is rejected with 5% chance to be more than
25 Elo ? Let us say there will be a potential 35 Elo gain rejected, that
does not hurt as long the patch has 95% chance to have at least 10 Elo gain.

So, if alpha=0.05 ( accepting 5% < 10 Elo ) why isn't beta = 1.00.
I mean accepting any patch > 10 Elo.

Back to the question, which means in detail:
1. Is P1 > P2
2. At least the amount of x ( in case 10 ).

With that in mind shouldn't the formular be like that ( for the question above)
-sprt elo0=10 elo1=10 alpha=0.05 beta=1.00

Ajedrecista · Post by **Ajedrecista** » Sun Nov 16, 2014 9:11 pm

Hello again:

Desperado wrote:Thank you.

I can follow the logic that accepting a patch with 5% chance to be lower
than 10 Elo.

But if the question is: Is P1 at least better by 10 Elo than P2 ?

Why should i care if a patch is rejected with 5% chance to be more than
25 Elo ? Let us say there will be a potential 35 Elo gain rejected, that
does not hurt as long the patch has 95% chance to have at least 10 Elo gain.

So, if alpha=0.05 ( accepting 5% < 10 Elo ) why isn't beta = 1.00.
I mean accepting any patch > 10 Elo.

Back to the question, which means in detail:
1. Is P1 > P2
2. At least the amount of x ( in case 10 ).

With that in mind shouldn't the formular be like that ( for the question above)
-sprt elo0=10 elo1=10 alpha=0.05 beta=1.00

SPRT does not measure Elo. I mean, each patch has an Elo gain but we do not know it. I suppose an Elo gain for doing the simulations. The only fact is that SPRT is about probabilities of accepting/rejecting patches once given elo0, elo1, alpha and beta. You can not compute an Elo gain from the result of a SPRT.

0 < alpha < 1 and 0 < beta < 1. SPRT uses LLR (Log Likelihood Ratio), which have an upper bound = ln[(1 - beta)/alpha] and a lower bound = ln[beta/(1 - alpha)]. The upper bound for LLR is undefined with beta = 1.

You can read more about SPRT in the following links:

Re: Type I error in LOS based early stopping rule.

Re: Stats and bench on Stockfish development site.

Re: Changes in Andscacs 0.70.

These three posts were written by myself, so I surely miss things.

You should care about rejecting a patch of +25 Elo because you miss a great patch. There are unlucky runs when a patch starts bad but it is good at the end (and viceversa). SPRT stop by LLR criteria could arrive before the bad luck ends. The way of accepting less bad patches and rejecting less good patches is decreasing both alpha and beta, but you can expect longer runs in average because the absolute value of upper and lower bounds raise. I write this from intuition, without checking it with simulations.

Regards from Spain.

Ajedrecista.

bob · Post by **bob** » Sun Nov 16, 2014 10:57 pm

Desperado wrote:Thank you.

I can follow the logic that accepting a patch with 5% chance to be lower
than 10 Elo.

But if the question is: Is P1 at least better by 10 Elo than P2 ?

Why should i care if a patch is rejected with 5% chance to be more than
25 Elo ? Let us say there will be a potential 35 Elo gain rejected, that
does not hurt as long the patch has 95% chance to have at least 10 Elo gain.

So, if alpha=0.05 ( accepting 5% < 10 Elo ) why isn't beta = 1.00.
I mean accepting any patch > 10 Elo.

Back to the question, which means in detail:
1. Is P1 > P2
2. At least the amount of x ( in case 10 ).

With that in mind shouldn't the formular be like that ( for the question above)
-sprt elo0=10 elo1=10 alpha=0.05 beta=1.00

When you use 10, 25 for the elo values, you will reject if the elo is < +10 better, which means small improvements get rejected. You accept if the Elo goes beyond +25, which is a big improvement. For elo between 10 and 25, you continue playing. You will accept good changes 95% of the time, and reject a +25 elo change 5% of the time. On the other end, you will reject bad changes (< 10 Elo which is not really "bad" IMO, but that is a different topic) 95% of the time, but 5% of the time you will not. You can tighten up on those alpha/beta values but to reduce the error, but at a cost of playing far more games.

Desperado · Post by **Desperado** » Tue Nov 18, 2014 7:24 am

Thank you. That makes it clear now. This gives immediately the idea of staged testing by using less severe and severe restrictions.

Desperado · Post by **Desperado** » Thu Sep 03, 2015 9:00 pm

Hello,

some time ago i started this post because it was unclear to me how to use the sprt feature of cutechess-cli. Although i thought it was clear finally, i stopped thinking about it, because the version 0_6 was crashing and i continued using the 0.5.1 version.

Now, having the new version at hand i started a simple match, and i found out that i am still confused. Maybe of a long day at work, or maybe i simply do not understand it. However something is going completly wrong in my mind and perhaps someone is able to enlighten me.

So, some data:

Code: Select all

@echo off

&#58;&#58;Engines
Set eng=%eng% -engine conf=Omen0003
Set eng=%eng% -engine conf=Omen0002

&#58;&#58; Common Config
Set eng=%eng% -each

&#58;&#58;Working Folder
Set eng=%eng% dir="C&#58;\_CHESS\CutechessEngine"

&#58;&#58;Protocol
Set eng=%eng% proto=uci
Set eng=%eng% restart=off

&#58;&#58;Time Control
Set eng=%eng% tc=10000.0 timemargin=200 nodes=10000


&#58;&#58;Tournament Type
Set eng=%eng% -games 35000 -tournament round-robin

&#58;&#58;sprt
Set eng=%eng% -sprt elo0=0 elo1=10 alpha=0.01 beta=0.01

&#58;&#58;Opening Book cutchess-cli 0.7x
Set eng=%eng% -openings file="C&#58;\\_CHESS\\PGN\\AH6_37643.pgn" format=pgn order=sequential

&#58;&#58;Pgn Storage
Set eng=%eng% -pgnout "result.pgn"

&#58;&#58;Options
Set eng=%eng% -concurrency 4
Set eng=%eng% -ratinginterval 10
Set eng=%eng% -recover
Set eng=%eng% -repeat

&#58;&#58;Command
echo %eng%
cutechess-cli_0.7.1_win32 %eng%

SPRT(extracted)

Code: Select all

&#58;&#58;sprt
Set eng=%eng% -sprt elo0=0 elo1=10 alpha=0.01 beta=0.01

The match stopped with the following results:

...
Score of Omen0003 vs Omen0002: 2139 - 1940 - 4316 [0.512] 8395
ELO difference: 8
SPRT: llr 4.64, lbound -4.6, ubound 4.6 - H1 was accepted
Finished match

1.Hypothesis H1 is that P1 is stronger than P2 by at least E0 ELO points
2.H0 (the null hypothesis) is that P1 is not stronger than P2 by at least E1 ELO points

So, i can follow that the result tells me with a probability of 99.9% that P1(Omen0003) is at least stronger by elo_0=0 points.

But why did the test stop, P1 may be stronger than elo_1=10 points?
My expectation was, that the test continues while a result is in the range
of (0,10). I did not calculate by hand if 10 Elo is still possible, but i guess it is.

Maybe it is too late for me today and i'm tired, but i am really impatient to understand and to use this feature.

thx for any feedback...

Desperado · Post by **Desperado** » Thu Sep 03, 2015 9:23 pm

Ok, the main reason is

"The match is stopped if either H0 or H1 is accepted"

Desperado · Post by **Desperado** » Thu Sep 03, 2015 11:04 pm

Ok, it's time to go to bed, but i guess i solved my confusion.

-sprt elo0=E0 elo1=E1 alpha=α beta=β

Code: Select all

-sprt elo0=E0 elo1=E1 alpha=&#945; beta=&#946;
Use a Sequential Probability Ratio Test as a termination criterion for the match.
This option should only be used in matches between two players to test if engine P1 is stronger than engine P2. Hypothesis H1 is that P1 is stronger than P2 by at least E0 ELO points, and H0 &#40;the null hypothesis&#41; is that P1 is not stronger than P2 by at least E1 ELO points. The maximum probabilities for type I and type II errors outside the interval &#91; E0, E1 &#93; are &#945; and &#946;.
The match is stopped if either H0 or H1 is accepted or if the maximum number of games set by -rounds and / or -games is reached.

I think, that the order(elo0=E0 elo1=E1 in the given syntax above and the given description) was confusing me. If i want to check
an "intervall" from [x,y] where x < y the correct usage is [E1,E0] (so just reversed according to the given syntax).

example:

E0 = 10 ( P1 is at least E0 points better,H1)
E1 = 5 ( P1 is not better than E1 points,H0)
iv = [5=E1,10=E0]

That would mean that the test would stop:

a. if it is "certain"(including the probabilities) that P1 is not stronger than P2 by at least E1 ELO points, in case "5" [E1,E0]

or

b. if it is "certain"(according to the probabilities) that P1 is stronger than P2 by at least E0 ELO points, in case "10" [E1,E0]

Now, if an "early stop" occurs, in case "a" there will be a rejection of the patch of course.
In case "b", although stopped early, the patch can be accepted with peace of conscience.
Otherwise the test will continue as long the result may be within the defined range.(5,10) like in the given example.

So, i simply need to use it this way (in terms of an intervall [5,10]):
-sprt elo0=10 elo1=5 alpha=α beta=β

Crazy stuff

Now, if this is all nonsense, i do not care about today anymore. Good night everybody. Enjoy Talkchess...

Desperado · Post by **Desperado** » Fri Sep 04, 2015 7:53 pm

Hello again,

a final attempt to get a explanation, but imho there is something going wrong!

Setup1:
======

games: 35000 (max)
formular: -sprt elo0=20 elo1=0 alpha=0.01 beta=0.01

Code: Select all

Finished game 5773 &#40;Omen0003 vs Omen0002&#41;&#58; * &#123;No result&#125;
Score of Omen0003 vs Omen0002&#58; 1477 - 1351 - 2942  &#91;0.511&#93; 5770
&#91;b&#93;ELO difference&#58; 8&#91;/b&#93;
SPRT&#58; llr 4.71, lbound -4.6, ubound 4.6 - &#91;b&#93;H1 &#91;/b&#93;was accepted
Finished match

The simple answer is that the P1 is not stronger by at least 20 Elo,
so the test needs to continue!

Setup2:
=====

games: 35000 (max)
Set eng=%eng% -sprt elo0=0 elo1=10 alpha=0.01 beta=0.01

Code: Select all

Score of Omen0003 vs Omen0002&#58; 1477 - 1351 - 2942  &#91;0.511&#93; 5770
ELO difference&#58; 8
SPRT&#58; llr -4.71, lbound -4.6, ubound 4.6 - H0 was accepted
Finished match

Again, the test should continue!, because not stronger (>=) 10 is still possible.

Summary:
=======

Maybe there is somthing mixed up/incorrect in the description.
But without understanding the maths i do understand " is at least stronger than" + "not stronger than at least by"
(with respect to the given uncertainties), and further that these are the requirements to stop the test!!!

So, this is simply wrong

Come on, please tell me i miss something essential, and please do not tell me that everybody is just happy about a "randomly" shortend test.

Usage sprt / cutechess-cli

Usage sprt / cutechess-cli

Re: Usage sprt / cutechess-cli.

Re: Usage sprt / cutechess-cli.

Re: Usage sprt / cutechess-cli.

Re: Usage sprt / cutechess-cli.

Re: Usage sprt / cutechess-cli.

Re: Usage sprt / cutechess-cli.

Re: Usage sprt / cutechess-cli.

Re: Usage sprt / cutechess-cli.

Re: Usage sprt / cutechess-cli.