Getting SPRT right

brtzsnr · Post by **brtzsnr** » Thu Apr 23, 2015 12:02 am

I was working to add SPRT to my evaluation framework and noticed a strange difference between how LLR is computed in cutechess-cli and fishtest. I'll paste the relevant piece of code:

In cutechess-cli: https://github.com/cutechess/cutechess/ ... c/sprt.cpp

Code: Select all

	// Probability laws under H0 and H1
	const double s = b.scale&#40;);
	const BayesElo b0&#40;m_elo0 / s, b.drawElo&#40;));
	const BayesElo b1&#40;m_elo1 / s, b.drawElo&#40;));
	const SprtProbability p0&#40;b0&#41;, p1&#40;b1&#41;;

In fishtest https://raw.githubusercontent.com/glins ... at_util.py

Code: Select all

  # Probability laws under H0 and H1
  P0 = bayeselo_to_proba&#40;elo0, drawelo&#41;
  P1 = bayeselo_to_proba&#40;elo1, drawelo&#41;

So fishtest omits the scaling step, i.e. it uses a fixed draw elo. Indeed the results are different. For

print SPRT({'wins': 716, 'losses': 591, 'draws': 2163}, 0, 0.05, 6, 0.05, 200)

fishtests prints LLR as 2.9948445563125237
while cutechess prints LLR as 4.373536

Which one is correct? Does cutechess's test allow one to run fewer tests?

Laskos · Post by **Laskos** » Thu Apr 23, 2015 12:26 am

brtzsnr wrote:I was working to add SPRT to my evaluation framework and noticed a strange difference between how LLR is computed in cutechess-cli and fishtest. I'll paste the relevant piece of code:

In cutechess-cli: https://github.com/cutechess/cutechess/ ... c/sprt.cpp
Code: Select all
	// Probability laws under H0 and H1
	const double s = b.scale&#40;);
	const BayesElo b0&#40;m_elo0 / s, b.drawElo&#40;));
	const BayesElo b1&#40;m_elo1 / s, b.drawElo&#40;));
	const SprtProbability p0&#40;b0&#41;, p1&#40;b1&#41;;
In fishtest https://raw.githubusercontent.com/glins ... at_util.py
Code: Select all
  # Probability laws under H0 and H1
  P0 = bayeselo_to_proba&#40;elo0, drawelo&#41;
  P1 = bayeselo_to_proba&#40;elo1, drawelo&#41;
So fishtest omits the scaling step, i.e. it uses a fixed draw elo. Indeed the results are different. For

print SPRT({'wins': 716, 'losses': 591, 'draws': 2163}, 0, 0.05, 6, 0.05, 200)

fishtests prints LLR as 2.9948445563125237
while cutechess prints LLR as 4.373536

Which one is correct? Does cutechess's test allow one to run fewer tests?

Cutechess seems to be correct here. Are you sure fishtest source is up to date? I checked some recent results of SF framework, they are apparently fine.

gladius · Post by **gladius** » Thu Apr 23, 2015 4:45 am

brtzsnr wrote: So fishtest omits the scaling step, i.e. it uses a fixed draw elo. Indeed the results are different. For

Fishtest will estimate the drawelo from the games that are played, so it doesn't really use a fixed drawelo.

Code: Select all

  # Estimate drawelo out of sample
  if &#40;R&#91;'wins'&#93; > 0 and R&#91;'losses'&#93; > 0 and R&#91;'draws'&#93; > 0&#41;&#58;
    N = R&#91;'wins'&#93; + R&#91;'losses'&#93; + R&#91;'draws'&#93;
    P = &#123;'win'&#58; float&#40;R&#91;'wins'&#93;)/N, 'loss'&#58; float&#40;R&#91;'losses'&#93;)/N, 'draw'&#58; float&#40;R&#91;'draws'&#93;)/N&#125;
    elo, drawelo = proba_to_bayeselo&#40;P&#41;

That being said, there could certainly be problems still

.

Michel · Post by **Michel** » Thu Apr 23, 2015 6:46 am

brtzsnr wrote:I was working to add SPRT to my evaluation framework and noticed a strange difference between how LLR is computed in cutechess-cli and fishtest. I'll paste the relevant piece of code:

In cutechess-cli: https://github.com/cutechess/cutechess/ ... c/sprt.cpp
Code: Select all
	// Probability laws under H0 and H1
	const double s = b.scale&#40;);
	const BayesElo b0&#40;m_elo0 / s, b.drawElo&#40;));
	const BayesElo b1&#40;m_elo1 / s, b.drawElo&#40;));
	const SprtProbability p0&#40;b0&#41;, p1&#40;b1&#41;;
In fishtest https://raw.githubusercontent.com/glins ... at_util.py
Code: Select all
  # Probability laws under H0 and H1
  P0 = bayeselo_to_proba&#40;elo0, drawelo&#41;
  P1 = bayeselo_to_proba&#40;elo1, drawelo&#41;
So fishtest omits the scaling step, i.e. it uses a fixed draw elo. Indeed the results are different. For

print SPRT({'wins': 716, 'losses': 591, 'draws': 2163}, 0, 0.05, 6, 0.05, 200)

fishtests prints LLR as 2.9948445563125237
while cutechess prints LLR as 4.373536

Which one is correct? Does cutechess's test allow one to run fewer tests?

Most likely cutechess uses LogisticElo for the bounds whereas fishtest uses BayesElo (fishtest is certainly correct). I haven't checked the numbers though.

brtzsnr · Post by **brtzsnr** » Thu Apr 23, 2015 8:28 am

Definitely the latest sources. I noticed the difference because I wanted to test that the formulas are correct so I went to http://tests.stockfishchess.org/ and picked an arbitrary finished test. Since the numbers did not match and I compared cutechess adn fishtest and saw the disagreement.

brtzsnr · Post by **brtzsnr** » Thu Apr 23, 2015 8:30 am

gladius wrote:

Code: Select all

  # Estimate drawelo out of sample
  if &#40;R&#91;'wins'&#93; > 0 and R&#91;'losses'&#93; > 0 and R&#91;'draws'&#93; > 0&#41;&#58;
    N = R&#91;'wins'&#93; + R&#91;'losses'&#93; + R&#91;'draws'&#93;
    P = &#123;'win'&#58; float&#40;R&#91;'wins'&#93;)/N, 'loss'&#58; float&#40;R&#91;'losses'&#93;)/N, 'draw'&#58; float&#40;R&#91;'draws'&#93;)/N&#125;
    elo, drawelo = proba_to_bayeselo&#40;P&#41;

That being said, there could certainly be problems still

.

Right. I was wrong about the draw elo. Only the scaling step is missing.

brtzsnr · Post by **brtzsnr** » Thu Apr 23, 2015 8:31 am

Michel wrote: Most likely cutechess uses LogisticElo for the bounds whereas fishtest uses BayesElo (fishtest is certainly correct). I haven't checked the numbers though.

FIshtest uses BayesElo, too. The formulas are identical, except for the missing scaling step in fishtest.

lucasart · Post by **lucasart** » Thu Apr 23, 2015 1:11 pm

brtzsnr wrote:I was working to add SPRT to my evaluation framework and noticed a strange difference between how LLR is computed in cutechess-cli and fishtest. I'll paste the relevant piece of code:

In cutechess-cli: https://github.com/cutechess/cutechess/ ... c/sprt.cpp
Code: Select all
	// Probability laws under H0 and H1
	const double s = b.scale&#40;);
	const BayesElo b0&#40;m_elo0 / s, b.drawElo&#40;));
	const BayesElo b1&#40;m_elo1 / s, b.drawElo&#40;));
	const SprtProbability p0&#40;b0&#41;, p1&#40;b1&#41;;
In fishtest https://raw.githubusercontent.com/glins ... at_util.py
Code: Select all
  # Probability laws under H0 and H1
  P0 = bayeselo_to_proba&#40;elo0, drawelo&#41;
  P1 = bayeselo_to_proba&#40;elo1, drawelo&#41;
So fishtest omits the scaling step, i.e. it uses a fixed draw elo. Indeed the results are different. For

print SPRT({'wins': 716, 'losses': 591, 'draws': 2163}, 0, 0.05, 6, 0.05, 200)

fishtests prints LLR as 2.9948445563125237
while cutechess prints LLR as 4.373536

Which one is correct? Does cutechess's test allow one to run fewer tests?

They are both correct:

in cutechess-cli, (elo0, elo1) are expressed in ELO
in fishtest, they are expressed in Bayes Elo

To be even more precise, cutechess-cli uses uses a first order approximation to invert the Bayes ELO model into: bayes_elo = f(elo, draw elo). You can find a more proper inversion elo->bayes_elo (dichotomy) in my SPRT simulator:
https://github.com/lucasart/sprt

I think the cutechess way is best for casual users, because Bayes ELO is a technical thing that should be kept internal and not exposed to the user. THe user simply inputs bounds expressed in usual ELO, and cutechess figures out the rest.

On the other hand, using Bayes ELO has an advantage, which is TC scaling. Typically a patch that gains 2 elo at STC gains only 1.5 elo at LTC, or sth like that. So, if you use the same Bayes ELO value, and increase DrawElo, you get that "average" scaling baked into the bounds for free. So SPRT(0,6) at STC and at LTC doesn't mean the same when re-expressed in ELO (bounds in ELO will be tighter at LTC where DrawELo is higher).

I would say that Bayes Elo is most appropriate for the expert user (fishtest), and ELO is more appropriate for the public at large (cutechess), so both implementation serve their purpose perfectly.

PS: Both fishtest and cutechess estimate drawelo out of sample.

Ajedrecista · Post by **Ajedrecista** » Thu Apr 23, 2015 9:01 pm

Hello Alexandru:

brtzsnr wrote:I was working to add SPRT to my evaluation framework and noticed a strange difference between how LLR is computed in cutechess-cli and fishtest.

[...]

print SPRT({'wins': 716, 'losses': 591, 'draws': 2163}, 0, 0.05, 6, 0.05, 200)

fishtests prints LLR as 2.9948445563125237
while cutechess prints LLR as 4.373536

Which one is correct? Does cutechess's test allow one to run fewer tests?

I am late into the thread but I write my explanation now:

Code: Select all

Games&#58;       3470
 
Wins&#58;         716 &#40;20.63 %).
Loses&#58;        591 &#40;17.03 %).
Draws&#58;       2163 &#40;62.33 %).
 
bayeselo&#58;    20.5207
drawelo&#58;    254.5410

Bayeselo and drawelo are estimated from the sample of 3470 games in the following way:

Code: Select all

games = wins + draws + loses

W = wins/games
D = draws/games
L = loses/games

bayeselo = 200*log10&#123;W*&#40;1 - L&#41;/&#91;L*&#40;1 - W&#41;&#93;&#125;
drawelo  = 200*log10&#91;&#40;1 - L&#41;*&#40;1 - W&#41;/&#40;L*W&#41;&#93;

And the conversion between logistic Elo and Bayeselo is (at least I use this one):

Code: Select all

x = 10^&#40;drawelo/400&#41;

K = 4x/&#40;1 + x&#41;²

Bayeselo = &#40;logistic Elo&#41;/K

Please correct me if there are typos.

I use alpha = 1/20 and beta = alpha in this case. Then, for SPRT(0, 6) (0 to 6 Bayeselo):

Code: Select all

LLR&#40;wins&#41;&#58;       20.0233
LLR&#40;loses&#41;&#58;     -16.6351
LLR&#40;draws&#41;&#58;      -0.3934

       LLR&#58;       2.9948

Of course LLR = LLR(wins) + LLR(loses) + LLR(draws). So Fishtest agrees with my numbers... but cutechess-cli has not say the last word yet. As Michel and Lucas suggested, the bounds of SPRT could be written in logistic Elo. Using the parameter K that I wrote above with drawelo ~ 254.541:

Code: Select all

I keep all the digits of a Casio calculator but I only round up to 1e-4 when writting&#58;

+716 -591 =2163 &#40;3470 games&#41;.

bayeselo&#58;    20.5207
drawelo&#58;    254.5410

x ~ 4.3287
K ~ 0.6098

Bounds&#58;
0 Elo = 0/K = 0 Bayeselo.
6 Elo = 6/K ~ 9.8395 Bayeselo.

I run my tool again. This time SPRT(0, 6) (0 to 6 logistic Elo) ~ SPRT(0, 9.8395) (0 to 9.8395 Bayeselo). Results:

Code: Select all

LLR&#40;wins&#41;&#58;       32.7669
LLR&#40;loses&#41;&#58;     -27.3355
LLR&#40;draws&#41;&#58;      -1.0579

       LLR&#58;       4.3735

Which agrees with cutechess-cli output.

Basically I agree with Michel and Lucas. I hope that my numerical check will be useful to you, Alexandru.

Summary:

· Fishtest: SPRT(Bayeselo_0, Bayeselo_1).
· cutechess-cli: SPRT(Elo_0, Elo_1).

Regards from Spain.

Ajedrecista.

brtzsnr · Post by **brtzsnr** » Fri Apr 24, 2015 3:10 pm

lucasart wrote: They are both correct:

in cutechess-cli, (elo0, elo1) are expressed in ELO

in fishtest, they are expressed in Bayes Elo

Thanks for the explanations. I was too quick to dismiss Michel's answer.

Getting SPRT right

Getting SPRT right

Re: Getting SPRT right

Re: Getting SPRT right

Re: Getting SPRT right

Re: Getting SPRT right

Re: Getting SPRT right

Re: Getting SPRT right

Re: Getting SPRT right

Re: Getting SPRT right: numerical check.

Re: Getting SPRT right