Getting SPRT right

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 4:02 pm

Getting SPRT right

Post by brtzsnr »

I was working to add SPRT to my evaluation framework and noticed a strange difference between how LLR is computed in cutechess-cli and fishtest. I'll paste the relevant piece of code:

In cutechess-cli: https://github.com/cutechess/cutechess/ ... c/sprt.cpp

Code: Select all

	// Probability laws under H0 and H1
	const double s = b.scale();
	const BayesElo b0(m_elo0 / s, b.drawElo());
	const BayesElo b1(m_elo1 / s, b.drawElo());
	const SprtProbability p0(b0), p1(b1);
In fishtest https://raw.githubusercontent.com/glins ... at_util.py

Code: Select all

  # Probability laws under H0 and H1
  P0 = bayeselo_to_proba(elo0, drawelo)
  P1 = bayeselo_to_proba(elo1, drawelo)
So fishtest omits the scaling step, i.e. it uses a fixed draw elo. Indeed the results are different. For

print SPRT({'wins': 716, 'losses': 591, 'draws': 2163}, 0, 0.05, 6, 0.05, 200)

fishtests prints LLR as 2.9948445563125237
while cutechess prints LLR as 4.373536

Which one is correct? Does cutechess's test allow one to run fewer tests?
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Getting SPRT right

Post by Laskos »

brtzsnr wrote:I was working to add SPRT to my evaluation framework and noticed a strange difference between how LLR is computed in cutechess-cli and fishtest. I'll paste the relevant piece of code:

In cutechess-cli: https://github.com/cutechess/cutechess/ ... c/sprt.cpp

Code: Select all

	// Probability laws under H0 and H1
	const double s = b.scale();
	const BayesElo b0(m_elo0 / s, b.drawElo());
	const BayesElo b1(m_elo1 / s, b.drawElo());
	const SprtProbability p0(b0), p1(b1);
In fishtest https://raw.githubusercontent.com/glins ... at_util.py

Code: Select all

  # Probability laws under H0 and H1
  P0 = bayeselo_to_proba(elo0, drawelo)
  P1 = bayeselo_to_proba(elo1, drawelo)
So fishtest omits the scaling step, i.e. it uses a fixed draw elo. Indeed the results are different. For

print SPRT({'wins': 716, 'losses': 591, 'draws': 2163}, 0, 0.05, 6, 0.05, 200)

fishtests prints LLR as 2.9948445563125237
while cutechess prints LLR as 4.373536

Which one is correct? Does cutechess's test allow one to run fewer tests?
Cutechess seems to be correct here. Are you sure fishtest source is up to date? I checked some recent results of SF framework, they are apparently fine.
gladius
Posts: 568
Joined: Tue Dec 12, 2006 10:10 am
Full name: Gary Linscott

Re: Getting SPRT right

Post by gladius »

brtzsnr wrote: So fishtest omits the scaling step, i.e. it uses a fixed draw elo. Indeed the results are different. For
Fishtest will estimate the drawelo from the games that are played, so it doesn't really use a fixed drawelo.

Code: Select all

  # Estimate drawelo out of sample
  if (R['wins'] > 0 and R['losses'] > 0 and R['draws'] > 0):
    N = R['wins'] + R['losses'] + R['draws']
    P = {'win': float(R['wins'])/N, 'loss': float(R['losses'])/N, 'draw': float(R['draws'])/N}
    elo, drawelo = proba_to_bayeselo(P)
That being said, there could certainly be problems still :).
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: Getting SPRT right

Post by Michel »

brtzsnr wrote:I was working to add SPRT to my evaluation framework and noticed a strange difference between how LLR is computed in cutechess-cli and fishtest. I'll paste the relevant piece of code:

In cutechess-cli: https://github.com/cutechess/cutechess/ ... c/sprt.cpp

Code: Select all

	// Probability laws under H0 and H1
	const double s = b.scale();
	const BayesElo b0(m_elo0 / s, b.drawElo());
	const BayesElo b1(m_elo1 / s, b.drawElo());
	const SprtProbability p0(b0), p1(b1);
In fishtest https://raw.githubusercontent.com/glins ... at_util.py

Code: Select all

  # Probability laws under H0 and H1
  P0 = bayeselo_to_proba(elo0, drawelo)
  P1 = bayeselo_to_proba(elo1, drawelo)
So fishtest omits the scaling step, i.e. it uses a fixed draw elo. Indeed the results are different. For

print SPRT({'wins': 716, 'losses': 591, 'draws': 2163}, 0, 0.05, 6, 0.05, 200)

fishtests prints LLR as 2.9948445563125237
while cutechess prints LLR as 4.373536

Which one is correct? Does cutechess's test allow one to run fewer tests?
Most likely cutechess uses LogisticElo for the bounds whereas fishtest uses BayesElo (fishtest is certainly correct). I haven't checked the numbers though.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 4:02 pm

Re: Getting SPRT right

Post by brtzsnr »

Definitely the latest sources. I noticed the difference because I wanted to test that the formulas are correct so I went to http://tests.stockfishchess.org/ and picked an arbitrary finished test. Since the numbers did not match and I compared cutechess adn fishtest and saw the disagreement.
brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 4:02 pm

Re: Getting SPRT right

Post by brtzsnr »

gladius wrote:

Code: Select all

  # Estimate drawelo out of sample
  if (R['wins'] > 0 and R['losses'] > 0 and R['draws'] > 0):
    N = R['wins'] + R['losses'] + R['draws']
    P = {'win': float(R['wins'])/N, 'loss': float(R['losses'])/N, 'draw': float(R['draws'])/N}
    elo, drawelo = proba_to_bayeselo(P)
That being said, there could certainly be problems still :).
Right. I was wrong about the draw elo. Only the scaling step is missing.
brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 4:02 pm

Re: Getting SPRT right

Post by brtzsnr »

Michel wrote: Most likely cutechess uses LogisticElo for the bounds whereas fishtest uses BayesElo (fishtest is certainly correct). I haven't checked the numbers though.
FIshtest uses BayesElo, too. The formulas are identical, except for the missing scaling step in fishtest.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Getting SPRT right

Post by lucasart »

brtzsnr wrote:I was working to add SPRT to my evaluation framework and noticed a strange difference between how LLR is computed in cutechess-cli and fishtest. I'll paste the relevant piece of code:

In cutechess-cli: https://github.com/cutechess/cutechess/ ... c/sprt.cpp

Code: Select all

	// Probability laws under H0 and H1
	const double s = b.scale();
	const BayesElo b0(m_elo0 / s, b.drawElo());
	const BayesElo b1(m_elo1 / s, b.drawElo());
	const SprtProbability p0(b0), p1(b1);
In fishtest https://raw.githubusercontent.com/glins ... at_util.py

Code: Select all

  # Probability laws under H0 and H1
  P0 = bayeselo_to_proba(elo0, drawelo)
  P1 = bayeselo_to_proba(elo1, drawelo)
So fishtest omits the scaling step, i.e. it uses a fixed draw elo. Indeed the results are different. For

print SPRT({'wins': 716, 'losses': 591, 'draws': 2163}, 0, 0.05, 6, 0.05, 200)

fishtests prints LLR as 2.9948445563125237
while cutechess prints LLR as 4.373536

Which one is correct? Does cutechess's test allow one to run fewer tests?
They are both correct:
  • in cutechess-cli, (elo0, elo1) are expressed in ELO
  • in fishtest, they are expressed in Bayes Elo
To be even more precise, cutechess-cli uses uses a first order approximation to invert the Bayes ELO model into: bayes_elo = f(elo, draw elo). You can find a more proper inversion elo->bayes_elo (dichotomy) in my SPRT simulator:
https://github.com/lucasart/sprt

I think the cutechess way is best for casual users, because Bayes ELO is a technical thing that should be kept internal and not exposed to the user. THe user simply inputs bounds expressed in usual ELO, and cutechess figures out the rest.

On the other hand, using Bayes ELO has an advantage, which is TC scaling. Typically a patch that gains 2 elo at STC gains only 1.5 elo at LTC, or sth like that. So, if you use the same Bayes ELO value, and increase DrawElo, you get that "average" scaling baked into the bounds for free. So SPRT(0,6) at STC and at LTC doesn't mean the same when re-expressed in ELO (bounds in ELO will be tighter at LTC where DrawELo is higher).

I would say that Bayes Elo is most appropriate for the expert user (fishtest), and ELO is more appropriate for the public at large (cutechess), so both implementation serve their purpose perfectly.

PS: Both fishtest and cutechess estimate drawelo out of sample.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
Ajedrecista
Posts: 1968
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Getting SPRT right: numerical check.

Post by Ajedrecista »

Hello Alexandru:
brtzsnr wrote:I was working to add SPRT to my evaluation framework and noticed a strange difference between how LLR is computed in cutechess-cli and fishtest.

[...]

print SPRT({'wins': 716, 'losses': 591, 'draws': 2163}, 0, 0.05, 6, 0.05, 200)

fishtests prints LLR as 2.9948445563125237
while cutechess prints LLR as 4.373536

Which one is correct? Does cutechess's test allow one to run fewer tests?
I am late into the thread but I write my explanation now:

Code: Select all

Games:       3470
 
Wins:         716 (20.63 %).
Loses:        591 (17.03 %).
Draws:       2163 (62.33 %).
 
bayeselo:    20.5207
drawelo:    254.5410
Bayeselo and drawelo are estimated from the sample of 3470 games in the following way:

Code: Select all

games = wins + draws + loses

W = wins/games
D = draws/games
L = loses/games

bayeselo = 200*log10{W*(1 - L)/[L*(1 - W)]}
drawelo  = 200*log10[(1 - L)*(1 - W)/(L*W)]
And the conversion between logistic Elo and Bayeselo is (at least I use this one):

Code: Select all

x = 10^(drawelo/400)

K = 4x/(1 + x)²

Bayeselo = (logistic Elo)/K
Please correct me if there are typos.

I use alpha = 1/20 and beta = alpha in this case. Then, for SPRT(0, 6) (0 to 6 Bayeselo):

Code: Select all

LLR(wins):       20.0233
LLR(loses):     -16.6351
LLR(draws):      -0.3934

       LLR:       2.9948
Of course LLR = LLR(wins) + LLR(loses) + LLR(draws). So Fishtest agrees with my numbers... but cutechess-cli has not say the last word yet. As Michel and Lucas suggested, the bounds of SPRT could be written in logistic Elo. Using the parameter K that I wrote above with drawelo ~ 254.541:

Code: Select all

I keep all the digits of a Casio calculator but I only round up to 1e-4 when writting:

+716 -591 =2163 (3470 games).

bayeselo:    20.5207
drawelo:    254.5410

x ~ 4.3287
K ~ 0.6098

Bounds:
0 Elo = 0/K = 0 Bayeselo.
6 Elo = 6/K ~ 9.8395 Bayeselo.
I run my tool again. This time SPRT(0, 6) (0 to 6 logistic Elo) ~ SPRT(0, 9.8395) (0 to 9.8395 Bayeselo). Results:

Code: Select all

LLR(wins):       32.7669
LLR(loses):     -27.3355
LLR(draws):      -1.0579

       LLR:       4.3735
Which agrees with cutechess-cli output.

Basically I agree with Michel and Lucas. I hope that my numerical check will be useful to you, Alexandru.

Summary:

· Fishtest: SPRT(Bayeselo_0, Bayeselo_1).
· cutechess-cli: SPRT(Elo_0, Elo_1).

Regards from Spain.

Ajedrecista.
brtzsnr
Posts: 433
Joined: Fri Jan 16, 2015 4:02 pm

Re: Getting SPRT right

Post by brtzsnr »

lucasart wrote: They are both correct:
  • in cutechess-cli, (elo0, elo1) are expressed in ELO
  • in fishtest, they are expressed in Bayes Elo
Thanks for the explanations. I was too quick to dismiss Michel's answer.