Opening book from a statistical point of view

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Opening book from a statistical point of view

Post by stegemma »

I've implemented a simple opening book in Satana. I miss some math here, from a statistical point of view, how can we compute the probability of win, giving a different score for any book entry? In my book, I store any position as an entry with the white/black score. Since some position are played often and some rarely, just computing a % like this one isn't always good:

Code: Select all

perc = white_score / (white_score + black_score)
If I compare two positions, for sample, one played 100000 times with a white % of win of 53% with another played only 10 times with a % of win for white of 70%, just comparing % would let the engine prefers the rare continuation.

I think that I should compute a more complex % but I don't know why (my math is poor, in statistical).

For sample, this is my engine output for the starting position:

Code: Select all

hash white_score black_score %
# 26B24248B0B213A7 Ng1-h3 3 1 75.00%
# A6F5BE14996CE34B Ng1-f3 141781 105371 57.30%
# A9CD43D265BB27C1 Nb1-c3 8670 8386 50.80%
# A2D57A5C5D03A1B4 Nb1-a3 ???
# A65545D9385B5A3B Ph2-h3 2 0 100.00%
# A9D5495D9934AFB0 Pg2-g3 4782 3596 57.00%
# A6CABC12D080A7CD Pf2-f3 2 0 100.00%
# A6CB76195BE75793 Pe2-e3 33 59 35.80%
# B6D55CBAA963D796 Pd2-d3 40 32 55.50%
# B6CD5B84A78F4849 Pc2-c3 32 32 50.00%
# 46D5AEC46E7F602A Pb2-b3 11438 11100 50.70%
# E6B53D23459CEBA5 Pa2-a3 ???
# 9FCA61F1E64C0CF0 Ph2-h4 ???
# 804C915C8920E580 Pg2-g4 ???
# 9DD7B0A819627060 Pf2-f4 13996 15814 46.90%
# 98A8776C1A9B3E65 Pe2-e4 1271383 1108327 53.40%
# 80DCEC6D58853A63 Pd2-d4 686201 550747 55.40%
# 9DE3F5BD01E05A54 Pc2-c4 198730 162554 55.00%
# 9B31EA80413BAD98 Pb2-b4 3965 4551 46.50%
# 9EEDB481C8A96A61 Pa2-a4 ???
The best entry would be:

Code: Select all

# A65545D9385B5A3B Ph2-h3 2 0 100.00%
but, of course, that's not good!

I know that I can filter entry with a low bound (I already do it, in fact) but I would like to know a more correct way.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Opening book from a statistical point of view

Post by Dann Corbit »

stegemma wrote:I've implemented a simple opening book in Satana. I miss some math here, from a statistical point of view, how can we compute the probability of win, giving a different score for any book entry? In my book, I store any position as an entry with the white/black score. Since some position are played often and some rarely, just computing a % like this one isn't always good:

Code: Select all

perc = white_score / (white_score + black_score)
If I compare two positions, for sample, one played 100000 times with a white % of win of 53% with another played only 10 times with a % of win for white of 70%, just comparing % would let the engine prefers the rare continuation.

I think that I should compute a more complex % but I don't know why (my math is poor, in statistical).

For sample, this is my engine output for the starting position:

Code: Select all

hash white_score black_score %
# 26B24248B0B213A7 Ng1-h3 3 1 75.00%
# A6F5BE14996CE34B Ng1-f3 141781 105371 57.30%
# A9CD43D265BB27C1 Nb1-c3 8670 8386 50.80%
# A2D57A5C5D03A1B4 Nb1-a3 ???
# A65545D9385B5A3B Ph2-h3 2 0 100.00%
# A9D5495D9934AFB0 Pg2-g3 4782 3596 57.00%
# A6CABC12D080A7CD Pf2-f3 2 0 100.00%
# A6CB76195BE75793 Pe2-e3 33 59 35.80%
# B6D55CBAA963D796 Pd2-d3 40 32 55.50%
# B6CD5B84A78F4849 Pc2-c3 32 32 50.00%
# 46D5AEC46E7F602A Pb2-b3 11438 11100 50.70%
# E6B53D23459CEBA5 Pa2-a3 ???
# 9FCA61F1E64C0CF0 Ph2-h4 ???
# 804C915C8920E580 Pg2-g4 ???
# 9DD7B0A819627060 Pf2-f4 13996 15814 46.90%
# 98A8776C1A9B3E65 Pe2-e4 1271383 1108327 53.40%
# 80DCEC6D58853A63 Pd2-d4 686201 550747 55.40%
# 9DE3F5BD01E05A54 Pc2-c4 198730 162554 55.00%
# 9B31EA80413BAD98 Pb2-b4 3965 4551 46.50%
# 9EEDB481C8A96A61 Pa2-a4 ???
The best entry would be:

Code: Select all

# A65545D9385B5A3B Ph2-h3 2 0 100.00%
but, of course, that's not good!

I know that I can filter entry with a low bound (I already do it, in fact) but I would like to know a more correct way.
First, if you have less than 32 games, the statistical significance of wins and losses is so small you should ignore it and rely purely on the computer evaluation. If the computer evaluation is so shallow that at the current time control you can out-think it, then you are officially out of book.

I think the question you are asking is : How to know what the breakpoint is to trust the data?

Clearly, this depends on how good your data is.
For instance from human games, is the data from correspondence chess between world championship candidates? Is it from FICS games between 'beanhead' and 'gizmo' with Elo of 800 and 750?

Is the data from bullet games? Is the data from TCEC games?

If you separate the data you collect into wins/losses and draws by type, then you can gain a lot more from it.

You should also store your own engine's wins/losses/draws from the given position. Even if it is a good position, you may want to avoid it if your engine does not play it well.

So, I think that the best way to form a book is to use as much carefully focused data as possible. Separate the games into their sources and you can statistically calibrate the value of each source. Consider the time control and also the strength of the players/engines.

When you have finely divided data, perform experiments exactly as you would for chess engine evaluation parameters (try different weights and run SPRT tests).
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Opening book from a statistical point of view

Post by stegemma »

I'm starting from games played by M and GM of all the times, extracted from PgnMentor, so the quality of the data should be good. This is just the first step, I would refine the book later.

Supposing to have a "perfect" and well refined book, I simply don't know the correct statistical approach for choosing the best move or just to sort them in a good order. Maybe there are no a predefined formula for this... I'm searching right now the net to learn more about statistic.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
User avatar
Ajedrecista
Posts: 1968
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Opening book from a statistical point of view.

Post by Ajedrecista »

Hello Stefano:

You may try the lower bound of the Wilson score interval (with or without continuity correction). It takes care of small samples. Wikipedia link:

Binomial proportion confidence interval

Just scroll down until you reach Wilson score interval. 95% confidence interval (z ~ 1.96) should be fine. In your example:

1e+5 games, 53% for white: lower bound (95% confidence) ~ 52.69%.
10 games, 70% for white: lower bound (95% confidence) ~ 39.68%.

(Online calculator).

Regards from Spain.

Ajedrecista.
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Opening book from a statistical point of view

Post by Ferdy »

stegemma wrote:I've implemented a simple opening book in Satana. I miss some math here, from a statistical point of view, how can we compute the probability of win, giving a different score for any book entry? In my book, I store any position as an entry with the white/black score. Since some position are played often and some rarely, just computing a % like this one isn't always good:

Code: Select all

perc = white_score / (white_score + black_score)
If I compare two positions, for sample, one played 100000 times with a white % of win of 53% with another played only 10 times with a % of win for white of 70%, just comparing % would let the engine prefers the rare continuation.

I think that I should compute a more complex % but I don't know why (my math is poor, in statistical).

For sample, this is my engine output for the starting position:

Code: Select all

hash white_score black_score %
# 26B24248B0B213A7 Ng1-h3 3 1 75.00%
# A6F5BE14996CE34B Ng1-f3 141781 105371 57.30%
# A9CD43D265BB27C1 Nb1-c3 8670 8386 50.80%
# A2D57A5C5D03A1B4 Nb1-a3 ???
# A65545D9385B5A3B Ph2-h3 2 0 100.00%
# A9D5495D9934AFB0 Pg2-g3 4782 3596 57.00%
# A6CABC12D080A7CD Pf2-f3 2 0 100.00%
# A6CB76195BE75793 Pe2-e3 33 59 35.80%
# B6D55CBAA963D796 Pd2-d3 40 32 55.50%
# B6CD5B84A78F4849 Pc2-c3 32 32 50.00%
# 46D5AEC46E7F602A Pb2-b3 11438 11100 50.70%
# E6B53D23459CEBA5 Pa2-a3 ???
# 9FCA61F1E64C0CF0 Ph2-h4 ???
# 804C915C8920E580 Pg2-g4 ???
# 9DD7B0A819627060 Pf2-f4 13996 15814 46.90%
# 98A8776C1A9B3E65 Pe2-e4 1271383 1108327 53.40%
# 80DCEC6D58853A63 Pd2-d4 686201 550747 55.40%
# 9DE3F5BD01E05A54 Pc2-c4 198730 162554 55.00%
# 9B31EA80413BAD98 Pb2-b4 3965 4551 46.50%
# 9EEDB481C8A96A61 Pa2-a4 ???
The best entry would be:

Code: Select all

# A65545D9385B5A3B Ph2-h3 2 0 100.00%
but, of course, that's not good!

I know that I can filter entry with a low bound (I already do it, in fact) but I would like to know a more correct way.
From my mini shogi engine Lima. No worries this is also poor :).

Code: Select all

5   r  b  s  g  k
4   .  .  .  .  o
3   .  .  .  .  .
2   P  .  .  .  .
1   K  G  S  B  R

    a  b  c  d  e

st 10
go
# Probing book file Lima.bok ...
# Searching for position rbsgk/4p/5/P4/KGSBR[-] w
#  No.  Move  Perf  Game   Win  Loss  PlayPct
#   1.  d1c2  0.69    13     9     4  67.54
#   2.  d1e2  0.67     3     2     1  46.00
#   3.  b1b2  0.00     3     0     3  6.00
#   4.  c1b2  0.00     1     0     1  2.00
0 0 0 0 (Book Move d1c2)
move d1c2

Code: Select all

PlayPct = (Y * Win/Game) + (Z * Game/TotalGame)
Y = 60 (win factor)
Z = 40 (game factor)
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Opening book from a statistical point of view

Post by jdart »

I use something like:

frequency*winloss/(total frequency)

as the relative weight. But in addition I appy a "selectivity" value. If that value is high, weights of moves besides the best move are decreased in relative score; if the value is low they are boosted. And I also decrease even further the weight of moves with very bad winloss, in an amount based on the selectivity value.

And on top of all this I have a manually tuned book that takes precedence and can override or modify weight decisions made by the code. This is necessary because you can have lines that are both popular and high-scoring but are now refuted, so they should not be highly weighted.

--Jon
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Opening book from a statistical point of view.

Post by stegemma »

Thanks, I'm try something based on that hint but simpler. I compute the square root of the sum of the w/bscores and then I exclude the position with a resulting score less than 2%. At end I choose randomly from the remaining entries:

Code: Select all

clsPCollection<clsBookEntryExt> cllBookEntries;
if&#40;int nBookMoves = GetBookEntries&#40;cllBookEntries, true&#41;)
&#123;
	double squareSum = 0;

	for&#40;int e = 0; e < cllBookEntries.Count&#40;); e++)
		if&#40;clsBookEntryExt *pEntry = cllBookEntries&#91;e&#93;)
			squareSum += sqrt&#40;pEntry->myScore + pEntry->otherScore&#41;;

	for&#40;int e = cllBookEntries.Count&#40;); e >= 0; e--)
		if&#40;clsBookEntryExt *pEntry = cllBookEntries&#91;e&#93;)
			if&#40;sqrt&#40;pEntry->myScore + pEntry->otherScore&#41; / squareSum < 0.02&#41; cllBookEntries.Remove&#40;e, true&#41;;

	if&#40;clsBookEntryExt *pEntry = cllBookEntries&#91;Rand&#40;nBookMoves&#41;&#93;)
	&#123;
		FOR_MOVES
		&#123;
			if&#40;GetUserMove&#40;pLoopMove, true&#41; == pEntry->sMove&#41;
			&#123;
				UserSetBestMove&#40;pLoopMove&#41;;
				sfout.Push&#40;"# book move " + pEntry->sMove&#41;;
				return 0;
			&#125;
		&#125;
	&#125;
&#125;
PS: next step would be to choose proportionally to %, of course
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Opening book from a statistical point of view

Post by Edmund »

Half a year ago I wrote some lines on this very topic:
http://talkchess.com/forum/viewtopic.php?t=59374
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Opening book from a statistical point of view

Post by stegemma »

As pointed by other, it seems that this is one is the more correct solution. It is a little too complex to apply that solution now; I must study very carefully maybe in the morning, not after a whole day of work.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com