Making Symbolic's opening book

sje · Post by **sje** » Fri Jul 17, 2015 12:42 am

Making Symbolic's opening book

Symbolic has a makebook command which takes four parameters:

1) Name of PGN input file
2) Name of book output file
3) Maximum variation ply length
4) Minimum game count for a position/move to appear

The variation length is usually set to 64, but a smaller figure like 32 or 40 would likely be okay and might even be better.

The minimum game count is the lower bound on the number of games in which a position/move pair must appear for the pair to be included in the book. Larger values mean smaller books and better safety from noise. Smaller values mean more knowledge in the book. Sixteen is a good value here for large or poorly edited PGN files.

Symbolic chugs through the input PGN file one game at a time. A game is used for the book only if the game started normally and had a definite conclusion. A valid game is played for the maximum variation length or to its end, whichever comes first. For each position and its immediately following move, the move and the 128 bit position signature is stored in a big fat binary tree with the signature as the key value in each node. Also in each node is a WLD (Win/Lose/Draw) counter vector; this vector gets bumped in the slot corresponding to the game's outcome.

All data is normalized from a White-To-Move standpoint before tree insertion/adjustment.

When the scan of the input PGN file is complete, Symbolic then writes the big fat binary tree to the book output file in signature order from low to high. Only those nodes with a WLD total greater than or equal to the minimum game count are output. All data is written in binary format and in small endian order: 4 bytes for the move, 12 bytes for the WLD, and 16 bytes for the signature for a total of 32 bytes. This makes the book file portable across different hosts.

To probe the book, a simple binary search is performed to find a matching position signature. If one is found, then the program backs up its file read pointer 32 bytes at a time to the very first position/move record match for the position being probed. Then the program advances to the last matching record, collecting the moves and their WLD statistics. The result is then fed to whatever routine which asked for the probe. Note that appropriate Side-To-Move mapping of data in and out is required if Black is on the move.

Because the book file is not very large compared to the amount of memory available, at program start time the entire book is loaded into memory to enable very fast probing. If needed, the program can switch books or have multiple books.

Sample log excerpt:

Code: Select all

2015-07-12 05&#58;25&#58;50.995 = C'tor&#58; BookEngine
2015-07-12 05&#58;25&#58;50.995 =   Loading opening book from file&#58; Book
2015-07-12 05&#58;25&#58;50.995 =   Book file byte count&#58; 7,261,184
2015-07-12 05&#58;25&#58;50.995 =   Book file move count&#58; 226,912

Total time for loading the book is about 200 ms.

Here is a list of the book files in use. Each file was built form the same PGN input but with a different minimum game count parameter.

Code: Select all

  232 -rw-r--r--@ 1 sje  staff    116896 Sep 28  2012 mondo1024.book
  456 -rw-r--r--@ 1 sje  staff    232288 Sep 28  2012 mondo512.book
  904 -rw-r--r--@ 1 sje  staff    462784 Sep 28  2012 mondo256.book
 1776 -rw-r--r--@ 1 sje  staff    907520 Sep 28  2012 mondo128.book
 3496 -rw-r--r--@ 1 sje  staff   1787360 Sep 28  2012 mondo64.book
 6976 -rw-r--r--@ 1 sje  staff   3568032 Sep 28  2012 mondo32.book
14192 -rw-r--r--@ 1 sje  staff   7261184 Jul  1 17&#58;49 mondo16.book
30104 -rw-r--r--@ 1 sje  staff  15410656 Sep 28  2012 mondo8.book
72656 -rw-r--r--@ 1 sje  staff  37198208 Sep 28  2012 mondo4.book

Notes:

1) Building the book can eat a lot of memory for storing the tree.
2) It would probably be safe to use 64 bit signatures instead of 128 bit signatures.
3) Stored moves could be packed into 16 bits instead of 32, but as with signatures, I prefer that the internal and external data formats match.
4) I've tried many alternatives over the years, and this is the fastest and simplest.

sje · Post by **sje** » Fri Jul 17, 2015 1:13 am

Symbolic has the dmob (display moves from opening book) command which will list the results of a book probe. Here's what a probe looks like using the 226,912 move mondo16 book at game start:

Code: Select all

&#91;&#93; dmob
 e4 &#91;446,009/355,049/313,166/1,114,224&#93; 0.540818
 d4 &#91;291,726/211,519/231,027/734,272&#93; 0.554617
Nf3 &#91;66,111/47,277/62,325/175,713&#93; 0.553593
 c4 &#91;65,668/48,149/55,078/168,895&#93; 0.551864
 g3 &#91;6,399/4,881/5,670/16,950&#93; 0.544779
 f4 &#91;5,592/6,036/3,541/15,169&#93; 0.485365
 b4 &#91;5,807/5,215/2,683/13,705&#93; 0.521598
 b3 &#91;3,333/3,134/2,334/8,801&#93; 0.511306
Nc3 &#91;2,759/2,296/1,665/6,720&#93; 0.534449
 e3 &#91;810/1,036/378/2,224&#93; 0.449191
 g4 &#91;889/797/258/1,944&#93; 0.523663
 d3 &#91;437/510/266/1,213&#93; 0.469909
 a3 &#91;394/382/198/974&#93; 0.50616
 c3 &#91;231/241/142/614&#93; 0.491857
 h3 &#91;86/100/43/229&#93; 0.469432
 h4 &#91;91/102/28/221&#93; 0.475113
 a4 &#91;87/77/23/187&#93; 0.526738
 f3 &#91;83/87/14/184&#93; 0.48913
Nh3 &#91;53/82/21/156&#93; 0.407051
Na3 &#91;49/39/10/98&#93; 0.55102

The found moves are listed in descending order by frequency of occurrence. The numbers in the brackets are the win, lose, draw, and total counts for the move in the given position. The floating point value is the scoring expectation for the move based on all games which had the position/move.

Although not seen in the above example, all moves in the book have proper notation applied for check, checkmate, and disambiguation.

jdart · Post by **jdart** » Fri Jul 17, 2015 3:18 am

I am doing something quite similar but in addition I have a lot of manual tuning, which takes precedence over WLD statistics. One problem with WLD is that when an opening move is met with a strong refutation it will start to be played less often. But if the bulk of the PGN data is before the refutation, then a lot of games in the dataset will play into what is now a bad position, and the WLD will look reasonable.

--Jon

sje · Post by **sje** » Fri Jul 17, 2015 9:04 am

Symbolic's book layout design deliberately avoids setting move selection policy. Instead, there are various tuning parameters in the code which do the work. The function and its parameters are chosen such that the moves picked look natural for some given value of "natural".

My experience here is that the relative frequency of the move in the book is more important than the move's WLD expectation.

Zenmastur · Post by **Zenmastur** » Fri Jul 17, 2015 9:15 pm

How many leaf nodes are in each book? Or do you even keep track of such things.

sje wrote:Symbolic's book layout design deliberately avoids setting move selection policy. Instead, there are various tuning parameters in the code which do the work. The function and its parameters are chosen such that the moves picked look natural for some given value of "natural".

My experience here is that the relative frequency of the move in the book is more important than the move's WLD expectation.

Is it a learning book or static?

Regards,

Zen

sje · Post by **sje** » Fri Jul 17, 2015 9:25 pm

Zenmastur wrote:How many leaf nodes are in each book? Or do you even keep track of such things.

Is it a learning book or static?

There is no distinction made for book entries based on on ply. Each record is just the three fields: move, position signature, and WLD.

The book file is static. However, the book construction process can be automated iteratively based on game history.

Zenmastur · Post by **Zenmastur** » Sat Jul 18, 2015 1:55 am

sje wrote:
Zenmastur wrote:How many leaf nodes are in each book? Or do you even keep track of such things.

Is it a learning book or static?
There is no distinction made for book entries based on on ply. Each record is just the three fields: move, position signature, and WLD.

The book file is static. However, the book construction process can be automated iteratively based on game history.

The reason for knowing your leaf node count, even if it isn't stored in the book is it determines the total number of lines in play in the book. In most static books you're either going to exit at a leaf node or your opponent can play a non-book move.

Even in relative large books there aren't too many terminal positions. E.G. I created a book from 578,574 high level human games with all position that occurred at least twice included. This is a very loose book because each position is only required to be seen twice. So it likely has a lot of errors in it. When I truncated the book to 15 plies it had 206,677 positions in it. Of these only 40,797 positions were leaf nodes. It's very easy to analyze 40K positions for gross errors. It can be done in a very short time. On a 4-core machine it takes less than three hours. Most positions will pass this test. Any position that is found to be bad can be back tracked to the root position causing the problem easily because there will likely be many leaf nodes that have the problem node as a parent. This procedure will NOT catch all errors but it will remove the vast majority of them. Of course, this procedure can be done with longer analysis times if better vetting is wanted.

In books with a stricter requirement for position inclusion the number of leaf nodes drops. E.G. if I require a position to be seen 30 times to be included the number of leaf nodes dropped to a mere 2,455. So, there is probably an optimum for the number of times seen to be included because as the leaf node count drops the analysis time can be increased so the vetting process is better in both cases.

So there is benefit to knowing your leaf nodes.

Regards,

Zen

jdart · Post by **jdart** » Sun Jul 19, 2015 3:07 pm

I use both frequency and WLD in computing the move frequency by default.

But the same problem I mentioned with refutations applies to frequency too. A move can be very frequent in a PGN collection because it used to be popular but has recently been dropped because a refutation was found. You could try to detect this by examining date played too. But this will be inexact because some players, especially weaker ones, don't "get the message" and continue playing the line.

--Jon

Making Symbolic's opening book

Making Symbolic's opening book

At game start, the results of a book probe

Re: Making Symbolic's opening book

Symbolic's book move picker

Re: Symbolic's book move picker

Re: Symbolic's book move picker

Re: Symbolic's book move picker

Re: Symbolic's book move picker