Database storage methods

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Database storage methods

Post by Dave_N »

I am thinking about Storing a compressed move list as a Tree in a hash table ...

The hash entry obviously has the following

int64 hash
int addrNext; // if needing a mainline
int addrVariation;


then for each move the game is merged by putting all variations found in a linked list of hash entries, so for a game list stored in sparse format

Code: Select all

1.e4 e5 2.Nf3 Nc6 3.Bc4  // game 1
                          3.Bb5  // game 2
                          3.Nc3  // game 3
the main line is written first, then the variable addrVariation for Bc4 holds the array index of the hash bucket for Bb5, and the Bb5 bucket stores the hash bucket index for Nc3 etc ...

I don't know what to do about terminal nodes or nodes with only 1 game passing through - perhaps the index of the game can be written and merged iff another game scores a hit ?

I also tried to create a DAWG (direct acyclic word graph) for the FEN's to try to create perfect hashing ... the fen can already be compressed to 5 bit chars, then I got a 15 mb to 10 mb compression for the Fen list to DAWG file sizes.

If anyone has tried this disk tree concept I'd be interested to hear about how to compress the tree and whether its worth storing sparse move lists, or whether to just use SQL and a compressed move list, with maybe a gui button to call CQL for position searches.
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: Database storage methods

Post by Rebel »

One compression idea is to store a move (usually 2 bytes, or 2x6=12 bits) as an 8-bit number representing the number in the pseudo-legal move generation list. Takes a lot of extra processor time but saves disk space.
User avatar
JuLieN
Posts: 2949
Joined: Mon May 05, 2008 12:16 pm
Location: Bordeaux (France)
Full name: Julien Marcel

Re: Database storage methods

Post by JuLieN »

Rebel wrote:One compression idea is to store a move (usually 2 bytes, or 2x6=12 bits) as an 8-bit number representing the number in the pseudo-legal move generation list. Takes a lot of extra processor time but saves disk space.
Right. In ICCA's January 1983 Newsletter, Kathe Spracklen (co-author of Sargon) writes a "Tutorial: Representation of an Opening Book Tree".

She discusses various methods and the one she finally chose was described to her by Ken Thompson (author of Belle).

As Ed said, each move in the tree is coded in a single byte: 6 bits are the index in the moves generation list, and the remaining two bits are coded "(" and ")" and help describe the tree.

To quote Kathe:
One bit, designated '(', indicates that there is another move choice available in this position. The other bit, designated ')', indicates that we have reached the end of that line.
(Actually, when you think of it, you only need 7 bits: six for displacement and one for the '(/)': 0='(', 1=')'.)

For instance, the following tree:

Code: Select all

e4 -- e5 -- Nf3 -- Nc6 -- Bb5 -- a6
 |       |        |       |        |
 |       |       f4     Nf6     Bc4
 |      c5 -- Nf3 -- d6 -- d4
 |       |               |
 |      e6            Nc6
d4 -- Nf6 -- c4 -- g6 -- Nc3
 |       |                |
 |       |               e6 -- Nc3 -- Bb4
 |       |                        |
 |       |                      Nf3
 |      d5 -- c4 -- c6 -- Nf3
 |       |       |      |
etc.   f5     Nf3   e6 -- Nc3
 |                      |
 |                     dxc4
Would translate as:

Code: Select all

(e4  (e5  (Nf3  (Nc6  (Bb5  a6)
                              Bc4)
                       Nf6)
               f4)
       (c5  Nf3  (d6  d4)
                     Nc6)
        e6)
(d4  (Nf6  c4  (g6  Nc3)
                     e6  (Nc3  Bb4)
                           Nf3)
       (d5  (c4  (c6  Nf3)
                    (e6  Nc3)
                     dxc4)
              Nf3)
        f5)
etc.
Or, when stored:

Code: Select all

(e4(e5(Nf3(Nc6(Bb5 a6)Bc4)Nf6)f4)(c5 Nf3)(d6 d4)Nc6)e6)(d4(Nf6 c4(g6 Nc3)e6(Nc3 Bb4)Nf3)(d5(c4(c6 Nf3)(e6 Nc3)dxc4)Nf3)f5) etc.
Hope that helps. :)
"The only good bug is a dead bug." (Don Dailey)
[Blog: http://tinyurl.com/predateur ] [Facebook: http://tinyurl.com/fbpredateur ] [MacEngines: http://tinyurl.com/macengines ]
Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Re: Database storage methods

Post by Dave_N »

I have implemented the byte compression form, with my own legal move generator. A big difference with other legal move generators is the fact that for promotions I stored for a pawn promotion on the a-file, the move a8 and a8=Q, a8=N, etc.

I got the idea from an earlier CCC post about 10 million games, then later found the pgc format however I had already created a legal move generator.

The problem is really with compressing the hash, I guess some disk space is save from 1st creating the Trie, then finding the games that can be sparse move lists, setting an index to the root game, then going through another stage by generating hash entries ...

My thought about the hash-graph concept I sketched above is to have hash table compression with empty buckets for large empty spaces (since the hash index would be [hash_number % array_size] ) ... so for example a span of memory with 8000 empty buckets would store 1 bucket with a flag to say empty and the number 8000.

Then a decompression phase would create a file in memory with the full size of the table and hash buckets would be written to the correct part of the file.

I think even with the larger table format, the memory gaps could be useful somehow.

I am not sure about the number of unique positions in a large DB, I think a fair upper bound might be number_of_games * 75.
Last edited by Dave_N on Thu Mar 08, 2012 1:45 pm, edited 1 time in total.
Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Re: Database storage methods

Post by Dave_N »

JuLieN wrote:
Rebel wrote:One compression idea is to store a move (usually 2 bytes, or 2x6=12 bits) as an 8-bit number representing the number in the pseudo-legal move generation list. Takes a lot of extra processor time but saves disk space.
Right. In ICCA's January 1983 Newsletter, Kathe Spracklen (co-author of Sargon) writes a "Tutorial: Representation of an Opening Book Tree".

She discusses various methods and the one she finally chose was described to her by Ken Thompson (author of Belle).

As Ed said, each move in the tree is coded in a single byte: 6 bits are the index in the moves generation list, and the remaining two bits are coded "(" and ")" and help describe the tree.

To quote Kathe:
One bit, designated '(', indicates that there is another move choice available in this position. The other bit, designated ')', indicates that we have reached the end of that line.
(Actually, when you think of it, you only need 7 bits: six for displacement and one for the '(/)': 0='(', 1=')'.)

For instance, the following tree:

Code: Select all

e4 -- e5 -- Nf3 -- Nc6 -- Bb5 -- a6
 |       |        |       |        |
 |       |       f4     Nf6     Bc4
 |      c5 -- Nf3 -- d6 -- d4
 |       |               |
 |      e6            Nc6
d4 -- Nf6 -- c4 -- g6 -- Nc3
 |       |                |
 |       |               e6 -- Nc3 -- Bb4
 |       |                        |
 |       |                      Nf3
 |      d5 -- c4 -- c6 -- Nf3
 |       |       |      |
etc.   f5     Nf3   e6 -- Nc3
 |                      |
 |                     dxc4
Would translate as:

Code: Select all

(e4  (e5  (Nf3  (Nc6  (Bb5  a6)
                              Bc4)
                       Nf6)
               f4)
       (c5  Nf3  (d6  d4)
                     Nc6)
        e6)
(d4  (Nf6  c4  (g6  Nc3)
                     e6  (Nc3  Bb4)
                           Nf3)
       (d5  (c4  (c6  Nf3)
                    (e6  Nc3)
                     dxc4)
              Nf3)
        f5)
etc.
Or, when stored:

Code: Select all

(e4(e5(Nf3(Nc6(Bb5 a6)Bc4)Nf6)f4)(c5 Nf3)(d6 d4)Nc6)e6)(d4(Nf6 c4(g6 Nc3)e6(Nc3 Bb4)Nf3)(d5(c4(c6 Nf3)(e6 Nc3)dxc4)Nf3)f5) etc.
Hope that helps. :)
Yeah thats very helpful, solves the problem of storing the bracket in the same byte.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Database storage methods

Post by mar »

Good idea Julien except that I think this can't handle transpositions :wink:
User avatar
JuLieN
Posts: 2949
Joined: Mon May 05, 2008 12:16 pm
Location: Bordeaux (France)
Full name: Julien Marcel

Re: Database storage methods

Post by JuLieN »

mar wrote:Good idea Julien except that I think this can't handle transpositions :wink:
Yep, she mentions the problem of course. But she wrote that in a time where storing 100k moves in 100k bytes was mandatory, considering the small available space on... those big 5"25 floppy disks. ;)

The solution to the transposition problem Ken told her was to have special codes for special cases. She gives actually two possible solutions to the transposition problem, and I quote her again:
For example, bit code "11000011" might be used to indicate that the computer should recognize, but not play, the following move*. While code "11000100" might mean that the position already reached is a transposition of moves elsewhere in the book. In that case the next byte or two bytes might contain a pointer to another location of the book. Transposition can also be handled by "walking" the tree to see if the given position is present somewhere in the book. With a minimum of optimization to avoid examining lines that cannot possibly lead to the given position, the time cost for walking the tree is quite reasonable.**
* this is to name openings but don't play them if they are losing. Or, precisely, to avoid opening traps.
** And this was written for 1983 hardware ;)
"The only good bug is a dead bug." (Don Dailey)
[Blog: http://tinyurl.com/predateur ] [Facebook: http://tinyurl.com/fbpredateur ] [MacEngines: http://tinyurl.com/macengines ]
Dave_N
Posts: 153
Joined: Fri Sep 30, 2011 7:48 am

Re: Database storage methods

Post by Dave_N »

I thought about this - the problem of searching for tree lines that can transpose must occur in Hippo style positions, with many transposable lines.
without captures or moving the same piece twice there must be [ply ! ] number of lines to search, this becomes very time consuming I guess.

Maybe there is a good way of removing the hash key as a way of finding transpositions in the build phase, however so far I think the hash key is the best way.

The problem of hash collision can at least be solved by walking the tree of the game that the original collided hash was from ...

After transpositions are merged, if the table is walked in variation first form, (breadth first), to find the terminal nodes for the indexes to the game list, then there could be a problem with double transpositions.

If double transpositions are ignored, then the tree (graph) can be walked to the ends to store data ranges of the list of game indexes. Then the re-indexed game list can have a sortable column for the data ranged selection. It might be a bit time consuming to work out when lines transpose twice every time, however when a variation turns into the mainline the lists would occur twice ... I suppose it would have to simply write into a set data structure.

I don't know how much space this fantasy structure takes ...
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Database storage methods

Post by mar »

JuLieN wrote:
mar wrote:Good idea Julien except that I think this can't handle transpositions :wink:
Yep, she mentions the problem of course. But she wrote that in a time where storing 100k moves in 100k bytes was mandatory, considering the small available space on... those big 5"25 floppy disks. ;)

The solution to the transposition problem Ken told her was to have special codes for special cases. She gives actually two possible solutions to the transposition problem, and I quote her again:
For example, bit code "11000011" might be used to indicate that the computer should recognize, but not play, the following move*. While code "11000100" might mean that the position already reached is a transposition of moves elsewhere in the book. In that case the next byte or two bytes might contain a pointer to another location of the book. Transposition can also be handled by "walking" the tree to see if the given position is present somewhere in the book. With a minimum of optimization to avoid examining lines that cannot possibly lead to the given position, the time cost for walking the tree is quite reasonable.**
* this is to name openings but don't play them if they are losing. Or, precisely, to avoid opening traps.
** And this was written for 1983 hardware ;)
Yes I'm aware of that and I really like Sargon and know when it was written. Dan and Kathe are legends.
The idea with using a code that a move leads elsewhere in the book can handle some but not all transpositions. You would have to generate all lines which lead to the same position in the tree which is impossible of course. So I would go for hashing for sure :wink:

What I am missing however is an open opening book (tree) format what would

- support different hash sizes (64-bit or 128, 192 and 256-bit for large books)
- have exact formula and rules to calculate hashes (i.e. no tables)
- handle >4G files
- be platform independent and thread safe (in case it's needed)
- handle more than orthodox chess (variants etc.)
- be compact yet having small memory footprint and very good access time
- allow for dynamic updates
etc.

Not an easy task to do, what do you think? Would be very nice though. I can imagine open source lib under MIT license... sure polyglot is good and lightweight but...
Does something like that already exist?
User avatar
JuLieN
Posts: 2949
Joined: Mon May 05, 2008 12:16 pm
Location: Bordeaux (France)
Full name: Julien Marcel

Re: Database storage methods

Post by JuLieN »

Hmm... problem is that increasing the performance on one of those criteria would decrease performances on other ones. For instance, as H.G. noticed in the SCIDvsPC/Mac thread, adding handling of variants to SCID's db format would double its size.

So one always has to compromise on something: there's no perfect solution. :?
"The only good bug is a dead bug." (Don Dailey)
[Blog: http://tinyurl.com/predateur ] [Facebook: http://tinyurl.com/fbpredateur ] [MacEngines: http://tinyurl.com/macengines ]