tablebase caching / mmap() / page cache

syzygy · Post by **syzygy** » Mon Oct 14, 2013 1:26 am

I'll start with my question.

If the OS needs to free a page of memory and it has the option of freeing a non-dirty page in the OS page cache (i.e. a memory page which can simply be released without having to write anything to disk as its contents are already on disk), does it make any difference for the probability that the OS frees exactly that page whether this page has been memory mapped into the address space of a user mode process?

The answer might of course depend on the OS being used.

I'll now try to explain the relevance of this question.

The probing code of most tablebase implementations seems to implement a cache for caching blocks of (usually decompressed) data. When the engine probes a position, the probing code first looks whether the position is in this cache. If it is not, a block of data is retrieved from the appropriate tablebase file using seek() and read(), decompressed and stored in the cache. I'll call this the classical approach.

My tablebase probing code takes a different approach. Data is compressed in small blocks of 32 or 64 bytes, which means that decompression of a single position is cheap enough that there is no need to cache decompressed blocks. (In fact, the decompression code does not really decompress but only decodes the 32- or 64-byte block without storing anything until the position is reached that is being probed and only retrieves the value of that position.)

Since there is no need to cache decompressed blocks, it is OK to cache only compressed blocks. This is good, since the same amount of RAM can store many more postions in compressed form than in decompressed form.

But my probing code does not maintain its own cache. Instead, it fully relies on the OS. When a program reads from a file using seek() and read(), the kernel caches part of the file into its "page cache". The kernel then copies the data being read from the page cache into the program's memory.

My probing code does not use seek() and read(), but simply mmap()s the whole file into its address space. Effectively, this allows the program to access the data in the page cache directly. There is no need to copy this data from the kernel's address space into the address space of the program.

If we compare with the classical approach of using seek() and read(), then:
- in both approaches parts of the tablebase files being read are cached in the OS page cache;
- with the classical approach data is being copied from the page cache into the probing code's cache and the cached data exists both in the page cache and in the probing code's cache;
- with the mmap() approach the data is accessed directly, which avoids a data copy operation and avoids storing the same data twice.

So, apart from being very simple to implement, the use of mmap() also has very significant advantages. As a bonus one gets efficient multithreaded access for free, since the probing code does not need any locks (I only have one when mmap()ing a file for the first time). The required thread synchronisation is already provided by the OS kernel.

I found this blog post which explains the page cache and mmap() in more detail.

So does the mmap() approach have any disadvantages?

One thing that quickly becomes apparent is that the engine is shown in the task manager as using large amounts of memory. This is because all the tablebase data in the page cache is counted as (shared) memory used by the engine. With the classical approach the tablebase data in the page cache is not counted as memory used by the engine. However, even with the classical approach the data will still be in the page cache, so the difference in reported memory usage is for the most part just cosmetic.

What I am less sure about is what happens once almost all of the page cache is filled with tablebase data and a free memory page must be found for another page of tablebase data to be fetched from disk.

With the classical approach, it seems almost certain that the free memory page is found by releasing an existing page from the page cache. This means that other processes (e.g. other engines) are not in real danger of being (partially) swapped out to disk.

With the mmap() approach, it would make sense if exactly the same thing happened. But I do not know if page cache data being mmap()d into the address space of a process makes it less likely that the data is being released from the page cache. If that does make it less likely, the mmap() approach might affect other processes/engines more than the classical approach, because it could happen that the OS chooses to swap out memory pages of other active processes/engines.

So this is why I posed the above question. Maybe someone here knows?

So to sum up: I believe there is a good chance that the mmap() approach used by my probing code has only advantages in terms of memory consumption, speed, simplicity and efficiency of multithreaded synchronisation and in particular (*) does not negatively affect the performance of other concurrently running processes more than the classical approach would. (The classical approach also uses almost unlimited amounts of page cache memory, but hides this.) But I am not completely sure about (*). This may also be dependent on the particular OS being used.

(For completeness I'll note that it is possible to use the classical approach while skipping the page cache, namely using O_DIRECT in Linux and NO_BUFFERING in Windows, see also the blog post I linked to, but I don't think current tablebase implementation do this and it would most likely severely degrade their performance. I'll also note that the classical approach allows tricks such as Gaviota's soft probes which are non-trivial to implement with my mmap() approach. It might further be worth mentioning that the mmap() approach does not allow to use 68 GB of 6-piece tables in a 32-bit engine, as the 4 GB virtual address space is simply too small. So 32-bit engines have to live with probing 5-piece tables only.)

Houdini · Post by **Houdini** » Mon Oct 14, 2013 9:24 pm

I don't have an answer to your question, but I'd say that it doesn't really matter.

Having run some pretty thorough testing of the Syzygy bases, I can say that even under stress situations the algorithm behaves very well.
After lengthy usage - say 500 games without unloading the engine - I've noticed that the private memory usage of the engine process can reach 2 GB when using the 6-men Syzygy.
But this 2 GB private memory is not counted by Task Manager in the total system memory load. When 23 similar engine processes are running on a server with 32 GB RAM, and *each* is using private memory of 1.5 GB to 2 GB, the Task Manager reports a total memory usage of only about 10 GB (and not as one might expect 23 x 2GB = 46 GB). Even in this situation the table base performance remains good.

Over-all you've done an excellent job, this is the first public EGTB solution that keeps its probing performance in multi-threaded situations. Houdini can now run multiple threads performing probes at low depth, while maintaining about 90% of the normal node speed (instead of the typical 10% to 40% node speed obtained with the other EGTB solutions I've tested).

Thank you!

Robert

Rein Halbersma · Post by **Rein Halbersma** » Mon Oct 14, 2013 10:54 pm

syzygy wrote:
The probing code of most tablebase implementations seems to implement a cache for caching blocks of (usually decompressed) data. When the engine probes a position, the probing code first looks whether the position is in this cache. If it is not, a block of data is retrieved from the appropriate tablebase file using seek() and read(), decompressed and stored in the cache. I'll call this the classical approach.

Thanks for this detailed explanation! Modern 10x10 draughts programs (such as Kingsrow) use the approach pioneered by the 8x8 checkers program Chinook. See their website and Schaeffer's book for all the gory details. It combines your compressed memory storage with a hand-written multi-level caching algorithm.

Let's do some numbers to explain that. Suppose a 512 Gb bitbase (WLD) on disk, and 24Gb of RAM availabe for raw caching. On disk, there is also an index file that stores the index of each starting position of a 4K block. With 128M blocks x 8 byte per position, that gives 1Gb of indexing data that is in memory at all times.

In memory, 4K blocks of uncompressed data are being cached. So 6M cache blocks fit into memory. This is cached by a hash table of 6M entries, and a 6M element doubly linked-list maintaining the LRU properties of the cache. On top of that, for each cached 4K block, 8 bytes corresponding to the starting position indices of each 64-byte segment within the block is also cached. Effectively, this increases the size of each 4K block by 1/8th to 4.5K. This increases total memory of all the entire cache to about 28Gb.

Probing a position is done by doing a binary search over the index file (27 steps for 128M blocks on disk) to find the 4K block, a hash probe to see if that block is in RAM and if so, a binary search over the starting indices of its 64 byte segments (6 steps for 64 segments per 4K block), and then decompressing each of the 64 run-length encoded bytes to retrieve the value. If the position in question was not in RAM (which the hash probe would indicate), it is loaded from disk, and the LRU list is update (typically under a lock in multithreaded code).

This multi-level approach keeps as much of possible of the disk-based bitbases in RAM, while still achieving a fast lookup. However, in late middle game positions the NPS can drop to about 1/10th of the speed when not using bitbases, although this figure depends heavily on whether probing is done unconditionally in all nodes or whether some priority criterion is being employed.

What kind of NPS drop does your mmap() approach lead to? Is the speed relativey unaffected by the probing? Note that in draughts, the drop in speed is more than compensated by the enormously valuable information from the bitbases, since 9-piece databases are now available that will be hit frequently with as many as 20 pieces on the board.

Your mmap() sounds extremly interesting, and it's something that's high on my list to study. As you said, it saves a lot of boiler plate compared to a sophisticated hand-written multi-level cache that the OS already has implemented for you (and behind the scene is also running for you!).

syzygy · Post by **syzygy** » Tue Oct 15, 2013 1:25 am

Houdini wrote:I don't have an answer to your question, but I'd say that it doesn't really matter.

Having run some pretty thorough testing of the Syzygy bases, I can say that even under stress situations the algorithm behaves very well.
After lengthy usage - say 500 games without unloading the engine - I've noticed that the private memory usage of the engine process can reach 2 GB when using the 6-men Syzygy.
But this 2 GB private memory is not counted by Task Manager in the total system memory load. When 23 similar engine processes are running on a server with 32 GB RAM, and *each* is using private memory of 1.5 GB to 2 GB, the Task Manager reports a total memory usage of only about 10 GB (and not as one might expect 23 x 2GB = 46 GB). Even in this situation the table base performance remains good.

I am happy to hear that performance is good on Windows even under such conditions, because I have almost exclusively tested on Linux.

I can't explain your observations on private memory usage. One the one hand I would expect it to be lower, on the other hand I would expect 23 engines each using 2 GB of private memory on a 32 GB system to lead to a lot of swapping. Maybe private memory usage as reported by Windows is not entirely accurate for some reason...

Over-all you've done an excellent job, this is the first public EGTB solution that keeps its probing performance in multi-threaded situations. Houdini can now run multiple threads performing probes at low depth, while maintaining about 90% of the normal node speed (instead of the typical 10% to 40% node speed obtained with the other EGTB solutions I've tested).

Thank you!

You're welcome!

Houdini · Post by **Houdini** » Tue Oct 15, 2013 1:48 am

syzygy wrote:I can't explain your observations on private memory usage. One the one hand I would expect it to be lower, on the other hand I would expect 23 engines each using 2 GB of private memory on a 32 GB system to lead to a lot of swapping. Maybe private memory usage as reported by Windows is not entirely accurate for some reason...

Disk activity remained low throughout the multiple hour test run - just the sporadic short read operations one expects from table base activity.
At no point in time was there any sign of Windows swapping. Windows does a pretty good job with memory-mapped files, apparently

.

syzygy · Post by **syzygy** » Tue Oct 15, 2013 2:54 am

Rein Halbersma wrote:Thanks for this detailed explanation! Modern 10x10 draughts programs (such as Kingsrow) use the approach pioneered by the 8x8 checkers program Chinook. See their website and Schaeffer's book for all the gory details. It combines your compressed memory storage with a hand-written multi-level caching algorithm.

I've certainly taken inspiration from the Chinook papers.

Let's do some numbers to explain that. Suppose a 512 Gb bitbase (WLD) on disk, and 24Gb of RAM availabe for raw caching. On disk, there is also an index file that stores the index of each starting position of a 4K block. With 128M blocks x 8 byte per position, that gives 1Gb of indexing data that is in memory at all times.

I treat the indexing data the same as the actual tablebase data: everything is mmap()'d into memory and paged in by the kernel when needed.

I have briefly described the structure of my indexing data in this post. For tables with 64-byte compressed blocks, the indexing overhead is a bit more than 2/64 = 3.125%. For tables with 32-byte compressed block (which means they compress very well... blocks contain at most 65536 positions) the overhead is a bit more than 2/32 = 6.25%.

In memory, 4K blocks of uncompressed data are being cached. So 6M cache blocks fit into memory.

I think you meant 4K blocks of compressed data.

This is cached by a hash table of 6M entries, and a 6M element doubly linked-list maintaining the LRU properties of the cache. On top of that, for each cached 4K block, 8 bytes corresponding to the starting position indices of each 64-byte segment within the block is also cached. Effectively, this increases the size of each 4K block by 1/8th to 4.5K. This increases total memory of all the entire cache to about 28Gb.

8 bytes per 64-byte segment is a lot (but it seems the scheme would also have worked with say 6 bytes). Instead of position indices I store the number of positions per block, which I limit to 65536 so that 2 bytes suffice. Using the main index I get reasonably close to the block I need, then add or subtract consecutive block sizes (which are consecutive in memory) until I know which block contains the position I need.

If the limit of 65536 bytes results in too many half empty blocks, I reduce the size of each block to 32 bytes. (For DTZ I used 1024 bytes per block or a smaller power of 2 if compression is "too good".) My compression is such that 1 compressed bit corresponds to at most 256 uncompressed bytes, so that 32 byte blocks will always be full.

Probing a position is done by doing a binary search over the index file (27 steps for 128M blocks on disk) to find the 4K block, a hash probe to see if that block is in RAM and if so, a binary search over the starting indices of its 64 byte segments (6 steps for 64 segments per 4K block), and then decompressing each of the 64 run-length encoded bytes to retrieve the value. If the position in question was not in RAM (which the hash probe would indicate), it is loaded from disk, and the LRU list is update (typically under a lock in multithreaded code).

My approach seems to require considerably fewer steps. It does need more parameterisation per table because of varying compression ratios, but that is not a very big deal.

This multi-level approach keeps as much of possible of the disk-based bitbases in RAM, while still achieving a fast lookup. However, in late middle game positions the NPS can drop to about 1/10th of the speed when not using bitbases, although this figure depends heavily on whether probing is done unconditionally in all nodes or whether some priority criterion is being employed.

Yes, we can't easily compare checkers and chess here.

What kind of NPS drop does your mmap() approach lead to? Is the speed relativey unaffected by the probing? Note that in draughts, the drop in speed is more than compensated by the enormously valuable information from the bitbases, since 9-piece databases are now available that will be hit frequently with as many as 20 pieces on the board.

With the tables on SSD (or once most of the relevant parts have been paged into memory) speed is relatively unaffected. When my engine plays on FICS, I see nps increase as more and more pieces are captured and this does not change when it starts to heavily hit the tablebases.

To my (admittedly untrained) eye, it improves endgame play considerably. It does not make the same difference as in checkers though (which I assume can be compared in this respect to suicide chess, where tablebases are extremely useful).

Your mmap() sounds extremly interesting, and it's something that's high on my list to study. As you said, it saves a lot of boiler plate compared to a sophisticated hand-written multi-level cache that the OS already has implemented for you (and behind the scene is also running for you!).

It is the lazy way out, but might be hard to beat even with sophisticated dedicated caches.

Another advantage is that there is almost no startup time, and that the cached data survives between engine restarts.

It should be relatively easy to replace the LRU cache by an mmap() implementation. It also seems possible to use my indexing scheme either in combination with the run length encoding currently in use or in combination with my compression scheme, but that would require to regenerate (or reformat) the data. My compression scheme will be somewhat slower to decompress than run length encoding schemes. I don't know how the compression ratios will compare.

Rein Halbersma · Post by **Rein Halbersma** » Tue Oct 15, 2013 8:53 am

syzygy wrote:
Rein Halbersma wrote:
In memory, 4K blocks of uncompressed data are being cached. So 6M cache blocks fit into memory.
I think you meant 4K blocks of compressed data.

yes, that was a typo!

8 bytes per 64-byte segment is a lot (but it seems the scheme would also have worked with say 6 bytes). Instead of position indices I store the number of positions per block, which I limit to 65536 so that 2 bytes suffice. Using the main index I get reasonably close to the block I need, then add or subtract consecutive block sizes (which are consecutive in memory) until I know which block contains the position I need.

If the limit of 65536 bytes results in too many half empty blocks, I reduce the size of each block to 32 bytes. (For DTZ I used 1024 bytes per block or a smaller power of 2 if compression is "too good".) My compression is such that 1 compressed bit corresponds to at most 256 uncompressed bytes, so that 32 byte blocks will always be full.

I wrote 8 bytes per position so that 64-bit pointers could be used, but you are right that this is overkill. IIRC, the Kingsrow author uses 32-bit indices, which cuts this overhead in half. Still, your 2 bytes is even more attractive.

To my (admittedly untrained) eye, it improves endgame play considerably. It does not make the same difference as in checkers though (which I assume can be compared in this respect to suicide chess, where tablebases are extremely useful).

Yes, I think that in regular chess it just makes you win quicker and more elegantly (which gives the appearance of increased strength) whereas in checkers a tablebase probe is often enough to determine the difference between a draw and a win/loss.

Your mmap() sounds extremly interesting, and it's something that's high on my list to study. As you said, it saves a lot of boiler plate compared to a sophisticated hand-written multi-level cache that the OS already has implemented for you (and behind the scene is also running for you!).
It is the lazy way out, but might be hard to beat even with sophisticated dedicated caches.

Another advantage is that there is almost no startup time, and that the cached data survives between engine restarts.

It should be relatively easy to replace the LRU cache by an mmap() implementation. It also seems possible to use my indexing scheme either in combination with the run length encoding currently in use or in combination with my compression scheme, but that would require to regenerate (or reformat) the data. My compression scheme will be somewhat slower to decompress than run length encoding schemes. I don't know how the compression ratios will compare.

The Boost.Iostreams C++ library has a platform independent interface for memory-mapped files, which also lets you treat such files as STL sequences (iterator and insertion support). It also has builtin support for streaming to/from zipped (.zip, .gz, .bz2) files, although I haven't looked into the details yet. This looks like an attractive solution for C++ programs.

The Chinook style compression has a table-lookup encoding in each byte of either a 4-tuple of WDL values (essentially a balanced 3-ary tree of height 4, so 3^4 = 81 entries), or a run-length of a single value (256-81 entries). This can be improved quite a bit by using a combination of Tunstall codes (an unbalanced 3-ary tree with the most frequently occurring sequences as the nodes) and run-length encoding. The nice thing is that it is a variable-to-fixed compression so that a hardware bit error in one byte does not propagate over the entire database. Does your compression also have this property that each 64-byte block is independent of the other blocks?

hgm · Post by **hgm** » Tue Oct 15, 2013 11:06 am

Running two programs on the same machine is never completely safe. There are many ways of 'electronic warfare' that a maliciously written program could engage in, trying to swamp resources, even when staying within explicit restrictions on use of the cores or memory.

For example on i7 architectures, as well as on the older Core 2 Duo, some cache levels are shared between processes. In a ponder-on game it might be more profitable to spend your time on thrashing the opponent's cache to slow him down during thinking than it would be to ponder yourself. (And, with a clever thrashing code that focuses on cache lines that are heavily used by the opponent, and not by you, you might even be able to do both.) Magic-bitboard programs usually do need more cache memory then the 256KB provided by the i7 L2, and denying them the use of L3 will hurt them badly. If you can slow them down by a factor >1.5 it is already better than pondering yourself.

Trying to conquer physical memory by flooding the file cache would just be another method. The situation you describe might be a nice way to cheat on hash-table usage, putting the hash table in a file, large enough to fit in the memory that you are not supposed to use, and making sure it will remain cached by having a low-CPU-consumption thread that makes dummy accesses to each page of the file to keep it cached. Good thing this would show up in the task manager. Would it be accepted as an excuse by testers that this doesn't reflect your own memory usage, but just mapped file-caching memory from the OS?

syzygy · Post by **syzygy** » Tue Oct 15, 2013 10:40 pm

Rein Halbersma wrote:
To my (admittedly untrained) eye, it improves endgame play considerably. It does not make the same difference as in checkers though (which I assume can be compared in this respect to suicide chess, where tablebases are extremely useful).
Yes, I think that in regular chess it just makes you win quicker and more elegantly (which gives the appearance of increased strength) whereas in checkers a tablebase probe is often enough to determine the difference between a draw and a win/loss.

I am pretty sure using 6-piece tables measurably increases strength, but not as drastically as with checkers or suicide.

The Boost.Iostreams C++ library has a platform independent interface for memory-mapped files, which also lets you treat such files as STL sequences (iterator and insertion support). It also has builtin support for streaming to/from zipped (.zip, .gz, .bz2) files, although I haven't looked into the details yet. This looks like an attractive solution for C++ programs.

I don't think you need the streaming functionality. But platform independent memory saves a bit of work, yes.

The Chinook style compression has a table-lookup encoding in each byte of either a 4-tuple of WDL values (essentially a balanced 3-ary tree of height 4, so 3^4 = 81 entries), or a run-length of a single value (256-81 entries). This can be improved quite a bit by using a combination of Tunstall codes (an unbalanced 3-ary tree with the most frequently occurring sequences as the nodes) and run-length encoding.

I might have a look into this. At some point I was considering adding a run-length encoding option, but my first experiments weren't very promising.

A complication is that my WDL tables can have up to 5 values (because I distinguish between real wins/losses and wins/losses that are drawn by the 50-move rule). But of course not all tables are affected by the 50-move rule.

The nice thing is that it is a variable-to-fixed compression so that a hardware bit error in one byte does not propagate over the entire database. Does your compression also have this property that each 64-byte block is independent of the other blocks?

Yes, each 64-byte block can be decompressed independently of any other block. This is essential for making random access possible. Each table has a huffman code for encoding up to 4095 symbols. Each symbol either corresponds to a possible position value (e.g. W/D/L) or to the concatenation of two other symbols. The symbols are created by repeatedly looping through the table and replacing consecutive pairs of symbols with high frequency by new symbols. When this is done, I create a Huffman code for the resulting alphabet and basically fill up the blocks by doing a final loop through the table and encoding each symbol.

Rein Halbersma · Post by **Rein Halbersma** » Tue Oct 15, 2013 11:24 pm

syzygy wrote: I might have a look into this. At some point I was considering adding a run-length encoding option, but my first experiments weren't very promising.

A complication is that my WDL tables can have up to 5 values (because I distinguish between real wins/losses and wins/losses that are drawn by the 50-move rule). But of course not all tables are affected by the 50-move rule.

The Chinook project also pioneered partial information databases (at least a draw, at most a draw), that also have this type of values. There is a paper on their website about this. IIRC, compression suffers a bit, but it's the same principle.

The nice thing is that it is a variable-to-fixed compression so that a hardware bit error in one byte does not propagate over the entire database. Does your compression also have this property that each 64-byte block is independent of the other blocks?
Yes, each 64-byte block can be decompressed independently of any other block. This is essential for making random access possible. Each table has a huffman code for encoding up to 4095 symbols. Each symbol either corresponds to a possible position value (e.g. W/D/L) or to the concatenation of two other symbols. The symbols are created by repeatedly looping through the table and replacing consecutive pairs of symbols with high frequency by new symbols. When this is done, I create a Huffman code for the resulting alphabet and basically fill up the blocks by doing a final loop through the table and encoding each symbol.

Maybe I should read your code in more detail, but can you explain how you use Huffman coding to get a fixed-length 64 byte code block? I thought that Huffman is a form of fixed-to-variable encoding (each fixed-sized input character to a variable-sized bitstring)?

tablebase caching / mmap() / page cache

tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache

Re: tablebase caching / mmap() / page cache