tablebase caching / mmap() / page cache
Posted: Mon Oct 14, 2013 1:26 am
I'll start with my question.
If the OS needs to free a page of memory and it has the option of freeing a non-dirty page in the OS page cache (i.e. a memory page which can simply be released without having to write anything to disk as its contents are already on disk), does it make any difference for the probability that the OS frees exactly that page whether this page has been memory mapped into the address space of a user mode process?
The answer might of course depend on the OS being used.
I'll now try to explain the relevance of this question.
The probing code of most tablebase implementations seems to implement a cache for caching blocks of (usually decompressed) data. When the engine probes a position, the probing code first looks whether the position is in this cache. If it is not, a block of data is retrieved from the appropriate tablebase file using seek() and read(), decompressed and stored in the cache. I'll call this the classical approach.
My tablebase probing code takes a different approach. Data is compressed in small blocks of 32 or 64 bytes, which means that decompression of a single position is cheap enough that there is no need to cache decompressed blocks. (In fact, the decompression code does not really decompress but only decodes the 32- or 64-byte block without storing anything until the position is reached that is being probed and only retrieves the value of that position.)
Since there is no need to cache decompressed blocks, it is OK to cache only compressed blocks. This is good, since the same amount of RAM can store many more postions in compressed form than in decompressed form.
But my probing code does not maintain its own cache. Instead, it fully relies on the OS. When a program reads from a file using seek() and read(), the kernel caches part of the file into its "page cache". The kernel then copies the data being read from the page cache into the program's memory.
My probing code does not use seek() and read(), but simply mmap()s the whole file into its address space. Effectively, this allows the program to access the data in the page cache directly. There is no need to copy this data from the kernel's address space into the address space of the program.
If we compare with the classical approach of using seek() and read(), then:
- in both approaches parts of the tablebase files being read are cached in the OS page cache;
- with the classical approach data is being copied from the page cache into the probing code's cache and the cached data exists both in the page cache and in the probing code's cache;
- with the mmap() approach the data is accessed directly, which avoids a data copy operation and avoids storing the same data twice.
So, apart from being very simple to implement, the use of mmap() also has very significant advantages. As a bonus one gets efficient multithreaded access for free, since the probing code does not need any locks (I only have one when mmap()ing a file for the first time). The required thread synchronisation is already provided by the OS kernel.
I found this blog post which explains the page cache and mmap() in more detail.
So does the mmap() approach have any disadvantages?
One thing that quickly becomes apparent is that the engine is shown in the task manager as using large amounts of memory. This is because all the tablebase data in the page cache is counted as (shared) memory used by the engine. With the classical approach the tablebase data in the page cache is not counted as memory used by the engine. However, even with the classical approach the data will still be in the page cache, so the difference in reported memory usage is for the most part just cosmetic.
What I am less sure about is what happens once almost all of the page cache is filled with tablebase data and a free memory page must be found for another page of tablebase data to be fetched from disk.
With the classical approach, it seems almost certain that the free memory page is found by releasing an existing page from the page cache. This means that other processes (e.g. other engines) are not in real danger of being (partially) swapped out to disk.
With the mmap() approach, it would make sense if exactly the same thing happened. But I do not know if page cache data being mmap()d into the address space of a process makes it less likely that the data is being released from the page cache. If that does make it less likely, the mmap() approach might affect other processes/engines more than the classical approach, because it could happen that the OS chooses to swap out memory pages of other active processes/engines.
So this is why I posed the above question. Maybe someone here knows?
So to sum up: I believe there is a good chance that the mmap() approach used by my probing code has only advantages in terms of memory consumption, speed, simplicity and efficiency of multithreaded synchronisation and in particular (*) does not negatively affect the performance of other concurrently running processes more than the classical approach would. (The classical approach also uses almost unlimited amounts of page cache memory, but hides this.) But I am not completely sure about (*). This may also be dependent on the particular OS being used.
(For completeness I'll note that it is possible to use the classical approach while skipping the page cache, namely using O_DIRECT in Linux and NO_BUFFERING in Windows, see also the blog post I linked to, but I don't think current tablebase implementation do this and it would most likely severely degrade their performance. I'll also note that the classical approach allows tricks such as Gaviota's soft probes which are non-trivial to implement with my mmap() approach. It might further be worth mentioning that the mmap() approach does not allow to use 68 GB of 6-piece tables in a 32-bit engine, as the 4 GB virtual address space is simply too small. So 32-bit engines have to live with probing 5-piece tables only.)
If the OS needs to free a page of memory and it has the option of freeing a non-dirty page in the OS page cache (i.e. a memory page which can simply be released without having to write anything to disk as its contents are already on disk), does it make any difference for the probability that the OS frees exactly that page whether this page has been memory mapped into the address space of a user mode process?
The answer might of course depend on the OS being used.
I'll now try to explain the relevance of this question.
The probing code of most tablebase implementations seems to implement a cache for caching blocks of (usually decompressed) data. When the engine probes a position, the probing code first looks whether the position is in this cache. If it is not, a block of data is retrieved from the appropriate tablebase file using seek() and read(), decompressed and stored in the cache. I'll call this the classical approach.
My tablebase probing code takes a different approach. Data is compressed in small blocks of 32 or 64 bytes, which means that decompression of a single position is cheap enough that there is no need to cache decompressed blocks. (In fact, the decompression code does not really decompress but only decodes the 32- or 64-byte block without storing anything until the position is reached that is being probed and only retrieves the value of that position.)
Since there is no need to cache decompressed blocks, it is OK to cache only compressed blocks. This is good, since the same amount of RAM can store many more postions in compressed form than in decompressed form.
But my probing code does not maintain its own cache. Instead, it fully relies on the OS. When a program reads from a file using seek() and read(), the kernel caches part of the file into its "page cache". The kernel then copies the data being read from the page cache into the program's memory.
My probing code does not use seek() and read(), but simply mmap()s the whole file into its address space. Effectively, this allows the program to access the data in the page cache directly. There is no need to copy this data from the kernel's address space into the address space of the program.
If we compare with the classical approach of using seek() and read(), then:
- in both approaches parts of the tablebase files being read are cached in the OS page cache;
- with the classical approach data is being copied from the page cache into the probing code's cache and the cached data exists both in the page cache and in the probing code's cache;
- with the mmap() approach the data is accessed directly, which avoids a data copy operation and avoids storing the same data twice.
So, apart from being very simple to implement, the use of mmap() also has very significant advantages. As a bonus one gets efficient multithreaded access for free, since the probing code does not need any locks (I only have one when mmap()ing a file for the first time). The required thread synchronisation is already provided by the OS kernel.
I found this blog post which explains the page cache and mmap() in more detail.
So does the mmap() approach have any disadvantages?
One thing that quickly becomes apparent is that the engine is shown in the task manager as using large amounts of memory. This is because all the tablebase data in the page cache is counted as (shared) memory used by the engine. With the classical approach the tablebase data in the page cache is not counted as memory used by the engine. However, even with the classical approach the data will still be in the page cache, so the difference in reported memory usage is for the most part just cosmetic.
What I am less sure about is what happens once almost all of the page cache is filled with tablebase data and a free memory page must be found for another page of tablebase data to be fetched from disk.
With the classical approach, it seems almost certain that the free memory page is found by releasing an existing page from the page cache. This means that other processes (e.g. other engines) are not in real danger of being (partially) swapped out to disk.
With the mmap() approach, it would make sense if exactly the same thing happened. But I do not know if page cache data being mmap()d into the address space of a process makes it less likely that the data is being released from the page cache. If that does make it less likely, the mmap() approach might affect other processes/engines more than the classical approach, because it could happen that the OS chooses to swap out memory pages of other active processes/engines.
So this is why I posed the above question. Maybe someone here knows?
So to sum up: I believe there is a good chance that the mmap() approach used by my probing code has only advantages in terms of memory consumption, speed, simplicity and efficiency of multithreaded synchronisation and in particular (*) does not negatively affect the performance of other concurrently running processes more than the classical approach would. (The classical approach also uses almost unlimited amounts of page cache memory, but hides this.) But I am not completely sure about (*). This may also be dependent on the particular OS being used.
(For completeness I'll note that it is possible to use the classical approach while skipping the page cache, namely using O_DIRECT in Linux and NO_BUFFERING in Windows, see also the blog post I linked to, but I don't think current tablebase implementation do this and it would most likely severely degrade their performance. I'll also note that the classical approach allows tricks such as Gaviota's soft probes which are non-trivial to implement with my mmap() approach. It might further be worth mentioning that the mmap() approach does not allow to use 68 GB of 6-piece tables in a 32-bit engine, as the 4 GB virtual address space is simply too small. So 32-bit engines have to live with probing 5-piece tables only.)