UCI Hash Usage Rules

hgm · Post by **hgm** » Mon Jul 16, 2018 10:58 am

noobpwnftw wrote: ↑Mon Jul 16, 2018 10:24 am
bob wrote: ↑Mon Jul 16, 2018 5:41 am I don't quite follow what that "threads are more resource heavy under linux." In fact, it is the exact opposite. Threads intentionally share everything, where processes (via fork()) share very little (at least they share executable code and anything that is not modified - an artifact of the "copy on write" approach fork() uses. Given the choice, threads is the least resource-intensive way to do multiprocessing.
This is wrong. What you think is almost true under Windows, however, threads under Linux are essentially fully-loaded "processes", they share nothing more than what you would get from fork(), actually, in the recent past it was implemented exactly via fork().

Threads must share their address space for global data, which processes would certainly not.

But I think Bob has a point: in Unix/Linux, after fork(), you share basically everything. Even when you fork off processes, they would share data that they are not supposed to share. And as long as the data is not modified they will continue sharing it. That data is then kept in memory in write-protected pages, although it formally should be writable, to catch any attempts of the processes sharing it to write there. When this happens the resulting 'segfault' is processed by the OS to duplicate the page, give the writing process its own (now writable) copy of it in its page table, and redo the write. So the only thing fork() does is create a new entry in the process table, and set all pages in the paging unit write-protected.

Because this is just a trick for efficiency, supposed to be transparent for the user, I can imagine that the task manager would report like each process has its own private memory, even while it is still shared, because formally it is not shared at all. But eventually all data that is supposed to be private an writable will be duplicated. This is a lot more for processes than for threads, though.

noobpwnftw · Post by **noobpwnftw** » Mon Jul 16, 2018 11:13 am

Code: Select all

[root@localhost ~]# ps -ef -o command,vsize,rss,%mem,size | grep stockfish

Fresh run:

Code: Select all

 \_ ./stockfish XDG_SESSION 189124 29424  0.0 174448

After "setoption name Threads value 64"

Code: Select all

 \_ ./stockfish XDG_SESSION 5868784 329512  0.0 5854108

After "setoption name Threads value 1"

Code: Select all

 \_ ./stockfish XDG_SESSION 262856 28204  0.0 248180

After "setoption name Threads value 512"

Code: Select all

 \_ ./stockfish XDG_SESSION 40373944 2461468  0.1 40359268

pmap shows:

Code: Select all

0000000000400000    276K r-x-- stockfish
0000000000645000      4K r---- stockfish
0000000000646000      4K rw--- stockfish
0000000000647000   1240K rw---   [ anon ]
000000000184c000 2303820K rw---   [ anon ]
00007fafb9623000      4K -----   [ anon ]
00007fafb9624000   8192K rw---   [ anon ]
00007fafb9e24000      4K -----   [ anon ]
00007fafb9e25000   8192K rw---   [ anon ]
00007fafba625000      4K -----   [ anon ]
00007fafba626000   8192K rw---   [ anon ]
00007fafbae26000      4K -----   [ anon ]
00007fafbae27000   8192K rw---   [ anon ]
00007fb0a8000000    132K rw---   [ anon ]
00007fb0a8021000  65404K -----   [ anon ]
00007fb0b0000000    132K rw---   [ anon ]
00007fb0b0021000  65404K -----   [ anon ]
00007fb0b8000000    132K rw---   [ anon ]
00007fb0b8021000  65404K -----   [ anon ]
00007fb0c0000000    132K rw---   [ anon ]
00007fb0c0021000  65404K -----   [ anon ]
00007fb0c8000000    132K rw---   [ anon ]
00007fb0c8021000  65404K -----   [ anon ]
00007fb0d0000000    132K rw---   [ anon ]

cut

00007fb9d07fa000   8192K rw---   [ anon ]
00007fb9d0ffa000      4K -----   [ anon ]
00007fb9d0ffb000   8192K rw---   [ anon ]
00007fb9d17fb000      4K -----   [ anon ]
00007fb9d17fc000   8192K rw---   [ anon ]
00007fb9d1ffc000      4K -----   [ anon ]
00007fb9d1ffd000   8192K rw---   [ anon ]
00007fb9d27fd000      4K -----   [ anon ]
00007fb9d27fe000   8192K rw---   [ anon ]
00007fb9d2ffe000      4K -----   [ anon ]
00007fb9d2fff000   8192K rw---   [ anon ]
00007fb9d37ff000      4K -----   [ anon ]
00007fb9d3800000   8192K rw---   [ anon ]
00007fb9d4000000    132K rw---   [ anon ]
00007fb9d4021000  65404K -----   [ anon ]
00007fb9d8000000    132K rw---   [ anon ]
00007fb9d8021000  65404K -----   [ anon ]
00007fb9dc000000    132K rw---   [ anon ]
00007fb9dc021000  65404K -----   [ anon ]
00007fb9e0000000    132K rw---   [ anon ]
00007fb9e0021000  65404K -----   [ anon ]
00007fb9e4000000    132K rw---   [ anon ]
00007fb9e4021000  65404K -----   [ anon ]
00007fb9e8000000    132K rw---   [ anon ]
00007fb9e8021000  65404K -----   [ anon ]
00007fb9ec000000    132K rw---   [ anon ]
00007fb9ec021000  65404K -----   [ anon ]
00007fb9f07f9000      4K -----   [ anon ]
00007fb9f07fa000   8192K rw---   [ anon ]
00007fb9f0ffa000      4K -----   [ anon ]
00007fb9f0ffb000   8192K rw---   [ anon ]
00007fb9f17fb000      4K -----   [ anon ]
00007fb9f17fc000   8192K rw---   [ anon ]
00007fb9f1ffc000      4K -----   [ anon ]
00007fb9f1ffd000   8192K rw---   [ anon ]
00007fb9f27fd000      4K -----   [ anon ]
00007fb9f27fe000   8192K rw---   [ anon ]
00007fb9f2ffe000      4K -----   [ anon ]
00007fb9f2fff000   8192K rw---   [ anon ]
00007fb9f37ff000      4K -----   [ anon ]
00007fb9f3800000   8192K rw---   [ anon ]
00007fb9f4000000    132K rw---   [ anon ]
00007fb9f4021000  65404K -----   [ anon ]
00007fb9f85ea000      4K -----   [ anon ]
00007fb9f85eb000   8192K rw---   [ anon ]
00007fb9f8deb000      4K -----   [ anon ]
00007fb9f8dec000   8192K rw---   [ anon ]
00007fb9f95ec000   1804K r-x-- libc-2.17.so
00007fb9f97af000   2044K ----- libc-2.17.so
00007fb9f99ae000     16K r---- libc-2.17.so
00007fb9f99b2000      8K rw--- libc-2.17.so
00007fb9f99b4000     20K rw---   [ anon ]
00007fb9f99b9000     84K r-x-- libgcc_s-4.8.5-20150702.so.1
00007fb9f99ce000   2044K ----- libgcc_s-4.8.5-20150702.so.1
00007fb9f9bcd000      4K r---- libgcc_s-4.8.5-20150702.so.1
00007fb9f9bce000      4K rw--- libgcc_s-4.8.5-20150702.so.1
00007fb9f9bcf000   1028K r-x-- libm-2.17.so
00007fb9f9cd0000   2044K ----- libm-2.17.so
00007fb9f9ecf000      4K r---- libm-2.17.so
00007fb9f9ed0000      4K rw--- libm-2.17.so
00007fb9f9ed1000    932K r-x-- libstdc++.so.6.0.19
00007fb9f9fba000   2044K ----- libstdc++.so.6.0.19
00007fb9fa1b9000     32K r---- libstdc++.so.6.0.19
00007fb9fa1c1000      8K rw--- libstdc++.so.6.0.19
00007fb9fa1c3000     84K rw---   [ anon ]
00007fb9fa1d8000     92K r-x-- libpthread-2.17.so
00007fb9fa1ef000   2044K ----- libpthread-2.17.so
00007fb9fa3ee000      4K r---- libpthread-2.17.so
00007fb9fa3ef000      4K rw--- libpthread-2.17.so
00007fb9fa3f0000     16K rw---   [ anon ]
00007fb9fa3f4000    136K r-x-- ld-2.17.so
00007fb9fa607000     24K rw---   [ anon ]
00007fb9fa612000     12K rw---   [ anon ]
00007fb9fa615000      4K r---- ld-2.17.so
00007fb9fa616000      4K rw--- ld-2.17.so
00007fb9fa617000      4K rw---   [ anon ]
00007fff57c46000    132K rw---   [ stack ]
00007fff57cdb000      8K r-x--   [ anon ]
ffffffffff600000      4K r-x--   [ anon ]
 total         40505020K

Ras · Post by **Ras** » Mon Jul 16, 2018 6:32 pm

the 8192K are easily explained: that is the default stack size for a thread under Linux, so launching a thread immediately needs 8 MB just for running the thread. However, it's configurable, and I don't think SF really needs that much stack.

These 65404K blocks seem to be the per-thread heap: https://stackoverflow.com/questions/475 ... g-a-thread - their idea is to avoid malloc congestion if several threads malloc and free, so each of them gets its own heap.

If you add upp 64M and 8M, that's pretty much the 75M per thread that were observed. It's almost entirely Linux sucking up RAM like crazy. But I guess most of that is paged away somehow so that it won't actually matter unless these 64M heaps are really used, which they aren't.

noobpwnftw · Post by **noobpwnftw** » Mon Jul 16, 2018 7:15 pm

Yes, they are probably harmless, unless it is over-committing total memory and swapping out other pages.

My point is that unless engines are intentionally doing those things, it's tricky to have total control over what it'd behave under circumstances, tweaks are inevitable.

hgm · Post by **hgm** » Mon Jul 16, 2018 9:03 pm

I wonder how much of this reported memory use is actually real. I think the map is for the address space, not for physical memory. Paging should normally be such that only pages that are actually used get physical memory assigned to them. Otherwise they just live in the swap space on disk, and never occupy memory.

E.g. if I allocate 8GB of TT, initially it will all be filled with zeros, and it will in fact just occupy 4KB of physical memory (1 page). Which then probably is shared with other processes that also have memory areas filled with zeros. Only when I start writing the TT physical memory will get assigned and used. But the 8GB would appear in the memory map of the (virtual) address space.

I think unallocated memory on the heap is also zero-filled, i.e. it does not use any physical memory until it gets allocated and then written. The page table would just refer to this zero-filled 4K page that the OS keeps anyway. Only the page table itself would need some extra space to describe this virtual memory, but that is only 1/512 of the size of the memory it describes (I thinkk, in x64 architecture). And that then probably gets swapped to disk when physical memory gets scarce, and the corresponding pages are not used.

So when Ras' explanation is correct, it seems the 75GB/thread is almost entirely virtual, not making any demand on physical memory at all, other than what is needed for the system stack.

I guess the moral lesson is that the OS (at least in the case of Linux) is pretty smart, and only wastes virtual resources, which are absolutely free, to prepare for some worst-case scenario. For physical memory it only uses what is actually needed. Which in the case of Stockfish would only be a very small fraction of this reserved address space.

Ras · Post by **Ras** » Tue Jul 17, 2018 12:35 am

hgm wrote: ↑Mon Jul 16, 2018 9:03 pmOnly when I start writing the TT physical memory will get assigned and used. But the 8GB would appear in the memory map of the (virtual) address space.

Which is why after allocating the hash memory, I fill it up with dummy data, then sync_synchronize(), then zero them again, and sync_synchronize() once more. This forces the OS to actually blend in the pages. Especially with super-fast time controls like 1 second per game, this is a notable gain of time if the "setoption hash" command is implemented so that it will block the answer to "isready" until the hash table alloc is finished.

Under Windows, you can see the difference neatly. If you don't do it like I do and just allocate the hash tables, the memory usage in the task scheduler won't jump up right away - but it increases when the engine calculates because more pages of the hash tables are blended in through usage.

I think unallocated memory on the heap is also zero-filled

Yes and no. Yes if it is "fresh" memory that the process has never allocated before. Otherwise, the process might see data from other processes which would be a security issue. No if the process allocates memory that it had malloc'ed and free'd before, in which case the process may see its own garbage left in memory. On the other hand, if the process uses calloc instead of malloc, it will always get zero initialised memory (and the potential multiplication overflow of the malloc call is eliminated).

The page table would just refer to this zero-filled 4K page that the OS keeps anyway.

Yes, that makes sense.

bob · Post by **bob** » Tue Jul 17, 2018 2:40 am

noobpwnftw wrote: ↑Mon Jul 16, 2018 10:24 am
bob wrote: ↑Mon Jul 16, 2018 5:41 am I don't quite follow what that "threads are more resource heavy under linux." In fact, it is the exact opposite. Threads intentionally share everything, where processes (via fork()) share very little (at least they share executable code and anything that is not modified - an artifact of the "copy on write" approach fork() uses. Given the choice, threads is the least resource-intensive way to do multiprocessing.
This is wrong. What you think is almost true under Windows, however, threads under Linux are essentially fully-loaded "processes", they share nothing more than what you would get from fork(), actually, in the recent past it was implemented exactly via fork().

Sorry, but THIS stuff I know inside and out. Threads are FAR more lightweight than fork() processes. That is the entire point for having threads and the clone() system call. Threads share their entire virtual address space, unlike processes spawned by fork(). Using fork() requires some effort to share things like hash (using mmap() or shmget()/shmat() or something). Using threads you do nothing, they are already visible to all threads.

Last time I looked, Linux still used the clone() system call. But I have not looked at it carefully in maybe a year.

hgm · Post by **hgm** » Tue Jul 17, 2018 7:29 am

So the bottom line seems to be that this 38GB is not real, but virtual memory, which will remain completely untouched by Stockfish. Thus Stockfish should have no trouble running on a (say) 20GB machine with 16GB hash and 512 threads.

I guess you could always drive up the demand for real (physical) memory by increasing the number of threads, as each thread does need some space in real memory for its stack. Recursion can of course go quite deep in Stockfish. If you would request a million threads, even a very modest amount of stack memory per thread could dwarf the Hash table.

But I think the UCI specs are also clear in this respect: memory for stack or code should not be included in the Hash setting.

Still, there is some room for controversy here: what if an engine would need an extraordinary amount of data for read-only data. Such as neural-network parameters. Would that count as 'program code'? This is not purely hypothetical; top Shogi engines typically use some 300MB in machine-learned evaluation parameters, and there is no reason why you couldn't do the same in Chess. The mini-Shogi engine GA-Sho!!!!!!!!! needs 6.5GB minimally to run, if you set the Hash to a vanishingly small value. I wonder how CCRL testers would react to such an engine, as their testing conditions specify 128MB or 256MB hash size.

bob · Post by **bob** » Wed Jul 18, 2018 10:10 pm

VM usage has never been real accurate. It is quite difficult to determine what is duplicated and what is not. And that answer changes as the program executes and causes copy-on-write duplications. And then there is the shared memory stuff if you use processes. Looks like each process is huge when in reality much is shared. The implementation really works well. The statistical gathering stuff not so well.

Sesse · Post by **Sesse** » Fri Jul 20, 2018 1:00 am

Ras wrote: ↑Tue Jul 17, 2018 12:35 am Under Windows, you can see the difference neatly. If you don't do it like I do and just allocate the hash tables, the memory usage in the task scheduler won't jump up right away - but it increases when the engine calculates because more pages of the hash tables are blended in through usage.

Under Linux, too; just look at the RSS column instead of the virtual size column.

UCI Hash Usage Rules

Re: UCI Hash Usage Rules

Re: UCI Hash Usage Rules

Re: UCI Hash Usage Rules

Re: UCI Hash Usage Rules

Re: UCI Hash Usage Rules

Re: UCI Hash Usage Rules

Re: UCI Hash Usage Rules

Re: UCI Hash Usage Rules

Re: UCI Hash Usage Rules

Re: UCI Hash Usage Rules