bob wrote: ↑Mon Jul 16, 2018 5:41 am
I don't quite follow what that "threads are more resource heavy under linux." In fact, it is the exact opposite. Threads intentionally share everything, where processes (via fork()) share very little (at least they share executable code and anything that is not modified - an artifact of the "copy on write" approach fork() uses. Given the choice, threads is the least resource-intensive way to do multiprocessing.
This is wrong. What you think is almost true under Windows, however, threads under Linux are essentially fully-loaded "processes", they share nothing more than what you would get from fork(), actually, in the recent past it was implemented exactly via fork().
Threads must share their address space for global data, which processes would certainly not.
But I think Bob has a point: in Unix/Linux, after fork(), you share basically everything. Even when you fork off processes, they would share data that they are not supposed to share. And as long as the data is not modified they will continue sharing it. That data is then kept in memory in write-protected pages, although it formally should be writable, to catch any attempts of the processes sharing it to write there. When this happens the resulting 'segfault' is processed by the OS to duplicate the page, give the writing process its own (now writable) copy of it in its page table, and redo the write. So the only thing fork() does is create a new entry in the process table, and set all pages in the paging unit write-protected.
Because this is just a trick for efficiency, supposed to be transparent for the user, I can imagine that the task manager would report like each process has its own private memory, even while it is still shared, because formally it is not shared at all. But eventually all data that is supposed to be private an writable will be duplicated. This is a lot more for processes than for threads, though.
the 8192K are easily explained: that is the default stack size for a thread under Linux, so launching a thread immediately needs 8 MB just for running the thread. However, it's configurable, and I don't think SF really needs that much stack.
These 65404K blocks seem to be the per-thread heap: https://stackoverflow.com/questions/475 ... g-a-thread - their idea is to avoid malloc congestion if several threads malloc and free, so each of them gets its own heap.
If you add upp 64M and 8M, that's pretty much the 75M per thread that were observed. It's almost entirely Linux sucking up RAM like crazy. But I guess most of that is paged away somehow so that it won't actually matter unless these 64M heaps are really used, which they aren't.
Yes, they are probably harmless, unless it is over-committing total memory and swapping out other pages.
My point is that unless engines are intentionally doing those things, it's tricky to have total control over what it'd behave under circumstances, tweaks are inevitable.
I wonder how much of this reported memory use is actually real. I think the map is for the address space, not for physical memory. Paging should normally be such that only pages that are actually used get physical memory assigned to them. Otherwise they just live in the swap space on disk, and never occupy memory.
E.g. if I allocate 8GB of TT, initially it will all be filled with zeros, and it will in fact just occupy 4KB of physical memory (1 page). Which then probably is shared with other processes that also have memory areas filled with zeros. Only when I start writing the TT physical memory will get assigned and used. But the 8GB would appear in the memory map of the (virtual) address space.
I think unallocated memory on the heap is also zero-filled, i.e. it does not use any physical memory until it gets allocated and then written. The page table would just refer to this zero-filled 4K page that the OS keeps anyway. Only the page table itself would need some extra space to describe this virtual memory, but that is only 1/512 of the size of the memory it describes (I thinkk, in x64 architecture). And that then probably gets swapped to disk when physical memory gets scarce, and the corresponding pages are not used.
So when Ras' explanation is correct, it seems the 75GB/thread is almost entirely virtual, not making any demand on physical memory at all, other than what is needed for the system stack.
I guess the moral lesson is that the OS (at least in the case of Linux) is pretty smart, and only wastes virtual resources, which are absolutely free, to prepare for some worst-case scenario. For physical memory it only uses what is actually needed. Which in the case of Stockfish would only be a very small fraction of this reserved address space.
hgm wrote: ↑Mon Jul 16, 2018 9:03 pmOnly when I start writing the TT physical memory will get assigned and used. But the 8GB would appear in the memory map of the (virtual) address space.
Which is why after allocating the hash memory, I fill it up with dummy data, then sync_synchronize(), then zero them again, and sync_synchronize() once more. This forces the OS to actually blend in the pages. Especially with super-fast time controls like 1 second per game, this is a notable gain of time if the "setoption hash" command is implemented so that it will block the answer to "isready" until the hash table alloc is finished.
Under Windows, you can see the difference neatly. If you don't do it like I do and just allocate the hash tables, the memory usage in the task scheduler won't jump up right away - but it increases when the engine calculates because more pages of the hash tables are blended in through usage.
I think unallocated memory on the heap is also zero-filled
Yes and no. Yes if it is "fresh" memory that the process has never allocated before. Otherwise, the process might see data from other processes which would be a security issue. No if the process allocates memory that it had malloc'ed and free'd before, in which case the process may see its own garbage left in memory. On the other hand, if the process uses calloc instead of malloc, it will always get zero initialised memory (and the potential multiplication overflow of the malloc call is eliminated).
The page table would just refer to this zero-filled 4K page that the OS keeps anyway.
bob wrote: ↑Mon Jul 16, 2018 5:41 am
I don't quite follow what that "threads are more resource heavy under linux." In fact, it is the exact opposite. Threads intentionally share everything, where processes (via fork()) share very little (at least they share executable code and anything that is not modified - an artifact of the "copy on write" approach fork() uses. Given the choice, threads is the least resource-intensive way to do multiprocessing.
This is wrong. What you think is almost true under Windows, however, threads under Linux are essentially fully-loaded "processes", they share nothing more than what you would get from fork(), actually, in the recent past it was implemented exactly via fork().
Sorry, but THIS stuff I know inside and out. Threads are FAR more lightweight than fork() processes. That is the entire point for having threads and the clone() system call. Threads share their entire virtual address space, unlike processes spawned by fork(). Using fork() requires some effort to share things like hash (using mmap() or shmget()/shmat() or something). Using threads you do nothing, they are already visible to all threads.
Last time I looked, Linux still used the clone() system call. But I have not looked at it carefully in maybe a year.
So the bottom line seems to be that this 38GB is not real, but virtual memory, which will remain completely untouched by Stockfish. Thus Stockfish should have no trouble running on a (say) 20GB machine with 16GB hash and 512 threads.
I guess you could always drive up the demand for real (physical) memory by increasing the number of threads, as each thread does need some space in real memory for its stack. Recursion can of course go quite deep in Stockfish. If you would request a million threads, even a very modest amount of stack memory per thread could dwarf the Hash table.
But I think the UCI specs are also clear in this respect: memory for stack or code should not be included in the Hash setting.
Still, there is some room for controversy here: what if an engine would need an extraordinary amount of data for read-only data. Such as neural-network parameters. Would that count as 'program code'? This is not purely hypothetical; top Shogi engines typically use some 300MB in machine-learned evaluation parameters, and there is no reason why you couldn't do the same in Chess. The mini-Shogi engine GA-Sho!!!!!!!!! needs 6.5GB minimally to run, if you set the Hash to a vanishingly small value. I wonder how CCRL testers would react to such an engine, as their testing conditions specify 128MB or 256MB hash size.
VM usage has never been real accurate. It is quite difficult to determine what is duplicated and what is not. And that answer changes as the program executes and causes copy-on-write duplications. And then there is the shared memory stuff if you use processes. Looks like each process is huge when in reality much is shared. The implementation really works well. The statistical gathering stuff not so well.
Ras wrote: ↑Tue Jul 17, 2018 12:35 am
Under Windows, you can see the difference neatly. If you don't do it like I do and just allocate the hash tables, the memory usage in the task scheduler won't jump up right away - but it increases when the engine calculates because more pages of the hash tables are blended in through usage.
Under Linux, too; just look at the RSS column instead of the virtual size column.