re-inventing the SMP wheel

bob · Post by **bob** » Tue Aug 21, 2007 12:02 am

hgm wrote:This is very unlikely. For one, compilers and linkers allocate variables as a contiguous set. they have no reason at all to leave large, unused holes between them. So if your total of global variables / tables does not exceed 1MB, they will all nicely map in one L2 cache way without possibilities for collisions. The system stack grows as a contiguous address range.

If that were the only issue.. Of course the compiler lays things out in a linear fashion. But that doesn't say a thing about how the program is actually placed into physical memory (remember that physical memory addresses are used to map data into cache). The operating system can, for example, stick you in only even-numbered physical pages (not likely, I know, but it can). That means your address space only maps into 1/2 the cache addresses.

If you have ever been to a real chess tournament where you see people continually re-booting and running their program until they get some target NPS value, this is what they are doing. Trying, by luck, to get a favorable physical memory layout to be more cache-friendly.

So contiguous virtual addresses mean zilch to the cache system, only the O/S can control the phsical address assignments.

Secondly, when there might be doubt, I make sure that there will not be any collisions, by laying out the memory map of variables and tables by hand (declaring them as part of a single struct). This is so far only critical for L1, as the total memory usage of the rest of my engine (code + data + stacks) is only a tiny, tiny fraction of a single L2 cache way.

You are wasting a ton of time. You are concerned with virtual addresses. That has nothing to do with cache at all, except for the concept of temporal locality / spatial locality. You should put variables close together in virtual memory, if they are used close together in time, to take advantage of cache line fetches that prefetch data close to the data that caused the miss. But as far as how your program maps into cache, you can forget about it. You are two levels too high in the abstraction process. What you want to do only suggests to the compiler what to do. What the compiler does has nothing to do with what the O/S does with the physical memory layout. Typically physical memory is allocated in what is essentially a random page order.

Even if yuo want to accept that you have several frequently used areas wandering through memory because the are stacks, there will not be enough to give more than e 3-fold collision. (System stack, move stack, globals, code.) In L1 this can still be fatal on some machines with a 2-way L1 (P-IV, AMD), despite the fat that code does not go in their. On the scale of L2 the move stack is just part of the contiguous block of globals, as it is declared as a global array. So I have only 3 contiguous blocks, each small compared to a cache way, which can never be a problem in a 4-way cache.

I'm not sure of your background with respect to compilers, operating systems, and architecture. But your comments above are not exactly what I would want to see on an exam I give. You can have a single stack that gets a 10-fold collision rate due to the physical memory problem I have already explained... Virtual has nothing to do with physical...

If I would make many threads that share an L2, obviously I would have to keep their stacks apart after mapping into L2. But so far L2 is never shared betwen more than 2 cores, so this is not such a problem. I plan to go to an iterative implementation of the search, which maintains its own stack of locals as structs that are put on the move stack, to merge them into one contiguous stack and guarantee they will never collide.

It is likely that future multi-core chips will not look like the first intel quad-core chips that just fused two dual-core chips into one. The shared L2 cache is actually a good thing for my program, to avoid redundant data. dual 2mb l2s has too much redundant data. I'd much prefer 4mb shared to 2x2mb shared...

Having programs that cause cache thrashing wrecks efficiency. Never do it!

If only it were in your control to limit this, but it isn't...

hgm · Post by **hgm** » Tue Aug 21, 2007 9:05 am

Well, like I said, I am doing this so far only for L1, as all frequently used data except the main hash easily fits there (if it is not a P-IV). L1 is indexed by the _logical_ address, not the _physical_ address (although it is physically tagged).

Although the memory management could indeed scatter the memory map so that it only allocates pages in 1MB strides, only a malicious person would design it such that this could happen in practice. The most obvious algorithms for memory allocation are best fit or first fit, and unless the memory is already completely fragmented, this means you get contiguous physical memory. If you run under conditions where swapping occurs so that the mapping constantly changes through demand paging, efficiency is dead anyway. So you keep the memory map you started with, which was the memory you asked for when memory was still completely empty.

I have never seen anything like this happen in practice. If I run a cache test by cycling through an array the size of L2, I always get the expected L2 access time, and never see any evidence of thrashing. It just doesn't happen. And if it ever would, it would be a good reason to replace (or rewrite) my OS.

Furthemore, even the paging unit can not redefine the lowest 12 bits of the address, so that the layout _within_ a 4KB page is always completely controlled by the user. Thus frequently used data that you put inside the same 4KB page, is guaranteed never to collide in cache. In a stack (e.g. move stack) that uses upto 0.5KB per ply (e.g. 128 moves of 4 bytes) the last 8 ply would not span more than 4KB, and thus use at most 2 memory pages. And even if Bill Gates his accomplishes would have designed it such that these memory pages were selected from the vast empty memory to ly 1MB apart and collide in L2, these top 8 ply of the stack would only use the opposite ends of the pages it straddles. So the individual cache lines would still be guaranteed to map apart. Thus a single stack is quite safe. As long as the frequently used top part doesn't span more than 4KB, you will have the guarantee that there is not a single collisions internal to it, (let alone 10), either in L2 or L1. Despite the most malicious effort OS designers could have brought to bear!

bob · Post by **bob** » Wed Aug 22, 2007 2:07 am

hgm wrote:Well, like I said, I am doing this so far only for L1, as all frequently used data except the main hash easily fits there (if it is not a P-IV). L1 is indexed by the _logical_ address, not the _physical_ address (although it is physically tagged).
Although the memory management could indeed scatter the memory map so that it only allocates pages in 1MB strides, only a malicious person would design it such that this could happen in practice. The most obvious algorithms for memory allocation are best fit or first fit, and unless the memory is already completely fragmented, this means you get contiguous physical memory

OK, we are on different assumptions. I am assuming that the program will not fit into L1. For 99% of the chess programs around, this is a valid assumption. I'm hopeful to get the majority of useful data in L2. No way I will get much of the critical stuff into L1 since it is so small (32K is barely enough for one table or two I use), not counting instructions.

If you actually fit into L1, your problem won't be that significant. But read on as the way virtual memory is handled in the O/S is far different than what you are thinking...

Sorry, but the second point above is way wrong. You will _never_ get contiguous physical memory addressing unless you are the first thing that runs after booting, which includes system tasks and the like. Initially the physical page frames are in some sort of order, but by the time the first program runs, this order has been lost because the O/S has allocated and freed large chunks of memory, thanks to all the system tasks that are started (and some of which also terminated) and it does not take the time to do any sort of garbage collection. Why would it when any physical page can be made to fit into any virtual address desired. Linux uses a stack approach for this, which means when you need a page of physical memory, the page you get was the last one someone else released. Memory is so scrambled by the time the first program runs, your physical memory is simply scattered in a random fashion. There's no time to garbage collect every time a single frame is released, when there are usually at least 1M page frames on a large memory system.

It isn't malicious at all. It is just an artifact of making the allocate/free for physical memory as cheap as possible since it happens all the time.

.[/q If you run under conditions where swapping occurs so that the mapping constantly changes through demand paging, efficiency is dead anyway. So you keep the memory map you started with, which was the memory you asked for when memory was still completely empty.

Memory is _never_ "completely empty" by the time your program runs. Windows and Linux have both started hundreds of processes by that time, thoroughly scrambling physical memory. That's why many people boot and reboot trying to get the fastest NPS they can get, because even though those initial system tasks are started in the same order, they become very non-deterministic in their execution due to I/O, timer and other types of interrupts.

I have never seen anything like this happen in practice. If I run a cache test by cycling through an array the size of L2, I always get the expected L2 access time, and never see any evidence of thrashing. It just doesn't happen. And if it ever would, it would be a good reason to replace (or rewrite) my OS.

That suggests that your test is flawed. Everybody else on the planet sees _exactly_ the behavior I have described...

Furthemore, even the paging unit can not redefine the lowest 12 bits of the address, so that the layout _within_ a 4KB page is always completely controlled by the user

No disagreement there. I am talking about your having zero control over the _physical_ memory addresses your program is given by the O/S. There is simply nothing you can do about it. This problem is discussed at length in most good operating system books in the virtual memory section. There is a solution, but it is considered too time-consuming to tolerate when for most programs, this is not a problem, until you get to large sizes.

Thus frequently used data that you put inside the same 4KB page, is guaranteed never to collide in cache. In a stack (e.g. move stack) that uses upto 0.5KB per ply (e.g. 128 moves of 4 bytes) the last 8 ply would not span more than 4KB, and thus use at most 2 memory pages. And even if Bill Gates his accomplishes would have designed it such that these memory pages were selected from the vast empty memory to ly 1MB apart and collide in L2, these top 8 ply of the stack would only use the opposite ends of the pages it straddles. So the individual cache lines would still be guaranteed to map apart. Thus a single stack is quite safe. As long as the frequently used top part doesn't span more than 4KB, you will have the guarantee that there is not a single collisions internal to it, (let alone 10), either in L2 or L1.

That makes no sense to me. We can get aliases to any 64 byte line, within a single physical page, when you consider that there are _other_ single physical pages that also map to the same area of cache. yes, you get no aliases within a page (assuming you only use small page sizes) but between pages aliasing is rampant and affects results.

Despite the most malicious effort OS designers could have brought to bear!

Again, it isn't malicious. It is "standard design practice" because performance in this area of the O/S is critical. You can't search thru 1M pages to find the "best" to prevent cache aliasing, when you are asked to allocate / free tens of thousands of pages per second (or more) at times..

hgm · Post by **hgm** » Wed Aug 22, 2007 10:32 am

The point is that when you allocate x cache-ways worth of memory as randomly scattered pages, The probability for N items to map into the same cache line has a Poissson distribution P(N) = x^N/N! exp(-x). For x=1 (e.g. 1MB in a 4-way 4MB cache) the probability that more than 4 items map into a line and give thrashing is only 0.4%.

This is small enough to not really affect the calculation. To do worse, the OS designer would have to intentionally assign blocks in strides of the cache-way size. That would be malicious. If he is just stupid, he will not be able to do worse than random.

I looked it up, and actually the Core 2 Duo L2 is 16-way. With such a high associativity, the odds are rigged even more in your favor. 1MB is now 4 cache ways, so on average 4 items map into the same line. The probability 17 or more map into a line (giving thrashing) is 1.13 in a million. Even with 2MB of 'fixed' data collisions of 17 or more are only 0.017%. Collissions of 15 ore less (leaving one spare cache way to accomodate the occasional hash probe) are 99.95%.

I would not recommend having your engine habitually use more than 2MB of fixed data (i.e. apart form hash tables). Not even in a 4MB cache if you cannot control the mapping. But more obviously, because there are still many systems around that do not have 2MB, or even 1MB L2 caches, and your engine would then run like crap on such systems.

This whole discussion has become a distracting side track from SMP search algorithms. If you really want to know large the probability is that the hash entry is still cached at the time of the hash store, and doubt my calculations, you should simply measure it. Even if you do 100 times worse than in my original assumption (e.g. because you do 100 L2 access randomly scattered by the OS paging per node next to the hash probe, and they should all be cache misses, as hits won't flush anything out of the cache!) it should still be 99.7%.

No amount of talking could convince me that you would have such a high miss rate, but an actual measurement would of course be the final verdict. In my engine, which avoids any use of L2 other than for hashing, it certainly isn't a problem.

hgm · Post by **hgm** » Wed Aug 22, 2007 1:25 pm

Another way to look at it is this:

Suppose the processing of each node causes N cache misses (hash probe included), that are randomly distributed over L2. Then in a 4MB cache, with 64K lines, you need a tree of 64K/N nodes before (on the average) a probed entry is flushed from the cache before the store to it takes place. The fraction of nodes with such a big subtree can be at most N/64K. These nodes will require 1 extra cache miss to process, N+1 in stead of N. So a relative increase of memory traffic of 1/N. The average increase in memory traffic will thus be 1/N * N/64K = 1/64K = 0.0016%, and is N-independent.

Thus, writing the hash immediately after probing can at most drive up memory traffic by 0.0016%. It is a total non-issue. This is why it seems a nearly free way to have the threads communicate through the hash table. They can leave messages there, and if there is no one to pick up the messages, there is virtually no overhead. If they do pick up the messge that a node is already being searched, and decide to back off that node without writing it, it would get the entry not from DRAM but from the other CPU (snoop hit on a modified line in another cache), which is generally much faster. This would force a DRAM write cycle in stead of the read you would have otherwise, though, but that doesn't really take more memory bandwidth.

Only if they decide to cooperate on the node frequent communication would be needed. But the messages you past for free allow you to only do that when it really pays.

bob · Post by **bob** » Wed Aug 22, 2007 10:24 pm

hgm wrote:The point is that when you allocate x cache-ways worth of memory as randomly scattered pages, The probability for N items to map into the same cache line has a Poissson distribution P(N) = x^N/N! exp(-x). For x=1 (e.g. 1MB in a 4-way 4MB cache) the probability that more than 4 items map into a line and give thrashing is only 0.4%.

I don't follow that math exactly, in the context of cache. With 4MB of L2, and 4GB of main memory, the probability of some aliasing issues is actually pretty significant. Particularly when within that 4 gigs, the pages are allocated randomly so that large data structures are not in contiguous physical RAM either.

In 4-way cache, with a line size of 64 bytes, each unique 64 byte line of memory maps to one of N of those sets of lines (your "ways" term). So for some specifics:

64 byte line in 16-way set associative cache (16-way for the core-2 so far). WIth 4mb of L2, that gives us 64K cache lines, or 4K sets. For 4K sets, there are 64M 64 byte memory lines that map into them. or 16K possible memory lines for each set.

The math looks tough to me. 16K lines mapping into a set that only holds 16 line, is a 1024 to 1 mapping. The larger your program, the easier it is for the O/S to pick more physical pages from one group of 1024 than from another, which creates the aliasing problem I mentioned. Trying to analyze this statistically is not nearly as interesting as trying to analyze it practically, to see what you will get frequently, rather than what you are going to get on average.

Again, this is what leads everyone to doing the reboot-test shuffle over and over until they happen to get a good physical RAM layout, and then they stick with it.

This is small enough to not really affect the calculation. To do worse, the OS designer would have to intentionally assign blocks in strides of the cache-way size. That would be malicious. If he is just stupid, he will not be able to do worse than random.

Again, you are talking about the "average case". But there are plenty of much worse cases you have to deal with when running such a program. It doesn't help me much to know that on average, I am going to get 4.0M nodes per second, if I frequently get 3.5M or 4.5M. The 3.5M hurts.

I looked it up, and actually the Core 2 Duo L2 is 16-way. With such a high associativity, the odds are rigged even more in your favor. 1MB is now 4 cache ways, so on average 4 items map into the same line. The probability 17 or more map into a line (giving thrashing) is 1.13 in a million. Even with 2MB of 'fixed' data collisions of 17 or more are only 0.017%. Collissions of 15 ore less (leaving one spare cache way to accomodate the occasional hash probe) are 99.95%.

I am not sure I like your math there. We are talking about the probability of a bad thing happening. This is the sum of the probabilities of any bad thing happening. It happens way more often than you think, but gets hidden when you try to average things out.

I would not recommend having your engine habitually use more than 2MB of fixed data (i.e. apart form hash tables). Not even in a 4MB cache if you cannot control the mapping. But more obviously, because there are still many systems around that do not have 2MB, or even 1MB L2 caches, and your engine would then run like crap on such systems.

This whole discussion has become a distracting side track from SMP search algorithms. If you really want to know large the probability is that the hash entry is still cached at the time of the hash store, and doubt my calculations, you should simply measure it. Even if you do 100 times worse than in my original assumption (e.g. because you do 100 L2 access randomly scattered by the OS paging per node next to the hash probe, and they should all be cache misses, as hits won't flush anything out of the cache!) it should still be 99.7%.

I would claim that _most_ hash probes are cache misses, based on some testing years ago when this was measurable for the first time, I think on the first Xeons was when I tried it. I had the CPU counting cache misses, and I simply read the MSR cache miss counter before/after hash probes and found it was essentially 100% misses. This was on a 512K L2 and then a 2M L2 cache. This is pretty easy to test today. You could do a hash probe at the start of each node, then do a hash probe when you finish and check the cache miss before/after the second probe to count hash misses. I suspect it will be very high for my program, since I do a lot of work in the q-search without doing any hashing at all. At the leaves, after doing the initial probe, the entry would have to survive lots of recursive search/q-search calls, along with all the static evaluation table accesses, the magic tables, the other data structure accesses, which makes it tough to keep that old hash entry around.

No amount of talking could convince me that you would have such a high miss rate, but an actual measurement would of course be the final verdict. In my engine, which avoids any use of L2 other than for hashing, it certainly isn't a problem.

I doubt "real programs" are going to fit in L1. No doubt an intentionally small program can, but that won't predict anything about how the real programs are going to perform, which is where we started this discussion...

bob · Post by **bob** » Wed Aug 22, 2007 10:26 pm

hgm wrote:Another way to look at it is this:

Suppose the processing of each node causes N cache misses (hash probe included), that are randomly distributed over L2. Then in a 4MB cache, with 64K lines, you need a tree of 64K/N nodes before (on the average) a probed entry is flushed from the cache before the store to it takes place. The fraction of nodes with such a big subtree can be at most N/64K. These nodes will require 1 extra cache miss to process, N+1 in stead of N. So a relative increase of memory traffic of 1/N. The average increase in memory traffic will thus be 1/N * N/64K = 1/64K = 0.0016%, and is N-independent.

Thus, writing the hash immediately after probing can at most drive up memory traffic by 0.0016%. It is a total non-issue. This is why it seems a nearly free way to have the threads communicate through the hash table. They can leave messages there, and if there is no one to pick up the messages, there is virtually no overhead. If they do pick up the messge that a node is already being searched, and decide to back off that node without writing it, it would get the entry not from DRAM but from the other CPU (snoop hit on a modified line in another cache), which is generally much faster. This would force a DRAM write cycle in stead of the read you would have otherwise, though, but that doesn't really take more memory bandwidth.

Only if they decide to cooperate on the node frequent communication would be needed. But the messages you past for free allow you to only do that when it really pays.

That is again assuming too much. I probe in the basic search, I don't probe in the q-search. So between probes in the basic search, there is a _lot_ that goes on. From move generation, recursive q-search calls, evaluation, all of which dislodge stuff from L1 and L2 frequently. LRU is not going to keep that old hash entry around long with all the other stuff flowing in and out.

Oliver · Post by **Oliver** » Thu Aug 23, 2007 3:46 am

Thank you for the link.
Rather interesting technique indeed. And powerfull too.
A mustread.

Oliver

hgm · Post by **hgm** » Thu Aug 23, 2007 8:06 am

bob wrote:The math looks tough to me. 16K lines mapping into a set that only holds 16 line, is a 1024 to 1 mapping.

Not really. Because 1024 is a large number, randomly assigning the pages becomes the same as randomly deciding per page if it is used or not, with a certain probability. This leads to Poisson statistics for the number of used pages in each set of 1024.

I would claim that _most_ hash probes are cache misses, ...

No disagreement there, but this is not the issue, and it is not relevant. The issue is how long the probed data will stay in L2, and if it is still around when the hash store will want to update it.

I doubt "real programs" are going to fit in L1. No doubt an intentionally small program can, but that won't predict anything about how the real programs are going to perform, which is where we started this discussion...

I cannot comment on that without knowing your definition of "real programs". Is that programs with a rating above 2500? Above 2700? Is Rybka the only "real program"?

That is again assuming too much. I probe in the basic search, I don't probe in the q-search. So between probes in the basic search, there is a _lot_ that goes on. From move generation, recursive q-search calls, evaluation, all of which dislodge stuff from L1 and L2 frequently. LRU is not going to keep that old hash entry around long with all the other stuff flowing in and out.

The beauty of it is that it does not assume anything!. The point is that to flush something (that can be anywere) out of a cache with 64K sets you will need about 64K replacements. It doesn't matter what these replacements are (other hash probes, table reads, code fetches), and over how many nodes they are distributed, and if they are evenly distributed over those nodes.

After search of a tree large enough to cause 64K cache misses the probed hash entry is likely to be flushed out of cache.
For smaller trees, it is likely to be still in.
So only trees that cause more than 64K cache misses will flush it out, which then requires one extra memory access to get it back in. This cannot drive up the number of memory accesses by more than one part in 64K = 0.0016%.

Do the measurement!

hgm · Post by **hgm** » Thu Aug 23, 2007 8:44 am

I can add that if you focus some of those other cache misses in certain lines, because of an unfavorable virtual -> physical memory mapping, they become _less_ effective in flushing out hash probes, as they mainly flush out each other.

The only thing that would help to get a shorter lifetime for the probed hash entries is if the hash table did not map homogeneously into the cache, but heavily concentrated in a fraction of the cache. And then the other misses should concentrate in that part of the cache too. But the hash table is typically so much bigger than the cache, that it is inconceivable that such a mapping would ever arise accidentally. It would really require very careful malicious picking of hundreds of pages at a bad (colliding) address, as the mapping of a 128MB on a 256KB cache way is already 512 to 1. With random assignment of physical pages this would be the average, and the spread around it would be sqrt(512) ~ 23. So to give some cache lines a significantly higher density of hits requires a 20-sigma fluctuation.

re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel

Re: Nagging search.

Re: re-inventing the SMP wheel

Re: re-inventing the SMP wheel