If that were the only issue.. Of course the compiler lays things out in a linear fashion. But that doesn't say a thing about how the program is actually placed into physical memory (remember that physical memory addresses are used to map data into cache). The operating system can, for example, stick you in only even-numbered physical pages (not likely, I know, but it can). That means your address space only maps into 1/2 the cache addresses.hgm wrote:This is very unlikely. For one, compilers and linkers allocate variables as a contiguous set. they have no reason at all to leave large, unused holes between them. So if your total of global variables / tables does not exceed 1MB, they will all nicely map in one L2 cache way without possibilities for collisions. The system stack grows as a contiguous address range.
If you have ever been to a real chess tournament where you see people continually re-booting and running their program until they get some target NPS value, this is what they are doing. Trying, by luck, to get a favorable physical memory layout to be more cache-friendly.
So contiguous virtual addresses mean zilch to the cache system, only the O/S can control the phsical address assignments.
You are wasting a ton of time. You are concerned with virtual addresses. That has nothing to do with cache at all, except for the concept of temporal locality / spatial locality. You should put variables close together in virtual memory, if they are used close together in time, to take advantage of cache line fetches that prefetch data close to the data that caused the miss. But as far as how your program maps into cache, you can forget about it. You are two levels too high in the abstraction process. What you want to do only suggests to the compiler what to do. What the compiler does has nothing to do with what the O/S does with the physical memory layout. Typically physical memory is allocated in what is essentially a random page order.
Secondly, when there might be doubt, I make sure that there will not be any collisions, by laying out the memory map of variables and tables by hand (declaring them as part of a single struct). This is so far only critical for L1, as the total memory usage of the rest of my engine (code + data + stacks) is only a tiny, tiny fraction of a single L2 cache way.
I'm not sure of your background with respect to compilers, operating systems, and architecture. But your comments above are not exactly what I would want to see on an exam I give. You can have a single stack that gets a 10-fold collision rate due to the physical memory problem I have already explained... Virtual has nothing to do with physical...
Even if yuo want to accept that you have several frequently used areas wandering through memory because the are stacks, there will not be enough to give more than e 3-fold collision. (System stack, move stack, globals, code.) In L1 this can still be fatal on some machines with a 2-way L1 (P-IV, AMD), despite the fat that code does not go in their. On the scale of L2 the move stack is just part of the contiguous block of globals, as it is declared as a global array. So I have only 3 contiguous blocks, each small compared to a cache way, which can never be a problem in a 4-way cache.
It is likely that future multi-core chips will not look like the first intel quad-core chips that just fused two dual-core chips into one. The shared L2 cache is actually a good thing for my program, to avoid redundant data. dual 2mb l2s has too much redundant data. I'd much prefer 4mb shared to 2x2mb shared...
If I would make many threads that share an L2, obviously I would have to keep their stacks apart after mapping into L2. But so far L2 is never shared betwen more than 2 cores, so this is not such a problem. I plan to go to an iterative implementation of the search, which maintains its own stack of locals as structs that are put on the move stack, to merge them into one contiguous stack and guarantee they will never collide.
If only it were in your control to limit this, but it isn't...
Having programs that cause cache thrashing wrecks efficiency. Never do it!