Minimizing Sharing of Data between Physical Processors

Pradu · Post by **Pradu** » Mon May 19, 2008 7:26 am

Intel 64 and IA-32 Architectures Optimization Reference Manual Section 8.6.2 (Shared-Memory Optimization). suggests that several bus transactions are needed when two threads are executing on two physical processors and sharing data. It suggests that to minimize sharing, one copies the data to the local stack variables if it is to be accessed repeatedly over an extended period.

Does this mean that the collection of threads on each processor must have their own copy of, for example, move-bitboard databases? How can one implement the access of memory and their respective threads on each processor in C (or is this done automatically)?

hgm · Post by **hgm** » Mon May 19, 2008 9:44 am

A lot depends on if the processors share a cache or not, and if the data is read-only or is written.

If two CPUs that do not share a cache share a large memory array of read-only data, there is no penalty: They simply load the data in their own caches, and it gets the status 'shared' there. The bus transactions are only needed if one of them starts to write that data: in that case the other has to be informed that the copy it has in its cache is no longer valid. And that takes bus transactions (invalidate cycles).

If they do share a chache, there is little penalty in even sharing of R/W data. On the contrary, giving each thread its own private copy of a big array in that case would force the arrays to be both present in cache, and thus unnecessarily occupy valuable cache space.

On AMD processors the situation is even more problematic, as different processors might not even share DRAM.

Pradu · Post by **Pradu** » Mon May 19, 2008 7:58 pm

hgm wrote:A lot depends on if the processors share a cache or not, and if the data is read-only or is written.

If two CPUs that do not share a cache share a large memory array of read-only data, there is no penalty: They simply load the data in their own caches, and it gets the status 'shared' there. The bus transactions are only needed if one of them starts to write that data: in that case the other has to be informed that the copy it has in its cache is no longer valid. And that takes bus transactions (invalidate cycles).

If they do share a chache, there is little penalty in even sharing of R/W data. On the contrary, giving each thread its own private copy of a big array in that case would force the arrays to be both present in cache, and thus unnecessarily occupy valuable cache space.

On AMD processors the situation is even more problematic, as different processors might not even share DRAM.

I wanted to address specifically read-only data for separate physical processors like move-bitboard databases, zobrist keys or any other frequently accessed read-only data that get initialized at the start of the program. The optimization guide suggests that there'll be overhead ("including snooping, request for ownership changes, ..."). I'm not sure if this overhead is significant or not but if it is, it will significantly change the design of your engine. Say you wanted your engine to run on two quad-cores. If it didn't matter I would use only threads; if it did matter, I'd guess I'd use separate processes, one for each CPU, with 4-threads each. The two questions I have are:

Is the read-only memory overhead among separate physical processors significant in a chess engine?
If the overhead is significant, how best would one implement a fix for it in C (both Windows and UNIX/POSIX).

hgm · Post by **hgm** » Mon May 19, 2008 8:27 pm

I would say the overhead is non-existent. Assuming the tables are in L2 all the time.

bob · Post by **bob** » Tue May 20, 2008 1:02 am

Pradu wrote:Intel 64 and IA-32 Architectures Optimization Reference Manual Section 8.6.2 (Shared-Memory Optimization). suggests that several bus transactions are needed when two threads are executing on two physical processors and sharing data. It suggests that to minimize sharing, one copies the data to the local stack variables if it is to be accessed repeatedly over an extended period.

Does this mean that the collection of threads on each processor must have their own copy of, for example, move-bitboard databases? How can one implement the access of memory and their respective threads on each processor in C (or is this done automatically)?

No. read-only data doesn't cause the problem they are talking about. It is data that is _updated_ that causes the overhead.

The problem becomes more complex because of cache lines. If any byte of a line is modified by one thread, then there will be overhead with any other thread (CPU) that accesses or modifies that same cache line (even if modifying or accessing a different byte, because the cache coherency stuff works on lines, not bytes... Worst-case scenario is a set of 4 consecutive words, each modified by only one thread. They never modify any word but their own, yet this will just smoke the cache. Put each word in a separate 64 byte chunk of memory and the problem goes away...

bob · Post by **bob** » Tue May 20, 2008 1:06 am

Pradu wrote:
hgm wrote:A lot depends on if the processors share a cache or not, and if the data is read-only or is written.

If two CPUs that do not share a cache share a large memory array of read-only data, there is no penalty: They simply load the data in their own caches, and it gets the status 'shared' there. The bus transactions are only needed if one of them starts to write that data: in that case the other has to be informed that the copy it has in its cache is no longer valid. And that takes bus transactions (invalidate cycles).

If they do share a chache, there is little penalty in even sharing of R/W data. On the contrary, giving each thread its own private copy of a big array in that case would force the arrays to be both present in cache, and thus unnecessarily occupy valuable cache space.

On AMD processors the situation is even more problematic, as different processors might not even share DRAM.
I wanted to address specifically read-only data for separate physical processors like move-bitboard databases, zobrist keys or any other frequently accessed read-only data that get initialized at the start of the program. The optimization guide suggests that there'll be overhead ("including snooping, request for ownership changes, ..."). I'm not sure if this overhead is significant or not but if it is, it will significantly change the design of your engine. Say you wanted your engine to run on two quad-cores. If it didn't matter I would use only threads; if it did matter, I'd guess I'd use separate processes, one for each CPU, with 4-threads each. The two questions I have are:

Is the read-only memory overhead among separate physical processors significant in a chess engine?

If the overhead is significant, how best would one implement a fix for it in C (both Windows and UNIX/POSIX).

snooping and ownership only apply to modified data. A cache becomes the "owner" when a write is done to memory that it is holding. It is the one that is ultimately responsible for either (a) writing the correct value back to memory before replacing it and (b) snooping for other caches accessing that line and giving it to them rather than having them read from memory.

Does't apply to data that is not modified...

Minimizing Sharing of Data between Physical Processors

Minimizing Sharing of Data between Physical Processors

Re: Minimizing Sharing of Data between Physical Processors

Re: Minimizing Sharing of Data between Physical Processors

Re: Minimizing Sharing of Data between Physical Processors

Re: Minimizing Sharing of Data between Physical Processors

Re: Minimizing Sharing of Data between Physical Processors