Intel 64 and IA-32 Architectures Optimization Reference Manual Section 8.6.2 (Shared-Memory Optimization). suggests that several bus transactions are needed when two threads are executing on two physical processors and sharing data. It suggests that to minimize sharing, one copies the data to the local stack variables if it is to be accessed repeatedly over an extended period.
Does this mean that the collection of threads on each processor must have their own copy of, for example, move-bitboard databases? How can one implement the access of memory and their respective threads on each processor in C (or is this done automatically)?
Minimizing Sharing of Data between Physical Processors
Moderators: hgm, Rebel, chrisw
-
- Posts: 287
- Joined: Sat Mar 11, 2006 3:19 am
- Location: Atlanta, GA
-
- Posts: 27790
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Minimizing Sharing of Data between Physical Processors
A lot depends on if the processors share a cache or not, and if the data is read-only or is written.
If two CPUs that do not share a cache share a large memory array of read-only data, there is no penalty: They simply load the data in their own caches, and it gets the status 'shared' there. The bus transactions are only needed if one of them starts to write that data: in that case the other has to be informed that the copy it has in its cache is no longer valid. And that takes bus transactions (invalidate cycles).
If they do share a chache, there is little penalty in even sharing of R/W data. On the contrary, giving each thread its own private copy of a big array in that case would force the arrays to be both present in cache, and thus unnecessarily occupy valuable cache space.
On AMD processors the situation is even more problematic, as different processors might not even share DRAM.
If two CPUs that do not share a cache share a large memory array of read-only data, there is no penalty: They simply load the data in their own caches, and it gets the status 'shared' there. The bus transactions are only needed if one of them starts to write that data: in that case the other has to be informed that the copy it has in its cache is no longer valid. And that takes bus transactions (invalidate cycles).
If they do share a chache, there is little penalty in even sharing of R/W data. On the contrary, giving each thread its own private copy of a big array in that case would force the arrays to be both present in cache, and thus unnecessarily occupy valuable cache space.
On AMD processors the situation is even more problematic, as different processors might not even share DRAM.
-
- Posts: 287
- Joined: Sat Mar 11, 2006 3:19 am
- Location: Atlanta, GA
Re: Minimizing Sharing of Data between Physical Processors
I wanted to address specifically read-only data for separate physical processors like move-bitboard databases, zobrist keys or any other frequently accessed read-only data that get initialized at the start of the program. The optimization guide suggests that there'll be overhead ("including snooping, request for ownership changes, ..."). I'm not sure if this overhead is significant or not but if it is, it will significantly change the design of your engine. Say you wanted your engine to run on two quad-cores. If it didn't matter I would use only threads; if it did matter, I'd guess I'd use separate processes, one for each CPU, with 4-threads each. The two questions I have are:hgm wrote:A lot depends on if the processors share a cache or not, and if the data is read-only or is written.
If two CPUs that do not share a cache share a large memory array of read-only data, there is no penalty: They simply load the data in their own caches, and it gets the status 'shared' there. The bus transactions are only needed if one of them starts to write that data: in that case the other has to be informed that the copy it has in its cache is no longer valid. And that takes bus transactions (invalidate cycles).
If they do share a chache, there is little penalty in even sharing of R/W data. On the contrary, giving each thread its own private copy of a big array in that case would force the arrays to be both present in cache, and thus unnecessarily occupy valuable cache space.
On AMD processors the situation is even more problematic, as different processors might not even share DRAM.
- Is the read-only memory overhead among separate physical processors significant in a chess engine?
- If the overhead is significant, how best would one implement a fix for it in C (both Windows and UNIX/POSIX).
-
- Posts: 27790
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Minimizing Sharing of Data between Physical Processors
I would say the overhead is non-existent. Assuming the tables are in L2 all the time.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Minimizing Sharing of Data between Physical Processors
No. read-only data doesn't cause the problem they are talking about. It is data that is _updated_ that causes the overhead.Pradu wrote:Intel 64 and IA-32 Architectures Optimization Reference Manual Section 8.6.2 (Shared-Memory Optimization). suggests that several bus transactions are needed when two threads are executing on two physical processors and sharing data. It suggests that to minimize sharing, one copies the data to the local stack variables if it is to be accessed repeatedly over an extended period.
Does this mean that the collection of threads on each processor must have their own copy of, for example, move-bitboard databases? How can one implement the access of memory and their respective threads on each processor in C (or is this done automatically)?
The problem becomes more complex because of cache lines. If any byte of a line is modified by one thread, then there will be overhead with any other thread (CPU) that accesses or modifies that same cache line (even if modifying or accessing a different byte, because the cache coherency stuff works on lines, not bytes... Worst-case scenario is a set of 4 consecutive words, each modified by only one thread. They never modify any word but their own, yet this will just smoke the cache. Put each word in a separate 64 byte chunk of memory and the problem goes away...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Minimizing Sharing of Data between Physical Processors
snooping and ownership only apply to modified data. A cache becomes the "owner" when a write is done to memory that it is holding. It is the one that is ultimately responsible for either (a) writing the correct value back to memory before replacing it and (b) snooping for other caches accessing that line and giving it to them rather than having them read from memory.Pradu wrote:I wanted to address specifically read-only data for separate physical processors like move-bitboard databases, zobrist keys or any other frequently accessed read-only data that get initialized at the start of the program. The optimization guide suggests that there'll be overhead ("including snooping, request for ownership changes, ..."). I'm not sure if this overhead is significant or not but if it is, it will significantly change the design of your engine. Say you wanted your engine to run on two quad-cores. If it didn't matter I would use only threads; if it did matter, I'd guess I'd use separate processes, one for each CPU, with 4-threads each. The two questions I have are:hgm wrote:A lot depends on if the processors share a cache or not, and if the data is read-only or is written.
If two CPUs that do not share a cache share a large memory array of read-only data, there is no penalty: They simply load the data in their own caches, and it gets the status 'shared' there. The bus transactions are only needed if one of them starts to write that data: in that case the other has to be informed that the copy it has in its cache is no longer valid. And that takes bus transactions (invalidate cycles).
If they do share a chache, there is little penalty in even sharing of R/W data. On the contrary, giving each thread its own private copy of a big array in that case would force the arrays to be both present in cache, and thus unnecessarily occupy valuable cache space.
On AMD processors the situation is even more problematic, as different processors might not even share DRAM.
- Is the read-only memory overhead among separate physical processors significant in a chess engine?
- If the overhead is significant, how best would one implement a fix for it in C (both Windows and UNIX/POSIX).
Does't apply to data that is not modified...