Interlock clusters

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Interlock clusters

Post by bob »

hgm wrote:
bob wrote:Won't work on X86. I tried this this past week in fact. The exchange instruction is _horribly_ slow, it grabs the bus for a long time, and crushes SMP performance, even when the lock is almost never set. You should try it on an 8 core or 16 core box. One or two per node will absolutely kill performance... the lock prefix instructions just ramp up the contention for the bus...
Is this also true if you use XCHG on a variable that would produce a cache miss anyway?

I have not really tested it, but I seem to remember that modern x86 caches are not 'write through', but have 'allocate on write' policy. This means that a STORE which misses L2 would trigger a READ memory cycle to fetch the part of the cache line that is not overwritten, while at the same time it would have to inform other CPUs (with independent caches) that this cache line is loaded in MODIFYED state. (So the others that migt have it in cache can INVALIDATE it). One presumes that this is implemented by havingthe other CPUs snoop the bus cycle that had to take place anyway, without any additional overhead. The actual WRITE memory cycle will take place later, when the cache line is flushed.

This is not much different then what needs to happen when you do an XCHG. And for hash entries you have to do the write sooner or later anyway. So if it makes little difference, you might as well prepare the entry as dirty with an XCHG, rather than with a normal load, while bringing it in cache.
The xchg instruction is a performance killer when applied to memory. It locks the bus, reads the original value (it can not use cache contents) from memory, then writes the new value (the exchanged value) back to memory. All while the bus is locked so that no other processors can access memory at all...

This is akin to using a lock prefix for other instructions such as adding to a value in memory, etc...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Interlock clusters

Post by bob »

hgm wrote:Sure, for hash storing the XOR trick is great. But I am also interested in having locks for more critical tasks.

I am still a bit puzzled by this. On Pentium M, when I measure the latency of XCHG REG,MEM. This was 20 clocks, which is pretty bad for an L1 cache hit, but not completely lethal. It does seem a pipeline stall, though, as I could not really execute other instructions in parallel.

But if I try it on a new memory variable all the time, which is not cached, it seems to take about 90 clocks more than incrementing that variable. This even happens when I do the increment first, and then the XCHG. While I would have thought that the increment would be enough to mark the cache line as MODIFIED, and the test on the cached memory variable show that in principle a 20-cycle XCHG on such an MODIFIED variable is possible.

Perhaps it is an out-of-order problem, that the XCHG in practice is executed before the INC.

There might be a huge inefficiency in the implementation of this; in principle XCHG on an MODIFIED L1-cached variable should not be anymore difficult than, say, an INC (as L1 is never sa shared cache). I read that the new processors would increase the efficiency of semaphores, which indeed suggest that this was a known inefficiency in older CPUs.
The problem is that xchg doesn't use cache. It is, by definition, an instruction that goes directly to memory, so that we can do the read of the old value, write of the new value, all in one "atomic" operation. It is horrifically slow. Modern spinlocks use a "shadow lock" to avoid completely blocking memory access for processors when one would be spinning on an exchange.

The idea is this:

while exchange(lock,1) while (lock);

The exchange is slow, locks the bus, fetches the old value and writes the new value back. If the old value was 1, when we spin on a normal memory load which gets cached in L2/L1 as usual, and we spin like mad, not touching the bus. When another processor clears the lock, our cache bus snoops the write, and either invalidates the cache block and reads it from memory (pre-MOESI caches) or it gets the new value from the owner in recent processors. But this can't be trusted to last but the "while(lock)" above fails, and we go back to the while exchange(lock,1) and atomically try to get a zero and write a 1. If we succeed, the while fails (exchange() returns a 0) and we fall into the critical section. If several are spinning, only one will get zero, the rest get a one and again spin on cache contents.

If you remove that second while (lock) operation, the machine slows to an absolute crawl, because the "spinners" keep the bus locked almost 100% of the time, so that the processor that is actually trying to do something important and then exit the critical section can't get to memory for instructions or data and it crawls along...

xchg() is atomic by definition. Other instructions like inc, etc can be made atomic by using the lock prefix on the instruction(s)...
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Interlock clusters

Post by hgm »

This must be a matter of very inefficient implementation, then:

when the CPU holds the only valid copy of the variable in its cache, reading the memory (which holds obsolete information) is a pure waste of time. And at some point the CPU would have to read the cache anyway, or the instruction would not even work as described.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Interlock clusters

Post by bob »

hgm wrote:This must be a matter of very inefficient implementation, then:

when the CPU holds the only valid copy of the variable in its cache, reading the memory (which holds obsolete information) is a pure waste of time. And at some point the CPU would have to read the cache anyway, or the instruction would not even work as described.
It is by design, otherwise a threaded application would be impossible. I do a read, you do a read, I do a write, you do a write. That won't work for a lock, because if we both do the read, we both get zero, then we both write a 1 back, and since that is defined as a "race" by the designers, it doesn't matter in what order the writes are done, nor does it matter. But we both acquired the lock, and that would break everything, kernel on up.

So the xchg instruction, when one operand is in memory, is a bear because it locks the bus, which means _no_ other access by other caches/processors, then it reads the old value, and writes the new value, before the bus is released. Cache hits can't happen. You can use the lock prefix to make other memory modification instructions behave the same way, such as "inc" and "dec" for counting semaphores...
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Interlock clusters

Post by hgm »

bob wrote:It is by design, otherwise a threaded application would be impossible. I do a read, you do a read, I do a write, you do a write. That won't work for a lock, because if we both do the read, we both get zero, then we both write a 1 back, and since that is defined as a "race" by the designers, it doesn't matter in what order the writes are done, nor does it matter. But we both acquired the lock, and that would break everything, kernel on up.
This is a highly simplified picture, as it ignores various level of caching. No CPU I know off shares L1 caches. So the cache coherence protocol comes in. If "I do a read" on modified data in my cache, there is no way "you do a read" as well. You don't have the data, the only valid copy is in my L1 cache, to which you have no access. The worst your read can do is send me a request to flush the data to a higher level of caching, or main memory, and to make the XCHG atomic, the only thing the CPU has to do is hold off the actual flushing until both my load and store are executed. Which does not need to cause any delay if that request does not come.

Note that an XCHG REG,MEM under some conditions where I tested it had a latency of less than 20 clocks, (measured on Pentium M), which is so short that it is hard to believe any bus activity was involve. This was in the loop

Code: Select all

    xorl %eax, %eax
    movl %eax, data(%eax)
L1:
    xchgl %eax, data(%eax)
    decl %ebx
    jne L1
so exchanging data at a fixed address, which will of course be brought in cache immediately. It seems a pipeline stall is involved, as adding other ALU instructions in the loop immediately add to its duration.

I guess one problem is that in OOO machines, STORE operations have to be posted until after the uOp responsible for them has retitred. This is the only way to guarantee that they are not speculative and have to be retracted because of a branch misprediction. Probably they implemented this by blocking the input to the pipeline at the decoder as soon as the xchg is detected, until everything before the xchg retires (so it is sure that it is not speculative), and only then start execution of the xchg, as a cache probe followed by modification of the cache line in the next clock (to prevent that a snoop cycle can sneak in between the two).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Interlock clusters

Post by bob »

hgm wrote:
bob wrote:It is by design, otherwise a threaded application would be impossible. I do a read, you do a read, I do a write, you do a write. That won't work for a lock, because if we both do the read, we both get zero, then we both write a 1 back, and since that is defined as a "race" by the designers, it doesn't matter in what order the writes are done, nor does it matter. But we both acquired the lock, and that would break everything, kernel on up.
This is a highly simplified picture, as it ignores various level of caching. No CPU I know off shares L1 caches. So the cache coherence protocol comes in. If "I do a read" on modified data in my cache, there is no way "you do a read" as well. You don't have the data, the only valid copy is in my L1 cache, to which you have no access. The worst your read can do is send me a request to flush the data to a higher level of caching, or main memory, and to make the XCHG atomic, the only thing the CPU has to do is hold off the actual flushing until both my load and store are executed. Which does not need to cause any delay if that request does not come.
OK, to try again. The lock is set to zero. Not in either cache. Now what? The "xchg" semantics says to (a) read the old value, (b) write a new value.

How do you make that work? Both do a read, it goes to both caches. now both do a write. But it is too late as both got the original value of zero, and think they "acquired" the lock. Both write a 1 back, but this is no longer an "atomic" operation and is useless as a "lock".

That's why we lock the bus, and read/write memory, and then unlock the bus. Now only one can get a zero value, and then write a 1 back, and now the lock works as expected.


Note that an XCHG REG,MEM under some conditions where I tested it had a latency of less than 20 clocks, (measured on Pentium M), which is so short that it is hard to believe any bus activity was involve. This was in the loop
Single core? It may well work differently there as there is no "lock" mechanism needed although I would be surprised if that were true. But on a dual core, I don't see how one could do it otherwise. There are potential enhancements with a single chip shared cache one might try, but this is not only used in single-chip multiple-core boxes, and it has to work for all.

Code: Select all

    xorl %eax, %eax
    movl %eax, data(%eax)
L1:
    xchgl %eax, data(%eax)
    decl %ebx
    jne L1
so exchanging data at a fixed address, which will of course be brought in cache immediately. It seems a pipeline stall is involved, as adding other ALU instructions in the loop immediately add to its duration.

I guess one problem is that in OOO machines, STORE operations have to be posted until after the uOp responsible for them has retitred. This is the only way to guarantee that they are not speculative and have to be retracted because of a branch misprediction. Probably they implemented this by blocking the input to the pipeline at the decoder as soon as the xchg is detected, until everything before the xchg retires (so it is sure that it is not speculative), and only then start execution of the xchg, as a cache probe followed by modification of the cache line in the next clock (to prevent that a snoop cycle can sneak in between the two).
Note that this is a special case of the lock prefix that can be applied to other instructions. The CPU has a "lock" pin that it exerts when an xchg reg, mem operation is done. This prevents any other memory accesses by other processors to make this operation atomic. Once the write half of the xchg is completed, the lock signal is dropped and things go back to normal..

I do seem to recall some sort of trickery to address this in later versions (after the pentium pro). I think that there is some caching potential and you might have stumbled onto it. I'll see if I can find the reference to explain exactly how this works. All I have ever really cared is that it be done absolutely atomically, which probably means that if you do a cache probe with the #LOCK signal exerted, this will cause an invalidate for this cache block for any other cache, which is actually a bit better than dealing directly with memory, but still causes a flurry of MOESI-type cache-to-cache traffic.

I'll track this down and report back...
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Interlock clusters

Post by hgm »

bob wrote: OK, to try again. The lock is set to zero. Not in either cache.
This was _not_ the case I was discussing. Yes, I understand that busactivity is required when the data was not cached. This is also true for non-locked bus cycles...
Now what? The "xchg" semantics says to (a) read the old value, (b) write a new value.

How do you make that work? Both do a read, it goes to both caches. now both do a write. But it is too late as both got the original value of zero, and think they "acquired" the lock. Both write a 1 back, but this is no longer an "atomic" operation and is useless as a "lock".
This is not how it works. First one that does a read gets the cache line in an EXCLUSIVE state. When the second one reads, the first one will be aware of it because it is snooping the bus, and the state is altered to SHARED. The second one acquires the cache line in a SHARED state to begin with, as it is informed by th the first one that this already had it, during the memory read.

Now they both have the lock in cache, but that does not mean they have acquired it yet. For that they must have the lock in a register. Doing the XCHG requires a write to the cache line. This write to a SHARED cache line brings it into the MODIFIED state (whether it was an XCHG or a normal unlocked STORE ), and to make that transition the CPU performs an INVALIDATE cycle on its external bus. The second CPU snoops this INVALIDATE cycle, and as a result makes the formerly SHARED cache line now an INVALID one. So when its XCHG instruction now tries to acquire the lock, it produces a cache miss, as INVALID cache lines cannot produce hits. This way it will discover that the lock has been snatched from under its nose.

The point is that all this works exactly the same on normal stores (e.g. MOV REG,MEM) as on the XCHG instruction. The cache-coherence logic solves the race condition for invalidating a shared cache line when both CPUs attempt to write it at the same time in hardware. It must, for otherwise you would already get errors when CPU #1 tries to write one word of the cache line (w1), and CPU #2 another word (w2) in that same line, without ever touching each others memory location. The one who flushes its cache line last would then obliterate the writing of the other. And this could not be explained as due to the unspecified order in which instructions on different CPUs, as CPU #2 did not execute any instructions that could alter w1 at all. It would mean the caching is no longer transparent. And Intel _guarantees_ that it is, and that there is no corruption of memory data to which you don't write, no matter how many CPUs are sharing the memory bus.

So it is always very expensive to do a write to a SHARED cache line (locked or unlocked), as it must trigger an external INVALIDATE cycle that the competing CPUs can snoop. Writing to an EXCLUSIVE or MODIFIED cache line does not involve such extra cost. And for implementing the XCHG as atomic, the latter cases only require that INVALIDATE attempts by the competing CPUs cannot sneak in between its LOAD and STORE parts.
That's why we lock the bus, and read/write memory, and then unlock the bus. Now only one can get a zero value, and then write a 1 back, and now the lock works as expected.


Note that an XCHG REG,MEM under some conditions where I tested it had a latency of less than 20 clocks, (measured on Pentium M), which is so short that it is hard to believe any bus activity was involve. This was in the loop
Single core? It may well work differently there as there is no "lock" mechanism needed although I would be surprised if that were true. But on a dual core, I don't see how one could do it otherwise. There are potential enhancements with a single chip shared cache one might try, but this is not only used in single-chip multiple-core boxes, and it has to work for all.
Btw, I now also timed the XCHG latency on a Core 2 Duo, (E6600), which truly has multiple cores. There it takes 25 clocks in the loop

Code: Select all

L1:
    xchgl %eax, _mem
    decl %ebx
    jne L1
I still think the most likely explanation for this is that to guarantee nothing can sneak in between the load and the store, it has to delay the load to take place at a time very close to the store, meaning it can only be done after the instruction retires. No external bus cycles are needed, as after the first iteration of the loop the _mem variable will be cached in the MODIFIED state. So only the cache access has to be locked between LOAD and STORE, and in principle this requires only a lock for the duration of a single clock cycle.

Code: Select all

    xorl %eax, %eax
    movl %eax, data(%eax)
L1:
    xchgl %eax, data(%eax)
    decl %ebx
    jne L1
so exchanging data at a fixed address, which will of course be brought in cache immediately. It seems a pipeline stall is involved, as adding other ALU instructions in the loop immediately add to its duration.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Interlock clusters

Post by bob »

hgm wrote:
bob wrote: OK, to try again. The lock is set to zero. Not in either cache.
This was _not_ the case I was discussing. Yes, I understand that busactivity is required when the data was not cached. This is also true for non-locked bus cycles...
Now what? The "xchg" semantics says to (a) read the old value, (b) write a new value.

How do you make that work? Both do a read, it goes to both caches. now both do a write. But it is too late as both got the original value of zero, and think they "acquired" the lock. Both write a 1 back, but this is no longer an "atomic" operation and is useless as a "lock".
This is not how it works. First one that does a read gets the cache line in an EXCLUSIVE state. When the second one reads, the first one will be aware of it because it is snooping the bus, and the state is altered to SHARED. The second one acquires the cache line in a SHARED state to begin with, as it is informed by th the first one that this already had it, during the memory read.
This is _also_ not how it works. That second read is a killer, because both now have a value of zero. No good.

But as I mentioned later in the post, I had remembered that Intel came up with some trickery somewhere either with the P6 (pentium pro) or sometime after that.

The #LOCK signal from the CPU puts the cache into a "locked state". The lock is most likely _not_ in the cache, because other processors are modifying it as well, so if it is there, it is marked invalid and has to be fetched. When Intel went to MOESI, they eliminated the old write-invalidate approach, so what happens today, after reading everything I could find, is that the #LOCK signal from the CPU notifies the local cache that this is a LOCKed request. It issues a request for this data, which may come from another cache or from memory. If another cache has the value, it sends it and invalidates its current copy. If nobody has it, the current cache goes to memory. But it maintains the "locked state". If anay other cache tries to read this address, the current cache issues a "hold" (do not read from memory, I have it and will send it to you) and maintains that "hold but inactive" state until the CPU drops the #LOCK signal to signify that the read/modify cycle is completed. Now this cache can send it to someone else that wants it.

So it is exactly the same effect as the original "lock the bus" approach but is more efficient and with just one processor, the overhead disappears completely...

Now they both have the lock in cache, but that does not mean they have acquired it yet. For that they must have the lock in a register. Doing the XCHG requires a write to the cache line.
Yes, but that's the issue. The xchg turns into two micro-ops. A load, and a store. Without the #LOCK mechanism, this falls apart as there is a window between the two that one could drive a small vehicle through and wreck things.
This write to a SHARED cache line brings it into the MODIFIED state
Actually "OWNER" state since that cache knows it was not exclusive (it was shared) , and SHARED + MODIFIED = OWNER, I'm the one that is responsible for writing it back to memory if I replace that block, or I am responsible for sending it to other caches if they request it from memory because I have the current value..
(whether it was an XCHG or a normal unlocked STORE ), and to make that transition the CPU performs an INVALIDATE cycle on its external bus. The second CPU snoops this INVALIDATE cycle, and as a result makes the formerly SHARED cache line now an INVALID one. So when its XCHG instruction now tries to acquire the lock, it produces a cache miss, as INVALID cache lines cannot produce hits. This way it will discover that the lock has been snatched from under its nose.
Without the #LOCK signal, this does not work. Both can read it at the same time. And then write it. Just as if we did this:

CPUa:
mov eax, lock
mov lock, 1

CPUb
mov eax, lock
mov lock, 2

After the above, the lock can contain either a 1 or a 2. CPUa will have either a zero (assuming lock starts at zero) or a 2. CPUb will have either a 0 or a 1. That was the reason for the older approach of bypassing cache completely when a lock prefix was used, and it is still the reason the cpu has the #LOCK signal it can exert to obtain exclusive access to the address required.

The point is that all this works exactly the same on normal stores (e.g. MOV REG,MEM) as on the XCHG instruction. The cache-coherence logic solves the race condition for invalidating a shared cache line when both CPUs attempt to write it at the same time in hardware. It must, for otherwise you would already get errors when CPU #1 tries to write one word of the cache line (w1), and CPU #2 another word (w2) in that same line, without ever touching each others memory location. The one who flushes its cache line last would then obliterate the writing of the other. And this could not be explained as due to the unspecified order in which instructions on different CPUs, as CPU #2 did not execute any instructions that could alter w1 at all. It would mean the caching is no longer transparent. And Intel _guarantees_ that it is, and that there is no corruption of memory data to which you don't write, no matter how many CPUs are sharing the memory bus.
That's essentially irrelevant. If two CPUs write to the _same_ word, which is what we are talking about here, there is no semantic definition as to which write occurs first. Even if one is executed nanoseconds before the other. It may well be the neither has it in cache. Races are a natural part of sharing memory among threads, and locks are a natural part of controlling that. And the #LOCK from the CPU is the key otherwise none of this could work, and the #LOCK signal introduces some significant overhead if you actually have multiple caches because once a cache is blocked waiting on memory, it is unable to read anything else from memory either, so the penalty hurts on real multiple-chip (or on non-shared cache on a single chip) processors...

So it is always very expensive to do a write to a SHARED cache line (locked or unlocked), as it must trigger an external INVALIDATE cycle that the competing CPUs can snoop .
I do not believe the "INVALIDATE" is a "snooped" issue. Snooping involves watching the bus to see what is being read/written. MOESI caches intentionally tell each other things such as "your copy is invalid" or "I am the owner, I will send it to you directly"
Writing to an EXCLUSIVE or MODIFIED cache line does not involve such extra cost. And for implementing the XCHG as atomic, the latter cases only require that INVALIDATE attempts by the competing CPUs cannot sneak in between its LOAD and STORE parts.
We agree on that point, enabled by the #LOCK signal being exerted by the CPU. But further, once a CPU discovers that the cache line is either shared or "owner" it has to do something it would normally not have to do, namely invalidate everybody's copy even though we are only doing a read at the moment, since the #LOCK demands that...
That's why we lock the bus, and read/write memory, and then unlock the bus. Now only one can get a zero value, and then write a 1 back, and now the lock works as expected.


Note that an XCHG REG,MEM under some conditions where I tested it had a latency of less than 20 clocks, (measured on Pentium M), which is so short that it is hard to believe any bus activity was involve. This was in the loop
Single core? It may well work differently there as there is no "lock" mechanism needed although I would be surprised if that were true. But on a dual core, I don't see how one could do it otherwise. There are potential enhancements with a single chip shared cache one might try, but this is not only used in single-chip multiple-core boxes, and it has to work for all.
Btw, I now also timed the XCHG latency on a Core 2 Duo, (E6600), which truly has multiple cores. There it takes 25 clocks in the loop

Code: Select all

L1:

    xchgl %eax, _mem
    decl %ebx
    jne L1
I still think the most likely explanation for this is that to guarantee nothing can sneak in between the load and the store, it has to delay the load to take place at a time very close to the store, meaning it can only be done after the instruction retires. No external bus cycles are needed, as after the first iteration of the loop the _mem variable will be cached in the MODIFIED state. So only the cache access has to be locked between LOAD and STORE, and in principle this requires only a lock for the duration of a single clock cycle.[/qyuote]


I suspect there is more, even there. If you wait until the xchg() retires, you have to do the read/write back to back, all the while maintaining the #LOCK signal so that this becomes an indivisible operation. And I am not sure how this will affect speculative execution. But it certainly means that for whatever period of time it takes to pull off the read and then the write, nothing else can happen with that cache block, nor any other processor wanting that cache block or anything else that they don't currently have..

Code: Select all

    xorl %eax, %eax
    movl %eax, data(%eax)
L1:
    xchgl %eax, data(%eax)
    decl %ebx
    jne L1
so exchanging data at a fixed address, which will of course be brought in cache immediately. It seems a pipeline stall is involved, as adding other ALU instructions in the loop immediately add to its duration.
I would agree although details are not clear and Intel's info is sketchy at best...
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Interlock clusters

Post by hgm »

bob wrote:This is _also_ not how it works. That second read is a killer, because both now have a value of zero. No good.

But as I mentioned later in the post, I had remembered that Intel came up with some trickery somewhere either with the P6 (pentium pro) or sometime after that.

The #LOCK signal from the CPU puts the cache into a "locked state". The lock is most likely _not_ in the cache, because other processors are modifying it as well, so if it is there, it is marked invalid and has to be fetched. When Intel went to MOESI,
Well, it could be that my explanation of events was a bit Pentium-M colored, but according to my info (e.g. http://www.intel.com/technology/itj/200 ... _art02.pdf ) Intel was still using MESI, rather than MOESI, in Core Solo / Duo. (In fact I get no hits for "MOESI" at all, when I search developer.intel.com...) And even if the Pentium-M only had a single core per chip, it was still designed to work in multi-CPU systems, e.g. on motherboards with multiple sockets. And there the "Read-For-Ownership" cycle invalidated the line in other caches, as I described.
they eliminated the old write-invalidate approach, so what happens today, after reading everything I could find, is that the #LOCK signal from the CPU notifies the local cache that this is a LOCKed request. It issues a request for this data, which may come from another cache or from memory. If another cache has the value, it sends it and invalidates its current copy. If nobody has it, the current cache goes to memory. But it maintains the "locked state". If anay other cache tries to read this address, the current cache issues a "hold" (do not read from memory, I have it and will send it to you) and maintains that "hold but inactive" state until the CPU drops the #LOCK signal to signify that the read/modify cycle is completed. Now this cache can send it to someone else that wants it.
You still focus too much on the case where both CPUs have valid or invalid copies of the lock, and this is not the case that interests me. I would like to design my SMP algorithm on locks that are not normally accessed by more than one CPU, and are only there "just in case", in the rare event that they are.

Normally, such a lock would be cached in the E or M state, even in MOESI protocol, and not in S or O, as no other CPU has it in cache. I suppose these CPUs avoid putting cache lines in the S or O state if they can avoid it by using E or M state, otherwise the latter need not even exist. So when I do an XCHG on a word in an E or M cache line, what exactly would have to be done that causes a slowdown, according to you? The CPU knows that the other CPUs don't have the line in cache, and are not in a position to get the lock. Especially for the M state, where they can only obtain the lock if I first give it to them, as I have the only valid copy in existence! I would say: "just not give it to them when they ask". That could delay them, and only when they ask, but it should never delay me...
So it is exactly the same effect as the original "lock the bus" approach but is more efficient and with just one processor, the overhead disappears completely...

...

Actually "OWNER" state since that cache knows it was not exclusive (it was shared) , and SHARED + MODIFIED = OWNER, I'm the one that is responsible for writing it back to memory if I replace that block, or I am responsible for sending it to other caches if they request it from memory because I have the current value..


Without the #LOCK signal, this does not work. Both can read it at the same time. And then write it. Just as if we did this:

CPUa:
mov eax, lock
mov lock, 1

CPUb
mov eax, lock
mov lock, 2

After the above, the lock can contain either a 1 or a 2. CPUa will have either a zero (assuming lock starts at zero) or a 2. CPUb will have either a 0 or a 1. That was the reason for the older approach of bypassing cache completely when a lock prefix was used, and it is still the reason the cpu has the #LOCK signal it can exert to obtain exclusive access to the address required.
Sure, you have to lock something to make sure no other read can sneak in between the your own read and write. The only point I wanted to make is that "sneaking in" can mean entirely different things depending on the situation. Is the lock in a sheared memory or cache, or is it in a private cache and does one have to sneak in by snooping. I am pretty sure the caches are not bypassed; the 25 clocks I measure on Core 2 Duo for the XCHG REG,MEM is way to short to go through memory. So data from the cache must be used.
I suspect there is more, even there. If you wait until the xchg() retires, you have to do the read/write back to back, all the while maintaining the #LOCK signal so that this becomes an indivisible operation. And I am not sure how this will affect speculative execution. But it certainly means that for whatever period of time it takes to pull off the read and then the write, nothing else can happen with that cache block, nor any other processor wanting that cache block or anything else that they don't currently have..
True, but if the cache block is part of L1, this can be pretty fast. And again, it is only cache snoops that are coming from the outside that will have to wait until the write completes. The CPU itself will get the data by store forwarding from the store queue, even when it would read before the write finally delivers its data to the cache memory array. So there is no reason for the CPU that issued the XCHG to wait very long.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Interlock clusters

Post by bob »

hgm wrote:
bob wrote:This is _also_ not how it works. That second read is a killer, because both now have a value of zero. No good.

But as I mentioned later in the post, I had remembered that Intel came up with some trickery somewhere either with the P6 (pentium pro) or sometime after that.

The #LOCK signal from the CPU puts the cache into a "locked state". The lock is most likely _not_ in the cache, because other processors are modifying it as well, so if it is there, it is marked invalid and has to be fetched. When Intel went to MOESI,
Well, it could be that my explanation of events was a bit Pentium-M colored, but according to my info (e.g. http://www.intel.com/technology/itj/200 ... _art02.pdf ) Intel was still using MESI, rather than MOESI, in Core Solo / Duo. (In fact I get no hits for "MOESI" at all, when I search developer.intel.com...) And even if the Pentium-M only had a single core per chip, it was still designed to work in multi-CPU systems, e.g. on motherboards with multiple sockets. And there the "Read-For-Ownership" cycle invalidated the line in other caches, as I described.
I don't ever find a lot of details on Intel that are very specific in cases like this. I was told about the MOESI stuff a couple of years ago when Eugene Nalimov and I were involved in trying to find an NPS scaling issue that did not happen on the Opterons, but which did happen on the Xeons of the period, although I have no idea today which processor that was.

The primary difference is that MOESI does away with the invalidate/write approach. I have an exclusive copy and modify it. I don't have to do anything. I have a shared copy and modify it, I have to tell everyone else their copy is invalid. If I then snoop someone trying to read that data, I have to make them hold, while I write it back to memory so that they can read it. MOESI gets rid of that write to memory / read from memory inefficiency...

Otherwise things work the same way as far as the program running is concerned, just more efficient with MOESI than with MESI...
they eliminated the old write-invalidate approach, so what happens today, after reading everything I could find, is that the #LOCK signal from the CPU notifies the local cache that this is a LOCKed request. It issues a request for this data, which may come from another cache or from memory. If another cache has the value, it sends it and invalidates its current copy. If nobody has it, the current cache goes to memory. But it maintains the "locked state". If anay other cache tries to read this address, the current cache issues a "hold" (do not read from memory, I have it and will send it to you) and maintains that "hold but inactive" state until the CPU drops the #LOCK signal to signify that the read/modify cycle is completed. Now this cache can send it to someone else that wants it.
You still focus too much on the case where both CPUs have valid or invalid copies of the lock, and this is not the case that interests me. I would like to design my SMP algorithm on locks that are not normally accessed by more than one CPU, and are only there "just in case", in the rare event that they are.
This case I am focusing on is the case that is most important, the case where there _is_ contention for a resource. If it is rarely used, there are other lockless approaches that can be used with less performance loss.


Normally, such a lock would be cached in the E or M state, even in MOESI protocol, and not in S or O, as no other CPU has it in cache. I suppose these CPUs avoid putting cache lines in the S or O state if they can avoid it by using E or M state, otherwise the latter need not even exist. So when I do an XCHG on a word in an E or M cache line, what exactly would have to be done that causes a slowdown, according to you? The CPU knows that the other CPUs don't have the line in cache, and are not in a position to get the lock. Especially for the M state, where they can only obtain the lock if I first give it to them, as I have the only valid copy in existence! I would say: "just not give it to them when they ask". That could delay them, and only when they ask, but it should never delay me...
For clarity, "O" is "S" + "M" if you use MESI. That is, you are the one responsible for writing it back to memory to maintain coherency, should you choose to discard it from your cache, and you are also responsible for sending a copy to any cache that tries to read it from memory, bypassing memory completely. Otherwise, in MESI, if you are in S and modify it, you change to M but invalidate everyone else's copy, and if they ask for it, you have to write it back to memory and let them pick it up from there, which is inefficient.

The "S" state is imprecise anyway. All it means is that at some instant in time, the value was in two caches at the same time. It doesn't mean it is still there as one could discard it. But once in the S state, any write to it automatically produces the "cache invalidate" broadcast in either protocol, it is just what happens afterward that is different.


So it is exactly the same effect as the original "lock the bus" approach but is more efficient and with just one processor, the overhead disappears completely...

...

Actually "OWNER" state since that cache knows it was not exclusive (it was shared) , and SHARED + MODIFIED = OWNER, I'm the one that is responsible for writing it back to memory if I replace that block, or I am responsible for sending it to other caches if they request it from memory because I have the current value..


Without the #LOCK signal, this does not work. Both can read it at the same time. And then write it. Just as if we did this:

CPUa:
mov eax, lock
mov lock, 1

CPUb
mov eax, lock
mov lock, 2

After the above, the lock can contain either a 1 or a 2. CPUa will have either a zero (assuming lock starts at zero) or a 2. CPUb will have either a 0 or a 1. That was the reason for the older approach of bypassing cache completely when a lock prefix was used, and it is still the reason the cpu has the #LOCK signal it can exert to obtain exclusive access to the address required.
Sure, you have to lock something to make sure no other read can sneak in between the your own read and write. The only point I wanted to make is that "sneaking in" can mean entirely different things depending on the situation. Is the lock in a sheared memory or cache, or is it in a private cache and does one have to sneak in by snooping. I am pretty sure the caches are not bypassed; the 25 clocks I measure on Core 2 Duo for the XCHG REG,MEM is way to short to go through memory. So data from the cache must be used.
I do not think caches are bypassed any longer either, at least on Intel. Other architectures are different.
I suspect there is more, even there. If you wait until the xchg() retires, you have to do the read/write back to back, all the while maintaining the #LOCK signal so that this becomes an indivisible operation. And I am not sure how this will affect speculative execution. But it certainly means that for whatever period of time it takes to pull off the read and then the write, nothing else can happen with that cache block, nor any other processor wanting that cache block or anything else that they don't currently have..
True, but if the cache block is part of L1, this can be pretty fast. And again, it is only cache snoops that are coming from the outside that will have to wait until the write completes. The CPU itself will get the data by store forwarding from the store queue, even when it would read before the write finally delivers its data to the cache memory array. So there is no reason for the CPU that issued the XCHG to wait very long.
I'm not worried about the CPU that issued the xchg having to wait. The problem is all the CPUs that are now trying to acquire the lock. If they blast away with xchg instructions things crawl. That's what led to the current "shadow lock" spinlock approach most everyone uses (if using Intel/AMD hardware of course, others are completely different). This could be an issue in the one lock per has h entry approach