hgm wrote:bob wrote:
OK, to try again. The lock is set to zero. Not in either cache.
This was _not_ the case I was discussing. Yes, I understand that busactivity is required when the data was not cached. This is also true for non-locked bus cycles...
Now what? The "xchg" semantics says to (a) read the old value, (b) write a new value.
How do you make that work? Both do a read, it goes to both caches. now both do a write. But it is too late as both got the original value of zero, and think they "acquired" the lock. Both write a 1 back, but this is no longer an "atomic" operation and is useless as a "lock".
This is not how it works. First one that does a read gets the cache line in an EXCLUSIVE state. When the second one reads, the first one will be aware of it because it is snooping the bus, and the state is altered to SHARED. The second one acquires the cache line in a SHARED state to begin with, as it is informed by th the first one that this already had it, during the memory read.
This is _also_ not how it works. That second read is a killer, because both now have a value of zero. No good.
But as I mentioned later in the post, I had remembered that Intel came up with some trickery somewhere either with the P6 (pentium pro) or sometime after that.
The #LOCK signal from the CPU puts the cache into a "locked state". The lock is most likely _not_ in the cache, because other processors are modifying it as well, so if it is there, it is marked invalid and has to be fetched. When Intel went to MOESI, they eliminated the old write-invalidate approach, so what happens today, after reading everything I could find, is that the #LOCK signal from the CPU notifies the local cache that this is a LOCKed request. It issues a request for this data, which may come from another cache or from memory. If another cache has the value, it sends it and invalidates its current copy. If nobody has it, the current cache goes to memory. But it maintains the "locked state". If anay other cache tries to read this address, the current cache issues a "hold" (do not read from memory, I have it and will send it to you) and maintains that "hold but inactive" state until the CPU drops the #LOCK signal to signify that the read/modify cycle is completed. Now this cache can send it to someone else that wants it.
So it is exactly the same effect as the original "lock the bus" approach but is more efficient and with just one processor, the overhead disappears completely...
Now they both have the lock in cache, but that does not mean they have acquired it yet. For that they must have the lock in a register. Doing the XCHG requires a write to the cache line.
Yes, but that's the issue. The xchg turns into two micro-ops. A load, and a store. Without the #LOCK mechanism, this falls apart as there is a window between the two that one could drive a small vehicle through and wreck things.
This write to a SHARED cache line brings it into the MODIFIED state
Actually "OWNER" state since that cache knows it was not exclusive (it was shared) , and SHARED + MODIFIED = OWNER, I'm the one that is responsible for writing it back to memory if I replace that block, or I am responsible for sending it to other caches if they request it from memory because I have the current value..
(whether it was an XCHG or a normal unlocked STORE ), and to make that transition the CPU performs an INVALIDATE cycle on its external bus. The second CPU snoops this INVALIDATE cycle, and as a result makes the formerly SHARED cache line now an INVALID one. So when its XCHG instruction now tries to acquire the lock, it produces a cache miss, as INVALID cache lines cannot produce hits. This way it will discover that the lock has been snatched from under its nose.
Without the #LOCK signal, this does not work. Both can read it at the same time. And then write it. Just as if we did this:
CPUa:
mov eax, lock
mov lock, 1
CPUb
mov eax, lock
mov lock, 2
After the above, the lock can contain either a 1 or a 2. CPUa will have either a zero (assuming lock starts at zero) or a 2. CPUb will have either a 0 or a 1. That was the reason for the older approach of bypassing cache completely when a lock prefix was used, and it is still the reason the cpu has the #LOCK signal it can exert to obtain exclusive access to the address required.
The point is that all this works exactly the same on normal stores (e.g. MOV REG,MEM) as on the XCHG instruction. The cache-coherence logic solves the race condition for invalidating a shared cache line when both CPUs attempt to write it at the same time in hardware. It must, for otherwise you would already get errors when CPU #1 tries to write one word of the cache line (w1), and CPU #2 another word (w2) in that same line, without ever touching each others memory location. The one who flushes its cache line last would then obliterate the writing of the other. And this could not be explained as due to the unspecified order in which instructions on different CPUs, as CPU #2 did not execute any instructions that could alter w1 at all. It would mean the caching is no longer transparent. And Intel _guarantees_ that it is, and that there is no corruption of memory data to which you don't write, no matter how many CPUs are sharing the memory bus.
That's essentially irrelevant. If two CPUs write to the _same_ word, which is what we are talking about here, there is no semantic definition as to which write occurs first. Even if one is executed nanoseconds before the other. It may well be the neither has it in cache. Races are a natural part of sharing memory among threads, and locks are a natural part of controlling that. And the #LOCK from the CPU is the key otherwise none of this could work, and the #LOCK signal introduces some significant overhead if you actually have multiple caches because once a cache is blocked waiting on memory, it is unable to read anything else from memory either, so the penalty hurts on real multiple-chip (or on non-shared cache on a single chip) processors...
So it is always very expensive to do a write to a SHARED cache line (locked or unlocked), as it must trigger an external INVALIDATE cycle that the competing CPUs can snoop .
I do not believe the "INVALIDATE" is a "snooped" issue. Snooping involves watching the bus to see what is being read/written. MOESI caches intentionally tell each other things such as "your copy is invalid" or "I am the owner, I will send it to you directly"
Writing to an EXCLUSIVE or MODIFIED cache line does not involve such extra cost. And for implementing the XCHG as atomic, the latter cases only require that INVALIDATE attempts by the competing CPUs cannot sneak in between its LOAD and STORE parts.
We agree on that point, enabled by the #LOCK signal being exerted by the CPU. But further, once a CPU discovers that the cache line is either shared or "owner" it has to do something it would normally not have to do, namely invalidate everybody's copy even though we are only doing a read at the moment, since the #LOCK demands that...
That's why we lock the bus, and read/write memory, and then unlock the bus. Now only one can get a zero value, and then write a 1 back, and now the lock works as expected.
Note that an XCHG REG,MEM under some conditions where I tested it had a latency of less than 20 clocks, (measured on Pentium M), which is so short that it is hard to believe any bus activity was involve. This was in the loop
Single core? It may well work differently there as there is no "lock" mechanism needed although I would be surprised if that were true. But on a dual core, I don't see how one could do it otherwise. There are potential enhancements with a single chip shared cache one might try, but this is not only used in single-chip multiple-core boxes, and it has to work for all.
Btw, I now also timed the XCHG latency on a Core 2 Duo, (E6600), which truly has multiple cores. There it takes 25 clocks in the loop
Code: Select all
L1:
xchgl %eax, _mem
decl %ebx
jne L1
I still think the most likely explanation for this is that to guarantee nothing can sneak in between the load and the store, it has to delay the load to take place at a time very close to the store, meaning it can only be done after the instruction retires. No external bus cycles are needed, as after the first iteration of the loop the _mem variable will be cached in the MODIFIED state. So only the cache access has to be locked between LOAD and STORE, and in principle this requires only a lock for the duration of a single clock cycle.[/qyuote]
I suspect there is more, even there. If you wait until the xchg() retires, you have to do the read/write back to back, all the while maintaining the #LOCK signal so that this becomes an indivisible operation. And I am not sure how this will affect speculative execution. But it certainly means that for whatever period of time it takes to pull off the read and then the write, nothing else can happen with that cache block, nor any other processor wanting that cache block or anything else that they don't currently have..
Code: Select all
xorl %eax, %eax
movl %eax, data(%eax)
L1:
xchgl %eax, data(%eax)
decl %ebx
jne L1
so exchanging data at a fixed address, which will of course be brought in cache immediately. It seems a pipeline stall is involved, as adding other ALU instructions in the loop immediately add to its duration.
I would agree although details are not clear and Intel's info is sketchy at best...