SMP hashing problem

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: SMP hashing problem

Post by Sven »

Maybe the cache is less involved than I initially thought since the two MOVQ instructions mentioned are updating the memory from two 64 bit registers. But I'm not sure.

Sven
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: SMP hashing problem

Post by diep »

bob wrote: [snip]
That's all well and good. But you are one step _beyond_ what I am talking about. In this context the memory controller is moot. That's not where the problem is. The problem lies in the fact that I need to store either 2 64 bit words for a hash entry, or 4 8 byte words for a pawn hash entry. And I can't store 'em in one cycle.

In the simplest case, I do this: (at&t format for linux asm):

movq %r8, [%rax]
movq %r9, [%rax + 8]
The above goes ok if you write it to the same cacheline.
All modern cpu's are 64 or 128 bytes cachelines in L2/L3.

128 bytes for P4 and HPC chips and 64 bytes for the rest.
32 bytes for cpu's from 90s.

Entire cacheline gets flushed at once to memory controller.

That is an atomic operation there, other cpu's can either store THEIR 64 bytes or you can store THIS 64 bytes. It can never mix.

All kind of parallel mechanisms would mess up otherwise of course and bandwidth would be gone.

It can go wrong ONLY at supercomputer chips, when some sort of network interrupt happens. Interrupts at pc's don't mess up the atomic nature of it.

If a flushed 64 bytes is gone because of an interrupt, it is entirely gone, OR in an atomic operation it gets written.

That's how i was explained things already by 4 independant persons from which 2 hardware guys and 2 very good low level programmers, in the past
6 years.

One of the programmers whom i shall keep unnamed, wondered why you bothered doing the XOR at a pc processor.

So the first thing i figured out for SGI is whether these cpu's flush entire 128 bytes (cacheline length in HPC usually is 128, better for bandwidth),
or whether the above scenario you pointed out can happen. It can go wrong there.

Not at all the pc processors you've got over there, other than that cacheline length may be shorter than above lengths.
bob wrote:
That assumes that hash word 1 is in %r8, and hash word 2 is in %r9, and that %rax points to the hash table.

two different CPUS executte the above at the same time. or at least start to execute the above. the two words for the hash entry are different on each CPU, but they hash to the same table entry so %rax is the same.

On CPU 1, we do the first movq, and then and interrupt pops in. We
context-switch to the interrupt handler and start executing there.
this scenario doesn't go wrong. the movq still is in the cacheline c1 processor buffer, it hasn't been flushed yet to RAM.

It would only go wrong if you do this:

you flush first movq word to cacheline c1
you flush some other thing to some other cacheline cX

Now processor may decide to already flush the previous cacheline to RAM,
as you wrote to a new cacheline.

In crafty this doesn't happen as you FIRST write both quadwords to cacheline.

Now some guy in this forum whose hobby is to slowdown software, sometimes also designing datastructures/libraries which i classify in the turtle category belonging to ESA/NASA standard coding (speaking of a need to buy fast hardware because they can't program), he wanted to do some sort of atomic exchange.

If you'd execute that to some other cacheline, which we all use to lock in case of splitting or whatever parallel action, the exchange operation to a different cacheline forces the rest to get flushed.

Hence it goes atomic correct and hence you measured the huge slowdown using it :)

It cripples bandwidth :)

The above explanation is what i got around the year 2000 regarding P3 Xeon. It still works like that.

Thanks
bob wrote:


meanwhile on CPU 2, we do both stores, overwriting the first word CPU 1 stored. Once CPU 1 finishes handling the interrupt, it resumes and stores word 2. now we have word 1 from CPU2's position, and word 2 from CPU1's position. And there is _nothing_ that can be done short of locking the entire entry, or else doing the XOR trick I currently use.

That's bad enough. But then we have the write-combine buffers in the cache, so that as words are scheduled to be written back to memory, they can be written back in a _different_ order than they were stored. There are instructions one can use to deal with this, and you can find references in the linux kernel source about the problem. The worse case is the alpha, but the intel case is bad enough. And both of the above are _before_ the memory controller exerts any influence on what is going on. And both are real problems to this kind of hash usage...
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: SMP hashing problem

Post by diep »

Sven Schüle wrote:Maybe the cache is less involved than I initially thought since the two MOVQ instructions mentioned are updating the memory from two 64 bit registers. But I'm not sure.

Sven
If each quadword would directly get written to cache, let alone memory,
just imagine the ugly bandwidth you'd have to RAM.

For Nehalem it has 19.2GB/s

It stores 96 bytes at once to RAM from its memory controllers.

If it would store now 8 bytes each time, bandwidth drops to 1.6GB/s GB/s suddenly.

That would be VERY bad :)

These processors are far superior nowadays, they have hundreds of millions of transistors and guess what? They are REALLY using them :)

Vincent
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SMP hashing problem

Post by bob »

Sven Schüle wrote:If this code:

Code: Select all

movq %r8, [%rax] 
movq %r9, [%rax + 8]
results in only one memory write operation, i.e. the CPU translates it into microcode that does only one write of (at least) 16 bytes instead of two writes of 8 bytes, then neither {a1,b2} nor {a2,b1} will occur.
It can't do just one write, because the two source operands are in different places. That's the point. And for the simplest case, the two instructions are separated by an interrupt.

If you don't believe this can happen, just test it. You can easily do this in Crafty, for example, by uncommenting the "illegal move from hash" error message which is caused by this problem (assuming you first remove the XOR stuff so that the problem can be seen.) If you want to assume "but that is not an out-of-order store issue, it must be a hash collision" it is easy enough to store the actual board position in a lengthened hash entry instead, and you will _still_ get the illegal move indication from time to time.

And then there is the more recent issue of a pawn hash entry that has 32 bytes, which allows even more opportunity for the stores to be done out of order with respect to other threads...


\

While a couple of years ago I would have said "no, impossible", today's processor architecture lets me say "ok, let's think about it" since the "real life behaviour" of a modern CPU seems to be much more intelligent than most of us can imagine.

Bob is convinced that two writes of 8 bytes each will occur, so that overlapping writes are possible in this case, while Vincent is convinced of the opposite (although he mentions the keyword "interrupt" which seems to be not involved in this SMP scenario).

The key point might be the question whether the whole 16 bytes are in the same cache line or not. If they are then I would expect that there will be only one 16-byte write per CPU, so no overlapping writes.
Simple case: I do the first store. That sets the dirty flag for this cache block, and that specific word. I get interrupted. Another processor says "I want to read this block". My cache says "forget about memory, I will send it to you instead as I am the current cache with, until now, the only copy of this stuff that is in any cache. And my copy is different from memory (my cache block status for this is "M" (Modified). I send the copy to the other cache, and change my state to O (owner). I have the good copy, I am responsible for writing it back to memory whenever that happens, and someone else has a copy as well so I am not the exclusive holder. Now that processor modifies one of the two words (the first one in fact). He becomes the Owner, my copy is invalidated, he then modifies the second, and continues. When I recover from the interrupt, I try to access the second word in the block, get a quick "invalid" since the other processor has the updated copy, and I request this from him. I get the copy with both words overwritten, I then overwrite the second, and there you go.

There are other ways this can happen, because there are lots of things that cause execution to come to a screeching halt besides an interrupt. We can get a memory stall, or a pipeline stall, or anything that causes us to stop for a short period of time, meanwhile caches keep talking to each other and sharing the data.

As I mentioned, there is a good writeup of this in some of the linux kernel documentation, where they complain about the "lax memory policy" of the X86 architecture, although they complain about the even laxer Alpha memory model much louder. :) There are lots of cute examples of this problem. There are some good guarantees. If, in the same core, I store to the same address twice, they will _never_ be done out of order. But across the cores, I am not violating any sort of timing issue at all.

Just imagine the case where I store a[0] now, and then store a[1] 10 minutes later. If another thread decides to store a[0] and a[1] in between, how can you possibly prevent the a1b2 or b1a2 problem I mentioned?

Everyone is using a microscope to look at a problem that is much bigger and simpler than that. It is a simple timing problem. If you store a1 and a2 widely separated by time, you would expect to see this happen every time. As you close the gap between when these two are done, the probability drops. But there is no way to guarantee that they are both done in successive cycles, so the probability never goes to zero.

The X86 has fence instructions to make sure all stores are completed before succeeding loads/stores are done...



However, from the "safety" viewpoint I would prefer Bob's side.

Just my 2 cents!
Sven
I'm certain of my side, because I spent several days debugging it before I figured out what it was. It was happening about once per 100 games or so, so it was _very_ rare. But that's not good enough. It has to be never...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SMP hashing problem

Post by bob »

Sven Schüle wrote:Maybe the cache is less involved than I initially thought since the two MOVQ instructions mentioned are updating the memory from two 64 bit registers. But I'm not sure.

Sven
That's the issue. As I said, this happens well before the memory controller gets a sniff of anything. I would assume one solution would be to use an XMM register to build the entire 128 bit hash entry, then one big store would do nicely. Except that my pawn hash is 32 bytes long and we don't have 32 byte stores yet, so there I would still need to either lock (too slow) or XOR things to avoid the crashes... Of course I could always validity check everything from the table, but that is a performance hit I don't want, and since the XORs had no effect on NPS, I'm happy enough with that even though it does make the code a tad messier where I load/store things.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: SMP hashing problem

Post by wgarvin »

On x86, you can do atomic 16-byte reads and writes (with 16-byte alignment, anyway) using SSE instructions like MOVDQA and MOVNTDQ.

It might add a little overhead if you have to shuffle around between XMM and integer registers, though.

On PowerPC you could do the same thing with Altivec instructions. On other platforms, mileage may vary.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: SMP hashing problem

Post by diep »

wgarvin wrote:On x86, you can do atomic 16-byte reads and writes (with 16-byte alignment, anyway) using SSE instructions like MOVDQA and MOVNTDQ.

It might add a little overhead if you have to shuffle around between XMM and integer registers, though.

On PowerPC you could do the same thing with Altivec instructions. On other platforms, mileage may vary.
Altivec is useless 32 bits junk nowadays.
Power6 also has it. 2 units :)

Bob measured wrong or something.
I want to see his code.
Last edited by diep on Tue Jan 27, 2009 10:29 pm, edited 1 time in total.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SMP hashing problem

Post by bob »

diep wrote:
bob wrote: [snip]
That's all well and good. But you are one step _beyond_ what I am talking about. In this context the memory controller is moot. That's not where the problem is. The problem lies in the fact that I need to store either 2 64 bit words for a hash entry, or 4 8 byte words for a pawn hash entry. And I can't store 'em in one cycle.

In the simplest case, I do this: (at&t format for linux asm):

movq %r8, [%rax]
movq %r9, [%rax + 8]
The above goes ok if you write it to the same cacheline.
All modern cpu's are 64 or 128 bytes cachelines in L2/L3.

Vincent, you are thinking _way_ too narrowly. I explained this in another post, but let me do it again here. Intel is using MOESI for SMP cache coherency. Here goes:

processor one and two both want to store at the same address, two consecutive quadwords (16 bytes total).

Assume nobody has this address in cache, for simplicity. P1 tries to access the address where a1 is going to go. Cache bring it in and flags the block as "E" (exclusive). P1 then writes a1 with one movq instruction. The cache block immediately becomes "M" (modified). Now we respond to an interrupt, or the processor stalls for some reason, so we are half done and "sitting".

Processor P2 tries to access the same address to store b1/b2. When it doesn't find it in cache, it asks for the data from memory but P1's cache says "wait a minute, I have the data and my copy is more current. Here it is" P1 sends it to P2 and turns its state into "O" (owner) since P1 has the modified copy and is responsible for writing it back. P2 now has an identical copy, that is marked "S" (shared). And P2 immediately overwrites a1 with b1 (same address). P2's cache block is now flagged as "O", and P1's copy is invalidated. P2 then writes b2, so that we now have b1/b2 in the 16 bytes of interest.

Now P1 resumes and tries to store a2. Not in cache so off to memory for the request, but P2 says "I have that, here it is." P2 is still "O" and P1 has a copy and is marked "S" (shared). P1 now stores a2 on top of b2, leaving us with the combination b1, a2. P1 is the owner, P2 is invalid, and eventually P1 has to write this back to memory. And at this point, this is the first the memory controller has seen of this mess. And there is nothing it can do because the contents of cache is broken, although this is exactly what should happen in such a case since it is "risky programming".

128 bytes for P4 and HPC chips and 64 bytes for the rest.
32 bytes for cpu's from 90s.

Entire cacheline gets flushed at once to memory controller.
Again, both correct and irrelevant. See above. That's why this is not a memory controller issue. google "x86 memory fence" to see lots of examples of how this problem can happen.


That is an atomic operation there, other cpu's can either store THEIR 64 bytes or you can store THIS 64 bytes. It can never mix.
Again, correct _and_ irrelevant. By this time, the problem is already in the cache contents. All you are doing here is guaranteeing that the bad cache contents are written back to memory exactly as they exist, but they are _still_ bad.


All kind of parallel mechanisms would mess up otherwise of course and bandwidth would be gone.

It can go wrong ONLY at supercomputer chips, when some sort of network interrupt happens. Interrupts at pc's don't mess up the atomic nature of it.

If a flushed 64 bytes is gone because of an interrupt, it is entirely gone, OR in an atomic operation it gets written.

That's how i was explained things already by 4 independant persons from which 2 hardware guys and 2 very good low level programmers, in the past
6 years.
Not sure what that means. My information comes _directly_ from Intel and AMD engineers. I could give you names if interested. This isn't a memory issue. It is a programming issue. Just think about a trivial case:

Processors P1 and P2 both want to store values into a[0] and a[1] which we would agree are in separate but consecutive words of memory, agreed?

P1 stores into a[0] and then goes off and does something else. meanwhile P2 stores into a[0] and a[1]. Later P1 comes back and stores into a[1]. We now have P1's value in a[1], but P2's value in a[0]. How can you _possibly_ prevent that from happening? (a) before you store either word, acquire a lock, then after storing the second, release the lock. problem gone. Performance probably gone too. (a) you can make the two values dependent on each other so that if they don't go together, you can detect this and not use them. That is the XOR trick. But _no_ hardware solution exists for this. The problem is the time gap between storing the two values, and that gap can be expanded unexpectedly by an interrupt, a pipeline stall, etc...


One of the programmers whom i shall keep unnamed, wondered why you bothered doing the XOR at a pc processor.
That programmer does _not_ know what he is talking about, in this case, so it is probably best he _does_ remain unnamed. I do the XOR because I spent several days week before last and last week, debugging this _exact_ problem. So feel free to tell him why I "bother". Because to not "bother" leads to crashes since I depend on the data to be correct for the position it is used in...

So the first thing i figured out for SGI is whether these cpu's flush entire 128 bytes (cacheline length in HPC usually is 128, better for bandwidth),
or whether the above scenario you pointed out can happen. It can go wrong there.

Not at all the pc processors you've got over there, other than that cacheline length may be shorter than above lengths.
I have not followed Intel carefully, but in past processors they did not write an entire cache line back to memory if just one word was modified. They had "write combine" buffers to deal with these writes and combine multiple writes into one if consecutive addresses are being written, for better performance. But this has _nothing_ to do with the current issue. The data is wrong in the cache. Not because of a hardware failure, but because of a classic "interleaved update" problem.
bob wrote:
That assumes that hash word 1 is in %r8, and hash word 2 is in %r9, and that %rax points to the hash table.

two different CPUS executte the above at the same time. or at least start to execute the above. the two words for the hash entry are different on each CPU, but they hash to the same table entry so %rax is the same.

On CPU 1, we do the first movq, and then and interrupt pops in. We
context-switch to the interrupt handler and start executing there.
this scenario doesn't go wrong. the movq still is in the cacheline c1 processor buffer, it hasn't been flushed yet to RAM.
Before we go on, please look up "MOESI cache coherency protocol". The caches pass data back and forth without writing it to RAM, so that is not an issue either.

It would only go wrong if you do this:

you flush first movq word to cacheline c1
you flush some other thing to some other cacheline cX

Now processor may decide to already flush the previous cacheline to RAM,
as you wrote to a new cacheline.

In crafty this doesn't happen as you FIRST write both quadwords to cacheline.

WIth a possibly significant delay between the two writes, which is _the_ problem. Instructions appearing consecutively do not necessarily get executed in consecutive cycles...

Now some guy in this forum whose hobby is to slowdown software, sometimes also designing datastructures/libraries which i classify in the turtle category belonging to ESA/NASA standard coding (speaking of a need to buy fast hardware because they can't program), he wanted to do some sort of atomic exchange.

If you'd execute that to some other cacheline, which we all use to lock in case of splitting or whatever parallel action, the exchange operation to a different cacheline forces the rest to get flushed.

Hence it goes atomic correct and hence you measured the huge slowdown using it :)

It cripples bandwidth :)

The above explanation is what i got around the year 2000 regarding P3 Xeon. It still works like that.
And that is all well and good, but it simply has _nothing_ to do with the problem that I have explained...

Thanks
bob wrote:


meanwhile on CPU 2, we do both stores, overwriting the first word CPU 1 stored. Once CPU 1 finishes handling the interrupt, it resumes and stores word 2. now we have word 1 from CPU2's position, and word 2 from CPU1's position. And there is _nothing_ that can be done short of locking the entire entry, or else doing the XOR trick I currently use.

That's bad enough. But then we have the write-combine buffers in the cache, so that as words are scheduled to be written back to memory, they can be written back in a _different_ order than they were stored. There are instructions one can use to deal with this, and you can find references in the linux kernel source about the problem. The worse case is the alpha, but the intel case is bad enough. And both of the above are _before_ the memory controller exerts any influence on what is going on. And both are real problems to this kind of hash usage...
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: SMP hashing problem

Post by diep »

bob wrote:
diep wrote:
bob wrote: [snip]
That's all well and good. But you are one step _beyond_ what I am talking about. In this context the memory controller is moot. That's not where the problem is. The problem lies in the fact that I need to store either 2 64 bit words for a hash entry, or 4 8 byte words for a pawn hash entry. And I can't store 'em in one cycle.

In the simplest case, I do this: (at&t format for linux asm):

movq %r8, [%rax]
movq %r9, [%rax + 8]
The above goes ok if you write it to the same cacheline.
All modern cpu's are 64 or 128 bytes cachelines in L2/L3.

Vincent, you are thinking _way_ too narrowly. I explained this in another post, but let me do it again here. Intel is using MOESI for SMP cache coherency. Here goes:

processor one and two both want to store at the same address, two consecutive quadwords (16 bytes total).

Assume nobody has this address in cache, for simplicity. P1 tries to access the address where a1 is going to go. Cache bring it in and flags the block as "E" (exclusive). P1 then writes a1 with one movq instruction. The cache block immediately becomes "M" (modified). Now we respond to an interrupt, or the processor stalls for some reason, so we are half done and "sitting".

Processor P2 tries to access the same address to store b1/b2. When it doesn't find it in cache, it asks for the data from memory but P1's cache says "wait a minute, I have the data and my copy is more current. Here it is" P1 sends it to P2 and turns its state into "O" (owner) since P1 has the modified copy and is responsible for writing it back. P2 now has an identical copy, that is marked "S" (shared). And P2 immediately overwrites a1 with b1 (same address). P2's cache block is now flagged as "O", and P1's copy is invalidated. P2 then writes b2, so that we now have b1/b2 in the 16 bytes of interest.

Now P1 resumes and tries to store a2. Not in cache so off to memory for the request, but P2 says "I have that, here it is." P2 is still "O" and P1 has a copy and is marked "S" (shared). P1 now stores a2 on top of b2, leaving us with the combination b1, a2. P1 is the owner, P2 is invalid, and eventually P1 has to write this back to memory. And at this point, this is the first the memory controller has seen of this mess. And there is nothing it can do because the contents of cache is broken, although this is exactly what should happen in such a case since it is "risky programming".

128 bytes for P4 and HPC chips and 64 bytes for the rest.
32 bytes for cpu's from 90s.

Entire cacheline gets flushed at once to memory controller.
Again, both correct and irrelevant. See above. That's why this is not a memory controller issue. google "x86 memory fence" to see lots of examples of how this problem can happen.


That is an atomic operation there, other cpu's can either store THEIR 64 bytes or you can store THIS 64 bytes. It can never mix.
Again, correct _and_ irrelevant. By this time, the problem is already in the cache contents. All you are doing here is guaranteeing that the bad cache contents are written back to memory exactly as they exist, but they are _still_ bad.


All kind of parallel mechanisms would mess up otherwise of course and bandwidth would be gone.

It can go wrong ONLY at supercomputer chips, when some sort of network interrupt happens. Interrupts at pc's don't mess up the atomic nature of it.

If a flushed 64 bytes is gone because of an interrupt, it is entirely gone, OR in an atomic operation it gets written.

That's how i was explained things already by 4 independant persons from which 2 hardware guys and 2 very good low level programmers, in the past
6 years.
Not sure what that means. My information comes _directly_ from Intel and AMD engineers. I could give you names if interested. This isn't a memory issue. It is a programming issue. Just think about a trivial case:

Processors P1 and P2 both want to store values into a[0] and a[1] which we would agree are in separate but consecutive words of memory, agreed?

P1 stores into a[0] and then goes off and does something else. meanwhile P2 stores into a[0] and a[1]. Later P1 comes back and stores into a[1]. We now have P1's value in a[1], but P2's value in a[0]. How can you _possibly_ prevent that from happening? (a) before you store either word, acquire a lock, then after storing the second, release the lock. problem gone. Performance probably gone too. (a) you can make the two values dependent on each other so that if they don't go together, you can detect this and not use them. That is the XOR trick. But _no_ hardware solution exists for this. The problem is the time gap between storing the two values, and that gap can be expanded unexpectedly by an interrupt, a pipeline stall, etc...


One of the programmers whom i shall keep unnamed, wondered why you bothered doing the XOR at a pc processor.
That programmer does _not_ know what he is talking about, in this case, so it is probably best he _does_ remain unnamed. I do the XOR because I spent several days week before last and last week, debugging this _exact_ problem. So feel free to tell him why I "bother". Because to not "bother" leads to crashes since I depend on the data to be correct for the position it is used in...

So the first thing i figured out for SGI is whether these cpu's flush entire 128 bytes (cacheline length in HPC usually is 128, better for bandwidth),
or whether the above scenario you pointed out can happen. It can go wrong there.

Not at all the pc processors you've got over there, other than that cacheline length may be shorter than above lengths.
I have not followed Intel carefully, but in past processors they did not write an entire cache line back to memory if just one word was modified. They had "write combine" buffers to deal with these writes and combine multiple writes into one if consecutive addresses are being written, for better performance. But this has _nothing_ to do with the current issue. The data is wrong in the cache. Not because of a hardware failure, but because of a classic "interleaved update" problem.
bob wrote:
That assumes that hash word 1 is in %r8, and hash word 2 is in %r9, and that %rax points to the hash table.

two different CPUS executte the above at the same time. or at least start to execute the above. the two words for the hash entry are different on each CPU, but they hash to the same table entry so %rax is the same.

On CPU 1, we do the first movq, and then and interrupt pops in. We
context-switch to the interrupt handler and start executing there.
this scenario doesn't go wrong. the movq still is in the cacheline c1 processor buffer, it hasn't been flushed yet to RAM.
Before we go on, please look up "MOESI cache coherency protocol". The caches pass data back and forth without writing it to RAM, so that is not an issue either.

It would only go wrong if you do this:

you flush first movq word to cacheline c1
you flush some other thing to some other cacheline cX

Now processor may decide to already flush the previous cacheline to RAM,
as you wrote to a new cacheline.

In crafty this doesn't happen as you FIRST write both quadwords to cacheline.

WIth a possibly significant delay between the two writes, which is _the_ problem. Instructions appearing consecutively do not necessarily get executed in consecutive cycles...

Now some guy in this forum whose hobby is to slowdown software, sometimes also designing datastructures/libraries which i classify in the turtle category belonging to ESA/NASA standard coding (speaking of a need to buy fast hardware because they can't program), he wanted to do some sort of atomic exchange.

If you'd execute that to some other cacheline, which we all use to lock in case of splitting or whatever parallel action, the exchange operation to a different cacheline forces the rest to get flushed.

Hence it goes atomic correct and hence you measured the huge slowdown using it :)

It cripples bandwidth :)

The above explanation is what i got around the year 2000 regarding P3 Xeon. It still works like that.
And that is all well and good, but it simply has _nothing_ to do with the problem that I have explained...

Thanks
bob wrote:


meanwhile on CPU 2, we do both stores, overwriting the first word CPU 1 stored. Once CPU 1 finishes handling the interrupt, it resumes and stores word 2. now we have word 1 from CPU2's position, and word 2 from CPU1's position. And there is _nothing_ that can be done short of locking the entire entry, or else doing the XOR trick I currently use.

That's bad enough. But then we have the write-combine buffers in the cache, so that as words are scheduled to be written back to memory, they can be written back in a _different_ order than they were stored. There are instructions one can use to deal with this, and you can find references in the linux kernel source about the problem. The worse case is the alpha, but the intel case is bad enough. And both of the above are _before_ the memory controller exerts any influence on what is going on. And both are real problems to this kind of hash usage...
Bob you each time miss the thing.

Cacheline is 128 bytes.

Do you understand what i'm trying to tell you?

You can write 16 quadwords to that 128 bytes before it gets flagged as invalid in your cacheline of P4 Xeon.

Vincent
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: SMP hashing problem

Post by bob »

wgarvin wrote:On x86, you can do atomic 16-byte reads and writes (with 16-byte alignment, anyway) using SSE instructions like MOVDQA and MOVNTDQ.

It might add a little overhead if you have to shuffle around between XMM and integer registers, though.

On PowerPC you could do the same thing with Altivec instructions. On other platforms, mileage may vary.
I'm aware of that. And for the 16 byte hash table, that would solve the problem. But my pawn hash table is 32 bytes and there's no workable solution there other than a traditional lock/unlock, or an XOR as is currently done...

The problem with XMM is that I am using C and do not want to add more x86 assembly beyond the BSF/BSR code already used. And C doesn't let me work directly with the XMM registers to do this portably.