SMP hashing problem

bob · Post by **bob** » Tue Jan 27, 2009 10:34 pm

diep wrote:
wgarvin wrote:On x86, you can do atomic 16-byte reads and writes (with 16-byte alignment, anyway) using SSE instructions like MOVDQA and MOVNTDQ.

It might add a little overhead if you have to shuffle around between XMM and integer registers, though.

On PowerPC you could do the same thing with Altivec instructions. On other platforms, mileage may vary.
Altivec is useless 32 bits junk nowadays.
Power6 also has it. 2 units

Bob measured wrong or something.
I want to see his code.

Just look at Evaluate(), where I store and retrieve values from the 32 bit hash entry. It is quite simple to follow. The problem, I have already explained. This is not a hardware issue...

diep · Post by **diep** » Tue Jan 27, 2009 10:37 pm

bob wrote:
diep wrote:
wgarvin wrote:On x86, you can do atomic 16-byte reads and writes (with 16-byte alignment, anyway) using SSE instructions like MOVDQA and MOVNTDQ.

It might add a little overhead if you have to shuffle around between XMM and integer registers, though.

On PowerPC you could do the same thing with Altivec instructions. On other platforms, mileage may vary.
Altivec is useless 32 bits junk nowadays.
Power6 also has it. 2 units

Bob measured wrong or something.
I want to see his code.
You fail to see which tricks past 15 years were introduced by hardware engineers to still get some sort of bandwidth to RAM...

Just look at Evaluate(), where I store and retrieve values from the 32 bit hash entry. It is quite simple to follow. The problem, I have already explained. This is not a hardware issue...

diep · Post by **diep** » Tue Jan 27, 2009 10:37 pm

If you keep failing to see tricks past 15 years were introduced by hardware engineers to still get some sort of bandwidth to RAM...

wgarvin · Post by **wgarvin** » Tue Jan 27, 2009 10:42 pm

@Vincent: The problem is not at the memory controller level -- its within each processor, executing instructions. The 16 bytes are written by two separate store instructions (each of which stores 8 bytes). Usually these will then get write-combined and so on. But if an interrupt and thread switch occurs between them, then the first instruction gets executed but the second one doesn't. After a while, the cache line gets flushed back out or RFO'd by another processor.

This is a common problem with lockless algorithms. You have to assume that a thread could "stall" between any two instructions, for an indefinite period of time. This is the reason that correct implementations of lock-free queues are so hard to find.

Unless you can always read and write entries with a single atomic instruction, you have to do things like the XOR trick Bob uses (and then be very careful about the ordering of memory effects).

[Edit: "interrupt and thread switch occurs between them" is a little bit sloppy, but hopefully you know what I meant... the first instruction has finished executing before the interrupt and gets retired, but the second one is not ready to be retired and so it gets discarded. It might be true that in all other cases, the two updates occur atomically as far as that cache line is concerned. But from the point of view of the CPU executing it, this "interrupted" case can be thought of as "write 8 bytes, then do a huge amount of other stuff that is guaranteed to flush out that cache line eventually, then someday, come back and write another 8 bytes".]

diep · Post by **diep** » Tue Jan 27, 2009 10:48 pm

wgarvin wrote:@Vincent: The problem is not at the memory controller level -- its within each processor, executing instructions. The 16 bytes are written by two separate store instructions (each of which stores 8 bytes). Usually these will then get write-combined and so on. But if an interrupt and thread switch occurs between them, then the first instruction gets executed but the second one doesn't. After a while, the cache line gets flushed back out or RFO'd by another processor.

This is a common problem with lockless algorithms. You have to assume that a thread could "stall" between any two instructions, for an indefinite period of time. This is the reason that correct implementations of lock-free queues are so hard to find. Unless you can always read and write entries with a single atomic instruction, you have to do things like the XOR trick Bob uses (and then be very careful about the ordering of memory effects).

Comeon, Bob just can't program:

pawn_hash_table = (PAWN_HASH_ENTRY *) malloc(cb_pawn_hash_table);

He's not even *checking* whether it has been aligned at 128 bytes boundary.

Vincent

diep · Post by **diep** » Tue Jan 27, 2009 11:05 pm

wgarvin wrote:@Vincent: The problem is not at the memory controller level -- its within each processor, executing instructions. The 16 bytes are written by two separate store instructions (each of which stores 8 bytes). Usually these will then get write-combined and so on. But if an interrupt and thread switch occurs between them, then the first instruction gets executed but the second one doesn't. After a while, the cache line gets flushed back out or RFO'd by another processor.

This is a common problem with lockless algorithms. You have to assume that a thread could "stall" between any two instructions, for an indefinite period of time. This is the reason that correct implementations of lock-free queues are so hard to find. Unless you can always read and write entries with a single atomic instruction, you have to do things like the XOR trick Bob uses (and then be very careful about the ordering of memory effects).

[Edit: "interrupt and thread switch occurs between them" is a little bit sloppy, but hopefully you know what I meant... the first instruction has finished executing before the interrupt and gets retired, but the second one is not ready to be retired and so it gets discarded. It might be true that in all other cases, the two updates occur atomically as far as that cache line is concerned. But from the point of view of the CPU executing it, this "interrupted" case can be thought of as "write 8 bytes, then do a huge amount of other stuff that is guaranteed to flush out that cache line eventually, then someday, come back and write another 8 bytes".]

Marvin i know everything about race conditions and deadlocks.

My parallellism has been proven 100% correct in mathematical manner
to be working like that. That's why it scales so well up to a cpu or 2048.

In case of Bob, you know he's still speaking about processors from the 70s and that will NEVER change. You can write of course entire articles explaining how something happened if in the end it is just a simple alignment bug.

If you simply FAIL to *ever* believe how most modern processors work, then that is not curable.

Most memory controllers work like this: if you write in the SAME cacheline they do not store it yet. If you write to another cacheline it gets stored.

I found a race bug in diep, without it EVER crashing on it, at 500 cpu's, simply because i wrote in the same cacheline... ...just in the wrong order.

Yet i DO align my data unlike Bob (note that in shared memory usually data already gets aligned at 2048 bytes or even bigger pages), i take care that if i allocate several datastructures in 1 block of shared RAM that the first datastructure is exactly having a length of 128 bytes:

i = (int)sizeof(NumaDataStructure);
if( (i&127) != 0 ) {
i &= (~127);
i += 128; /* dus veelvoud van 128 bytes is het. */
}

In case of Bob he has to take care that each entry falls within 1 cacheline.

Now he will ever fail to do this and deny until he is in hell (all chessplayers go to hell, not a single one ever will go to heaven - so i was told). That's how he behaves already for 35 years, so nothing new there.

Usually what happens then is some sort of student which shows up which fixes the code in Crafty for Bob. Maybe you can do it for him this time?

Vincent

jwes · Post by **jwes** » Tue Jan 27, 2009 11:08 pm

Perhaps I can explain this more clearly. Thread one writes one QWORD and is interrupted before it writes the second QWORD. Thread two reads 2 QWORDS before thread one returns from the interrupt, the first is the new one written by thread one, the second was written some time earlier and you have invalid data. Length of cache lines and cache coherency has nothing to do with it.

diep · Post by **diep** » Tue Jan 27, 2009 11:12 pm

Anyway, you see why i didn't post all this in the year 2000, some people are as movable as a sleeper.

You can say 1000 times: "align your data to a cacheline length", it doesn't get understood simply.

diep · Post by **diep** » Tue Jan 27, 2009 11:14 pm

jwes wrote:Perhaps I can explain this more clearly. Thread one writes one QWORD and is interrupted before it writes the second QWORD. Thread two reads 2 QWORDS before thread one returns from the interrupt, the first is the new one written by thread one, the second was written some time earlier and you have invalid data. Length of cache lines and cache coherency has nothing to do with it.

This doesn't go wrong at pc processors. Only at HPC processors it can go wrong.

I've spent weeks to hardware experts asking the same thing and i got from each one of them the same answer.

BANDWIDTH is the keyword.

And ALIGN YOUR HASHTABLE.

Vincent

wgarvin · Post by **wgarvin** » Tue Jan 27, 2009 11:18 pm

jwes wrote:Perhaps I can explain this more clearly. Thread one writes one QWORD and is interrupted before it writes the second QWORD. Thread two reads 2 QWORDS before thread one returns from the interrupt, the first is the new one written by thread one, the second was written some time earlier and you have invalid data. Length of cache lines and cache coherency has nothing to do with it.

Right. Even though the two write instructions come right after each other, its still possible for an interrupt and thread switch to cause many millions of other instructions to be executed in between. Even if the two write instructions were decoded and issued near-simultaneously, due to limited execution resources its still possible for one to retire and the other to be discarded (due to the interrupt) before retiring. I don't think micro-op fusion alters this situation any, either.

[Edit: it doesn't matter that the processor decodes and issues multiple instructions per cycle, executes instructions out-of-order, has dozens of them in-flight at the same time, combines the effects of multiple writes together in a buffer before pushing it to L1, fuses multiple u-ops together, or any of that stuff. Arguments about cache line size or how the SMP sharing of those cache lines works are even more irrelevant for this particular bug. The bottom line is that CPU instructions are retired in-order, and the illusion of serial execution is preserved when something like an interrupt occurs. When an interrupt is raised, some of the "not finished yet" execution steps are completed, and others are held back or discarded, with the result that it looks like the first store instruction has completed but the second one never even issued.]

SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem