SMP hashing problem

jwes · Post by **jwes** » Tue Jan 27, 2009 11:21 pm

diep wrote:
jwes wrote:Perhaps I can explain this more clearly. Thread one writes one QWORD and is interrupted before it writes the second QWORD. Thread two reads 2 QWORDS before thread one returns from the interrupt, the first is the new one written by thread one, the second was written some time earlier and you have invalid data. Length of cache lines and cache coherency has nothing to do with it.
This doesn't go wrong at pc processors. Only at HPC processors it can go wrong.

I've spent weeks to hardware experts asking the same thing and i got from each one of them the same answer.

BANDWIDTH is the keyword.

And ALIGN YOUR HASHTABLE.

Vincent

What data do you think thread two will read in the situation I described? 16 bytes of old data, 16 bytes of data written by thread one, or something else.

wgarvin · Post by **wgarvin** » Tue Jan 27, 2009 11:30 pm

diep wrote:And ALIGN YOUR HASHTABLE.

Vincent

Most x86 malloc implementations have 16-byte alignment anyways so that double and SSE data will be properly aligned (I know Microsoft's does).

With 32-byte pawn hash entries, the only thing he's risking is that he'll have to fetch two cache lines instead of one. (I'm not sure but x86 processors might hide that kind of latency pretty well... if it was two 16-byte reads into XMM registers they would, anyway).

But you're right that manually aligning it to a 32-byte boundary is probably a good idea (better safe than sorry!)

It wouldn't fix Bob's interrupt bug though, which has nothing to do with alignment and everything to do with atomicity.

diep · Post by **diep** » Tue Jan 27, 2009 11:51 pm

jwes wrote:
diep wrote:
jwes wrote:Perhaps I can explain this more clearly. Thread one writes one QWORD and is interrupted before it writes the second QWORD. Thread two reads 2 QWORDS before thread one returns from the interrupt, the first is the new one written by thread one, the second was written some time earlier and you have invalid data. Length of cache lines and cache coherency has nothing to do with it.
This doesn't go wrong at pc processors. Only at HPC processors it can go wrong.

I've spent weeks to hardware experts asking the same thing and i got from each one of them the same answer.

BANDWIDTH is the keyword.

And ALIGN YOUR HASHTABLE.

Vincent
What data do you think thread two will read in the situation I described? 16 bytes of old data, 16 bytes of data written by thread one, or something else.

What do you guess that needs to get stored in case of a context switch?
Right, also the memory buffers the processor has to still store to the RAM in case another line gets touched.

If you write to cache line c1 and then move to writing into cacheline c2, then it is POSSIBLE that the processor flushes cacheline c1 to the memory controller. No garantuees. The garantuee you have is that c2 didn't get written yet.

Look, if an old man is even *forever* too stubborn to even test it, you can claim forever to be right, it doesn't change the situation that it is a total lie he tested it. He did NOT. He is not even aligning his hashtable.

Here is the thing: he writes an article regarding the number of collissions. His claim is you can have a lot of collissions, yes even errors in collissions without a problem. This is hash collissions (so 2 different positions hashes to the same 64 bits integer number). Bob reduced the keysize in order to get more collissions (and managed of course). Then published an article about that you can really allow a lot of hashcollissions.

Now he's complaining here about a phenomena, which if you do not align your hashtable OR run on a HPC processor, happens really rare. In order to fix that rare phenomena, which happens a lot less than you get collissions, he's doing an XOR.

Consequent would now to be not doing that XOR, if you first write an article to show that your search is so crappy that you can allow a lot collission-errors, to not care about this XOR.

No in order to 'protect' his publication, which happened about the year 2000, he's now spreading the word: "thou shall do XOR".

Yet in 2000 already the commercial programmers spoke with me explaining why Bob's article was total irrelevant because of how in intel the memory controller works.

Now you see based upon the huge number of postings why he keeps ignoring things. You can of course post forever if you never TEST it yourself. Then doing a claim you did do it. Yet hashtable as i have proven is not even ALIGNED in his source code.

So that's an infinite discussion of course of someone who just fails to believe something and of course forever will believe it by never aligning his hashtable. In which case SOMETIMES such an error happens as the processor writes the first cacheline, but not yet the second one.

In the past years the above Robert Hyatt scenario has happened in about every small detail he claims to be true or denies.

Good example is: He claims lineair scaling, namely 1 + 0.7n.
So 1.7 at 2 processors, and 3.1 at his quad xeons.

I prove statistical significant evidence it is under 1.5 at my dual K7.
Yes 1.4 in fact.

Bob claims the hardware is the problem not his software.

I claim nullmove is a problem for parallel speedup as it delays the splitting time. Bob denies.

GCP tests 30 positions at his quad xeon box and proves it is 2.8 out of 4.
And GCP tests without nullmove and shows it is 3.0 out of 4, at Bob's own hardware.

Bob needs like 48 hours to then run 3 positions or so to "prove" it is 3.1 (in fact rounded off it was 3.0 not 3.1).

No statistical significant evidence. The lie continues again online.

How many time did he need to rerun them in order to get one output of 3.0?

So you realize why most people sidn't bother posting all this past years and another 1000 things?

You realize clearly what we are speaking about? We speak about the odds of something going wrong here. I measured it at a HPC processor where it can go wrong once in 200 billion.

That is a lot less than Bob claims he cares about collisions happening in crafty.

I want however 0 errors preferably both 0 collissions and 0 errors. That's why diep has 128 bits Zobrist by the way and not 64 bits.

Now what about every hardware engineer knows that cannot go wrong he's making a big deal about, because he wrote 1 article somewhere. So it can NEVER be admitted of course he's wrong.

Like he did do at 100 other subjects.

Vincent

wgarvin · Post by **wgarvin** » Wed Jan 28, 2009 12:12 am

Do you agree that its possible for an interrupt to occur between two instructions? ANY two instructions, even a pair of stores?

That's the scenario Bob described in his very first post. First store instruction executes, then interrupt happens. Second store instruction does not get executed. Half of the table entry (the key half) is written to the cache line, the other half (the data half) does not get written until a zillion years later when the thread resumes. Meanwhile (say, after about half of a zillion years), another thread tries to read the same table entry.

Yes, its highly unlikely to happen on any particular execution of that code. When it DOES happen, the second thread reads exactly what the cache line contains: the first half of a new entry, and the second half of an OLD one from long before (because the thread suspended right before the store instruction that would have stored the second half of the new entry, and it still hasn't been executed---in Bob's case, it might NEVER have be executed because the other thread of Crafty would crash first

)

diep · Post by **diep** » Wed Jan 28, 2009 1:35 am

wgarvin wrote:
diep wrote:And ALIGN YOUR HASHTABLE.

Vincent
Most x86 malloc implementations have 16-byte alignment anyways so that double and SSE data will be properly aligned (I know Microsoft's does).

With 32-byte pawn hash entries, the only thing he's risking is that he'll have to fetch two cache lines instead of one. (I'm not sure but x86 processors might hide that kind of latency pretty well... if it was two 16-byte reads into XMM registers they would, anyway).

But you're right that manually aligning it to a 32-byte boundary is probably a good idea (better safe than sorry!)

It wouldn't fix Bob's interrupt bug though, which has nothing to do with alignment and everything to do with atomicity.

Please contact some CPU designers and ask them after PC processors rather than console IBM processors. Maybe that will convince you that pc processors have no problem at all.

Thanks,
Vincent

jwes · Post by **jwes** » Wed Jan 28, 2009 1:56 am

diep wrote:
jwes wrote:
diep wrote:
jwes wrote:Perhaps I can explain this more clearly. Thread one writes one QWORD and is interrupted before it writes the second QWORD. Thread two reads 2 QWORDS before thread one returns from the interrupt, the first is the new one written by thread one, the second was written some time earlier and you have invalid data. Length of cache lines and cache coherency has nothing to do with it.
This doesn't go wrong at pc processors. Only at HPC processors it can go wrong.

I've spent weeks to hardware experts asking the same thing and i got from each one of them the same answer.

BANDWIDTH is the keyword.

And ALIGN YOUR HASHTABLE.

Vincent
What data do you think thread two will read in the situation I described? 16 bytes of old data, 16 bytes of data written by thread one, or something else.
<rant deleted>
Vincent

I read your rant several times and did not see the answer to my question. Could you try again (preferably in a few words) ?

bob · Post by **bob** » Wed Jan 28, 2009 2:31 am

diep wrote:
bob wrote:
diep wrote:
bob wrote: [snip]
That's all well and good. But you are one step _beyond_ what I am talking about. In this context the memory controller is moot. That's not where the problem is. The problem lies in the fact that I need to store either 2 64 bit words for a hash entry, or 4 8 byte words for a pawn hash entry. And I can't store 'em in one cycle.

In the simplest case, I do this: (at&t format for linux asm):

movq %r8, [%rax]
movq %r9, [%rax + 8]

The above goes ok if you write it to the same cacheline.
All modern cpu's are 64 or 128 bytes cachelines in L2/L3.

Vincent, you are thinking _way_ too narrowly. I explained this in another post, but let me do it again here. Intel is using MOESI for SMP cache coherency. Here goes:

processor one and two both want to store at the same address, two consecutive quadwords (16 bytes total).

Assume nobody has this address in cache, for simplicity. P1 tries to access the address where a1 is going to go. Cache bring it in and flags the block as "E" (exclusive). P1 then writes a1 with one movq instruction. The cache block immediately becomes "M" (modified). Now we respond to an interrupt, or the processor stalls for some reason, so we are half done and "sitting".

Processor P2 tries to access the same address to store b1/b2. When it doesn't find it in cache, it asks for the data from memory but P1's cache says "wait a minute, I have the data and my copy is more current. Here it is" P1 sends it to P2 and turns its state into "O" (owner) since P1 has the modified copy and is responsible for writing it back. P2 now has an identical copy, that is marked "S" (shared). And P2 immediately overwrites a1 with b1 (same address). P2's cache block is now flagged as "O", and P1's copy is invalidated. P2 then writes b2, so that we now have b1/b2 in the 16 bytes of interest.

Now P1 resumes and tries to store a2. Not in cache so off to memory for the request, but P2 says "I have that, here it is." P2 is still "O" and P1 has a copy and is marked "S" (shared). P1 now stores a2 on top of b2, leaving us with the combination b1, a2. P1 is the owner, P2 is invalid, and eventually P1 has to write this back to memory. And at this point, this is the first the memory controller has seen of this mess. And there is nothing it can do because the contents of cache is broken, although this is exactly what should happen in such a case since it is "risky programming".

128 bytes for P4 and HPC chips and 64 bytes for the rest.
32 bytes for cpu's from 90s.

Entire cacheline gets flushed at once to memory controller.
Again, both correct and irrelevant. See above. That's why this is not a memory controller issue. google "x86 memory fence" to see lots of examples of how this problem can happen.

That is an atomic operation there, other cpu's can either store THEIR 64 bytes or you can store THIS 64 bytes. It can never mix.
Again, correct _and_ irrelevant. By this time, the problem is already in the cache contents. All you are doing here is guaranteeing that the bad cache contents are written back to memory exactly as they exist, but they are _still_ bad.

All kind of parallel mechanisms would mess up otherwise of course and bandwidth would be gone.

It can go wrong ONLY at supercomputer chips, when some sort of network interrupt happens. Interrupts at pc's don't mess up the atomic nature of it.

If a flushed 64 bytes is gone because of an interrupt, it is entirely gone, OR in an atomic operation it gets written.

That's how i was explained things already by 4 independant persons from which 2 hardware guys and 2 very good low level programmers, in the past
6 years.
Not sure what that means. My information comes _directly_ from Intel and AMD engineers. I could give you names if interested. This isn't a memory issue. It is a programming issue. Just think about a trivial case:

Processors P1 and P2 both want to store values into a[0] and a[1] which we would agree are in separate but consecutive words of memory, agreed?

P1 stores into a[0] and then goes off and does something else. meanwhile P2 stores into a[0] and a[1]. Later P1 comes back and stores into a[1]. We now have P1's value in a[1], but P2's value in a[0]. How can you _possibly_ prevent that from happening? (a) before you store either word, acquire a lock, then after storing the second, release the lock. problem gone. Performance probably gone too. (a) you can make the two values dependent on each other so that if they don't go together, you can detect this and not use them. That is the XOR trick. But _no_ hardware solution exists for this. The problem is the time gap between storing the two values, and that gap can be expanded unexpectedly by an interrupt, a pipeline stall, etc...

One of the programmers whom i shall keep unnamed, wondered why you bothered doing the XOR at a pc processor.
That programmer does _not_ know what he is talking about, in this case, so it is probably best he _does_ remain unnamed. I do the XOR because I spent several days week before last and last week, debugging this _exact_ problem. So feel free to tell him why I "bother". Because to not "bother" leads to crashes since I depend on the data to be correct for the position it is used in...

So the first thing i figured out for SGI is whether these cpu's flush entire 128 bytes (cacheline length in HPC usually is 128, better for bandwidth),
or whether the above scenario you pointed out can happen. It can go wrong there.

Not at all the pc processors you've got over there, other than that cacheline length may be shorter than above lengths.
I have not followed Intel carefully, but in past processors they did not write an entire cache line back to memory if just one word was modified. They had "write combine" buffers to deal with these writes and combine multiple writes into one if consecutive addresses are being written, for better performance. But this has _nothing_ to do with the current issue. The data is wrong in the cache. Not because of a hardware failure, but because of a classic "interleaved update" problem.

bob wrote:
That assumes that hash word 1 is in %r8, and hash word 2 is in %r9, and that %rax points to the hash table.

two different CPUS executte the above at the same time. or at least start to execute the above. the two words for the hash entry are different on each CPU, but they hash to the same table entry so %rax is the same.

On CPU 1, we do the first movq, and then and interrupt pops in. We
context-switch to the interrupt handler and start executing there.
this scenario doesn't go wrong. the movq still is in the cacheline c1 processor buffer, it hasn't been flushed yet to RAM.
Before we go on, please look up "MOESI cache coherency protocol". The caches pass data back and forth without writing it to RAM, so that is not an issue either.

It would only go wrong if you do this:

you flush first movq word to cacheline c1
you flush some other thing to some other cacheline cX

Now processor may decide to already flush the previous cacheline to RAM,
as you wrote to a new cacheline.

In crafty this doesn't happen as you FIRST write both quadwords to cacheline.

WIth a possibly significant delay between the two writes, which is _the_ problem. Instructions appearing consecutively do not necessarily get executed in consecutive cycles...

Now some guy in this forum whose hobby is to slowdown software, sometimes also designing datastructures/libraries which i classify in the turtle category belonging to ESA/NASA standard coding (speaking of a need to buy fast hardware because they can't program), he wanted to do some sort of atomic exchange.

If you'd execute that to some other cacheline, which we all use to lock in case of splitting or whatever parallel action, the exchange operation to a different cacheline forces the rest to get flushed.

Hence it goes atomic correct and hence you measured the huge slowdown using it

It cripples bandwidth

The above explanation is what i got around the year 2000 regarding P3 Xeon. It still works like that.
And that is all well and good, but it simply has _nothing_ to do with the problem that I have explained...

Thanks

bob wrote:

meanwhile on CPU 2, we do both stores, overwriting the first word CPU 1 stored. Once CPU 1 finishes handling the interrupt, it resumes and stores word 2. now we have word 1 from CPU2's position, and word 2 from CPU1's position. And there is _nothing_ that can be done short of locking the entire entry, or else doing the XOR trick I currently use.

That's bad enough. But then we have the write-combine buffers in the cache, so that as words are scheduled to be written back to memory, they can be written back in a _different_ order than they were stored. There are instructions one can use to deal with this, and you can find references in the linux kernel source about the problem. The worse case is the alpha, but the intel case is bad enough. And both of the above are _before_ the memory controller exerts any influence on what is going on. And both are real problems to this kind of hash usage...
Bob you each time miss the thing.

Cacheline is 128 bytes.

Do you understand what i'm trying to tell you?

You can write 16 quadwords to that 128 bytes before it gets flagged as invalid in your cacheline of P4 Xeon.

Vincent

L1 is not 128 bytes on every processor. And that is moot.

I gave you an example for the problem. If you don't understand it, I don't know how to make it clearer. This has _nothing_ to do with cache linesize... It has _nothing_ to do with how much the cache/controller and memory/controller can write in one chunk. None of that is in the least bit relevant to this discussion.

I challenge you to try this:

create two threads.

The first thread executes this:

sleep(1);
a[0] = 1;
a[1= 2;

The second executes this:

a[0] = 3;
sleep (2);
a[1] = 4;

Now, according to your logic, we can only get 1,2 or 3, 4...

I claim we can get 3,2.

THe first thread sleeps for one second.

Second thread stores 3. Sleeps.

First stores 1, 2...

then second wakes up and stores 4. What is in the two-element array? not 1,2 or 3,4

That really is not that hard to understand...

bob · Post by **bob** » Wed Jan 28, 2009 2:49 am

diep wrote:
wgarvin wrote:@Vincent: The problem is not at the memory controller level -- its within each processor, executing instructions. The 16 bytes are written by two separate store instructions (each of which stores 8 bytes). Usually these will then get write-combined and so on. But if an interrupt and thread switch occurs between them, then the first instruction gets executed but the second one doesn't. After a while, the cache line gets flushed back out or RFO'd by another processor.

This is a common problem with lockless algorithms. You have to assume that a thread could "stall" between any two instructions, for an indefinite period of time. This is the reason that correct implementations of lock-free queues are so hard to find. Unless you can always read and write entries with a single atomic instruction, you have to do things like the XOR trick Bob uses (and then be very careful about the ordering of memory effects).

[Edit: "interrupt and thread switch occurs between them" is a little bit sloppy, but hopefully you know what I meant... the first instruction has finished executing before the interrupt and gets retired, but the second one is not ready to be retired and so it gets discarded. It might be true that in all other cases, the two updates occur atomically as far as that cache line is concerned. But from the point of view of the CPU executing it, this "interrupted" case can be thought of as "write 8 bytes, then do a huge amount of other stuff that is guaranteed to flush out that cache line eventually, then someday, come back and write another 8 bytes".]
Marvin i know everything about race conditions and deadlocks.

Almost fell out of my chair on reading that bit of nonsense. If _you_ don't see the problem, that's fine. That doesn't mean that others are not smart enough to understand what I have written. It happens. I have _proven_ that it happens. And even someone with a casual understanding could understand the examples that I have explained.

So saying you know "everything" about race conditions is ridiculous. This is a trivial race that is not the hardware's fault, and there is _nothing_ that the hardware can possibly do to prevent this particular problem. It is purely up to the programmer. Otherwise there would be no need for atomic locks at all.

My parallellism has been proven 100% correct in mathematical manner
to be working like that. That's why it scales so well up to a cpu or 2048.

In case of Bob, you know he's still speaking about processors from the 70s and that will NEVER change. You can write of course entire articles explaining how something happened if in the end it is just a simple alignment bug.

This is happening on current hardware. It was first noticed on the I7 I was running on. It was this seen on the dual quad-core box I am running on all the time. It was also found on my core-2 duo laptop. And I repeated it several times while testing and debugging while using my office dual-PIV box and the older quad PIII xeon 700 I have. This has nothing to do with 70's hardware. It has _nothing_ to do with alignment, except that you can't seem to align your reasoning ability with reality to understand what most others get pretty quickly.

If you simply FAIL to *ever* believe how most modern processors work, then that is not curable.

Or if you don't understand the problem being described, then any amount of hand-waving just stirs the air, but provides nothing useful of any kind.

Most memory controllers work like this: if you write in the SAME cacheline they do not store it yet. If you write to another cacheline it gets stored.

Sorry, but when I write to a cache line, the memory controller is unaware of that until the cache controller finally decides to flush that modified block from cache and write it back to memory. I wish you would back off and re-read the many descriptions of the problem I have written. you are _completely_ missing out on what is happening, and waving your hands about the memory controller when it has _nothing_ to do with this specific issue... Jesus...

I found a race bug in diep, without it EVER crashing on it, at 500 cpu's, simply because i wrote in the same cacheline... ...just in the wrong order.

Yet i DO align my data unlike Bob (note that in shared memory usually data already gets aligned at 2048 bytes or even bigger pages), i take care that if i allocate several datastructures in 1 block of shared RAM that the first datastructure is exactly having a length of 128 bytes:

I did not align the phash because up until a few months ago, it was using the shmget() call which _always_ returns memory on a page boundary. But that has nothing to do with this, and fixing that is one of several things I have done in 23.0...

i = (int)sizeof(NumaDataStructure);
if( (i&127) != 0 ) {
i &= (~127);
i += 128; /* dus veelvoud van 128 bytes is het. */
}

In case of Bob he has to take care that each entry falls within 1 cacheline.

Why do you want to make statements that are wrong? It is more _efficient_ if a single entry is in one cache line. But whether it is or is not has _nothing_ to do with this problem.

Now he will ever fail to do this and deny until he is in hell (all chessplayers go to hell, not a single one ever will go to heaven - so i was told). That's how he behaves already for 35 years, so nothing new there.

And in your case, you have "imagined" what is wrong with my program to cause this behavior. You have no idea at all what is _really_ happening. But you are so certain that you are an expert ("ex" as in has-been, "spert" as in 'drip under pressure???) that you have locked onto some imaginary issue that nobody but you is talking about, and as a result, you've completely missed the point. And if you haven't taken care of how your store anything longer than 8 bytes for one entry, you will have the _same_ problem and simply won't know it...

Usually what happens then is some sort of student which shows up which fixes the code in Crafty for Bob. Maybe you can do it for him this time?

Never had a student write one line of code for Crafty. Your imagination _really_ is in overdrive here. Throttle back a bit and come back to reality and perhaps learn what this discussion is all about...

Vincent

bob · Post by **bob** » Wed Jan 28, 2009 2:50 am

diep wrote:Anyway, you see why i didn't post all this in the year 2000, some people are as movable as a sleeper.

You can say 1000 times: "align your data to a cacheline length", it doesn't get understood simply.

Not when the suggestion has _nothing_ to do with the problem. Feel free to not post for another 2-3 years if this is all you have to offer. Others see the problem. If you can't, go study for a while...

bob · Post by **bob** » Wed Jan 28, 2009 2:53 am

diep wrote:
bob wrote:
diep wrote:
wgarvin wrote:On x86, you can do atomic 16-byte reads and writes (with 16-byte alignment, anyway) using SSE instructions like MOVDQA and MOVNTDQ.

It might add a little overhead if you have to shuffle around between XMM and integer registers, though.

On PowerPC you could do the same thing with Altivec instructions. On other platforms, mileage may vary.
Altivec is useless 32 bits junk nowadays.
Power6 also has it. 2 units

Bob measured wrong or something.
I want to see his code.
You fail to see which tricks past 15 years were introduced by hardware engineers to still get some sort of bandwidth to RAM...

Just look at Evaluate(), where I store and retrieve values from the 32 bit hash entry. It is quite simple to follow. The problem, I have already explained. This is not a hardware issue...

Vincent, don't be a dumbass. I know it is hard to not "do what comes natural" but I have _already_ told you, a _dozen_ times, that this is not a "hardware issue"... You just can't do two stores to memory and assume that other stores don't overwrite either one of those independently.

So yes it is a software problem, I, unlike you, have never said otherwise. I explained _how_ to create the problem. If you store an entry that takes 16 bytes, you have the problem. End of story. There is no other way out of this. You accept the error, you acquire a lock before the first store and release it after the second to be sure that both are done before anyone else can modify either value, or you use the XOR trick. But you _will_ do one of those three. It sounds like you do number 1 and just ignore the errors. I was doing that in the hash, in fact. But the pawn hash error crashed me and I had to fix that circumstance, which the XOR did.

SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem