SMP hashing problem

diep · Post by **diep** » Mon Jan 26, 2009 9:31 pm

bob wrote:
diep wrote:More explanation on how memory controller works:

Vincent says: (4:26:57 AM) we write nonstop with all cores to RAM
Vincent says: (4:27:13 AM) now suppose 2 cores write at same time to the same adress in RAM
Vincent says: (4:27:34 AM) one entry is 16 bytes
Vincent says: (4:27:50 AM) core1 writes (a1,a2)
Vincent says: (4:27:59 AM) core2 writes (b1,b2)
Vincent says: (4:28:15 AM) can we get in ram stored (a1,b2) or (b1,a2) ?
Vincent says: (4:28:22 AM) as that would mean big problem
Michael says: (4:28:51 AM) no. that's not really how it works
Michael says: (4:29:37 AM) all cores share the same two memory controllers via crossbar
Michael says: (4:30:11 AM) and that is essentially the same thing as if they only had one memory controller / channel
Michael says: (4:30:25 AM) so they cannot write "simultaneously" to memory

Anyway, i knew this already back in 2002 or so. Just asked it to confirm it.

Note that at some HPC processors like R14000 it can happen you get write error, as they work more complex than pc processors.

Thanks,
Vincent
Michael is not thinking of what I am talking about. If the CPU only stores A1, then pauses for any reason from interrupt to hyper-threading context swap, the other can store B1, B2 and then the first later comes back to the current process and stores A2.

But for hardware, I still believe he is wrong, because of the "write-combine" buffers Intel does. A1 could be on the tail end of one of those buffers and get written out _before_ A2 gets written out. Other cores can then overwrite A1 and A2 before before A2 gets written, and we are right back to this problem. The X86 actually has load/store fence operations you can use to help this, if you are writing in asm. But even those do not solve the problem where two consecutive mov [mem], rax and mov [mem+8], rbx do not necessarily get executed close together, even though they appear consecutively in the program being run. And without that guarantee, we have to take evasive action ourselves...

Well Bob, suppose you would be a CPU designer.

Hyperthreading doesn't matter of course in this context.
Let's first look at a quadcore cpu with 4 cores.

Suppose that cores could simultaneously write to the same cache line and pollute their own L2 like that, instead of writing ENTIRE cacheline from its cache to RAM.

Just SUPPOSE.

Then it is suddenly total impossible to sell that processor.

Because not a single mechanism works correct anymore with respect to RAM.

So the only occurance it can happen is when there is more between the memory controller implemented and the CPU.

That happens for example in R14000. There is a hub in between. So if interrupt happens that hub could pollute it.

Note the bandwidth of those CPU's to RAM is real ugly because of this.

We're speaking of something that has 1GB/s bandwidth.

Todays Nehalem is on paper 32GB/s. Some benchmarks i saw it practical get 19GB/s and even some claims of 21 to 22 GB/s have been done.

You don't have the bandwidth to implement latency sensitive stuff in that case.

Vincent

diep · Post by **diep** » Mon Jan 26, 2009 9:39 pm

Note that i'm in diep not aligning my hashtable probes onto a cacheline and the length of an entry today is 16 bytes tomorrow maybe 20 bytes or 24 bytes again. So because Diep's spreading potentially over more than 1 cacheline of the processor, it could go wrong in theory.

Even then some years ago when i measured (had 20 bytes an entry back then), i measured 0 errors.

At supercomputer it was 1 in 200 billion stores.

The number of errors is a LOT smaller than what you claim in your paper that you can allow for Errors in hashtable caused by collissions. So in your case just do the Xor as that single instruction has a cost of 0 and never worry again about write errors

My argument is however that the number of errors that i could measure on a cpu where such errors could occur, that it was tinier than the number of collissions i measured there when using 64 bits Zobrist key.

The number of errors was far smaller than i had guessed and that math had shown me would occur. How many times a second does the evaluation misevaluate on the other hand?

Best Regards,
Vincent

diep · Post by **diep** » Mon Jan 26, 2009 9:48 pm

Bob if you can produce an error at a Nehalem in this manner,
note that Nehalem doesn't have ECC ram, i advice to you to definitely ship that version with source code to a lot of different people who can try to figure out what causes the bug.

My experience is that Diep is a much better program to show worst case path of the processor than prime95 does do, to give one example. Also powerconsumption is higher when Diep runs than with other software (of course do not ignore i/o, that has to keep constant).

I would not rule out that with chessprograms we can unveil sometimes a bug in a processor. For now i'd assume a RAM error if you manage to measure at a Nehalem with 8 threads.

Parallel search is not deterministic, but with so many writes to hashtable, it definitely should be possible to get in a similar amount of time another error.

I hope you realize that back then doing a run with diep of 200 billion nodes, that this took a long time and 200 processors.

200 billion nodes really is a LOT for 1 search of Diep in 2003.

It's a tad slower in nps than crafty

Vincent

bob · Post by **bob** » Tue Jan 27, 2009 7:35 am

diep wrote:
bob wrote:
diep wrote:More explanation on how memory controller works:

Vincent says: (4:26:57 AM) we write nonstop with all cores to RAM
Vincent says: (4:27:13 AM) now suppose 2 cores write at same time to the same adress in RAM
Vincent says: (4:27:34 AM) one entry is 16 bytes
Vincent says: (4:27:50 AM) core1 writes (a1,a2)
Vincent says: (4:27:59 AM) core2 writes (b1,b2)
Vincent says: (4:28:15 AM) can we get in ram stored (a1,b2) or (b1,a2) ?
Vincent says: (4:28:22 AM) as that would mean big problem
Michael says: (4:28:51 AM) no. that's not really how it works
Michael says: (4:29:37 AM) all cores share the same two memory controllers via crossbar
Michael says: (4:30:11 AM) and that is essentially the same thing as if they only had one memory controller / channel
Michael says: (4:30:25 AM) so they cannot write "simultaneously" to memory

Anyway, i knew this already back in 2002 or so. Just asked it to confirm it.

Note that at some HPC processors like R14000 it can happen you get write error, as they work more complex than pc processors.

Thanks,
Vincent
Michael is not thinking of what I am talking about. If the CPU only stores A1, then pauses for any reason from interrupt to hyper-threading context swap, the other can store B1, B2 and then the first later comes back to the current process and stores A2.

But for hardware, I still believe he is wrong, because of the "write-combine" buffers Intel does. A1 could be on the tail end of one of those buffers and get written out _before_ A2 gets written out. Other cores can then overwrite A1 and A2 before before A2 gets written, and we are right back to this problem. The X86 actually has load/store fence operations you can use to help this, if you are writing in asm. But even those do not solve the problem where two consecutive mov [mem], rax and mov [mem+8], rbx do not necessarily get executed close together, even though they appear consecutively in the program being run. And without that guarantee, we have to take evasive action ourselves...
Well Bob, suppose you would be a CPU designer.

Hyperthreading doesn't matter of course in this context.
Let's first look at a quadcore cpu with 4 cores.

Suppose that cores could simultaneously write to the same cache line and pollute their own L2 like that, instead of writing ENTIRE cacheline from its cache to RAM.

Just SUPPOSE.

Then it is suddenly total impossible to sell that processor.

Because not a single mechanism works correct anymore with respect to RAM.

So the only occurance it can happen is when there is more between the memory controller implemented and the CPU.

That happens for example in R14000. There is a hub in between. So if interrupt happens that hub could pollute it.

Note the bandwidth of those CPU's to RAM is real ugly because of this.

We're speaking of something that has 1GB/s bandwidth.

Todays Nehalem is on paper 32GB/s. Some benchmarks i saw it practical get 19GB/s and even some claims of 21 to 22 GB/s have been done.

You don't have the bandwidth to implement latency sensitive stuff in that case.

Vincent

That's all well and good. But you are one step _beyond_ what I am talking about. In this context the memory controller is moot. That's not where the problem is. The problem lies in the fact that I need to store either 2 64 bit words for a hash entry, or 4 8 byte words for a pawn hash entry. And I can't store 'em in one cycle.

In the simplest case, I do this: (at&t format for linux asm):

movq %r8, [%rax]
movq %r9, [%rax + 8]

That assumes that hash word 1 is in %r8, and hash word 2 is in %r9, and that %rax points to the hash table.

two different CPUS executte the above at the same time. or at least start to execute the above. the two words for the hash entry are different on each CPU, but they hash to the same table entry so %rax is the same.

On CPU 1, we do the first movq, and then and interrupt pops in. We context-switch to the interrupt handler and start executing there. meanwhile on CPU 2, we do both stores, overwriting the first word CPU 1 stored. Once CPU 1 finishes handling the interrupt, it resumes and stores word 2. now we have word 1 from CPU2's position, and word 2 from CPU1's position. And there is _nothing_ that can be done short of locking the entire entry, or else doing the XOR trick I currently use.

That's bad enough. But then we have the write-combine buffers in the cache, so that as words are scheduled to be written back to memory, they can be written back in a _different_ order than they were stored. There are instructions one can use to deal with this, and you can find references in the linux kernel source about the problem. The worse case is the alpha, but the intel case is bad enough. And both of the above are _before_ the memory controller exerts any influence on what is going on. And both are real problems to this kind of hash usage...

bob · Post by **bob** » Tue Jan 27, 2009 7:42 am

diep wrote:Note that i'm in diep not aligning my hashtable probes onto a cacheline and the length of an entry today is 16 bytes tomorrow maybe 20 bytes or 24 bytes again. So because Diep's spreading potentially over more than 1 cacheline of the processor, it could go wrong in theory.

Even then some years ago when i measured (had 20 bytes an entry back then), i measured 0 errors.

At supercomputer it was 1 in 200 billion stores.

On my 8-way box, I can see several errors per _minute_. If you want an actual count, I can find a version without the XOR trick and run it and have it count the number of positions where the best move is illegal, which is either a hash collision (which is _very_ rare from past testing) or a case where the signature doesn't go with the rest of the entry because of the problem I have explained...

The number of errors is a LOT smaller than what you claim in your paper that you can allow for Errors in hashtable caused by collissions. So in your case just do the Xor as that single instruction has a cost of 0 and never worry again about write errors

Again, you are missing the point. I am not concerned with the _score_ being wrong. I already know that won't hurt the search a bit. But the other information being wrong can cause problems. An 8-bit mask that says "pawn on D file is passed" will cause my evaluation to crash, if there is no pawn on the D file. It produces an invalid subscript, which goes to an invalid array address, which gives another invalid subscript that causes a crash. So this is not about the score, either in the hash table or pawn hash table. It is about the data being wrong, and wrong data can make the program crash and burn since I don't want to waste the time to check every subscript I compute to verify it is valid...

My argument is however that the number of errors that i could measure on a cpu where such errors could occur, that it was tinier than the number of collissions i measured there when using 64 bits Zobrist key.

I do not believe you are trying to measure what I am describing. I had this _same_ problem on the Cray, from the original XMP through the T90, and I have had it on every multiple-CPU system I have used since, from the Alphas, to the sparcs, to the Itaniums, to the current X86-64 boxes (both AMD and Intel).

The number of errors was far smaller than i had guessed and that math had shown me would occur. How many times a second does the evaluation misevaluate on the other hand?

Best Regards,
Vincent

If a "misevaluation" caused a crash, you could tolerate none. That's the problem here.

bob · Post by **bob** » Tue Jan 27, 2009 7:46 am

diep wrote:Bob if you can produce an error at a Nehalem in this manner,
note that Nehalem doesn't have ECC ram, i advice to you to definitely ship that version with source code to a lot of different people who can try to figure out what causes the bug.

It happens on I7 and every other box I have used. I first saw this on the I7 after I had changed the passed pawn scoring so that bad data would crash it. It happens on our opteron cluster. On the core-2 cluster, and on the original core-1 cluster. It happens on my dual PIV box, on my core-2 laptop, on the quad xeon-700 box, on the quad-xeon-PII-450 box... and on a SMP Sun SPARC box we have.

It is not a hardware problem.

My experience is that Diep is a much better program to show worst case path of the processor than prime95 does do, to give one example. Also powerconsumption is higher when Diep runs than with other software (of course do not ignore i/o, that has to keep constant).

I would not rule out that with chessprograms we can unveil sometimes a bug in a processor. For now i'd assume a RAM error if you manage to measure at a Nehalem with 8 threads.

Parallel search is not deterministic, but with so many writes to hashtable, it definitely should be possible to get in a similar amount of time another error.

I hope you realize that back then doing a run with diep of 200 billion nodes, that this took a long time and 200 processors.

200 billion nodes really is a LOT for 1 search of Diep in 2003.

It's a tad slower in nps than crafty

Vincent

bob · Post by **bob** » Tue Jan 27, 2009 7:51 am

diep wrote:
bob wrote:
diep wrote:
bob wrote:
diep wrote: Actually DDR2 ram does store more than 8 bytes at once.

It depends largely upon whether we take a look to memory controllers from intel or from AMD. Huge differences.

Let's start with intel.

It's 128 bits path for Nehalem.
DDR3, so that's 16 * 2^2 = 64 bytes at once

DDR4 is 128 bytes at once stored
and DDR5 stores at once 256 bytes.

DDR2 ram you've got is storing it at chunks of 32 bytes.
All of that is well and good, but practically it does _not_ work like that. I ran into this identical problem on multiple platforms, all the way up to a dual quad-core I7. I debugged and fixed it on our dual quad-core 2.33ghz xeon (core-2) cluster.

The pawn hash table is aligned on a 32-byte starting address. If it really stored 32 bytes at once this problem could _never_ happened. Yet I could make it happen at will on the dual xeon, once I found a position that was problematic. Ditto for the dual I7 I was testing on..
Hi Bob on memory controller we have the luxury to ask an expert.

Actually i've got the luxury position here to easily be able to quote experts.

Note Nehalem thanks to triple memory controller is 96 bytes at once to and from memory controllers.

What i also asked is how this works.
"it is completely independent of the platform"
"you just go by I/O pin"
Vincent says: (3:59:01 AM) so it is also the case for amd?
Vincent says: (3:59:05 AM) or just for intel
Michael says: (3:59:08 AM) because it is done internally on the memory
Michael says: (3:59:26 AM) no matter who supplies the memory controller
Michael says: (3:59:40 AM) Intel, AMD. VIA, HP
Michael says: (3:59:44 AM) doesn't matter

Thanks to Michael Schuette (director technology OCZ).

He complained by the way about a memory test mine. As it uses just 8 bytes, in fact the memory controllers internally has to throw away a lot of bytes, and that throwing away costs some time, making it slower to get 8 bytes than 32,64,128 bytes at respectively ddr2,dd3,ddr4.

Heh - try to find such experts at any uni

So the point is that to DDR2,DDR3 ram a big bunch of bytes at once get read and written. Not just a few. That makes sense of course, otherwise they wouldn't get that bandwidth that they get.

The processors work different with respect to alignment of cachelinesizes. Now i must avoid running into trouble as not all tests i wrote in my life are public source code (understatement).

How the store buffers work is different from processor to processor.
If you use entries of 16 bytes and you are sure they are aligned it cannot happen obviously at most pc processors. It is interesting to figure out how this works for Nehalem better.

If you get 'weird results". Then please try to reproduce it, then i can take a look. Must be bug somewhere.

Speaking of bugs, i removed a parallel search bug from Diep that was inside since short before world champs 1999

Though it hardly ever occured. Possibly it never crashed because of it because of other code.

As we both know, nothing can be ever claimed bugfree in parallel searching engines, that's the thing

Maybe you simply measured a bitflip, who knows?

Vincent
I just realized that you and I are talking about a _different_ problem. The issue of a cache controller writing multiple quadwords at once is not the issue. What I described was this:

We have two processors, A and B. A searches to reach position Pa, while B searches to reach position Pb. Pa and Pb are completely different positions, but the lower N bits of Pa and Pb are the same so they both hash to the same memory address (table entry).

Processor A needs to store two words in the hash table, both of which belong to position Pa. Processor B needs to store two words in the hash table, both of which go with position Pb. And they are going to store at the same memory address. A stores the first 8 bytes of the hash signature, which is all you can write to memory with one instruction in X86 (mov instruction with an 8 byte register, or movq in linux). A then gets "distracted" for any reason. An interrupt. Hyperthreading is enabled and the other thread executes an instruction, whatever. While that is happening B stores _both_ words of the Pb entry to memory. A then comes along and stores the second word of Pa. Now we have the first word of Pb, the second word of Pa, and that's broken.

This is not an issue of writing a complete 64 byte / 128 byte L2 cache line to memory in one operation. It is an issue of the CPU being unable to write more than 8 bytes at once, or for more recent processors, no more than 16, which then fails on my 32 byte hash entries. So it isn't about the cache-to-memory write as much as it is about the CPU instruction sets' ability to write to cache/memory

And this happens _often_. Even on the 16 byte hash table entries, and it happens _much_ more often on the 32 byte pawn hash entries. Forget about memory controllers and such and figure out how to store 16 bytes with one instruction on X86. You can't... And that is a problem for the above scenarios...
Suppose we have 2 different entries both going to the same slot in hashtable.

{a1,a2} and {b1,b2}

What we want to know is the odds for writing: {a1,b2} or {b1,a2}

This as we use the XOR trick to avoid this.

In supercomputer processors like R14000 indeed an interrupt can cause that
you write {a1,b2} or {b1,a2}. You guessed that right. This is however not how modern microprocessors work. They are way more sophisticated than this. At a philosophical level you could understand why. If it would work like the above, then of course it is impossible to keep a PC very stable online and the bandwidth to the RAM would be too little.

At pc hardware as it is now this is not possible, PROVIDED, that you write entire entry within 1 cacheline. It gets stored at once then to RAM without interrupt.

Of course the scenario that you KILL a process is not interesting to discuss, as then crafty is dead. However also in THAT case it should go, according to PC paper specs ok, as the operation is atomic to flush either entire cacheline or nothing at all.

Memory bitflips CAN occur however. If you have ECC ram you can already take a look how many bitflips in RAM happen, as it generates statistics about this. That is very easy to do in linux.

So non-ECC ram, which majority of machines use, you could get sometimes a bitflip in which case you can get a 'seemingly' different entry which gets reported to you as a write error...

Thanks,
Vincent

I have no idea what you are talking about. On cpu 1, I execute this:

movq %r8, [%rax]
movq %r9, [%rax + 8]

On cpu 2, I execute the same code.

In CPU 1, let's say r8 has a1 and r9 has a2. On CPU 2, r8 has b1 and r9 had b2. I can _trivially_ get {a1, b2} or {a2, b1}, because I can't make both movq instructions on one CPU execute before they execute on the other, unless I use a lock which is too expensive.

This is not a hardware error. It is simply a fact that happens on SMP hardware. There are other issues with the write-combine buffer that _also_ adds to this problem, as the order the CPU writes to memory does not have to be followed by by the cacne controller as memory gets updated...

jwes · Post by **jwes** » Tue Jan 27, 2009 7:16 pm

If you used the 120 bit store MOVDQA instruction, would it solve the problem ?

bob · Post by **bob** » Tue Jan 27, 2009 7:43 pm

jwes wrote:If you used the 120 bit store MOVDQA instruction, would it solve the problem ?

For the primary hash table, possibly. For the 32 byte pawn hash entries, no.

(I assume you mean 128 bits also).

This also adds a bit of a twist in dealing with the "XMM" registers, where the actual computation is done in the normal registers.

Sven · Post by **Sven** » Tue Jan 27, 2009 8:22 pm

If this code:

Code: Select all

movq %r8, &#91;%rax&#93; 
movq %r9, &#91;%rax + 8&#93;

results in only one memory write operation, i.e. the CPU translates it into microcode that does only one write of (at least) 16 bytes instead of two writes of 8 bytes, then neither {a1,b2} nor {a2,b1} will occur.

While a couple of years ago I would have said "no, impossible", today's processor architecture lets me say "ok, let's think about it" since the "real life behaviour" of a modern CPU seems to be much more intelligent than most of us can imagine.

Bob is convinced that two writes of 8 bytes each will occur, so that overlapping writes are possible in this case, while Vincent is convinced of the opposite (although he mentions the keyword "interrupt" which seems to be not involved in this SMP scenario).

The key point might be the question whether the whole 16 bytes are in the same cache line or not. If they are then I would expect that there will be only one 16-byte write per CPU, so no overlapping writes.

However, from the "safety" viewpoint I would prefer Bob's side.

Just my 2 cents!
Sven

SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem