interesting SMP bug

rvida · Post by **rvida** » Fri Sep 27, 2013 2:27 pm

rtitle wrote: (1) What needs to be per-thread, what needs to be global?

This was discussed in this thread

syzygy · Post by **syzygy** » Fri Sep 27, 2013 5:20 pm

rtitle wrote:Now, let's talk about the transposition table - per-thread or global?. That wants to be global, right? I mean, it doesn't seem like you'd want to clone the transposition table to each thread. But if it's global, how do you guys protect it so it doesn't get corrupted by multiple threads writing, and so that a read doesn't get corrupted results by another thread writing?

Most simply ignore the problem, i.e. pray for the best. If you really insist on addressing it you can use this trick, but see this paper: a few incorrect probes aren't going to hurt you. Any solution that does not have an insignificant cost will only hurt in terms of playing strength. But make sure the engine does not crash on illegal moves.

rtitle · Post by **rtitle** » Sat Sep 28, 2013 2:11 pm

Thanks for the links to those 2 papers.

The Hyatt & Cozzie paper concludes that occasional wrong answer due to hash collisions is tolerable (and does experiments to show that). That matches my own intuition. TT hash collisions can happen even in a single-threaded implementation. You can detect some or all collisions by storing more bits in each TT entry (at one extreme you can store the complete position representation in each entry so you absolutely *know* when you've got a hash collision). But I concluded intuitively (and this paper proves) that it just isn't worth it.

However, let's not conflate the thread safety problem with the hash collision problem. The statement is made in that same Hyatt & Cozzie paper "A parallel search aggravates this [the collision problem] because multiple threads can write to the hash table in various orders, leaving hash entries that are incorrect". Yes but, umm I hate to argue with a computer science professor but: In general, non-atomic writes to the same data from different threads at the same time, not protected by locking, can produce corrupted results. These corrupted results might only have the same effect as a hash collision (i.e. give you a wrong answer on a subsequent TT lookup). Or they might have worse effects, like producing a completely nonsensical TT entry that later confuses your program when it looks at it and causes it to crash. It all depends on exactly what is the structure of your TT entries and how the program uses them after lookup.

I have a similar criticism of the other paper ("A lockless transposition table..."). The scheme assumes a certain layout of TT entries (a pair of 64-bit integers) and assumes atomic 64-bit write. So it's quite limited. More general lock-free hashing schemes exist (e.g. see http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf ) but they are complex to implement.

So I'm still unsure what to do in my own program. For now I am doing like you guys - no locking and hope fior the best. I might try implementing a lock-free scheme as described in the Maged Michael paper at some point.

BTW in my day job I work on implementing systems like database systems & storage systems (I used to work for a major vendor of such systems, now I am doing the same thing at a startup). In this context it is simply unacceptable to say "oh, we'll occasionally corrupt your data because we didn't bother to protect the storage with locks, but gosh look how fast our DB runs". So for me, writing multithreaded code that does unprotected reads and writes of global data is scratching my fingernails on a chalkboard. It goes against all my training.

Cheers,

Rich Title

brianr · Post by **brianr** » Sat Sep 28, 2013 2:31 pm

Chess programs are not conducting banking, merely trying to win a game.

Philosophically, think about the benefits of null move vs total accuracy--it is no contest. The other way to think about it is that we want the most correct move in any given position, where "most correct" is the best in terms of improving winning chances.

Sometimes it is worth the extra cost in time to add something like more certainty (e.g. always extend in check) or knowledge (e.g. piece square tables), other times it is not. The trade-offs in your engine are up to you.

IIRC, what the paper showed is that hash table "errors" (surprisingly even "lots" of them) have almost no effect on playing strength (of course, provided the moves are legal and the engine does not crash, as was noted).

syzygy · Post by **syzygy** » Sat Sep 28, 2013 2:41 pm

rtitle wrote:Yes but, umm I hate to argue with a computer science professor but: In general, non-atomic writes to the same data from different threads at the same time, not protected by locking, can produce corrupted results. These corrupted results might only have the same effect as a hash collision (i.e. give you a wrong answer on a subsequent TT lookup). Or they might have worse effects, like producing a completely nonsensical TT entry that later confuses your program when it looks at it and causes it to crash. It all depends on exactly what is the structure of your TT entries and how the program uses them after lookup.

The main danger is illegal moves, which can easily be checked for. Of course it is possible to devise a way of using the TT that could lead to other kinds of crashes, in which case you need to test for more than just move legality.

I have a similar criticism of the other paper ("A lockless transposition table..."). The scheme assumes a certain layout of TT entries (a pair of 64-bit integers) and assumes atomic 64-bit write. So it's quite limited.

That is not a valid criticism imo. The paper is aimed at parallel chess programs using a TT and gives a simple and efficient solution for that case. An inefficient complicated scheme is fine in theory but not interesting for a practical chess engine.

I am not 100% sure about this, but I think there is no assumption that 64-bit writes are atomic. The paper does mention this assumption when explaining the problem, but I am not convinced that the proposed lockless strategy relies on this. I suppose the paper (but not the scheme) can be criticised for not clarifying this point.

BTW in my day job I work on implementing systems like database systems & storage systems (I used to work for a major vendor of such systems, now I am doing the same thing at a startup). In this context it is simply unacceptable to say "oh, we'll occasionally corrupt your data because we didn't bother to protect the storage with locks, but gosh look how fast our DB runs".

Of course, but writing a chess engine is an art of its own. In chess it definitely is acceptable. In fact, a top engine is built out of thousands of engineering decisions where some degree of inaccuracy is traded off against more speed and (in particular) higher search depth.

michiguel · Post by **michiguel** » Sat Sep 28, 2013 3:17 pm

rtitle wrote:Thanks for the links to those 2 papers.

The Hyatt & Cozzie paper concludes that occasional wrong answer due to hash collisions is tolerable (and does experiments to show that). That matches my own intuition. TT hash collisions can happen even in a single-threaded implementation. You can detect some or all collisions by storing more bits in each TT entry (at one extreme you can store the complete position representation in each entry so you absolutely *know* when you've got a hash collision). But I concluded intuitively (and this paper proves) that it just isn't worth it.

However, let's not conflate the thread safety problem with the hash collision problem. The statement is made in that same Hyatt & Cozzie paper "A parallel search aggravates this [the collision problem] because multiple threads can write to the hash table in various orders, leaving hash entries that are incorrect". Yes but, umm I hate to argue with a computer science professor but: In general, non-atomic writes to the same data from different threads at the same time, not protected by locking, can produce corrupted results. These corrupted results might only have the same effect as a hash collision (i.e. give you a wrong answer on a subsequent TT lookup). Or they might have worse effects, like producing a completely nonsensical TT entry that later confuses your program when it looks at it and causes it to crash. It all depends on exactly what is the structure of your TT entries and how the program uses them after lookup.

I have a similar criticism of the other paper ("A lockless transposition table..."). The scheme assumes a certain layout of TT entries (a pair of 64-bit integers) and assumes atomic 64-bit write. So it's quite limited. More general lock-free hashing schemes exist (e.g. see http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf ) but they are complex to implement.

So I'm still unsure what to do in my own program. For now I am doing like you guys - no locking and hope fior the best. I might try implementing a lock-free scheme as described in the Maged Michael paper at some point.

BTW in my day job I work on implementing systems like database systems & storage systems (I used to work for a major vendor of such systems, now I am doing the same thing at a startup). In this context it is simply unacceptable to say "oh, we'll occasionally corrupt your data because we didn't bother to protect the storage with locks, but gosh look how fast our DB runs". So for me, writing multithreaded code that does unprotected reads and writes of global data is scratching my fingernails on a chalkboard. It goes against all my training.

When you realize that a chess engine is a gigantic gambling entity, that is probabilistically trying to guess what is the move that most likely lead me to a win, you will see that this is not "corrupting" data, but lowering the chances to have a better score at a given point of the tree. The benefit is of course you get to analyze more nodes. The data is not perfect to start with, so the term "corruption" is not philosophically appropriate.

Miguel

Cheers,

Rich Title

bob · Post by **bob** » Sat Sep 28, 2013 5:22 pm

rtitle wrote:Thanks for the links to those 2 papers.

The Hyatt & Cozzie paper concludes that occasional wrong answer due to hash collisions is tolerable (and does experiments to show that). That matches my own intuition. TT hash collisions can happen even in a single-threaded implementation. You can detect some or all collisions by storing more bits in each TT entry (at one extreme you can store the complete position representation in each entry so you absolutely *know* when you've got a hash collision). But I concluded intuitively (and this paper proves) that it just isn't worth it.

However, let's not conflate the thread safety problem with the hash collision problem. The statement is made in that same Hyatt & Cozzie paper "A parallel search aggravates this [the collision problem] because multiple threads can write to the hash table in various orders, leaving hash entries that are incorrect". Yes but, umm I hate to argue with a computer science professor but: In general, non-atomic writes to the same data from different threads at the same time, not protected by locking, can produce corrupted results. These corrupted results might only have the same effect as a hash collision (i.e. give you a wrong answer on a subsequent TT lookup). Or they might have worse effects, like producing a completely nonsensical TT entry that later confuses your program when it looks at it and causes it to crash. It all depends on exactly what is the structure of your TT entries and how the program uses them after lookup.

I have a similar criticism of the other paper ("A lockless transposition table..."). The scheme assumes a certain layout of TT entries (a pair of 64-bit integers) and assumes atomic 64-bit write. So it's quite limited. More general lock-free hashing schemes exist (e.g. see http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf ) but they are complex to implement.

So I'm still unsure what to do in my own program. For now I am doing like you guys - no locking and hope fior the best. I might try implementing a lock-free scheme as described in the Maged Michael paper at some point.

BTW in my day job I work on implementing systems like database systems & storage systems (I used to work for a major vendor of such systems, now I am doing the same thing at a startup). In this context it is simply unacceptable to say "oh, we'll occasionally corrupt your data because we didn't bother to protect the storage with locks, but gosh look how fast our DB runs". So for me, writing multithreaded code that does unprotected reads and writes of global data is scratching my fingernails on a chalkboard. It goes against all my training.

Cheers,

Rich Title

1. The paper assumes nothing about the tt format. I also use this for my pawn hashing, which has entries FAR longer than 16 bytes. Just xor them all together and it works.

2. No machine today does not do at LEAST 64 bit writes, at least not machines you would use to run a chess engine, that is not a problem. However, this lockless hashing scheme does NOT depend on any level of atomic write. BEFORE you write the 16 byte entry, you xor the two halves. To get a match on that entry, you have to read the SAME 16 bytes back, unchanged. Don't know where you picked up the assumption about 8 byte atomic writes, but it is wrong.

syzygy · Post by **syzygy** » Sat Sep 28, 2013 5:51 pm

bob wrote:Don't know where you picked up the assumption about 8 byte atomic writes, but it is wrong.

It's in the Introduction part:

For the remainder of this discussion, assume that 64 bit memory read/write operations are atomic, that is the entire 64 bit value is read/written in one cycle. This leads to the conclusion that a 128 bit read/write is broken up into two 64 bit reads/writes, and therefore this is not atomic since it becomes two distinct cpu-to-memory transactions.

As I already wrote earlier in this thread, there is no indication that this assumption is actually needed for the lockless hashing scheme to work. I tend to agree with you that it is not needed. But I can see how a reader of the paper could be left with the impression that it is required.

mcostalba · Post by **mcostalba** » Sat Sep 28, 2013 7:46 pm

bob wrote:
michiguel wrote:
jdart wrote:I have a very similar killer move scheme, except killers are per-thread, and the local killer list is cleared before a new search commences at a split point. This may be less efficient but it doesn't have the problem you mention.

--Jon
At the point of parallelization, I copy the killers array for the new thread. This is a very local information of I believe that the structure should be local.

Miguel
I do that as well. But at the actual SPLIT point, the killers are shared, as is the move list and everything else for that particular node/ply. While I agree it is local, I believe that keeping something globally available is better than starting from zero...

I might run a few test positions 10,000 times each way to see which produces the smaller tree, if either is better.

Miguel comment is very natural and I also thought the problem was simply you keep a global array of killers instead of copying them at split point. But I don't understand your answer.

When you split in a node you copy what needs to be copied (including killers) and then proceed with the search after the split. The split node remains suspended because master thread copies too all the stuff and proceeds on a copied data, even if it is the master.

Do you mean that in your implementation master thread when splitting does not make a copy of its data (but only that of other slaves), and continues searching using the pre-split data structures ?

bob · Post by **bob** » Sat Sep 28, 2013 8:58 pm

mcostalba wrote:
bob wrote:
michiguel wrote:
jdart wrote:I have a very similar killer move scheme, except killers are per-thread, and the local killer list is cleared before a new search commences at a split point. This may be less efficient but it doesn't have the problem you mention.

--Jon
At the point of parallelization, I copy the killers array for the new thread. This is a very local information of I believe that the structure should be local.

Miguel
I do that as well. But at the actual SPLIT point, the killers are shared, as is the move list and everything else for that particular node/ply. While I agree it is local, I believe that keeping something globally available is better than starting from zero...

I might run a few test positions 10,000 times each way to see which produces the smaller tree, if either is better.
Miguel comment is very natural and I also thought the problem was simply you keep a global array of killers instead of copying them at split point. But I don't understand your answer.

When you split in a node you copy what needs to be copied (including killers) and then proceed with the search after the split. The split node remains suspended because master thread copies too all the stuff and proceeds on a copied data, even if it is the master.

Do you mean that in your implementation master thread when splitting does not make a copy of its data (but only that of other slaves), and continues searching using the pre-split data structures ?

When I split at ply=N, I create m split blocks for m threads. But when searching at ply=N, ALL threads use the original split block for the move list and move generation staging information.

My "bug" was caused by searching killers for ply=N, then ply=N-2. Once I search those, I exclude them from searching after I generate all moves (I search killers without generating anything). If they change between the time I searched them, and the time I start excluding them from the current move list, I would exclude the WRONG moves.

Simple to fix, but difficult to find.

interesting SMP bug

Re: interesting SMP bug

Re: interesting SMP bug

Re: interesting SMP bug

Re: interesting SMP bug

Re: interesting SMP bug

Re: interesting SMP bug

Re: interesting SMP bug

Re: interesting SMP bug

Re: interesting SMP bug

Re: interesting SMP bug