Page 1 of 2

How to find SMP bugs ?

Posted: Wed Mar 15, 2017 1:07 pm
by lucasart
I have a working SMP implementation (lazy). It even scales very well… When it doesn't crash. Problem is: crashes are never reproducible, they just happen randomly.

How do you go about finding SMP bugs ?

So far the only method I can think of is to remove things one by one, and continue so long as the crashes continue. When crashes stop, you've probably removed the faulty piece. Very tedious, and not guaranteed to work…

Re: How to find SMP bugs ?

Posted: Wed Mar 15, 2017 1:36 pm
by mar
First thing that comes to mind would be TT move validation.
So I'd start with a simple check that tt move is legal.

The reason is always the same: a race condition somewhere.
So you you can battle it by either synchronizing or making some data local per thread.

Re: How to find SMP bugs ?

Posted: Wed Mar 15, 2017 2:09 pm
by Sven
lucasart wrote:I have a working SMP implementation (lazy). It even scales very well… When it doesn't crash. Problem is: crashes are never reproducible, they just happen randomly.

How do you go about finding SMP bugs ?

So far the only method I can think of is to remove things one by one, and continue so long as the crashes continue. When crashes stop, you've probably removed the faulty piece. Very tedious, and not guaranteed to work…
Also watch out for any "static" variables. Some of them could lead to dubious behaviour as well.

Re: How to find SMP bugs ?

Posted: Wed Mar 15, 2017 2:10 pm
by cdani
mar wrote:First thing that comes to mind would be TT move validation.
So I'd start with a simple check that tt move is legal.
To solve this in one go, I generated random moves, even with nonsense squares and states. Now should be 100% reliable in Andscacs.

Re: How to find SMP bugs ?

Posted: Wed Mar 15, 2017 2:23 pm
by mar
cdani wrote:To solve this in one go, I generated random moves, even with nonsense squares and states. Now should be 100% reliable in Andscacs.
Yes, that's a good idea - I did something similar in Cheng a long time ago - simply tried to throw all possible move representations at it - this allowed me to fix all the remaining bugs and no problems at all since then;

(however my TT move validation code is very complicated due to the fact that I have way too many move-related flags to check; this was probably a design mistake)

Re: How to find SMP bugs ?

Posted: Wed Mar 15, 2017 6:13 pm
by jdart
It is usually helpful to get a stack trace where it crashed. For example if it is an illegal memory access. That can give you a clue where the issue is.

Besides just ordinary bugs, you have to pay very careful attention to locking. All you need is a tiny time window during which one thread can write into something another one is reading, and that can cause rare crashes.

(in C++11, atomic variables can sometimes substitute for locks).

Also check out the -fsanitze=thread option in recent version of GCC:

https://gcc.gnu.org/onlinedocs/gcc/Inst ... tions.html.

--Jon

Re: How to find SMP bugs ?

Posted: Wed Mar 15, 2017 9:53 pm
by jwes
lucasart wrote:I have a working SMP implementation (lazy). It even scales very well… When it doesn't crash. Problem is: crashes are never reproducible, they just happen randomly.

How do you go about finding SMP bugs ?

So far the only method I can think of is to remove things one by one, and continue so long as the crashes continue. When crashes stop, you've probably removed the faulty piece. Very tedious, and not guaranteed to work…
Can you run it and have it crash in a debugger, or set up a JIT debugger?

Re: How to find SMP bugs ?

Posted: Sun Mar 19, 2017 2:31 am
by lucasart
I'll have a look at the TT, when I get electricity back… Simply disable it to see if crashes persist. But I don't think that will ne the answer.

My TT entries are (key,data) pairs where key and data are 64-bits. I know that nothing is guaranteed by the C standard, but in practice (reasonable machine with reasonable compiler), each 64-bit write is atomic. So you should not end up with garbage data. If 2 threads write (key1,data1) and (key2,data2) concurrently, the only scrambled results you can get are (key1,data2) or (key2,data1). Is that right ? Or can things get worse with things like (key1,key2) or (data1,key1), etc. ?

Also I don't understand how illegal TT moves can be a problem. I never play the TT move. I generate pseudo legal moves, and when scoring them I put a very high score if the move is the TT move. Maybe other engines are directly playing the TT move, in an attempt to save time generating moves if it fails bigh. I don't do that.

Re: How to find SMP bugs ?

Posted: Sun Mar 19, 2017 2:48 pm
by jdart
lucasart wrote:
My TT entries are (key,data) pairs where key and data are 64-bits. I know that nothing is guaranteed by the C standard, but in practice (reasonable machine with reasonable compiler), each 64-bit write is atomic. So you should not end up with garbage data. If 2 threads write (key1,data1) and (key2,data2) concurrently, the only scrambled results you can get are (key1,data2) or (key2,data1). Is that right ? Or can things get worse with things like (key1,key2) or (data1,key1), etc. ?
If your table is aligned then you won't get unatomic writes of the 64-bit values. But having mismatched data can still be a problem. The standard way of dealing with this is Hyatt's lockless hashing method ( (http://www.craftychess.com/hyatt/hashing.html).
Maybe other engines are directly playing the TT move, in an attempt to save time generating moves if it fails bigh. I don't do that.
Practically all modern engines do this. It is a considerable time-saver, because you skip the whole move generation step.

--Jon

Re: How to find SMP bugs ?

Posted: Sun Mar 19, 2017 3:44 pm
by mar
lucasart wrote:Also I don't understand how illegal TT moves can be a problem. I never play the TT move. I generate pseudo legal moves, and when scoring them I put a very high score if the move is the TT move. Maybe other engines are directly playing the TT move, in an attempt to save time generating moves if it fails bigh. I don't do that.
I forgot that you don't use staged move generator, in this case playing illegal moves shouldn't be the problem.
Still to rule out this possibility completely, perhaps a temporary assert that makemove gets a legal move might not be a bad idea?

I'd still bet on some invalid state due to missing synchronization somewhere.
So either checking master<->helper communication between iterations (depending on what you do) or as Sven said perhaps some global state? I doubt this is the case but still.

If it's neither of the above, then it'll be tedious to find. Maybe stack trace of all threads might help, as others have suggested.