How to find SMP bugs ?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

How to find SMP bugs ?

Post by lucasart »

I have a working SMP implementation (lazy). It even scales very well… When it doesn't crash. Problem is: crashes are never reproducible, they just happen randomly.

How do you go about finding SMP bugs ?

So far the only method I can think of is to remove things one by one, and continue so long as the crashes continue. When crashes stop, you've probably removed the faulty piece. Very tedious, and not guaranteed to work…
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: How to find SMP bugs ?

Post by mar »

First thing that comes to mind would be TT move validation.
So I'd start with a simple check that tt move is legal.

The reason is always the same: a race condition somewhere.
So you you can battle it by either synchronizing or making some data local per thread.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: How to find SMP bugs ?

Post by Sven »

lucasart wrote:I have a working SMP implementation (lazy). It even scales very well… When it doesn't crash. Problem is: crashes are never reproducible, they just happen randomly.

How do you go about finding SMP bugs ?

So far the only method I can think of is to remove things one by one, and continue so long as the crashes continue. When crashes stop, you've probably removed the faulty piece. Very tedious, and not guaranteed to work…
Also watch out for any "static" variables. Some of them could lead to dubious behaviour as well.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: How to find SMP bugs ?

Post by cdani »

mar wrote:First thing that comes to mind would be TT move validation.
So I'd start with a simple check that tt move is legal.
To solve this in one go, I generated random moves, even with nonsense squares and states. Now should be 100% reliable in Andscacs.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: How to find SMP bugs ?

Post by mar »

cdani wrote:To solve this in one go, I generated random moves, even with nonsense squares and states. Now should be 100% reliable in Andscacs.
Yes, that's a good idea - I did something similar in Cheng a long time ago - simply tried to throw all possible move representations at it - this allowed me to fix all the remaining bugs and no problems at all since then;

(however my TT move validation code is very complicated due to the fact that I have way too many move-related flags to check; this was probably a design mistake)
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: How to find SMP bugs ?

Post by jdart »

It is usually helpful to get a stack trace where it crashed. For example if it is an illegal memory access. That can give you a clue where the issue is.

Besides just ordinary bugs, you have to pay very careful attention to locking. All you need is a tiny time window during which one thread can write into something another one is reading, and that can cause rare crashes.

(in C++11, atomic variables can sometimes substitute for locks).

Also check out the -fsanitze=thread option in recent version of GCC:

https://gcc.gnu.org/onlinedocs/gcc/Inst ... tions.html.

--Jon
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: How to find SMP bugs ?

Post by jwes »

lucasart wrote:I have a working SMP implementation (lazy). It even scales very well… When it doesn't crash. Problem is: crashes are never reproducible, they just happen randomly.

How do you go about finding SMP bugs ?

So far the only method I can think of is to remove things one by one, and continue so long as the crashes continue. When crashes stop, you've probably removed the faulty piece. Very tedious, and not guaranteed to work…
Can you run it and have it crash in a debugger, or set up a JIT debugger?
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: How to find SMP bugs ?

Post by lucasart »

I'll have a look at the TT, when I get electricity back… Simply disable it to see if crashes persist. But I don't think that will ne the answer.

My TT entries are (key,data) pairs where key and data are 64-bits. I know that nothing is guaranteed by the C standard, but in practice (reasonable machine with reasonable compiler), each 64-bit write is atomic. So you should not end up with garbage data. If 2 threads write (key1,data1) and (key2,data2) concurrently, the only scrambled results you can get are (key1,data2) or (key2,data1). Is that right ? Or can things get worse with things like (key1,key2) or (data1,key1), etc. ?

Also I don't understand how illegal TT moves can be a problem. I never play the TT move. I generate pseudo legal moves, and when scoring them I put a very high score if the move is the TT move. Maybe other engines are directly playing the TT move, in an attempt to save time generating moves if it fails bigh. I don't do that.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: How to find SMP bugs ?

Post by jdart »

lucasart wrote:
My TT entries are (key,data) pairs where key and data are 64-bits. I know that nothing is guaranteed by the C standard, but in practice (reasonable machine with reasonable compiler), each 64-bit write is atomic. So you should not end up with garbage data. If 2 threads write (key1,data1) and (key2,data2) concurrently, the only scrambled results you can get are (key1,data2) or (key2,data1). Is that right ? Or can things get worse with things like (key1,key2) or (data1,key1), etc. ?
If your table is aligned then you won't get unatomic writes of the 64-bit values. But having mismatched data can still be a problem. The standard way of dealing with this is Hyatt's lockless hashing method ( (http://www.craftychess.com/hyatt/hashing.html).
Maybe other engines are directly playing the TT move, in an attempt to save time generating moves if it fails bigh. I don't do that.
Practically all modern engines do this. It is a considerable time-saver, because you skip the whole move generation step.

--Jon
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: How to find SMP bugs ?

Post by mar »

lucasart wrote:Also I don't understand how illegal TT moves can be a problem. I never play the TT move. I generate pseudo legal moves, and when scoring them I put a very high score if the move is the TT move. Maybe other engines are directly playing the TT move, in an attempt to save time generating moves if it fails bigh. I don't do that.
I forgot that you don't use staged move generator, in this case playing illegal moves shouldn't be the problem.
Still to rule out this possibility completely, perhaps a temporary assert that makemove gets a legal move might not be a bad idea?

I'd still bet on some invalid state due to missing synchronization somewhere.
So either checking master<->helper communication between iterations (depending on what you do) or as Sven said perhaps some global state? I doubt this is the case but still.

If it's neither of the above, then it'll be tedious to find. Maybe stack trace of all threads might help, as others have suggested.