I have a working SMP implementation (lazy). It even scales very well… When it doesn't crash. Problem is: crashes are never reproducible, they just happen randomly.
How do you go about finding SMP bugs ?
So far the only method I can think of is to remove things one by one, and continue so long as the crashes continue. When crashes stop, you've probably removed the faulty piece. Very tedious, and not guaranteed to work…
How to find SMP bugs ?
Moderators: hgm, Rebel, chrisw
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
How to find SMP bugs ?
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: How to find SMP bugs ?
First thing that comes to mind would be TT move validation.
So I'd start with a simple check that tt move is legal.
The reason is always the same: a race condition somewhere.
So you you can battle it by either synchronizing or making some data local per thread.
So I'd start with a simple check that tt move is legal.
The reason is always the same: a race condition somewhere.
So you you can battle it by either synchronizing or making some data local per thread.
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: How to find SMP bugs ?
Also watch out for any "static" variables. Some of them could lead to dubious behaviour as well.lucasart wrote:I have a working SMP implementation (lazy). It even scales very well… When it doesn't crash. Problem is: crashes are never reproducible, they just happen randomly.
How do you go about finding SMP bugs ?
So far the only method I can think of is to remove things one by one, and continue so long as the crashes continue. When crashes stop, you've probably removed the faulty piece. Very tedious, and not guaranteed to work…
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: How to find SMP bugs ?
To solve this in one go, I generated random moves, even with nonsense squares and states. Now should be 100% reliable in Andscacs.mar wrote:First thing that comes to mind would be TT move validation.
So I'd start with a simple check that tt move is legal.
Daniel José - http://www.andscacs.com
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: How to find SMP bugs ?
Yes, that's a good idea - I did something similar in Cheng a long time ago - simply tried to throw all possible move representations at it - this allowed me to fix all the remaining bugs and no problems at all since then;cdani wrote:To solve this in one go, I generated random moves, even with nonsense squares and states. Now should be 100% reliable in Andscacs.
(however my TT move validation code is very complicated due to the fact that I have way too many move-related flags to check; this was probably a design mistake)
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: How to find SMP bugs ?
It is usually helpful to get a stack trace where it crashed. For example if it is an illegal memory access. That can give you a clue where the issue is.
Besides just ordinary bugs, you have to pay very careful attention to locking. All you need is a tiny time window during which one thread can write into something another one is reading, and that can cause rare crashes.
(in C++11, atomic variables can sometimes substitute for locks).
Also check out the -fsanitze=thread option in recent version of GCC:
https://gcc.gnu.org/onlinedocs/gcc/Inst ... tions.html.
--Jon
Besides just ordinary bugs, you have to pay very careful attention to locking. All you need is a tiny time window during which one thread can write into something another one is reading, and that can cause rare crashes.
(in C++11, atomic variables can sometimes substitute for locks).
Also check out the -fsanitze=thread option in recent version of GCC:
https://gcc.gnu.org/onlinedocs/gcc/Inst ... tions.html.
--Jon
-
- Posts: 778
- Joined: Sat Jul 01, 2006 7:11 am
Re: How to find SMP bugs ?
Can you run it and have it crash in a debugger, or set up a JIT debugger?lucasart wrote:I have a working SMP implementation (lazy). It even scales very well… When it doesn't crash. Problem is: crashes are never reproducible, they just happen randomly.
How do you go about finding SMP bugs ?
So far the only method I can think of is to remove things one by one, and continue so long as the crashes continue. When crashes stop, you've probably removed the faulty piece. Very tedious, and not guaranteed to work…
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: How to find SMP bugs ?
I'll have a look at the TT, when I get electricity back… Simply disable it to see if crashes persist. But I don't think that will ne the answer.
My TT entries are (key,data) pairs where key and data are 64-bits. I know that nothing is guaranteed by the C standard, but in practice (reasonable machine with reasonable compiler), each 64-bit write is atomic. So you should not end up with garbage data. If 2 threads write (key1,data1) and (key2,data2) concurrently, the only scrambled results you can get are (key1,data2) or (key2,data1). Is that right ? Or can things get worse with things like (key1,key2) or (data1,key1), etc. ?
Also I don't understand how illegal TT moves can be a problem. I never play the TT move. I generate pseudo legal moves, and when scoring them I put a very high score if the move is the TT move. Maybe other engines are directly playing the TT move, in an attempt to save time generating moves if it fails bigh. I don't do that.
My TT entries are (key,data) pairs where key and data are 64-bits. I know that nothing is guaranteed by the C standard, but in practice (reasonable machine with reasonable compiler), each 64-bit write is atomic. So you should not end up with garbage data. If 2 threads write (key1,data1) and (key2,data2) concurrently, the only scrambled results you can get are (key1,data2) or (key2,data1). Is that right ? Or can things get worse with things like (key1,key2) or (data1,key1), etc. ?
Also I don't understand how illegal TT moves can be a problem. I never play the TT move. I generate pseudo legal moves, and when scoring them I put a very high score if the move is the TT move. Maybe other engines are directly playing the TT move, in an attempt to save time generating moves if it fails bigh. I don't do that.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: How to find SMP bugs ?
If your table is aligned then you won't get unatomic writes of the 64-bit values. But having mismatched data can still be a problem. The standard way of dealing with this is Hyatt's lockless hashing method ( (http://www.craftychess.com/hyatt/hashing.html).lucasart wrote:
My TT entries are (key,data) pairs where key and data are 64-bits. I know that nothing is guaranteed by the C standard, but in practice (reasonable machine with reasonable compiler), each 64-bit write is atomic. So you should not end up with garbage data. If 2 threads write (key1,data1) and (key2,data2) concurrently, the only scrambled results you can get are (key1,data2) or (key2,data1). Is that right ? Or can things get worse with things like (key1,key2) or (data1,key1), etc. ?
Practically all modern engines do this. It is a considerable time-saver, because you skip the whole move generation step.Maybe other engines are directly playing the TT move, in an attempt to save time generating moves if it fails bigh. I don't do that.
--Jon
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: How to find SMP bugs ?
I forgot that you don't use staged move generator, in this case playing illegal moves shouldn't be the problem.lucasart wrote:Also I don't understand how illegal TT moves can be a problem. I never play the TT move. I generate pseudo legal moves, and when scoring them I put a very high score if the move is the TT move. Maybe other engines are directly playing the TT move, in an attempt to save time generating moves if it fails bigh. I don't do that.
Still to rule out this possibility completely, perhaps a temporary assert that makemove gets a legal move might not be a bad idea?
I'd still bet on some invalid state due to missing synchronization somewhere.
So either checking master<->helper communication between iterations (depending on what you do) or as Sven said perhaps some global state? I doubt this is the case but still.
If it's neither of the above, then it'll be tedious to find. Maybe stack trace of all threads might help, as others have suggested.