interesting SMP bug
Posted: Tue Sep 24, 2013 8:47 pm
I just found a bug I have been searching for for years. Thought I would relay it here just in case it might affect anyone using the same sort of idea...
I have a typical killer move array, two per ply. For quite a while I have been first trying the two killers for the current ply, then the two killers for two plies previous to this ply. Made a significant difference in the size of the tree.
After I search the hash move, then non-losing captures, and killers, I generate the rest of the moves, and for efficiency, I carefully excluded any previously searched killers from the current move list. I was doing a sanity test on a recent change, and noticed that I had missed WAC $266, which is a simple mate, taking 11 plies and basically no time. But the normal single-thread search version spotted it instantly. I started testing, and about 1 out of every 3-4 times, the parallel search version would miss it. The key move is Rxh2+ Bxh2 Ng3+ and on to mate. I was splitting at ply=1, and one thread followed the above to ply=3 where it needed to search everything. meanwhile, back at the ranch, the other parallel thread changed the killers for ply=1, and just happened to insert the move Ng3 as a killer. The first thread was busy searching at ply = 3, and when it got to the third move in the list to search, namely Ng3+, it concluded "already searched, it is a killer for ply=1, so skip searching it again.
Fix was simple, as I search killers, I save them in a local array, and when I start culling them from the full move list, I use this local array (up to 4 entries) to cull, rather than the killer list which can be changed by the other threads at a common split point.
ugh.
I went back and sure enough, a couple of other odd positions I had saved now no longer fail and overlook a forced repetition or else find what appears to be a forced repetition when there was a key escaping move that was getting culled as above.
One less bug, who knows how many remain...
I will release 23.7 once I run a full test on it, as it also has a couple of fixes for the skill command to stop the time losses some were seeing with skill=1 or such.
I have a typical killer move array, two per ply. For quite a while I have been first trying the two killers for the current ply, then the two killers for two plies previous to this ply. Made a significant difference in the size of the tree.
After I search the hash move, then non-losing captures, and killers, I generate the rest of the moves, and for efficiency, I carefully excluded any previously searched killers from the current move list. I was doing a sanity test on a recent change, and noticed that I had missed WAC $266, which is a simple mate, taking 11 plies and basically no time. But the normal single-thread search version spotted it instantly. I started testing, and about 1 out of every 3-4 times, the parallel search version would miss it. The key move is Rxh2+ Bxh2 Ng3+ and on to mate. I was splitting at ply=1, and one thread followed the above to ply=3 where it needed to search everything. meanwhile, back at the ranch, the other parallel thread changed the killers for ply=1, and just happened to insert the move Ng3 as a killer. The first thread was busy searching at ply = 3, and when it got to the third move in the list to search, namely Ng3+, it concluded "already searched, it is a killer for ply=1, so skip searching it again.
Fix was simple, as I search killers, I save them in a local array, and when I start culling them from the full move list, I use this local array (up to 4 entries) to cull, rather than the killer list which can be changed by the other threads at a common split point.
ugh.
I went back and sure enough, a couple of other odd positions I had saved now no longer fail and overlook a forced repetition or else find what appears to be a forced repetition when there was a key escaping move that was getting culled as above.
One less bug, who knows how many remain...
I will release 23.7 once I run a full test on it, as it also has a couple of fixes for the skill command to stop the time losses some were seeing with skill=1 or such.