hgm wrote:Thanks for the feedback, everyone. I gather the NPS slowdown is not normal, so I looked a bit into that. The laptop is an Intel i3, which is not supposed to have turbo boost. But I know that there are chipsets for laptops that throttle down memory access (because memory is often poorly cooled in laptops). I use no locks (and did not even bother to validate TT probes trhough lockless hashing).
When I repeated the 1-code search while there was a second, independent version of the engine analyzing, the NPS suffered 14-16% on the laptop! So I took the .exe to my i7 machine (which is also not supposed to have turbo boost), and I got the following results:
Code: Select all
depth 1-core 1-core 2-core 4-core
loaded
6 12 12 7 6
7 20 20 10 11
8 34 34 18 17
9 67 65 37 29
10 131 132 90 62
11 354 352 205 146
12 1165 1156 692 427
13 2252 2248 2042 1633
14 5099 5174 8484 7124
15 16136 16559 23551 18495
NPS(12-ply) 2,939 3,034 2,600
Now there is basically no difference between the NPS with 1 core when I run a second instance of the engine ('loaded'), and also not when I switch to 2-core SMP. Because of that the speedup seems much nicer: the 2-core searches take about 60% of the time of a 1-core search. This seems to agree with what I have heard from others for this simple unsynchronized approach. (At depth > 12 the PVs are completely different, so I am not sure how to compare the numbers there in a sensible way.)
Of course this unsynchronized approch scales very poorly: when I try 4 cores, the speedup compared to 2 cores is much less than when going from 1 to 2 cores. This is partly due to a 13% drop in NPS (per core). I had noticed such an NPS drop before when I run more than 2 independent games in parallel, so I suppose there is not much that can be done about it, on the hardware I have. If it is due to the memory bottleneck not probing TT in QS might prevent it, but on a single thread this reduced time-to-depth as well. Perhaps a small in-cache private TT for QS only could be a compromise.
I already have a lot of patches to better 'synchronize' the search processes, and these did speed up the 2-core search in earlier tests. But at that time it was all for naught, because I could not get the basics right, and the unsynchronized 2-core search slowed things down nearly a factor 2, which even the smartest algorithm I could think off couldn't earn back. (And I did not have more than 2 cores at the time.) So I will now start to apply those, to see how much I can improve the scaling. They basically work by labeling the hash entries as 'busy' when the search enters the node, rather than waiting with the update until the search on it completed. And when encountering a 'busy' node which is not a 'must have' in the parent (i.e. an early move that is likely to be a cut move) immediately return with an invalid result, so that the parent can reschedule the search of the move until after it has searched another move. Only when there is nothing else to search the remaining moves become 'must haves', and the search will try them again, to help in their search.
Another inconvenience is that this SMP code is Windows-specific, while in the mean time I moved my development to Linux. Launching the slave processes is simple enough in Linx (through pipe(2) and fork(2)), but I would have to figure out how I can share memory between processes on Linux to also make it work there.