Crazy SMP

Stan Arts · Post by **Stan Arts** » Mon Jun 20, 2016 2:02 am

20% seems a whole lot but as you say might have something to do with the system itself. Or it might have to do with heavy hashing in Qsearch or evaluation for example. But for Nemeton on my i5 quad desktop I don't seem to slow down more than 10% regardless of what other cores are doing and if they are doing Nemeton chess slowdown is no more than a few percent at most.

But searching on two cores searching exactly the same way on both cores you might not see much time to depth improvement at all which might be what you are seeing.
Seemed to be the case for Nemeton when I implemented extremely simple lazy SMP for the Leiden tournament end of april. (Where I make a copy of the NemetonChess unit (Pascal) per core and pass along the pointers to the hashtable to each. The first core remains the main core of which PV score and move is used in all cases which is not very optimal. Other cores might have a better score/move/depth etc. Yet it kinda works.)
My first crude solution (which I am still using as I could not yet be bothered to look into it further) is to have the second core always search one ply deeper. This helps the main core along a bit better and picks up a lot of tactical shots fast.
On four cores I let one core search on the same depth as the main core and the remaining two cores 1 and 2 ply deeper. Far from ideal but much better than doing the same on all cores.
At Leiden I picked up ideas such as searching the root moves in reverse or random order on the other cores which is probably a much better idea, but I did not get around to try yet.

Anyway it seems Nemeton now suddenly improved a lot not by SMP but by finally researching reduced moves that turn out good anyway. Never wanted to do this as I have no good clean way to write down the researches in my strange iterative search but more so my thinking was I'll still miss a lot regardless of the researches.
But now I did an ugly but correct implementation and it's a gigantic improvement. Ofcourse it makes sense to me now that the researches reduce the reduction in interesting previously unsearched lines a lot and cost almost nothing as the initial reduced search acts as some sort of IID. What's more I can now probably reduce a lot more which I could not do before.
Such a duality of feels. One the one hand a glorious wallop of Elo yet disgust to give in to such stinking researches.

Joost Buijs · Post by **Joost Buijs** » Mon Jun 20, 2016 7:09 am

This can also be caused by the 'turbo boost' mechanism in the CPU.
It is possible that the clock-frequency with just 1 core fully loaded is (much) higher than with 2 cores fully loaded, of course this depends upon the type of CPU in your laptop.
I don't know whether you run Linux or Windows, under Windows you can monitor the clock-frequencies with utilities like for instance CPUZ.

Usually I switch off the turbo mechanism in the BIOS before doing SMP tests, or I make sure that all cores have the same turbo settings, unfortunately most lap-tops don't support this.

Since you are using processes with only a shared hash-table I don't think the problem is 'false sharing', in this type of scenario I never see much slow-down using 1 or all cores, at least not on my system.

hgm · Post by **hgm** » Mon Jun 20, 2016 11:07 am

Thanks for the feedback, everyone. I gather the NPS slowdown is not normal, so I looked a bit into that. The laptop is an Intel i3, which is not supposed to have turbo boost. But I know that there are chipsets for laptops that throttle down memory access (because memory is often poorly cooled in laptops). I use no locks (and did not even bother to validate TT probes trhough lockless hashing).

When I repeated the 1-code search while there was a second, independent version of the engine analyzing, the NPS suffered 14-16% on the laptop! So I took the .exe to my i7 machine (which is also not supposed to have turbo boost), and I got the following results:

Code: Select all

depth 1-core 1-core 2-core 4-core
             loaded
 6        12     12      7      6
 7        20     20     10     11
 8        34     34     18     17
 9        67     65     37     29
10       131    132     90     62
11       354    352    205    146
12      1165   1156    692    427
13      2252   2248   2042   1633
14      5099   5174   8484   7124
15     16136  16559  23551  18495

NPS&#40;12-ply&#41;   2,939  3,034  2,600

Now there is basically no difference between the NPS with 1 core when I run a second instance of the engine ('loaded'), and also not when I switch to 2-core SMP. Because of that the speedup seems much nicer: the 2-core searches take about 60% of the time of a 1-core search. This seems to agree with what I have heard from others for this simple unsynchronized approach. (At depth > 12 the PVs are completely different, so I am not sure how to compare the numbers there in a sensible way.)

Of course this unsynchronized approch scales very poorly: when I try 4 cores, the speedup compared to 2 cores is much less than when going from 1 to 2 cores. This is partly due to a 13% drop in NPS (per core). I had noticed such an NPS drop before when I run more than 2 independent games in parallel, so I suppose there is not much that can be done about it, on the hardware I have. If it is due to the memory bottleneck not probing TT in QS might prevent it, but on a single thread this reduced time-to-depth as well. Perhaps a small in-cache private TT for QS only could be a compromise.

I already have a lot of patches to better 'synchronize' the search processes, and these did speed up the 2-core search in earlier tests. But at that time it was all for naught, because I could not get the basics right, and the unsynchronized 2-core search slowed things down nearly a factor 2, which even the smartest algorithm I could think off couldn't earn back. (And I did not have more than 2 cores at the time.) So I will now start to apply those, to see how much I can improve the scaling. They basically work by labeling the hash entries as 'busy' when the search enters the node, rather than waiting with the update until the search on it completed. And when encountering a 'busy' node which is not a 'must have' in the parent (i.e. an early move that is likely to be a cut move) immediately return with an invalid result, so that the parent can reschedule the search of the move until after it has searched another move. Only when there is nothing else to search the remaining moves become 'must haves', and the search will try them again, to help in their search.

Another inconvenience is that this SMP code is Windows-specific, while in the mean time I moved my development to Linux. Launching the slave processes is simple enough in Linx (through pipe(2) and fork(2)), but I would have to figure out how I can share memory between processes on Linux to also make it work there.

Joost Buijs · Post by **Joost Buijs** » Mon Jun 20, 2016 11:25 am

You probably know best what kind of CPU sits in your computer, but a core i7 without turbo boost is not very common.
Many core i3 chips introduced after 2013 also feature turbo boost.

hgm · Post by **hgm** » Mon Jun 20, 2016 11:38 am

OK, I will check it out, then. I thought that turbo-boost was mainly a thermal issue, and that once the chip is sufficiently well cooled (as it can be in a desktop), all cores could run at their theoretical max all the time, so that nothing can be gained anymore. But I could be wrong. I run Win7, and should have CPUZ already on it, so it should be easy to check. I recall that 'unloaded' CPUZ indicates the clock mentioned in the specs on which I bought the machine. But it could still be that with all cores active it throttles down. (Although that seems alotlike cheating.)

[Edit] OK, I checked it. The CPU is a Sandy Bridge i7-2600 @ 3.4GHz. The bus speed is 100MHz, and with a single engine alayzing the multiplier jitters between 35 and 36, occasionally hitting 37. (When the machine is idle the multiplier is at 16.) When I let 2-4 engines analyze, the multiplier is stable at 35.

So it seems I got 0.1GHz than what the specs promised. Perhaps because it is currently pretty cold in the room where I have that desktop. The speedup with a single core in continuous use is sort of negligible. It could still be that for short bustst of activity it is much faster.

Joost Buijs · Post by **Joost Buijs** » Mon Jun 20, 2016 1:17 pm

According to Intel ARK the base frequency is 3.4 GHz. and the max. turbo freq. is 3.8 GHz.

http://ark.intel.com/products/52213/Int ... o-3_80-GHz

With K and X CPU's you are allowed to change the max. turbo frequency manually, for instance set the turbo freq. at 3.8 even when all cores are fully loaded, when it gets too hot or when it exceeds TDP it will still throttle.
When the processor is properly cooled it will never get too hot with a chess engine (mostly integer calculations) because large parts of the processor remain idle (like the FP units and the MMX units).

Normally there is a scheme, something like 1 core 3.8 GHz., 2 cores 3.7 GHz. etc. I don't know what the scheme for your specific CPU is.
Turbo boost changed a few times and is now in it's 3th. incarnation.
When you really want to have reliable results it is the best to turn it off in the BIOS (if possible) while you are testing.

Joost Buijs · Post by **Joost Buijs** » Mon Jun 20, 2016 2:04 pm

It seems the turbo scheme for the i7-2600 is 4,3,2,1 (x additional bus clocks ???) this translates to 3.8,3.7,3.6,3.5 GHz.
Strange that there is already a 100 MHz. boost with all 4 cores under load, this makes it effectively a 3500 MHz. processor.

hgm · Post by **hgm** » Mon Jun 20, 2016 3:10 pm

Indeed, this puzzled me too. But I figured it might switch to 34x when you use all 8 hyper threads,

Anyway, the speed differences are not shocking, and the drop from 3,000 to 2,600 knps when going from 2 to 4 cores must be mainly due to other factors. 4 x 2.6 = 10.4, so in principle I should still be able to get a 3x speedup with 4 cores. If I manage to do that before the Olympiad starts next Monday, I would be pretty happy.

Joost Buijs · Post by **Joost Buijs** » Mon Jun 20, 2016 4:30 pm

hgm wrote:Indeed, this puzzled me too. But I figured it might switch to 34x when you use all 8 hyper threads,

Anyway, the speed differences are not shocking, and the drop from 3,000 to 2,600 knps when going from 2 to 4 cores must be mainly due to other factors. 4 x 2.6 = 10.4, so in principle I should still be able to get a 3x speedup with 4 cores. If I manage to do that before the Olympiad starts next Monday, I would be pretty happy.

Hopefully you manage to get it working before the Olympiad.

Maybe I will drop by one day to see what is going on there in Leiden.
Since I have to travel by public transportation it takes me at least 1.5 hour to get there (when there are no delays), this holds me back a little.

bob · Post by **bob** » Mon Jun 20, 2016 4:39 pm

hgm wrote:Thanks for the feedback, everyone. I gather the NPS slowdown is not normal, so I looked a bit into that. The laptop is an Intel i3, which is not supposed to have turbo boost. But I know that there are chipsets for laptops that throttle down memory access (because memory is often poorly cooled in laptops). I use no locks (and did not even bother to validate TT probes trhough lockless hashing).

When I repeated the 1-code search while there was a second, independent version of the engine analyzing, the NPS suffered 14-16% on the laptop! So I took the .exe to my i7 machine (which is also not supposed to have turbo boost), and I got the following results:
Code: Select all
depth 1-core 1-core 2-core 4-core
             loaded
 6        12     12      7      6
 7        20     20     10     11
 8        34     34     18     17
 9        67     65     37     29
10       131    132     90     62
11       354    352    205    146
12      1165   1156    692    427
13      2252   2248   2042   1633
14      5099   5174   8484   7124
15     16136  16559  23551  18495

NPS&#40;12-ply&#41;   2,939  3,034  2,600
Now there is basically no difference between the NPS with 1 core when I run a second instance of the engine ('loaded'), and also not when I switch to 2-core SMP. Because of that the speedup seems much nicer: the 2-core searches take about 60% of the time of a 1-core search. This seems to agree with what I have heard from others for this simple unsynchronized approach. (At depth > 12 the PVs are completely different, so I am not sure how to compare the numbers there in a sensible way.)

Of course this unsynchronized approch scales very poorly: when I try 4 cores, the speedup compared to 2 cores is much less than when going from 1 to 2 cores. This is partly due to a 13% drop in NPS (per core). I had noticed such an NPS drop before when I run more than 2 independent games in parallel, so I suppose there is not much that can be done about it, on the hardware I have. If it is due to the memory bottleneck not probing TT in QS might prevent it, but on a single thread this reduced time-to-depth as well. Perhaps a small in-cache private TT for QS only could be a compromise.

I already have a lot of patches to better 'synchronize' the search processes, and these did speed up the 2-core search in earlier tests. But at that time it was all for naught, because I could not get the basics right, and the unsynchronized 2-core search slowed things down nearly a factor 2, which even the smartest algorithm I could think off couldn't earn back. (And I did not have more than 2 cores at the time.) So I will now start to apply those, to see how much I can improve the scaling. They basically work by labeling the hash entries as 'busy' when the search enters the node, rather than waiting with the update until the search on it completed. And when encountering a 'busy' node which is not a 'must have' in the parent (i.e. an early move that is likely to be a cut move) immediately return with an invalid result, so that the parent can reschedule the search of the move until after it has searched another move. Only when there is nothing else to search the remaining moves become 'must haves', and the search will try them again, to help in their search.

Another inconvenience is that this SMP code is Windows-specific, while in the mean time I moved my development to Linux. Launching the slave processes is simple enough in Linx (through pipe(2) and fork(2)), but I would have to figure out how I can share memory between processes on Linux to also make it work there.

If you use fork(), nothing is shared and you get to resort to the system V stuff like shmget() (allocate a block of shared memory) and shmat() (attach a block of shared memory to current process).

Or you can switch to threads which will share everything but local data automatically...

Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP