Crazy SMP

hgm · Post by **hgm** » Mon Jun 20, 2016 12:01 am

Finally I returned to the project of making my engine SMP, after my failed attempt to do this in the plane from Amsterdam to Tokyo in 2010. I guess my earlier problems were at least partly due to an error in transferring the hash size correctly between the processes. Now that I fixed this, at least I do seem to get some speedup rather than a slowdown for the simplest version. This version runs two processes with a shared hash table, one communicating with the GUI, the other (the first slave) getting the commands relayed to it from the first. When the first starts a search, it puts the second in analyze mode. So this is a totally unsynchronized approach.

I ran searches of the opening position using 1 and 2 cores, and the times to reach the various depths were:

Code: Select all

depth  1      2 cores
 6    12     12
 7    23     20
 8    43     34
 9    90     62
10   185    175
11   514    370
12  1714   1050
13  3342   3003
14  7704  11078
15 23375  23536

So in most cases the 2-core results are a bit faster, although it varies how much, and for 14-ply is actually reached only in a 50% longer time with 2 cores. In this case the PV is different for the two cases, however, as the raw results posted below show.

In this raw output I do not only print the usual depth / score / time / nodes, but after that the number of hash probes, hash hits to entries written by the probing process, and hash to hits written by another process. (For testing purposes I stored this ID of the writing process in the hash entry.) All the counts are only from the processes themselves.

If I look at the 12-ply results, which is the largest depth for which they showed the same PV, the single-core case reached 1,997 knps. This drops to 1,635 knps with two cores, but the 1,602 knps of the slave process has of course to be added to that. Even so, the speed per process seems to be some 20% lower. I don't see such a nps decrease when I run two independent engine processes, so this must be overhead due to accessing shared memory.

It can also be seen that of the 16.8M nodes searched by the master process (for 12ply) there were 10M hash probes. (The rest must have been stand-pat cutoffs, which do not get toprobing the hash.) And that of these 2M were hits, 60% from the own process, 40% from the slave. The latter must be responsible for the speedup.

For the 10-ply case, where there was little speedup, there were 1.8M probes, 400k hits, about evenly distributed between the two sources. So not very from the 12-ply result.

Are these results typical for what others are seeing? The decrease in NPS/core seems a bit disappointing. Perhaps this is caused by a memory bottleneck in my laptop, so that I should betterswitch to testing on a bigger machine.

Code: Select all

SINGLE CORE
tellics say     HaQiKi D 1.7d SMP
tellics say     by H.G. Muller
> memory 256
> new
> variant xiangqi
> level 40 5 0
> time 1000000
> post
> go
 1     44      0         76      36       4      0 proc0 f0e1
 2     12      0        727     569      71      0 proc0 f0e1 f9e8
 3     16      1       3078    1577     283      0 proc0 f0e1 f9e8 g3g4
 4     24      1      21175   12300    2153      0 proc0 h2h6 b9c7 f0e1 c6c5
 5     24      3      44229   26648    4773      0 proc0 f0e1 h7h3 g3g4 f9e8 c3c4
 6     16     12     195637  123979   20743      0 proc0 h2e2 b7b6 f0e1 h7e7 b2b3 f9e8
 7     20     23     404724  254746   46304      0 proc0 f0e1 f9e8 g3g4 c6c5 h0g2 b9c7 g2f4
 8     16     43     791151  490244   94660      0 proc0 f0e1 f9e8 g3g4 c6c5 h0g2 b7b3 i0i1 b9c7
 9     16     90    1696417 1028450  191668      0 proc0 f0e1 f9e8 g3g4 c6c5 h2e2 b9c7 h0g2 h7h3 g2f4
10     20    185    3592476 2172270  421711      0 proc0 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 c6c5
11     20    514   10113875 6111370 1147416      0 proc0 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 h7h3 c3c4
12     20   1714   34236944 20516263 3684090      0 proc0 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 c6c5 g2f4 b9c7
13     20   3342   67688750 40199511 7149910      0 proc0 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 c6c5 f0f6 h7h3 c3c4 c5c4 e2c4
14     20   7704  156402542 92280255 15623050      0 proc0 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 c6c5 b2b6 b9c7 g2f4 h7g7
15     20  23375  472004116 276810109 42459029      0 proc0 f0e1 f9e8 g3g4 b7g7 c3c4 g6g5 g4g5 g7g0 h0i2 h7g7 b2e2 b9c7 g5f5 g0g3 i0f0
# stop iterating, time=233766, tl=116279, id=15 md=63 reps=2228 perps=1318
move f0e1
quit

2 CORES, MASTER PROCESS
tellics say     HaQiKi D 1.7d SMP
tellics say     by H.G. Muller
> cores 2
> memory 256
> new
> variant xiangqi
> level 40 5 0
> time 1000000
> post
> go
 1     44      0         54      15       0      9 proc2 f0e1
 2     12      0        327     238       6     55 proc2 f0e1
 3     16      1       2234    1026      48    248 proc2 f0e1 f9e8
 4     24      1      17640   10265     631   1674 proc2 h2h6 b9c7 f0e1 c6c5
 5     24      3      37057   22395    1500   3515 proc2 f0e1 h7h3 g3g4 f9e8 c3c4
 6     16     12     153286   97046    7835  13485 proc2 h2e2 b7b6 f0e1 h7e7 b2b3 f9e8
 7     20     20     284689  180726   16977  23395 proc2 f0e1 f9e8 g3g4 c6c5 h0g2 b9c7 g2f4
 8     16     34     512585  319958   34258  42523 proc2 f0e1 f9e8 g3g4 c6c5 h0g2 b7b3 i0i1 b9c7
 9     16     62    1017669  621496   65475  83776 proc2 f0e1 f9e8 g3g4 c6c5 h2e2 b9c7 h0g2 h7h3 g2f4
10     20    175    2926448 1799590  203108 202104 proc2 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 c6c5
11     20    370    6109947 3730090  430238 397176 proc2 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9
12     20   1050   17164202 10200040 1170140 807093 proc2 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 c6c5
13     20   3003   50039843 29685875 3327902 2172989 proc2 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 c6c5 f0f6 h7h3 c3c4 c5c4 e2c4
14     20  11078  185478140 109582190 12631656 5890829 proc2 f0e1 f9e8 g3g4 b7g7 b2e2 b9c7 h0g2 g7g4 g2f4 c6c5 g0i2 g4g3 i0g0 g3c3 g0g6
15     20  23536  398692002 234644820 29374260 9615268 proc2 f0e1 f9e8 g3g4 c6c5 h0g2 b7e7 g0e2 b9c7 i0f0 a9b9 b2c2 b9b5 c3c4 c5c4 e2c4
# stop iterating, time=235360, tl=116279, id=15 md=63 reps=1654 perps=559
move f0e1

2 CORES, SLAVE PROCESS
tellics say     HaQiKi D 1.7d SMP
tellics say     by H.G. Muller
> cores 1
> attach 4194303 HashMem120500296
> new
> force
> variant xiangqi
> analyze
 1    -44      0         76      36       4      0 proc1 f0e1
 2    -12      0        678     530      59     26 proc1 f0e1 f9e8
 3    -16      1       3009    1522     242     59 proc1 f0e1 f9e8 g3g4
 4    -24      1      18944   11070    1480    849 proc1 h2h6 b9c7 f0e1 c6c5
 5    -24      3      39290   23813    3056   2074 proc1 f0e1 h7h3 g3g4 f9e8
 6    -16     12     157117   99533   11485  10667 proc1 h2e2 b7b6 f0e1 h7e7 b2b3 f9e8
 7    -20     20     293153  187715   24439  19189 proc1 f0e1 f9e8 g3g4 c6c5
 8    -16     34     515031  325751   42077  35991 proc1 f0e1 f9e8 g3g4 c6c5 h0g2 b7b3 i0i1 b9c7
 9    -16     62    1012518  625025   77449  72163 proc1 f0e1 f9e8 g3g4 c6c5 h2e2 b9c7 h0g2 h7h3 g2f4
10    -20    175    2901145 1794147  219394 188840 proc1 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 c6c5
11    -20    370    6011465 3685086  454099 367813 proc1 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 h7h3 c3c4
12    -20   1050   16817093 10168706 1259155 808277 proc1 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 c6c5 g2f4 b9c7
13    -20   3003   49335980 29463640 3565563 2014712 proc1 f0e1 f9e8 g3g4 g9e7 h0g2 h9f8 g0e2 i9f9 i0f0 c6c5 f0f6 h7h3 c3c4 c5c4 e2c4
14    -20  11078  182801534 108499592 13229123 5099258 proc1 f0e1 f9e8 g3g4 c6c5 h0g2 b7e7 g0e2 b9c7 i0f0 a9b9 b2c2 b9b5 c3c4 c5c4 e2c4
15    -20  23536  393714646 231360218 28944753 8944958 proc1 f0e1 f9e8 g3g4 c6c5 h0g2 b7e7 g0e2 b9c7 i0f0 a9b9 b2c2
> exit
# stop iterating, time=235360, tl=0, id=15 md=63 reps=2984 perps=1189
f0e1

bob · Post by **bob** » Mon Jun 20, 2016 12:34 am

Difficult to answer or explain. But a drop in NPS per core is obviously something you don't want. But also something you probably will never completely eliminate either.

The biggest issue is one called "false sharing" where a single 64 byte cache line has two variables that are modified by different threads. This causes lots of cache invalidates and forwarding.

Locks are another issue and the closer to the tips of the tree the worse they hurt...

This is a lifelong quest, of course.

Stan Arts · Post by **Stan Arts** » Mon Jun 20, 2016 2:02 am

20% seems a whole lot but as you say might have something to do with the system itself. Or it might have to do with heavy hashing in Qsearch or evaluation for example. But for Nemeton on my i5 quad desktop I don't seem to slow down more than 10% regardless of what other cores are doing and if they are doing Nemeton chess slowdown is no more than a few percent at most.

But searching on two cores searching exactly the same way on both cores you might not see much time to depth improvement at all which might be what you are seeing.
Seemed to be the case for Nemeton when I implemented extremely simple lazy SMP for the Leiden tournament end of april. (Where I make a copy of the NemetonChess unit (Pascal) per core and pass along the pointers to the hashtable to each. The first core remains the main core of which PV score and move is used in all cases which is not very optimal. Other cores might have a better score/move/depth etc. Yet it kinda works.)
My first crude solution (which I am still using as I could not yet be bothered to look into it further) is to have the second core always search one ply deeper. This helps the main core along a bit better and picks up a lot of tactical shots fast.
On four cores I let one core search on the same depth as the main core and the remaining two cores 1 and 2 ply deeper. Far from ideal but much better than doing the same on all cores.
At Leiden I picked up ideas such as searching the root moves in reverse or random order on the other cores which is probably a much better idea, but I did not get around to try yet.

Anyway it seems Nemeton now suddenly improved a lot not by SMP but by finally researching reduced moves that turn out good anyway. Never wanted to do this as I have no good clean way to write down the researches in my strange iterative search but more so my thinking was I'll still miss a lot regardless of the researches.
But now I did an ugly but correct implementation and it's a gigantic improvement. Ofcourse it makes sense to me now that the researches reduce the reduction in interesting previously unsearched lines a lot and cost almost nothing as the initial reduced search acts as some sort of IID. What's more I can now probably reduce a lot more which I could not do before.
Such a duality of feels. One the one hand a glorious wallop of Elo yet disgust to give in to such stinking researches.

Joost Buijs · Post by **Joost Buijs** » Mon Jun 20, 2016 7:09 am

This can also be caused by the 'turbo boost' mechanism in the CPU.
It is possible that the clock-frequency with just 1 core fully loaded is (much) higher than with 2 cores fully loaded, of course this depends upon the type of CPU in your laptop.
I don't know whether you run Linux or Windows, under Windows you can monitor the clock-frequencies with utilities like for instance CPUZ.

Usually I switch off the turbo mechanism in the BIOS before doing SMP tests, or I make sure that all cores have the same turbo settings, unfortunately most lap-tops don't support this.

Since you are using processes with only a shared hash-table I don't think the problem is 'false sharing', in this type of scenario I never see much slow-down using 1 or all cores, at least not on my system.

hgm · Post by **hgm** » Mon Jun 20, 2016 11:07 am

Thanks for the feedback, everyone. I gather the NPS slowdown is not normal, so I looked a bit into that. The laptop is an Intel i3, which is not supposed to have turbo boost. But I know that there are chipsets for laptops that throttle down memory access (because memory is often poorly cooled in laptops). I use no locks (and did not even bother to validate TT probes trhough lockless hashing).

When I repeated the 1-code search while there was a second, independent version of the engine analyzing, the NPS suffered 14-16% on the laptop! So I took the .exe to my i7 machine (which is also not supposed to have turbo boost), and I got the following results:

Code: Select all

depth 1-core 1-core 2-core 4-core
             loaded
 6        12     12      7      6
 7        20     20     10     11
 8        34     34     18     17
 9        67     65     37     29
10       131    132     90     62
11       354    352    205    146
12      1165   1156    692    427
13      2252   2248   2042   1633
14      5099   5174   8484   7124
15     16136  16559  23551  18495

NPS&#40;12-ply&#41;   2,939  3,034  2,600

Now there is basically no difference between the NPS with 1 core when I run a second instance of the engine ('loaded'), and also not when I switch to 2-core SMP. Because of that the speedup seems much nicer: the 2-core searches take about 60% of the time of a 1-core search. This seems to agree with what I have heard from others for this simple unsynchronized approach. (At depth > 12 the PVs are completely different, so I am not sure how to compare the numbers there in a sensible way.)

Of course this unsynchronized approch scales very poorly: when I try 4 cores, the speedup compared to 2 cores is much less than when going from 1 to 2 cores. This is partly due to a 13% drop in NPS (per core). I had noticed such an NPS drop before when I run more than 2 independent games in parallel, so I suppose there is not much that can be done about it, on the hardware I have. If it is due to the memory bottleneck not probing TT in QS might prevent it, but on a single thread this reduced time-to-depth as well. Perhaps a small in-cache private TT for QS only could be a compromise.

I already have a lot of patches to better 'synchronize' the search processes, and these did speed up the 2-core search in earlier tests. But at that time it was all for naught, because I could not get the basics right, and the unsynchronized 2-core search slowed things down nearly a factor 2, which even the smartest algorithm I could think off couldn't earn back. (And I did not have more than 2 cores at the time.) So I will now start to apply those, to see how much I can improve the scaling. They basically work by labeling the hash entries as 'busy' when the search enters the node, rather than waiting with the update until the search on it completed. And when encountering a 'busy' node which is not a 'must have' in the parent (i.e. an early move that is likely to be a cut move) immediately return with an invalid result, so that the parent can reschedule the search of the move until after it has searched another move. Only when there is nothing else to search the remaining moves become 'must haves', and the search will try them again, to help in their search.

Another inconvenience is that this SMP code is Windows-specific, while in the mean time I moved my development to Linux. Launching the slave processes is simple enough in Linx (through pipe(2) and fork(2)), but I would have to figure out how I can share memory between processes on Linux to also make it work there.

Joost Buijs · Post by **Joost Buijs** » Mon Jun 20, 2016 11:25 am

You probably know best what kind of CPU sits in your computer, but a core i7 without turbo boost is not very common.
Many core i3 chips introduced after 2013 also feature turbo boost.

hgm · Post by **hgm** » Mon Jun 20, 2016 11:38 am

OK, I will check it out, then. I thought that turbo-boost was mainly a thermal issue, and that once the chip is sufficiently well cooled (as it can be in a desktop), all cores could run at their theoretical max all the time, so that nothing can be gained anymore. But I could be wrong. I run Win7, and should have CPUZ already on it, so it should be easy to check. I recall that 'unloaded' CPUZ indicates the clock mentioned in the specs on which I bought the machine. But it could still be that with all cores active it throttles down. (Although that seems alotlike cheating.)

[Edit] OK, I checked it. The CPU is a Sandy Bridge i7-2600 @ 3.4GHz. The bus speed is 100MHz, and with a single engine alayzing the multiplier jitters between 35 and 36, occasionally hitting 37. (When the machine is idle the multiplier is at 16.) When I let 2-4 engines analyze, the multiplier is stable at 35.

So it seems I got 0.1GHz than what the specs promised. Perhaps because it is currently pretty cold in the room where I have that desktop. The speedup with a single core in continuous use is sort of negligible. It could still be that for short bustst of activity it is much faster.

Joost Buijs · Post by **Joost Buijs** » Mon Jun 20, 2016 1:17 pm

According to Intel ARK the base frequency is 3.4 GHz. and the max. turbo freq. is 3.8 GHz.

http://ark.intel.com/products/52213/Int ... o-3_80-GHz

With K and X CPU's you are allowed to change the max. turbo frequency manually, for instance set the turbo freq. at 3.8 even when all cores are fully loaded, when it gets too hot or when it exceeds TDP it will still throttle.
When the processor is properly cooled it will never get too hot with a chess engine (mostly integer calculations) because large parts of the processor remain idle (like the FP units and the MMX units).

Normally there is a scheme, something like 1 core 3.8 GHz., 2 cores 3.7 GHz. etc. I don't know what the scheme for your specific CPU is.
Turbo boost changed a few times and is now in it's 3th. incarnation.
When you really want to have reliable results it is the best to turn it off in the BIOS (if possible) while you are testing.

Joost Buijs · Post by **Joost Buijs** » Mon Jun 20, 2016 2:04 pm

It seems the turbo scheme for the i7-2600 is 4,3,2,1 (x additional bus clocks ???) this translates to 3.8,3.7,3.6,3.5 GHz.
Strange that there is already a 100 MHz. boost with all 4 cores under load, this makes it effectively a 3500 MHz. processor.

hgm · Post by **hgm** » Mon Jun 20, 2016 3:10 pm

Indeed, this puzzled me too. But I figured it might switch to 34x when you use all 8 hyper threads,

Anyway, the speed differences are not shocking, and the drop from 3,000 to 2,600 knps when going from 2 to 4 cores must be mainly due to other factors. 4 x 2.6 = 10.4, so in principle I should still be able to get a 3x speedup with 4 cores. If I manage to do that before the Olympiad starts next Monday, I would be pretty happy.

Crazy SMP

Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP

Re: Crazy SMP