parallel search speedup/overhead

bob · Post by **bob** » Mon May 06, 2013 10:12 pm

Ronald de Man and I were discussing this relative to hyper-threading. Since I had not tested Crafty in a few years, I figured it was about time to do so to get some actual data.

My test suite for this test is 500 positions. They were randomly chosen from GM games somewhere between move 10 and 40 or so. No tactical test positions at all, unless they occurred naturally inside games.

Here's the raw data, piece by piece, with discussion after each block. First data simply measures max parallel speedup potential by looking at NPS scaling. These were run on cluster nodes with 2 x E5345 Xeons, at 2.33ghz. These cpus do have a bottleneck or two since there are 8 cores fighting for memory/cache access. The parallel search speedup obviously can't exceed the raw NPS speedup under any circumstances other than the oddball super-linear speedup cases that are uncommon.

Here's the nps scaling, with raw NPS, and then speedup going from 1->2, 1->4 and 1->8 cores.

Code: Select all


raw nps = 1->2.9m     2->5.5m     4-> 11.2m     8-> 18.9m

scaling = 1->2 = 190%  1->4 = 386%  1->8 = 651%

Based on the above, 8 cores is pushing the hardware pretty hard, at least for Crafty, with a 6.5x speedup as the best possible. Note that I have run on boxes in the past where the speedup was on the order of 7.9x (older opterons with 8 cores and 8 physical cpu chips as one example). 4 cores scales pretty well.

Now for actual speedup data. In the data following, each data point is the same set of 500 positions, run with N cores, repeated 8 times. The numbers at the ends of the lines in parens is the average over the 8 runs.

Code: Select all


======================================
Parallel speedup from 1->2->4->8 cores
======================================

1cpu = 349.5m

2cpu = 193.8m 203.5m 198.1m 199.6m 205.5m 195.8m 204.9m 208.5m &#40;201.2m&#41;

4cpu = 110.2m 111.6m 110.4m 105.3m 106.9m 113.0m 106.8m 111.7m &#40;109.5&#41;

8cpu = 73.4m 73.0m 72.7m 73.7m 72.5m 73.2m 73.5m 74.4m &#40;73.3&#41;

speedup

1->2 = 1.7x

1->4 = 3.2x

1->8 = 4.8x

The 1-4 core speedup is right along the prediction my linear speedup formula estimates. The 1-8 speedup is not, but remember that 1-8 is really just 1-6.5.

Finally, the tree growth as threads are added:

Code: Select all



===========================================
size of tree increase from 1->2->4->8 cores
===========================================

1cpu = 61.1b nodes

2cpu = 64.7b 68.0b 65.8b 66.5b 68.6b 66.0b 68.4b 69.7b &#40;67.2b&#41;

4cpu = 73.6b 74.6b 73.6b 70.5b 71.7b 75.9b 71.5b 74.9b &#40;73.3b&#41;

8cpu = 82.7b 82.5b 81.5b 83.1b 81.9b 82.6b 83.4b 84.2b &#40;82.7b&#41;

overhead

1->2 = 10%

1->4 = 20%

1->8 = 35%

Overhead is lower than I expected. And lower than I had measured in the past. The 1-2-4 cpu overhead is pretty accurate. Again, the 8cpu overhead is under-stated because the search speed was not 8x faster.

Given the 1-2-4 cpu overhead, hyper-threading has a chance of helping. I no longer have a cluster with HT cpus, however, so I can't run the above test with 16 cores to see what happens. However, I MIGHT be able to run it on my macbook. It is about 2x faster than the cluster I used, The one-cpu test would take about 3 hours, and a couple of runs with 2 ought to be another 3, followed by runs with 4. Whether I can stand to watch it run 8 runs for 2 and 4 is another matter.

Before I go too far, I want to look at the positions I used, just to be sure they are spread from opening to early endgame to give a representative sample of complete games...

But at the moment, this is my current data. Speedup is about what I except thru 4 cpus. the 8 cpu numbers are off a bit, but so is the NPS scaling. I might try running this on our 12 core cluster but at the moment it is running something big for someone and it will be busy for a while...

jdart · Post by **jdart** » Tue May 07, 2013 2:23 am

I have recently done some testing on a dual Opteron 6344 box (2 CPUs, 12 cores each). I am getting near linear scaling now from 1-8 cores on typical middlegame positions. Achieving this required taking some measures to reduce global memory access, outside of the hashtable and I have also recently put in some code that dynamically adjusts the split depth based on observed split frequency (these changes are not in the recently released Arasan 15.5).

This is my canonical test position:

3r1rk1/1pp1qpb1/1P4pp/p3nbn1/N1P1p3/PQ2P2P/1B1NBPP1/3R1R1K b - - bm Nxh3; c0 "E.T.Chess - Spike Turin, G. Banks 2009";

At 12 cores things are not so good. NPS goes from 7.7M to about 9.6, so around a 25% increase for 50% more cores. Average thread usage is still high though, over 11.

At 24 cores NPS is 13.75M and average thread usage is 21.59. So doubling cores gets only 43% more NPS.

There are no changes in my codebase so far specifically to improve performance on NUMA architectures, where accessing the local CPU's memory is faster than accessing memory on the other processor.

Scaling is currently quite a bit worse in the endgame. Even at 12 cores thread usage is dropping to around 9 on average. Not sure why. AMD has some profiling tools I am running but IMO they are not as informative as the tools Intel has.

--Jon

bob · Post by **bob** » Wed May 08, 2013 9:59 pm

jdart wrote:I have recently done some testing on a dual Opteron 6344 box (2 CPUs, 12 cores each). I am getting near linear scaling now from 1-8 cores on typical middlegame positions. Achieving this required taking some measures to reduce global memory access, outside of the hashtable and I have also recently put in some code that dynamically adjusts the split depth based on observed split frequency (these changes are not in the recently released Arasan 15.5).

This is my canonical test position:

3r1rk1/1pp1qpb1/1P4pp/p3nbn1/N1P1p3/PQ2P2P/1B1NBPP1/3R1R1K b - - bm Nxh3; c0 "E.T.Chess - Spike Turin, G. Banks 2009";

At 12 cores things are not so good. NPS goes from 7.7M to about 9.6, so around a 25% increase for 50% more cores. Average thread usage is still high though, over 11.

At 24 cores NPS is 13.75M and average thread usage is 21.59. So doubling cores gets only 43% more NPS.

There are no changes in my codebase so far specifically to improve performance on NUMA architectures, where accessing the local CPU's memory is faster than accessing memory on the other processor.

Scaling is currently quite a bit worse in the endgame. Even at 12 cores thread usage is dropping to around 9 on average. Not sure why. AMD has some profiling tools I am running but IMO they are not as informative as the tools Intel has.

--Jon

I've got a note to deal with later on, namely lopping off hash probes in last couple of plies. The savings are minimal on hits, and the memory bandwidth/TLB thrashing might more than offset that.

the 8-core opteron box I last tested on exhaustively worked well. More recent multi-core chips seem to flatten out a bit. Memory has always been a big bottleneck. Adding a bunch of cores only highlights the problem...

AMD has some internal tools that are very good, but they don't distribute them, unfortunately...

Adam Hair · Post by **Adam Hair** » Wed Jul 17, 2013 4:41 am

Bob, do you happen to still have the raw data such as the logs or the time to depth for each position for each test run?

Instead of comparing the total time taken by x processors to the total time taken by one processor, I am interested in comparing the time to depth of each position for various number of processors. I have been studying ttd(1cpu)/ttd(4cpus) for multiple engines (where ttd equals time to depth). For almost every engine I have tested so far, the histogram of ttd(1cpu)/ttd(4cpus) has a similar shape to this:

I am interested in seeing if histograms of your data also look similar.

bob · Post by **bob** » Wed Jul 17, 2013 6:54 am

Adam Hair wrote:Bob, do you happen to still have the raw data such as the logs or the time to depth for each position for each test run?

Instead of comparing the total time taken by x processors to the total time taken by one processor, I am interested in comparing the time to depth of each position for various number of processors. I have been studying ttd(1cpu)/ttd(4cpus) for multiple engines (where ttd equals time to depth). For almost every engine I have tested so far, the histogram of ttd(1cpu)/ttd(4cpus) has a similar shape to this:

I am interested in seeing if histograms of your data also look similar.

I have most of the old data (first test results we discussed here a few years ago on an 8 cpu amd opteron box. I am not sure what I have for recent versions, although I do have some 8 and 12 core results stuck away. But these are all in the form of just raw Crafty log files. And if you want current stuff, I could probably queue up a few runs for the current version...

flok · Post by **flok** » Mon Jul 22, 2013 5:39 pm

Adam Hair wrote:Bob, do you happen to still have the raw data such as the logs or the time to depth for each position for each test run?

How is this ttd calculated? Time it takes to reach depth x / divided ttd-depth-1?
Because in the graph this value is < 100%. I would expect ttd(1)/ttd(1) to be 100%?

Sven · Post by **Sven** » Mon Jul 22, 2013 6:48 pm

flok wrote:
Adam Hair wrote:Bob, do you happen to still have the raw data such as the logs or the time to depth for each position for each test run?
How is this ttd calculated? Time it takes to reach depth x / divided ttd-depth-1?
Because in the graph this value is < 100%. I would expect ttd(1)/ttd(1) to be 100%?

ttd is time to a given fixed depth. It is not additionally divided by anything. The graph does not show a percentage, it shows how many test positions (y axis) showed speedup values in a given interval (x axis), e.g. more than 300 test positions have a speedup between 2.0 and 2.25. "speedup" := ttd(1 cpu) / ttd(4 cpus), as Adam wrote above.

Sven

flok · Post by **flok** » Mon Aug 05, 2013 5:18 pm

DeepBrutePos v1.7 SVN467 does currently do:

x axis is the number of logical cpus used

It looks as if it somewhat flattens but at 11 cores there's a peak way above what I would expect. 12 cores should be ignored as the system on which I run it is never 100% idle (it has 6 cores with hyper threading and 4 vms that usually use 10-50% of 1 logical cpu).

Sven · Post by **Sven** » Mon Aug 05, 2013 11:05 pm

Hi Folkert,

two questions:
1) How many different testpositions were used to obtain these data?
2) Is this a fixed-depth test so that you can calculate "time to depth"?

If the answer to 2) were "yes" then my conclusion would be (after calculating TTD, NPS speedup and TTD-based SMP speedup):

Code: Select all

nCores  nodes    NPS    TTD    NPS speedup  TTD-based SMP speedup
     1   300000  17000  17.65  1.00         1.00
     2  2100000  32000  65.63  1.88         0.27
     3  2000000  48000  41.67  2.82         0.42
     4  2700000  64000  42.19  3.76         0.42
     5  2800000  69000  40.58  4.06         0.43
     6  3500000  83000  42.17  4.88         0.42
     7  3300000  86000  38.37  5.06         0.46
     8  4000000  87000  45.98  5.12         0.38
     9  5000000  90000  55.56  5.29         0.32
    10  4700000  90000  52.22  5.29         0.34
    11  6000000  97000  61.86  5.71         0.29

a) Your NPS speedup looks almost perfect up to 4 logical cores used, and still acceptable for 5 and 6 logical cores. With >6 logical cores you get practically no further NPS speedup, which is not unexpected for me since your machine only has 6 physical cores. So better don't look too much at the values for >6.

b) Your "time to depth" is horrible. It *increases* drastically from 1 to 2 logical cores, where everyone would expect it to *decrease*, then goes down a bit for 3 cores, and basically remains constant at about 42 sec for more logical cores. That is not what it should be. The TTD-based SMP speedup for 3..6 logical cores would be around 0.42 in your case which is absolutely ineffective. At least a 2.0 for 4 cores would be o.k., 2.5 already much better, and 3.0 really professional.

So I hope your underlying test is actually not a "fixed-depth" test

Sven

flok · Post by **flok** » Mon Aug 05, 2013 11:38 pm

Sven Schüle wrote:two questions:
1) How many different testpositions were used to obtain these data?

One actually: the response to e2-e4.

2) Is this a fixed-depth test so that you can calculate "time to depth"?

Mostly: if a king is under attack it'll do an extension of 1 ply.

a) Your NPS speedup looks almost perfect up to 4 logical cores used, and still acceptable for 5 and 6 logical cores. With >6 logical cores you get practically no further NPS speedup, which is not unexpected for me since your machine only has 6 physical cores. So better don't look too much at the values for >6.

Ah that's great to hear!
Hopefully I'll be able to test more than 6 cores one of the following weeks.

Let me give all the data:
- I ran my program 12 times, with each time a extra logical cpu. the system on which it ran has also 4 virtual machines running which use +/- 1 logical cpu
- depth is 7 plies with an extension for king-under-attack
- alpha/beta is used in the work which each thread calculates, no a/b between threads as that would be a concurrency nightmare. altough maybe I should bring back a lock-shared-for-all in the form of a readwrite lock. hmmm, let me give that some thought.

The output is:
- 1st column is maximum depth as entered on the commandline
- 2nd is number of logical cpus
- 3d (8) please ignore this one
- 4th is number of nodes visited
- 5th is how long it ook to do reach depth 7. I think the reason that the 1 cpu version is faster than the others might be related to better a/b pruning
- 6th is number of nodes per second
- 7th is the move it came up with in the end

Code: Select all

7 1 8 293641 17.722 16569.292404920438 E7-E6
7 2 8 2016701 62.414 32311.6768673695 E7-E6
7 3 8 1980269 42.246 46874.710031718976 E7-E6
7 4 8 2688385 42.21 63690.71310116086 E7-E5
7 5 8 2782687 40.579 68574.55826905541 E7-E6
7 6 8 3564356 43.282 82351.92458758838 E7-E6
7 7 8 3479088 41.021 84812.36439872261 E7-E6
7 8 8 3990635 46.938 85019.28075333418 E7-E6
7 9 8 4972073 56.044 88717.31139818714 E7-E6
7 10 8 4767687 53.649 88868.14292903875 G7-G6
7 11 8 5989901 61.944 96698.64716518145 E7-E6
7 12 8 6961774 74.671 93232.63382035863 D7-D5

b) Your "time to depth" is horrible. It *increases* drastically from 1 to 2 logical cores, where everyone would expect it to *decrease*, then goes down a bit for 3 cores, and basically remains constant at about 42 sec for more logical cores.
That is not what it should be. The TTD-based SMP speedup for 3..6 logical cores would be around 0.42 in your case which is absolutely ineffective. At least a 2.0 for 4 cores would be o.k., 2.5 already much better, and 3.0 really professional.

conclusion:
- not only the move generator needs some attention, I also need to figure out a way to communicate the a/b values between threads to improve pruning
- distribution of the work over threads at itself runs pretty well (which was my main focus for the last 2 weeks)[/code]

parallel search speedup/overhead

parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead

Re: parallel search speedup/overhead