My test suite for this test is 500 positions. They were randomly chosen from GM games somewhere between move 10 and 40 or so. No tactical test positions at all, unless they occurred naturally inside games.

Here's the raw data, piece by piece, with discussion after each block. First data simply measures max parallel speedup potential by looking at NPS scaling. These were run on cluster nodes with 2 x E5345 Xeons, at 2.33ghz. These cpus do have a bottleneck or two since there are 8 cores fighting for memory/cache access. The parallel search speedup obviously can't exceed the raw NPS speedup under any circumstances other than the oddball super-linear speedup cases that are uncommon.

Here's the nps scaling, with raw NPS, and then speedup going from 1->2, 1->4 and 1->8 cores.

Code: Select all

```
raw nps = 1->2.9m 2->5.5m 4-> 11.2m 8-> 18.9m
scaling = 1->2 = 190% 1->4 = 386% 1->8 = 651%
```

Now for actual speedup data. In the data following, each data point is the same set of 500 positions, run with N cores, repeated 8 times. The numbers at the ends of the lines in parens is the average over the 8 runs.

Code: Select all

```
======================================
Parallel speedup from 1->2->4->8 cores
======================================
1cpu = 349.5m
2cpu = 193.8m 203.5m 198.1m 199.6m 205.5m 195.8m 204.9m 208.5m (201.2m)
4cpu = 110.2m 111.6m 110.4m 105.3m 106.9m 113.0m 106.8m 111.7m (109.5)
8cpu = 73.4m 73.0m 72.7m 73.7m 72.5m 73.2m 73.5m 74.4m (73.3)
speedup
1->2 = 1.7x
1->4 = 3.2x
1->8 = 4.8x
```

Finally, the tree growth as threads are added:

Code: Select all

```
===========================================
size of tree increase from 1->2->4->8 cores
===========================================
1cpu = 61.1b nodes
2cpu = 64.7b 68.0b 65.8b 66.5b 68.6b 66.0b 68.4b 69.7b (67.2b)
4cpu = 73.6b 74.6b 73.6b 70.5b 71.7b 75.9b 71.5b 74.9b (73.3b)
8cpu = 82.7b 82.5b 81.5b 83.1b 81.9b 82.6b 83.4b 84.2b (82.7b)
overhead
1->2 = 10%
1->4 = 20%
1->8 = 35%
```

Given the 1-2-4 cpu overhead, hyper-threading has a chance of helping. I no longer have a cluster with HT cpus, however, so I can't run the above test with 16 cores to see what happens. However, I MIGHT be able to run it on my macbook. It is about 2x faster than the cluster I used, The one-cpu test would take about 3 hours, and a couple of runs with 2 ought to be another 3, followed by runs with 4. Whether I can stand to watch it run 8 runs for 2 and 4 is another matter.

Before I go too far, I want to look at the positions I used, just to be sure they are spread from opening to early endgame to give a representative sample of complete games...

But at the moment, this is my current data. Speedup is about what I except thru 4 cpus. the 8 cpu numbers are off a bit, but so is the NPS scaling. I might try running this on our 12 core cluster but at the moment it is running something big for someone and it will be busy for a while...