New SMP stuff (particularly Kai)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

MikeB wrote:
bob wrote:
In any case, these numbers are actually more in line with what I expected. The NPS scaling is down, but partly due to a funny compiler issue I don't understand as of yet, and partly because I made a few changes to better measure the internal wait time, which was previously being under-reported. The strange compiler problem is one of those "random events". A couple of weeks back I reported a huge difference between the Intel and GCC executables. They always searched exactly the same number of nodes, but the intel speed was 10%+ faster. That has gone away. Just moving things around or adding data occasionally causes this. I'm going to work on running vTune to see what is up, but all of these were run with the gcc compiler which is currently producing the fastest executable...
I currently have gcc and icc producing almost the same speed - I think gcc is a hair faster - but gcc used to be way faster. Now trying to figure out which is better at smp.
Right now I see very little difference. Used to be (for me) that icc was always faster.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

I was looking over the data, and one thing interesting stands out, not that unexpected however.

1 cpu searches 210 billion nodes, total, across all 24 positions.

each of the other tests usually have one "outlier". For example, the first 2cpu test searched 279B nodes, the third 4cpu test searched 289B nodes, the first 8cpu test searched 272B nodes, the second 16cpu test searched 281B nodes, and it apparently lucked out on the 20 cpu test as they were between 231B and 248B with no larger/smaller outliers. I suspect all of this needs to be replicated again and averaged, giving 8 runs rather than just 4.

If you look inside at the position by position data, there is almost always one position (not the same one but usually one of a relatively small set) that blows up, but averaged over 24 positions things tend to smooth out.

There's still a ton of variability here. I suspect there always will be. I had a lot more splitting at the root and not carefully keeping up with move list order (for LMR decisions). Since I fixed the program so that LMR is always the same regardless of serial or parallel search, anywhere in the tree, that much bigger variability went away.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

Here is a 1 thread (25.0x) vs 8 threads (25.0) to depth=12. The depth=10 looks similar, just faster.

1 Crafty-25.0 2600 5 5 1962 50% 2600 50%
2 Crafty-25.0x 2600 5 5 1962 50% 2600 50%

It seems to eventually stabilize at 2601/2599 after more games. I'll post an update later. This is for the same version that has been running the past few days where the SMP test data was produced.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

Laskos wrote:
zullil wrote:
Joerg Oster wrote:
zullil wrote:
Joerg Oster wrote:Hi Louis,

if you are interested and have some time, you eventually want to repeat the test with my latest patch, which you can find here: https://github.com/joergoster/Stockfish ... join_tweak.
I'd really be interested to see, if this patch improves things. :D
Just one run, but less total time and a smaller tree, so maybe a change for the better:

Code: Select all

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 479159
Nodes searched  : 11386079093
Nodes/second    : 23762632

./stockfish bench 8192 20 27 /home/louis/Documents/Chess/Testing/HyattPositions depth
===========================
Total time (ms) : 443388
Nodes searched  : 10465813469
Nodes/second    : 23604187
Obviously, the second benchmark is for your tweaked Stockfish. Will repeat this, but with greater depth.
Looks promising, many thanks.
Looking forward to your further results.
Now it's looking bad. Clearly, one run of 24 positions is not enough to say anything.

Code: Select all

===========================
Total time (ms) : 1068964
Nodes searched  : 29587093943
Nodes/second    : 27678288
./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth

latejoin_tweak
===========================
Total time (ms) : 1214284
Nodes searched  : 34913456185
Nodes/second    : 28752298
./stockfish bench 8192 20 30 /home/louis/Documents/Chess/Testing/HyattPositions depth

Yes, this way of using only total times needs much more than 1 run with 24 positions. I usually do it with 150 positions, but it becomes very time consuming. This single run can be used more effectively if you keep the individual times for each position, then the easiest way is to do the geometric average:

(product all individual times on one core / product all individual times on 20 cores) ^ (1/24).

The average error should be no more than 5-10% even using only those 24 positions in 1 run.
I am still greatly dissatisfied with my testing approach. My next attempt once the current normal tests are finished, will be something like this:

I am going to run the 1-2-4-8-16-20 core tests with a 2 minute target time for ALL. I am then going to write code to compare 1 vs 2 and find the largest complete iteration depth for both, for each position, and use this for the speedup for 1->2. Some of that 2-core run will be wasted here but it will be used for the 2->4 test. This time I will take the 2cpu test and find the deepest common depth per position, which basically means best case the 2cpu test will have a 2 min search, the 4 cpu search will only be significant to 1 minute assuming twice as fast. I'm going to repeat this for all numbers of processors.

Right now with the 20cpu test averaging 2 minutes, by the time I get to 1 cpu the times are enormous. This new approach ought to take 48 minutes per run, with 6 runs for 1, 2, 4, 8, 16 and 20 we are talking 288 minutes, just over four hours total. Repeat 4 times and it is 16 hours, which is about what the 1 cpu test takes by itself right now.

I am going to hope these correlate well with the current approach, because it is a REAL pain in the ass to have to re-calibrate the depths if I make any search changes (such as LMR adjustments or whatever). This new approach will just target a 2-1 ratio for each comparison. The only downside here is that comparing an average to an average is not exactly precise science. I'd rather compare 4 to 1. But here I have to compare it to 2 and extrapolate what it would look like for 1->4 since I won't have comparable numbers for 1 thread vs 4 threads as I do right now. So it depends on how stable the 2 core result is and how much error there is in that, because each succeeding test will be adding new error on top of old error. I hope this doesn't call error creep into the results.

More once the current test is done. Have everything but last two 2 cpu tests done...

I want a faster way because if I find out where part of this NPS scaling problem is happening, fixing that will result in a LOT of re-testing even though I won't (likely) have to rerun the 1 cpu test... maybe...

This is the idea you suggested (Kai) last time in this discussion. I like the fact that it can be automated. I might even build this in to Crafty as a command/module to make it easy for anyone to run if they are curious and have plenty of CPU time. Right now just figuring out the search depths is a real pain...
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: New SMP stuff (particularly Kai)

Post by Laskos »

bob wrote:
I am still greatly dissatisfied with my testing approach. My next attempt once the current normal tests are finished, will be something like this:

I am going to run the 1-2-4-8-16-20 core tests with a 2 minute target time for ALL. I am then going to write code to compare 1 vs 2 and find the largest complete iteration depth for both, for each position, and use this for the speedup for 1->2. Some of that 2-core run will be wasted here but it will be used for the 2->4 test. This time I will take the 2cpu test and find the deepest common depth per position, which basically means best case the 2cpu test will have a 2 min search, the 4 cpu search will only be significant to 1 minute assuming twice as fast. I'm going to repeat this for all numbers of processors.

Right now with the 20cpu test averaging 2 minutes, by the time I get to 1 cpu the times are enormous. This new approach ought to take 48 minutes per run, with 6 runs for 1, 2, 4, 8, 16 and 20 we are talking 288 minutes, just over four hours total. Repeat 4 times and it is 16 hours, which is about what the 1 cpu test takes by itself right now.

I am going to hope these correlate well with the current approach, because it is a REAL pain in the ass to have to re-calibrate the depths if I make any search changes (such as LMR adjustments or whatever). This new approach will just target a 2-1 ratio for each comparison. The only downside here is that comparing an average to an average is not exactly precise science. I'd rather compare 4 to 1. But here I have to compare it to 2 and extrapolate what it would look like for 1->4 since I won't have comparable numbers for 1 thread vs 4 threads as I do right now. So it depends on how stable the 2 core result is and how much error there is in that, because each succeeding test will be adding new error on top of old error. I hope this doesn't call error creep into the results.

More once the current test is done. Have everything but last two 2 cpu tests done...

I want a faster way because if I find out where part of this NPS scaling problem is happening, fixing that will result in a LOT of re-testing even though I won't (likely) have to rerun the 1 cpu test... maybe...

This is the idea you suggested (Kai) last time in this discussion. I like the fact that it can be automated. I might even build this in to Crafty as a command/module to make it easy for anyone to run if they are curious and have plenty of CPU time. Right now just figuring out the search depths is a real pain...
I think this testing approach on SMP must work.

For fun, I took Crafty 18.12, the first engine I used in multicore mode on my dual Xeon back in 2001 or 2002. The SMP performance 1->4 cores is not particularly shining.
150 positions used:

Code: Select all

Time to depth speedup: 2.47
NPS speedup: 3.07
Overhead: 31%
But not all overhead is wasted, the engine widens a bit in SMP, so you had this as early as 2001.
1000 games at fixed depth 8:

Code: Select all

Score of Crafty 4 cores depth=8 vs Crafty 1 core depth=8: 
272 - 236 - 492  [0.52] 1000
ELO difference: 13
All in all, time to depth + widening would give something like 2.7 effective speedup 1->4 cores.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

Laskos wrote:
bob wrote:
I am still greatly dissatisfied with my testing approach. My next attempt once the current normal tests are finished, will be something like this:

I am going to run the 1-2-4-8-16-20 core tests with a 2 minute target time for ALL. I am then going to write code to compare 1 vs 2 and find the largest complete iteration depth for both, for each position, and use this for the speedup for 1->2. Some of that 2-core run will be wasted here but it will be used for the 2->4 test. This time I will take the 2cpu test and find the deepest common depth per position, which basically means best case the 2cpu test will have a 2 min search, the 4 cpu search will only be significant to 1 minute assuming twice as fast. I'm going to repeat this for all numbers of processors.

Right now with the 20cpu test averaging 2 minutes, by the time I get to 1 cpu the times are enormous. This new approach ought to take 48 minutes per run, with 6 runs for 1, 2, 4, 8, 16 and 20 we are talking 288 minutes, just over four hours total. Repeat 4 times and it is 16 hours, which is about what the 1 cpu test takes by itself right now.

I am going to hope these correlate well with the current approach, because it is a REAL pain in the ass to have to re-calibrate the depths if I make any search changes (such as LMR adjustments or whatever). This new approach will just target a 2-1 ratio for each comparison. The only downside here is that comparing an average to an average is not exactly precise science. I'd rather compare 4 to 1. But here I have to compare it to 2 and extrapolate what it would look like for 1->4 since I won't have comparable numbers for 1 thread vs 4 threads as I do right now. So it depends on how stable the 2 core result is and how much error there is in that, because each succeeding test will be adding new error on top of old error. I hope this doesn't call error creep into the results.

More once the current test is done. Have everything but last two 2 cpu tests done...

I want a faster way because if I find out where part of this NPS scaling problem is happening, fixing that will result in a LOT of re-testing even though I won't (likely) have to rerun the 1 cpu test... maybe...

This is the idea you suggested (Kai) last time in this discussion. I like the fact that it can be automated. I might even build this in to Crafty as a command/module to make it easy for anyone to run if they are curious and have plenty of CPU time. Right now just figuring out the search depths is a real pain...
I think this testing approach on SMP must work.

For fun, I took Crafty 18.12, the first engine I used in multicore mode on my dual Xeon back in 2001 or 2002. The SMP performance 1->4 cores is not particularly shining.
150 positions used:

Code: Select all

Time to depth speedup: 2.47
NPS speedup: 3.07
Overhead: 31%
But not all overhead is wasted, the engine widens a bit in SMP, so you had this as early as 2001.
1000 games at fixed depth 8:

Code: Select all

Score of Crafty 4 cores depth=8 vs Crafty 1 core depth=8: 
272 - 236 - 492  [0.52] 1000
ELO difference: 13
All in all, time to depth + widening would give something like 2.7 effective speedup 1->4 cores.
Here's food for thought.

Case 1: 24 positions, total time simply added together. Do this for 1, 2, 4, 8, 16 and 20 cores. Then divide 1 core by N core. If you run that 4 times you get 4 numbers that can be averaged with normal mean, geomean or whatever. But just 4 values.

Case 2: 24 positions, take each position independently and compute the speedup as above. Now you have 24 values. Either mean or geomean to get a speedup over 24, or over 48, or if I run 4 times, that would be 96 samples.

Which seems most reasonable? And no, they don't produce the same answers which is a problem. I assume that with enough samples, both ought to get close, but 24 samples to produce one average leaves a lot of room for varying results.

I'm beginning to think the 96 samples would be better overall, but I am not sure...
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: New SMP stuff (particularly Kai)

Post by Laskos »

bob wrote:
Laskos wrote:
bob wrote:
I am still greatly dissatisfied with my testing approach. My next attempt once the current normal tests are finished, will be something like this:

I am going to run the 1-2-4-8-16-20 core tests with a 2 minute target time for ALL. I am then going to write code to compare 1 vs 2 and find the largest complete iteration depth for both, for each position, and use this for the speedup for 1->2. Some of that 2-core run will be wasted here but it will be used for the 2->4 test. This time I will take the 2cpu test and find the deepest common depth per position, which basically means best case the 2cpu test will have a 2 min search, the 4 cpu search will only be significant to 1 minute assuming twice as fast. I'm going to repeat this for all numbers of processors.

Right now with the 20cpu test averaging 2 minutes, by the time I get to 1 cpu the times are enormous. This new approach ought to take 48 minutes per run, with 6 runs for 1, 2, 4, 8, 16 and 20 we are talking 288 minutes, just over four hours total. Repeat 4 times and it is 16 hours, which is about what the 1 cpu test takes by itself right now.

I am going to hope these correlate well with the current approach, because it is a REAL pain in the ass to have to re-calibrate the depths if I make any search changes (such as LMR adjustments or whatever). This new approach will just target a 2-1 ratio for each comparison. The only downside here is that comparing an average to an average is not exactly precise science. I'd rather compare 4 to 1. But here I have to compare it to 2 and extrapolate what it would look like for 1->4 since I won't have comparable numbers for 1 thread vs 4 threads as I do right now. So it depends on how stable the 2 core result is and how much error there is in that, because each succeeding test will be adding new error on top of old error. I hope this doesn't call error creep into the results.

More once the current test is done. Have everything but last two 2 cpu tests done...

I want a faster way because if I find out where part of this NPS scaling problem is happening, fixing that will result in a LOT of re-testing even though I won't (likely) have to rerun the 1 cpu test... maybe...

This is the idea you suggested (Kai) last time in this discussion. I like the fact that it can be automated. I might even build this in to Crafty as a command/module to make it easy for anyone to run if they are curious and have plenty of CPU time. Right now just figuring out the search depths is a real pain...
I think this testing approach on SMP must work.

For fun, I took Crafty 18.12, the first engine I used in multicore mode on my dual Xeon back in 2001 or 2002. The SMP performance 1->4 cores is not particularly shining.
150 positions used:

Code: Select all

Time to depth speedup: 2.47
NPS speedup: 3.07
Overhead: 31%
But not all overhead is wasted, the engine widens a bit in SMP, so you had this as early as 2001.
1000 games at fixed depth 8:

Code: Select all

Score of Crafty 4 cores depth=8 vs Crafty 1 core depth=8: 
272 - 236 - 492  [0.52] 1000
ELO difference: 13
All in all, time to depth + widening would give something like 2.7 effective speedup 1->4 cores.
Here's food for thought.

Case 1: 24 positions, total time simply added together. Do this for 1, 2, 4, 8, 16 and 20 cores. Then divide 1 core by N core. If you run that 4 times you get 4 numbers that can be averaged with normal mean, geomean or whatever. But just 4 values.

Case 2: 24 positions, take each position independently and compute the speedup as above. Now you have 24 values. Either mean or geomean to get a speedup over 24, or over 48, or if I run 4 times, that would be 96 samples.

Which seems most reasonable? And no, they don't produce the same answers which is a problem. I assume that with enough samples, both ought to get close, but 24 samples to produce one average leaves a lot of room for varying results.

I'm beginning to think the 96 samples would be better overall, but I am not sure...
Sure individually is better. I was doing total time just because it was too tedious to me to do it individually for each position. I think geomean individual speedups is the best. Take these 5 individual speedups for 1->2 cores (5 positions): 1.6, 2.2, 2.3, 1.1, 3.2. Geomean: 1.95. Average: 2.08. Average gives apparently a superlinear behavior. But why? Because of "very large" 3.2 individual speedup, but it is a no larger statistical fluke than 1.1, in fact 1.1 is a larger statistical fluke. 3.2 is off 2 by 60%, 2 is off 1.1 by 82%. If you do the average of 1.1 and 3.2, you get the unnatural superlinear 2.15, if you do geomean of 1.1 and 3.2, you get 1.88, which is intuitive and reasonable on account of the above (that 1.1 speedup is even more extreme than 3.2)

So, my "recipe" would be: compute all individual speedups 1->N cores, 24 of them. Geomean them (all 24) to get the value for 1->N cores speedup in one run. Run the same for 4 times. Geomean the 4 speedups of individual runs, but here it's not that important whether you take the average or geomean, because the spread of results is not so high.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
I am still greatly dissatisfied with my testing approach. My next attempt once the current normal tests are finished, will be something like this:

I am going to run the 1-2-4-8-16-20 core tests with a 2 minute target time for ALL. I am then going to write code to compare 1 vs 2 and find the largest complete iteration depth for both, for each position, and use this for the speedup for 1->2. Some of that 2-core run will be wasted here but it will be used for the 2->4 test. This time I will take the 2cpu test and find the deepest common depth per position, which basically means best case the 2cpu test will have a 2 min search, the 4 cpu search will only be significant to 1 minute assuming twice as fast. I'm going to repeat this for all numbers of processors.

Right now with the 20cpu test averaging 2 minutes, by the time I get to 1 cpu the times are enormous. This new approach ought to take 48 minutes per run, with 6 runs for 1, 2, 4, 8, 16 and 20 we are talking 288 minutes, just over four hours total. Repeat 4 times and it is 16 hours, which is about what the 1 cpu test takes by itself right now.

I am going to hope these correlate well with the current approach, because it is a REAL pain in the ass to have to re-calibrate the depths if I make any search changes (such as LMR adjustments or whatever). This new approach will just target a 2-1 ratio for each comparison. The only downside here is that comparing an average to an average is not exactly precise science. I'd rather compare 4 to 1. But here I have to compare it to 2 and extrapolate what it would look like for 1->4 since I won't have comparable numbers for 1 thread vs 4 threads as I do right now. So it depends on how stable the 2 core result is and how much error there is in that, because each succeeding test will be adding new error on top of old error. I hope this doesn't call error creep into the results.

More once the current test is done. Have everything but last two 2 cpu tests done...

I want a faster way because if I find out where part of this NPS scaling problem is happening, fixing that will result in a LOT of re-testing even though I won't (likely) have to rerun the 1 cpu test... maybe...

This is the idea you suggested (Kai) last time in this discussion. I like the fact that it can be automated. I might even build this in to Crafty as a command/module to make it easy for anyone to run if they are curious and have plenty of CPU time. Right now just figuring out the search depths is a real pain...
I think this testing approach on SMP must work.

For fun, I took Crafty 18.12, the first engine I used in multicore mode on my dual Xeon back in 2001 or 2002. The SMP performance 1->4 cores is not particularly shining.
150 positions used:

Code: Select all

Time to depth speedup: 2.47
NPS speedup: 3.07
Overhead: 31%
But not all overhead is wasted, the engine widens a bit in SMP, so you had this as early as 2001.
1000 games at fixed depth 8:

Code: Select all

Score of Crafty 4 cores depth=8 vs Crafty 1 core depth=8: 
272 - 236 - 492  [0.52] 1000
ELO difference: 13
All in all, time to depth + widening would give something like 2.7 effective speedup 1->4 cores.
Here's food for thought.

Case 1: 24 positions, total time simply added together. Do this for 1, 2, 4, 8, 16 and 20 cores. Then divide 1 core by N core. If you run that 4 times you get 4 numbers that can be averaged with normal mean, geomean or whatever. But just 4 values.

Case 2: 24 positions, take each position independently and compute the speedup as above. Now you have 24 values. Either mean or geomean to get a speedup over 24, or over 48, or if I run 4 times, that would be 96 samples.

Which seems most reasonable? And no, they don't produce the same answers which is a problem. I assume that with enough samples, both ought to get close, but 24 samples to produce one average leaves a lot of room for varying results.

I'm beginning to think the 96 samples would be better overall, but I am not sure...
Sure individually is better. I was doing total time just because it was too tedious to me to do it individually for each position. I think geomean individual speedups is the best. Take these 5 individual speedups for 1->2 cores (5 positions): 1.6, 2.2, 2.3, 1.1, 3.2. Geomean: 1.95. Average: 2.08. Average gives apparently a superlinear behavior. But why? Because of "very large" 3.2 individual speedup, but it is a no larger statistical fluke than 1.1, in fact 1.1 is a larger statistical fluke. 3.2 is off 2 by 60%, 2 is off 1.1 by 82%. If you do the average of 1.1 and 3.2, you get the unnatural superlinear 2.15, if you do geomean of 1.1 and 3.2, you get 1.88, which is intuitive and reasonable on account of the above (that 1.1 speedup is even more extreme than 3.2)

So, my "recipe" would be: compute all individual speedups 1->N cores, 24 of them. Geomean them (all 24) to get the value for 1->N cores speedup in one run. Run the same for 4 times. Geomean the 4 speedups of individual runs, but here it's not that important whether you take the average or geomean, because the spread of results is not so high.
That was sort of my thinking when I asked the question. But to clarify, if I do four runs, is collapsing each run to a geomean speedup and then averaging better than geomean of all 96 results (repeat 4 times, 24 positions, but compute geomean of all 96 individual speedups)?

I always have a suspicion about averages of averages. This would avoid that to some extent. Or would it be even better to just have 96 different positions? I suppose I could burn some CPU time to see what the answers look like to see which appears most reasonable..
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: New SMP stuff (particularly Kai)

Post by Laskos »

bob wrote:
That was sort of my thinking when I asked the question. But to clarify, if I do four runs, is collapsing each run to a geomean speedup and then averaging better than geomean of all 96 results (repeat 4 times, 24 positions, but compute geomean of all 96 individual speedups)?

I always have a suspicion about averages of averages. This would avoid that to some extent. Or would it be even better to just have 96 different positions? I suppose I could burn some CPU time to see what the answers look like to see which appears most reasonable..
Geomean of 4 runs of 24 (where each run uses geomean too) should be equal to geomean of all 96 one by one. And there I think is little difference between whether you do 4 runs of the same 24 positions or 96 different positions. May be some difference is there if you have a "pathological" position which behaves badly each run (which average speedup even after 100 runs would be far away from other positions), making 4 runs with the same "pathological" position an unwelcomed amplified noise.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: New SMP stuff (particularly Kai)

Post by bob »

Laskos wrote:
bob wrote:
That was sort of my thinking when I asked the question. But to clarify, if I do four runs, is collapsing each run to a geomean speedup and then averaging better than geomean of all 96 results (repeat 4 times, 24 positions, but compute geomean of all 96 individual speedups)?

I always have a suspicion about averages of averages. This would avoid that to some extent. Or would it be even better to just have 96 different positions? I suppose I could burn some CPU time to see what the answers look like to see which appears most reasonable..
Geomean of 4 runs of 24 (where each run uses geomean too) should be equal to geomean of all 96 one by one. And there I think is little difference between whether you do 4 runs of the same 24 positions or 96 different positions. May be some difference is there if you have a "pathological" position which behaves badly each run (which average speedup even after 100 runs would be far away from other positions), making 4 runs with the same "pathological" position an unwelcomed amplified noise.
There are at least two pathological positions in this test set. Caused by something "deep" that gets exposed during the parallel part of the search where a super-linear speedup is quite common. I think my next order of business is going to be to put together a group of non-related positions that exhibit a little bit of everything. IE rock-solid lock on to best move at depth =1 and never changes, some positions where several moves are close and the search keeps bouncing, some with lots of tactics that confuse move ordering, some with deep/narrow trees (i.e. endgames). Etc...