Werewolf wrote:I have two questions, both of which seem to be disputed:

1) When a doubling of physical cores takes place (say going from 1 to 2 cores, both at same clock speed) is this formula generally accepted as right to measure speedup?:

number of cores^0.76

To get the total speed of the machine, one would multiply this value by clockspeed (say 3.0 GHz) and processor efficiency.

Can someone explain WHY this value is used? Is it based on empirical evidence or just maths? The Rybka team favoured this formula.

2) Linked to the above question:

Can someone deduce the search efficiency loss when a doubling of cores occurs?

The reason I ask is because it would help greatly in determining whether Hyper-Threading is good or bad.

This is what I've heard so far:

Bob H: approx 30% loss of efficiency for a doubling. Therefore HT ON would have to raise NPS by 30% just to break even.

(Bob if you're reading this, how did you work that out?)

Based on empirical testing more than anything. I have not done this test/analysis in a few years, so it MIGHT have changed although I personally doubt it.

For Crafty, here's some current data. Not near enough, but I ran this just now on my office iMac, which is a quad-core i7 running at 3.1ghz, 8mb of L3 cache, 16gb RAM. I made 4 runs with 4 threads, 4 runs with 8 threads (which uses hyper-threading, obviously) and then 1 run with one thread. Test position was kopek #22, searched to a depth of 30 plies. Here's a small table with speedup (time(1) / time(n)) and bps increase (nps(n)/nps(1)). I would normally run 16 of each to smooth things out, but this gives the idea:

Code: Select all

```
#cpu time speedup
4 49.2 4.0
4 58.8 3.4
4 47.7 3.7
4 52.7 3.7
avg 52.1 3.8
8 40.8 4.8
8 60.0 3.3
8 43.0 4.6
8 41.6 4.7
avg 46.4 4.3
```

So on my iMac, hyper threading actually seems to help, but I am not certain, because I don't particularly trust the process scheduler to run 4 threads on 4 physical cores correctly.

For my estimated speedup formula, the last time I measured this carefully, each additional thread added about 30% EXTRA nodes (30% of the nodes 1 cpu searched) as search overhead. It is not perfectly linear, but a simple linear fit goes

speedup = 1 + (N - 1) * 0.7

where N = number of CPUs. That is a fairly pessimistic formula, but it works. 0.8 might be a better number today. I need to do a bunch of test runs to confirm, which I will do later this week.

for 4 cpus, that gives 3.1x as the predicted speedup. The 4^.76 predicts 2.9x. Pretty close (I never saw any super-performance from Rybka's parallel search for the few numbers that were ever posted.

Robert Houdart: same as Bob but 20% instead of 30% (he was applying this to Houdini)

My own workings, based on the top formula with ^0.76, are that search efficiency drops by 15% each doubling. e.g:

2^0.76 = 1.7 (i.e. for a doubling of cores a 1.7x speedup is achieved)

1 core = 1000 NPS (say)

2 cores = 2000 NPS = 1700 effective NPS

which means you need a co-efficient of 0.85 to make this happen, which is a loss of 15%

**Can anyone comment on this, please?**

I realise there will be variations from engine to engine, but if we can deduce a principle...that would help a lot!

There is NO "principle" here. Every parallel approach, combined with a specific search implementation, will produce different speedup numbers. This is a "personalized" number that will fit one specific implementation of one program. My numbers have changed a bit because I spent some time over the past year trying to clean up the parallel search to make it faster and more efficient. And my speedup numbers have changed. A "one size fits all" approach REALLY turns into a "one size fits none" formula for this specific performance measure.

Note that I don't trust my speedup numbers from the iMac because of turboboost which can't be turned off. I'll run this on my cluster later this week where can have turboboost and hyper threading both disabled..