Reliable speed comparison: some math required

mcostalba · Post by **mcostalba** » Tue Feb 27, 2018 9:37 am

I don't have math skills to find a correct answer to this, so I post here for help.

The problem is to measure if a non-functional speed optimization (NEW) is faster than current (MASTER):

Running a single speed bench test, as everybody knows, gives very noise information, able to detect only the most grossly differences in speed, but when speed is similar (as in real cases) this is not reliable.

We can run more than one test and average the results, but how much is the speed result reliable? So the problem is:

Given N noisy measures of speed of NEW and N noisy measures of MASTER, we want to know:

1. If NEW is faster than MASTER with 95% reliability

2. Maximum speed up of NEW, defined as (Speed NEW - Speed Master) / Speed MASTER that is guaranteed with 95% reliability

The second point needs a clarification. Suppose NEW is faster then MASTER of 0.5% with 95% guarantee), then it will be faster also of at least 0,4% and lower. On the contrary it will not be faster than 0.8% with 95% reliability, maybe just with 30% reliability. We want to find the max speed-up (in this case 0.5%) that has 95% reliability.

This is what I have cooked up myself:

Code: Select all

Suppose we have N measures and we sum them&#58;

SUM A = N * SA + sum&#40;1..N of nai&#41;   // Where SA is speed of A, nai is a noise added to the i-th measure
SUM B = N * SB + sum&#40;1..N of nbi&#41;

SUM A - SUM B = N * &#40;SA - SB&#41; + sum&#40;1..N of nai-nbi&#41;

Now we want to know when SUM A - SUM B > 0, assuming nai are guassian noises, so...

SUM A - SUM B > 0   =>   SA - SB > sum&#40;1...N of nbi-nai&#41; / N

lucasart · Post by **lucasart** » Tue Feb 27, 2018 9:57 am

This is already in pyshbench:
https://github.com/lucasart/pyshbench/b ... /pyshbench

But the problem is not the maths. The math is correct, but relies on a fundamental assumption (behind the CL theorem used): that** the observations are drawn from independent and identically distributed random variables**

This is not the case in reality, if you run the benches sequentially (and that is easily proven with proper statistical tests). That is because external noise of OS and other processes pollute too much. The best you can do is run benches in parralel: for each run, do "master bench" and "test bench" concurrently, so that external noise is (if not absent) impacting both master and test the same. Of course it's never good enough, but it's the best you can do in practice. This is what pyshbench does already.

So the limiting factor is never the statistical precision (very high), it's the practical instability of the machine, iow its inadequation to the theoretical model.

mcostalba · Post by **mcostalba** » Tue Feb 27, 2018 10:01 am

What about running a longer bench? Same time but allocated on fewer but longer benches?

lucasart · Post by **lucasart** » Tue Feb 27, 2018 10:07 am

And of course, there's another problem. That even with the most perfect and precise measures, different machines and/or compilers really will give different speedups.

I know I'm stating the obvious. But this is really important. At least as important as understanding the math behind the formulas.

lucasart · Post by **lucasart** » Tue Feb 27, 2018 10:10 am

mcostalba wrote:What about running a longer bench? Same time but allocated on fewer but longer benches?

That might help. But needs to be confirmed with data.

Kotlov · Post by **Kotlov** » Tue Feb 27, 2018 10:15 am

I also noticed that the first test will be faster than the next one, it probably depends on the temperature of the processor.

lucasart · Post by **lucasart** » Tue Feb 27, 2018 10:20 am

Kotlov wrote:I also noticed that the first test will be faster than the next one, it probably depends on the temperature of the processor.

Quite possibly. But this is already neutralized by the concurrent testing technique deployed in pyshbench, that i explained above.

mar · Post by **mar** » Tue Feb 27, 2018 10:22 am

I typically do something very simple: n runs for each version, pick the fastest one for each, then simply compare.
Not very scientific but works well for me.

syzygy · Post by **syzygy** » Tue Feb 27, 2018 10:24 am

Testing in parallel is only more noisy, since the different cores will not be affected equally by OS noise and other processes.

In my view:
- disable cpu scaling if possible.
- kill as many background processes as possible.
- run a single bench at a time with SF bound to a particular core (use taskset on Linux).
- do a number of runs and take the fastest (= least disturbed by noise).
- if not most of the runs are close to the fastest, the system is too noisy and you have to find the background process that causes the trouble.
- there may be problems with cpu throttling on laptops.

If you can't disable cpu scaling, run two benches from a batch file and take the second result.

The problem remains that speed differs per architecture and per compiler. Also insignificant code changes can unpredictably affect speed in both directions for no good reason (due to cache alignment).

lucasart · Post by **lucasart** » Tue Feb 27, 2018 10:25 am

mar wrote:I typically do something very simple: n runs for each version, pick the fastest one for each, then simply compare.
Not very scientific but works well for me.

In theory, with noisy observations, it's best to choose the median. It's a more robust statistic than the max. By the way, that's what I do in my engine (median of 5 runs).

Reliable speed comparison: some math required

Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required

Re: Reliable speed comparison: some math required