Misprediction-poor looping

hgm · Post by **hgm** » Mon Dec 09, 2019 10:36 pm

Suppose we have a loop, the body of which should only be performed for some iterations, and it is difficult to predict on which:

Code: Select all

for(i=0; i<MAX; i++) if(Condition(i)) Action(i);

We could minimize the number of branch mispredictions by first figuring out on which iterations the action should be performed:

Code: Select all

int todo[MAX];
int n=0;
for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) != 0); }
for(i=0; i<n; i++) Action(todo[i]);

This uses branches for the looping only, which, if the loops typically do many iterations, should be predicted as 'always taken', causing mispredicts only at the loop end. (And if the loops are unrolled, and have a well-predictable number of iterations, not even that.)

dragontamer5788 · Post by **dragontamer5788** » Mon Dec 09, 2019 11:25 pm

Comment: For some reason, I couldn't grok your code in the first pass. My brain understands the following more:

Code: Select all

for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }

Condition(i) ? 1 : 0 will probably compile into a cmov instruction, so you won't have any branches either. In fact, you probably can use the if() else {} code and rely upon compiler optimizations to use cmov instead of a branch.

EDIT: Godbolt says that GCC uses sete flag, which is still branchless.

--------

I see the theory, and I think it could be useful! Interesting food for thought.

mar · Post by **mar** » Tue Dec 10, 2019 12:29 am

dragontamer5788 wrote: ↑Mon Dec 09, 2019 11:25 pm Comment: For some reason, I couldn't grok your code in the first pass. My brain understands the following more:
Code: Select all
for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }
Condition(i) ? 1 : 0 will probably compile into a cmov instruction, so you won't have any branches either. In fact, you probably can use the if() else {} code and rely upon compiler optimizations to use cmov instead of a branch.

EDIT: Godbolt says that GCC uses sete flag, which is still branchless.

--------

I see the theory, and I think it could be useful! Interesting food for thought.

Yes, but only if you assume that Action() is trivial without side effects and the compiler can see its guts.

My experience:
- don't use likely/unlikely compiler hints (I remember I got a slightly slower code this way even if I thought the hint was appropriate, at instruction level this means extra byte (prefix) before the jump instruction); I don't know if "forward branches assumed not taken" is still a deal now that we got branch predictors (I guess not)
- for well predicted trivial branches, a branch may actually be faster than branchless magic that taxes ALU more (example: sample clipping loop in audio mixer)
- of course, profile, profile, profile. I remember I have outsmarted myself serveral times, thinking I came up with something clever that has to be faster but profiling showed otherwise

bob · Post by **bob** » Tue Dec 10, 2019 4:33 am

One other thing. Modern hardware can predict 100% on loops that are less than N iterations. Not sure what N is today, a few years ago it was 10-12 depending on the microprocessor. If the iteration count is reasonably long, you get N-1 predictions and 1 misprediction for each time you execute the loop, which is still not very bad. And the neat thing is that if you have that loop in several places, it will predict all the same way, not having to learn what is going on beyond the first loop completing.

hgm · Post by **hgm** » Tue Dec 10, 2019 9:47 am

dragontamer5788 wrote: ↑Mon Dec 09, 2019 11:25 pm Comment: For some reason, I couldn't grok your code in the first pass. My brain understands the following more:
Code: Select all
for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }

I think that (A != 0) used as an operand by definition means (A ? 1 : 0), so the compiler will consider it equivalent, and generate the same code for it.

There are several possibilities to do this without branches. You already mentioned CMOV, but there is an older instruction SETcc (in this case SETNE) which loads 1 or 0 in the AL register depending on the condition codes. So it is always as easy to branch on a condition (with BNE) as to convert it to a truth value. Another trick is to compute the condition and then add -1 (= 0xFFFF...) to it, which only does not overflow (and then leaves the carry condition code cleared) is the condition was zero. You can then add 0 with carry to n (ADC) to only increment it when the condition was non-zero.

mar wrote: ↑Tue Dec 10, 2019 12:29 amYes, but only if you assume that Action() is trivial without side effects and the compiler can see its guts.

Indeed, I used the function calls only as a means to write compact pseudo-code; they were supposed to be simple code sections. BTW, all calls to Action are done in the same order as they would be in the original code. So side effects there are not dangerous per se. But Condition and Action should not have side effects on each other's behavior.

mar · Post by **mar** » Tue Dec 10, 2019 10:47 am

hgm wrote: ↑Mon Dec 09, 2019 10:36 pm

Code: Select all

int todo[MAX];
int n=0;
for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) != 0); }
for(i=0; i<n; i++) Action(todo[i]);

I use something similar but with different motivation. I often extract elements/pointers based on a condition into a (thread)local array for further processing.
This is beneficial if you want to do several loops or parallelize the loop.
Or even if you want to expose this to a higher-level scripting language where iterating would mean calling C/C++ iterator code periodically.
EDIT: in my case, Action is typically non-trivial

flok · Post by **flok** » Tue Dec 10, 2019 3:40 pm

hgm wrote: ↑Tue Dec 10, 2019 9:47 am
Code: Select all
for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }

Code: Select all

for(i=0; i<MAX; i++) { todo[n] = i; n += !!Condition(i); }

dragontamer5788 · Post by **dragontamer5788** » Tue Dec 10, 2019 4:05 pm

hgm wrote: ↑Tue Dec 10, 2019 9:47 am
mar wrote: ↑Tue Dec 10, 2019 12:29 amYes, but only if you assume that Action() is trivial without side effects and the compiler can see its guts.
Indeed, I used the function calls only as a means to write compact pseudo-code; they were supposed to be simple code sections. BTW, all calls to Action are done in the same order as they would be in the original code. So side effects there are not dangerous per se. But Condition and Action should not have side effects on each other's behavior.

If this simplicity can be assured, then we open up SIMD-compute capabilities. Wait, wait wait... don't run away. I'm not talking GPUs here, just pointing out that the compiler will autovectorize the loop.

Godbolt says the following code is SIMD, compiling into XMM-registers (SSE, 128-bit assembly).

The "SIMD" part is simply:

Code: Select all


void Action(int& x){
    x = x+10;
}

...

    #pragma omp simd
    for(int i=0; i<n; i++) Action(todo[i]);

GCC still needs the #pragma omp simd to see that this loop is vectorizable for some reason, but the #pragma hint is all GCC needs to start using SSE. With "-mavx2", you'll get AVX2 (256-bit YMM registers). With "-mavx512f", you'll get 512-bit ZMM registers.

The Condition() loop is obviously stream-compaction, which requires prefix-sum as a primitive. This will be tricky, requiring #pragma omp simd capabilities (well supported by CLANG and GCC, but not by Microsoft Visual C++). But I assert that the Condition() loop can also be formed into SIMD while retaining work-efficiency, albeit with a bit of extra effort, some helper functions, and some extra memory (Maybe a second "todo" array to hold the prefix-sum results);

mar · Post by **mar** » Tue Dec 10, 2019 4:34 pm

dragontamer5788 wrote: ↑Tue Dec 10, 2019 4:05 pmGCC still needs the #pragma omp simd to see that this loop is vectorizable for some reason

why? the code you posted is auto-vectorized by gcc regardless of any pragma, clang even unrolls the loop as well as msc

dragontamer5788 · Post by **dragontamer5788** » Tue Dec 10, 2019 5:45 pm

mar wrote: ↑Tue Dec 10, 2019 4:34 pm
dragontamer5788 wrote: ↑Tue Dec 10, 2019 4:05 pmGCC still needs the #pragma omp simd to see that this loop is vectorizable for some reason
why? the code you posted is auto-vectorized by gcc regardless of any pragma, clang even unrolls the loop as well as msc

Interesting. I stand corrected. I tested it earlier, but maybe I herp-derp'd and messed up my reading of the assembly earlier.

Misprediction-poor looping

Misprediction-poor looping

Re: Misprediction-poor looping

Re: Misprediction-poor looping

Re: Misprediction-poor looping

Re: Misprediction-poor looping

Re: Misprediction-poor looping

Re: Misprediction-poor looping

Re: Misprediction-poor looping

Re: Misprediction-poor looping

Re: Misprediction-poor looping