Misprediction-poor looping

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27792
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Misprediction-poor looping

Post by hgm »

Suppose we have a loop, the body of which should only be performed for some iterations, and it is difficult to predict on which:

Code: Select all

for(i=0; i<MAX; i++) if(Condition(i)) Action(i);
We could minimize the number of branch mispredictions by first figuring out on which iterations the action should be performed:

Code: Select all

int todo[MAX];
int n=0;
for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) != 0); }
for(i=0; i<n; i++) Action(todo[i]);
This uses branches for the looping only, which, if the loops typically do many iterations, should be predicted as 'always taken', causing mispredicts only at the loop end. (And if the loops are unrolled, and have a well-predictable number of iterations, not even that.)
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Misprediction-poor looping

Post by dragontamer5788 »

Comment: For some reason, I couldn't grok your code in the first pass. My brain understands the following more:

Code: Select all

for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }
Condition(i) ? 1 : 0 will probably compile into a cmov instruction, so you won't have any branches either. In fact, you probably can use the if() else {} code and rely upon compiler optimizations to use cmov instead of a branch.

EDIT: Godbolt says that GCC uses sete flag, which is still branchless.

--------

I see the theory, and I think it could be useful! Interesting food for thought.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Misprediction-poor looping

Post by mar »

dragontamer5788 wrote: Mon Dec 09, 2019 11:25 pm Comment: For some reason, I couldn't grok your code in the first pass. My brain understands the following more:

Code: Select all

for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }
Condition(i) ? 1 : 0 will probably compile into a cmov instruction, so you won't have any branches either. In fact, you probably can use the if() else {} code and rely upon compiler optimizations to use cmov instead of a branch.

EDIT: Godbolt says that GCC uses sete flag, which is still branchless.

--------

I see the theory, and I think it could be useful! Interesting food for thought.
Yes, but only if you assume that Action() is trivial without side effects and the compiler can see its guts.

My experience:
- don't use likely/unlikely compiler hints (I remember I got a slightly slower code this way even if I thought the hint was appropriate, at instruction level this means extra byte (prefix) before the jump instruction); I don't know if "forward branches assumed not taken" is still a deal now that we got branch predictors (I guess not)
- for well predicted trivial branches, a branch may actually be faster than branchless magic that taxes ALU more (example: sample clipping loop in audio mixer)
- of course, profile, profile, profile. I remember I have outsmarted myself serveral times, thinking I came up with something clever that has to be faster but profiling showed otherwise
Martin Sedlak
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Misprediction-poor looping

Post by bob »

One other thing. Modern hardware can predict 100% on loops that are less than N iterations. Not sure what N is today, a few years ago it was 10-12 depending on the microprocessor. If the iteration count is reasonably long, you get N-1 predictions and 1 misprediction for each time you execute the loop, which is still not very bad. And the neat thing is that if you have that loop in several places, it will predict all the same way, not having to learn what is going on beyond the first loop completing.
User avatar
hgm
Posts: 27792
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Misprediction-poor looping

Post by hgm »

dragontamer5788 wrote: Mon Dec 09, 2019 11:25 pm Comment: For some reason, I couldn't grok your code in the first pass. My brain understands the following more:

Code: Select all

for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }
I think that (A != 0) used as an operand by definition means (A ? 1 : 0), so the compiler will consider it equivalent, and generate the same code for it.

There are several possibilities to do this without branches. You already mentioned CMOV, but there is an older instruction SETcc (in this case SETNE) which loads 1 or 0 in the AL register depending on the condition codes. So it is always as easy to branch on a condition (with BNE) as to convert it to a truth value. Another trick is to compute the condition and then add -1 (= 0xFFFF...) to it, which only does not overflow (and then leaves the carry condition code cleared) is the condition was zero. You can then add 0 with carry to n (ADC) to only increment it when the condition was non-zero.
mar wrote: Tue Dec 10, 2019 12:29 amYes, but only if you assume that Action() is trivial without side effects and the compiler can see its guts.
Indeed, I used the function calls only as a means to write compact pseudo-code; they were supposed to be simple code sections. BTW, all calls to Action are done in the same order as they would be in the original code. So side effects there are not dangerous per se. But Condition and Action should not have side effects on each other's behavior.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Misprediction-poor looping

Post by mar »

hgm wrote: Mon Dec 09, 2019 10:36 pm

Code: Select all

int todo[MAX];
int n=0;
for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) != 0); }
for(i=0; i<n; i++) Action(todo[i]);
I use something similar but with different motivation. I often extract elements/pointers based on a condition into a (thread)local array for further processing.
This is beneficial if you want to do several loops or parallelize the loop.
Or even if you want to expose this to a higher-level scripting language where iterating would mean calling C/C++ iterator code periodically.
EDIT: in my case, Action is typically non-trivial
Martin Sedlak
User avatar
flok
Posts: 481
Joined: Tue Jul 03, 2018 10:19 am
Full name: Folkert van Heusden

Re: Misprediction-poor looping

Post by flok »

hgm wrote: Tue Dec 10, 2019 9:47 am

Code: Select all

for(i=0; i<MAX; i++) { todo[n] = i; n += (Condition(i) ? 1 : 0); }

Code: Select all

for(i=0; i<MAX; i++) { todo[n] = i; n += !!Condition(i); }
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Misprediction-poor looping

Post by dragontamer5788 »

hgm wrote: Tue Dec 10, 2019 9:47 am
mar wrote: Tue Dec 10, 2019 12:29 amYes, but only if you assume that Action() is trivial without side effects and the compiler can see its guts.
Indeed, I used the function calls only as a means to write compact pseudo-code; they were supposed to be simple code sections. BTW, all calls to Action are done in the same order as they would be in the original code. So side effects there are not dangerous per se. But Condition and Action should not have side effects on each other's behavior.
If this simplicity can be assured, then we open up SIMD-compute capabilities. Wait, wait wait... don't run away. I'm not talking GPUs here, just pointing out that the compiler will autovectorize the loop.

Godbolt says the following code is SIMD, compiling into XMM-registers (SSE, 128-bit assembly).

The "SIMD" part is simply:

Code: Select all


void Action(int& x){
    x = x+10;
}

...

    #pragma omp simd
    for(int i=0; i<n; i++) Action(todo[i]);
GCC still needs the #pragma omp simd to see that this loop is vectorizable for some reason, but the #pragma hint is all GCC needs to start using SSE. With "-mavx2", you'll get AVX2 (256-bit YMM registers). With "-mavx512f", you'll get 512-bit ZMM registers.

The Condition() loop is obviously stream-compaction, which requires prefix-sum as a primitive. This will be tricky, requiring #pragma omp simd capabilities (well supported by CLANG and GCC, but not by Microsoft Visual C++). But I assert that the Condition() loop can also be formed into SIMD while retaining work-efficiency, albeit with a bit of extra effort, some helper functions, and some extra memory (Maybe a second "todo" array to hold the prefix-sum results);
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Misprediction-poor looping

Post by mar »

dragontamer5788 wrote: Tue Dec 10, 2019 4:05 pmGCC still needs the #pragma omp simd to see that this loop is vectorizable for some reason
why? the code you posted is auto-vectorized by gcc regardless of any pragma, clang even unrolls the loop as well as msc
Martin Sedlak
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: Misprediction-poor looping

Post by dragontamer5788 »

mar wrote: Tue Dec 10, 2019 4:34 pm
dragontamer5788 wrote: Tue Dec 10, 2019 4:05 pmGCC still needs the #pragma omp simd to see that this loop is vectorizable for some reason
why? the code you posted is auto-vectorized by gcc regardless of any pragma, clang even unrolls the loop as well as msc
Interesting. I stand corrected. I tested it earlier, but maybe I herp-derp'd and messed up my reading of the assembly earlier.