Sure! Here's an example from Expositor's network code.
The original source was:
Code: Select all
let mut s2 : [MaybeUninit<f32>; N2] = MaybeUninit::uninit_array();
for n in 0..N2 {
let mut s = SIMD_ZERO;
for x in 0..V1 { s += a1[x] * simd_load!(SEARCH_NETWORK.w2[n], x); }
s2[n].write(SEARCH_NETWORK.b2[n] + horizontal_sum(s));
}
To understand this code, let's unroll the loop
Code: Select all
for n in 0..N2 {
let mut s = SIMD_ZERO;
s += a1[0] * simd_load!(SEARCH_NETWORK.w2[n], 0);
s += a1[1] * simd_load!(SEARCH_NETWORK.w2[n], 1);
s += a1[2] * simd_load!(SEARCH_NETWORK.w2[n], 2);
// ...
s += a1[63] * simd_load!(SEARCH_NETWORK.w2[n], 63);
s2[n].write(SEARCH_NETWORK.b2[n] + horizontal_sum(s));
}
and name each intermediate result:
Code: Select all
for n in 0..N2 {
let i0 = SIMD_ZERO;
i1 = i0 + a1[0] * simd_load!(SEARCH_NETWORK.w2[n], 0);
i2 = i1 + a1[1] * simd_load!(SEARCH_NETWORK.w2[n], 1);
i3 = i2 + a1[2] * simd_load!(SEARCH_NETWORK.w2[n], 2);
// ...
i64 = i63 + a1[63] * simd_load!(SEARCH_NETWORK.w2[n], 63);
let s = i64;
s2[n].write(SEARCH_NETWORK.b2[n] + horizontal_sum(s));
}
Notice that to calculate i2, the processor has to wait for the value of i1 to be calculated; to calculate i3, the processor has to wait until it's finished i2, and so on. In other words, the processor is forced to execute instructions serially.
After noticing my mistake, I rewrote this section, and this is the source as it is today:
Code: Select all
for n in 0..N2 {
let mut s_a = SIMD_ZERO;
let mut s_b = SIMD_ZERO;
let mut s_c = SIMD_ZERO;
let mut s_d = SIMD_ZERO;
for x in 0..V1/4 {
s_a += a1[x*4+0] * simd_load!(SEARCH_NETWORK.w2[n], x*4+0);
s_b += a1[x*4+1] * simd_load!(SEARCH_NETWORK.w2[n], x*4+1);
s_c += a1[x*4+2] * simd_load!(SEARCH_NETWORK.w2[n], x*4+2);
s_d += a1[x*4+3] * simd_load!(SEARCH_NETWORK.w2[n], x*4+3);
}
let s = (s_a + s_b) + (s_c + s_d);
s2[n].write(SEARCH_NETWORK.b2[n] + horizontal_sum(s));
}
If we similarly unroll the loop and name the intermediate results, we have
Code: Select all
for n in 0..N2 {
let sa0 = SIMD_ZERO;
let sb0 = SIMD_ZERO;
let sc0 = SIMD_ZERO;
let sd0 = SIMD_ZERO;
sa1 = sa0 + a1[0] * simd_load!(SEARCH_NETWORK.w2[n], 0);
sb1 = sb0 + a1[1] * simd_load!(SEARCH_NETWORK.w2[n], 1);
sc1 = sc0 + a1[2] * simd_load!(SEARCH_NETWORK.w2[n], 2);
sd1 = sd0 + a1[3] * simd_load!(SEARCH_NETWORK.w2[n], 3);
sa2 = sa1 + a1[4] * simd_load!(SEARCH_NETWORK.w2[n], 4);
sb2 = sb1 + a1[5] * simd_load!(SEARCH_NETWORK.w2[n], 5);
sc2 = sc1 + a1[6] * simd_load!(SEARCH_NETWORK.w2[n], 6);
sd2 = sd1 + a1[7] * simd_load!(SEARCH_NETWORK.w2[n], 7);
// ...
sa16 = sa15 + a1[60] * simd_load!(SEARCH_NETWORK.w2[n], 60);
sb16 = sb15 + a1[61] * simd_load!(SEARCH_NETWORK.w2[n], 61);
sc16 = sc15 + a1[62] * simd_load!(SEARCH_NETWORK.w2[n], 62);
sd16 = sd15 + a1[63] * simd_load!(SEARCH_NETWORK.w2[n], 63);
let s = sa16 + sb16 + sc16 + sd16;
s2[n].write(SEARCH_NETWORK.b2[n] + horizontal_sum(s));
}
Now to calculate sa2, we only have to wait for sa1 – we don't need to wait for sb1, sc1, or sd1. In fact, sa2, sb2, sc2, and sd2 can be calculated concurrently. Modern processors are
superscalar, which means that multiple instructions can be in flight simultaneously (with their dispatch, execution, and retirement overlapping or even happening in parallel). This makes better utilization of the processor's resources and keeps it busy.
The overall effect of the change was a 10 to 15% increase in Expositor's overall performance (specifically, the nodes/second metric).
Why didn't the compiler perform this optimization for me? Because it's not allowed to – this optimization can subtly change the result! Floating point arithmetic is not associative; for example,
Code: Select all
a + b + c + d + e + f + g + h // this is equivalent to (((a + b) + c) + d) + ...
is not equal to
Code: Select all
(a + b) + (c + d) + (e + f) + (g + h) // this is equivalent to (((a+b) + (c+d)) + (e+f)) + (g+h)
in general. Happily, the difference between the two was empirically negligible (if anything, it's more accurate now).