Need help understanding some c++ code

Discussion of chess software programming and technical issues.

Moderator: Ras

Carbec
Posts: 162
Joined: Thu Jan 20, 2022 9:42 am
Location: France
Full name: Philippe Chevalier

Need help understanding some c++ code

Post by Carbec »

Hello,

I found a little code like this :
template <Color c> U64 Board::Knights()
{
return Bitboards[c == White ? WhiteKnight : BlackKnight];
}
I know what is a template; (well i think). But I don't understand the template here.
I see "c" more like an argument.
There are surely c++ experts that can explain it .
Thanks.
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Need help understanding some c++ code

Post by dangi12012 »

When c is passed as a template - the compiler will instantiate two functions. One for white and one for black.
The condition in the function body will be removed. Its a performance trick.

Of course it does not make too much difference here - but when shifting pawns up or down depending on the color this is actually very important.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Carbec
Posts: 162
Joined: Thu Jan 20, 2022 9:42 am
Location: France
Full name: Philippe Chevalier

Re: Need help understanding some c++ code

Post by Carbec »

Well, I understand now. Until now, I used templates just for what they are designed.
I didn't have the performance as a goal.

So the idea is to use templates everywhere ? (functions depending on color, or piece type).
User avatar
j.t.
Posts: 263
Joined: Wed Jun 16, 2021 2:08 am
Location: Berlin
Full name: Jost Triller

Re: Need help understanding some c++ code

Post by j.t. »

Carbec wrote: Sun Nov 20, 2022 2:51 pm So the idea is to use templates everywhere ? (functions depending on color, or piece type).
That might be a good general idea, but sometimes the effect of using templates goes against intuition and even makes the program slower. In the end you should probably benchmark different versions and see which one performs best.
JohnWoe
Posts: 529
Joined: Sat Mar 02, 2013 11:31 pm

Re: Need help understanding some c++ code

Post by JohnWoe »

Carbec wrote: Sun Nov 20, 2022 2:23 pm Hello,

I found a little code like this :
template <Color c> U64 Board::Knights()
{
return Bitboards[c == White ? WhiteKnight : BlackKnight];
}
I know what is a template; (well i think). But I don't understand the template here.
I see "c" more like an argument.
There are surely c++ experts that can explain it .
Thanks.
Templates are compilation logic. Branches are eliminated before an executable is produced. Speed of a program is all about branch prediction and nothing else.
In this case the compiler will generate 2 functions in the binary for both colors.
Thus eliminating a branch. Hopefully. I spent some time looking for whether ternary or "if constexpr" are the same w/ templates in C++.
I guess they are.
Only benchmarking will tell the truth. I prefer templates only when they share a lot of code.
expositor
Posts: 60
Joined: Sat Dec 11, 2021 5:03 am
Full name: expositor

Re: Need help understanding some c++ code

Post by expositor »

Templates are compilation logic. [...] In this case the compiler will generate 2 functions in the binary for both colors.
As far as I'm aware, a standards-conforming C++ compiler is not required to perform monomorphization. Although this will likely happen, it's always a good idea to (1) check the assembly that your compiler generates to make sure it's doing what you expect and (2) measure the performance of different implementations.
Speed of a program is all about branch prediction and nothing else.
Although branch prediction is incredibly important, it's quite false to say nothing else matters – if you want optimized code, you should also consider the size of data caches and layout within the L1 cache, and you should avoid tight data dependency chains, for example.* I've seen a double-digit percentage increase in performance from fixing a cache layout problem and a similar increase in performance from breaking a tight dependency chain.
Only benchmarking will tell the truth.
+1 to this!


*Other things you might consider are instruction cache pressure, the decoding bandwidth of the microarchitecture you are optimizing for, the number and nature of execution ports, the size of the μop cache, and the throughput and latency of instructions (notably pext/pdep on Zen 2).
Carbec
Posts: 162
Joined: Thu Jan 20, 2022 9:42 am
Location: France
Full name: Philippe Chevalier

Re: Need help understanding some c++ code

Post by Carbec »

Hi,

Woot, this is far above my knowledge. Could you give some example of "cache layout problem" and "tight dependency chain" ?
For myself, I always considered optimising by finding better algorithms. I never looked what happens in the processor, as I dont know it.

Thanks for informations.
expositor
Posts: 60
Joined: Sat Dec 11, 2021 5:03 am
Full name: expositor

Re: Need help understanding some c++ code

Post by expositor »

Sure! Here's an example from Expositor's network code.

The original source was:

Code: Select all

let mut s2 : [MaybeUninit<f32>; N2] = MaybeUninit::uninit_array();
for n in 0..N2 {
  let mut s = SIMD_ZERO;
  for x in 0..V1 { s += a1[x] * simd_load!(SEARCH_NETWORK.w2[n], x); }
  s2[n].write(SEARCH_NETWORK.b2[n] + horizontal_sum(s));
}
To understand this code, let's unroll the loop

Code: Select all

for n in 0..N2 {
  let mut s = SIMD_ZERO;
  s += a1[0] * simd_load!(SEARCH_NETWORK.w2[n], 0);
  s += a1[1] * simd_load!(SEARCH_NETWORK.w2[n], 1);
  s += a1[2] * simd_load!(SEARCH_NETWORK.w2[n], 2);
  // ...
  s += a1[63] * simd_load!(SEARCH_NETWORK.w2[n], 63);
  s2[n].write(SEARCH_NETWORK.b2[n] + horizontal_sum(s));
}
and name each intermediate result:

Code: Select all

for n in 0..N2 {
  let i0 = SIMD_ZERO;
  i1 = i0 + a1[0] * simd_load!(SEARCH_NETWORK.w2[n], 0);
  i2 = i1 + a1[1] * simd_load!(SEARCH_NETWORK.w2[n], 1);
  i3 = i2 + a1[2] * simd_load!(SEARCH_NETWORK.w2[n], 2);
  // ...
  i64 = i63 + a1[63] * simd_load!(SEARCH_NETWORK.w2[n], 63);
  let s = i64;
  s2[n].write(SEARCH_NETWORK.b2[n] + horizontal_sum(s));
}
Notice that to calculate i2, the processor has to wait for the value of i1 to be calculated; to calculate i3, the processor has to wait until it's finished i2, and so on. In other words, the processor is forced to execute instructions serially.

After noticing my mistake, I rewrote this section, and this is the source as it is today:

Code: Select all

for n in 0..N2 {
  let mut s_a = SIMD_ZERO;
  let mut s_b = SIMD_ZERO;
  let mut s_c = SIMD_ZERO;
  let mut s_d = SIMD_ZERO;
  for x in 0..V1/4 {
    s_a += a1[x*4+0] * simd_load!(SEARCH_NETWORK.w2[n], x*4+0);
    s_b += a1[x*4+1] * simd_load!(SEARCH_NETWORK.w2[n], x*4+1);
    s_c += a1[x*4+2] * simd_load!(SEARCH_NETWORK.w2[n], x*4+2);
    s_d += a1[x*4+3] * simd_load!(SEARCH_NETWORK.w2[n], x*4+3);
  }
  let s = (s_a + s_b) + (s_c + s_d);
  s2[n].write(SEARCH_NETWORK.b2[n] + horizontal_sum(s));
}
If we similarly unroll the loop and name the intermediate results, we have

Code: Select all

for n in 0..N2 {
  let sa0 = SIMD_ZERO;
  let sb0 = SIMD_ZERO;
  let sc0 = SIMD_ZERO;
  let sd0 = SIMD_ZERO;
  sa1 = sa0 + a1[0] * simd_load!(SEARCH_NETWORK.w2[n], 0);
  sb1 = sb0 + a1[1] * simd_load!(SEARCH_NETWORK.w2[n], 1);
  sc1 = sc0 + a1[2] * simd_load!(SEARCH_NETWORK.w2[n], 2);
  sd1 = sd0 + a1[3] * simd_load!(SEARCH_NETWORK.w2[n], 3);
  sa2 = sa1 + a1[4] * simd_load!(SEARCH_NETWORK.w2[n], 4);
  sb2 = sb1 + a1[5] * simd_load!(SEARCH_NETWORK.w2[n], 5);
  sc2 = sc1 + a1[6] * simd_load!(SEARCH_NETWORK.w2[n], 6);
  sd2 = sd1 + a1[7] * simd_load!(SEARCH_NETWORK.w2[n], 7);
  // ...
  sa16 = sa15 + a1[60] * simd_load!(SEARCH_NETWORK.w2[n], 60);
  sb16 = sb15 + a1[61] * simd_load!(SEARCH_NETWORK.w2[n], 61);
  sc16 = sc15 + a1[62] * simd_load!(SEARCH_NETWORK.w2[n], 62);
  sd16 = sd15 + a1[63] * simd_load!(SEARCH_NETWORK.w2[n], 63);
  let s = sa16 + sb16 + sc16 + sd16;
  s2[n].write(SEARCH_NETWORK.b2[n] + horizontal_sum(s));
}
Now to calculate sa2, we only have to wait for sa1 – we don't need to wait for sb1, sc1, or sd1. In fact, sa2, sb2, sc2, and sd2 can be calculated concurrently. Modern processors are superscalar, which means that multiple instructions can be in flight simultaneously (with their dispatch, execution, and retirement overlapping or even happening in parallel). This makes better utilization of the processor's resources and keeps it busy.

The overall effect of the change was a 10 to 15% increase in Expositor's overall performance (specifically, the nodes/second metric).

Why didn't the compiler perform this optimization for me? Because it's not allowed to – this optimization can subtly change the result! Floating point arithmetic is not associative; for example,

Code: Select all

a + b + c + d + e + f + g + h           // this is equivalent to (((a + b) + c) + d) + ...
is not equal to

Code: Select all

(a + b) + (c + d) + (e + f) + (g + h)   // this is equivalent to (((a+b) + (c+d)) + (e+f)) + (g+h)
in general. Happily, the difference between the two was empirically negligible (if anything, it's more accurate now).
syzygy
Posts: 5695
Joined: Tue Feb 28, 2012 11:56 pm

Re: Need help understanding some c++ code

Post by syzygy »

Carbec wrote: Sun Nov 20, 2022 2:23 pm Hello,

I found a little code like this :
template <Color c> U64 Board::Knights()
{
return Bitboards[c == White ? WhiteKnight : BlackKnight];
}
I know what is a template; (well i think). But I don't understand the template here.
I see "c" more like an argument.
There are surely c++ experts that can explain it .
Thanks.
One way to look at this is template instantiations, but in practice this function will anyway be inlined by the compiler.

By using templates, you force "c" to be a compile-time constant (or your code will not compile). This means that the compiler will always be abe to replace "c == White ? WhiteKnight : BlackKnight" with either "WhiteKnight" or "BlackKnight".

There is no need to use templates here. If you define this functions as a regular function with "c" as a parameter, the compiler will recognise when you invoke it with a compile-time constant and it will optimise away the "c == White ? ..." expression. Using templates here gives no performance benefit and only makes the code less flexible, since you can't invoke Knights<c>() with a c that is not a compile-time constant.

In general, it seems better to be aware of the compiler's ability to optimise away expressions with compile-time constants and to leave most of the optimisation decisions to the compiler. If at some point you find out that the compiler makes suboptimal decisions, you can still guide it with "inline" directives and perhaps templates.