Compiler Optimization Question

Michael Sherwin · Post by **Michael Sherwin** » Mon Apr 13, 2020 1:11 am

A really large function has many local variables and accesses many global variables. Will modern compilers be able to optimize register usage effectively or is it better to break up the large function into a few subfunctions with fewer variables so the compiler can better optimize register usage?

Dann Corbit · Post by **Dann Corbit** » Mon Apr 13, 2020 1:45 am

I would not worry about trying to out-think the compiler.
Modern compilers can even eliminate tail recursion.
The first rule of optimization is "Don't do it."
The second rule, for experts only, is "Don't do it yet."

Of course it is kind of a gag, but the point is never to waste time optimizing until you have run the code through the profiler.
You might make a routine 100 times faster and have no measurable impact on the performance of the program.

The most important thing for writing fast, correct code is to choose good algorithms and debug them carefully.

Michael Sherwin · Post by **Michael Sherwin** » Mon Apr 13, 2020 3:09 am

Dann Corbit wrote: ↑Mon Apr 13, 2020 1:45 am I would not worry about trying to out-think the compiler.
Modern compilers can even eliminate tail recursion.
The first rule of optimization is "Don't do it."
The second rule, for experts only, is "Don't do it yet."

Of course it is kind of a gag, but the point is never to waste time optimizing until you have run the code through the profiler.
You might make a routine 100 times faster and have no measurable impact on the performance of the program.

The most important thing for writing fast, correct code is to choose good algorithms and debug them carefully.

That is good advice Dann! However, my question is not about me optimizing the code but helping the compiler to optimize better which is almost the exact same but not quite. Let me rephrase, can the compiler optimize better one really huge function or several smaller functions? Or does it not matter?

jdart · Post by **jdart** » Mon Apr 13, 2020 3:15 am

Michael Sherwin wrote: ↑Mon Apr 13, 2020 3:09 am Let me rephrase, can the compiler optimize better one really huge function or several smaller functions? Or does it not matter?

It is not likely to matter much. Usually you have a few performance-critical parts in any program. Optimizing anything else doesn't change the overall performance very much.

The one place function size matters is in relation to inlining. Compilers have rules that determine whether a function can be inlined or not. Generally small ones are inlined by the optimizer, larger ones may not be (at least with gcc, though, this is adjustable via options).

lucasart · Post by **lucasart** » Mon Apr 13, 2020 3:22 am

Michael Sherwin wrote: ↑Mon Apr 13, 2020 1:11 am A really large function has many local variables and accesses many global variables. Will modern compilers be able to optimize register usage effectively or is it better to break up the large function into a few subfunctions with fewer variables so the compiler can better optimize register usage?

I don't think you can get a good answer on talkchess. There are as many such questions as there are examples of such code. Just measure it, and answer your own question.

Michael Sherwin · Post by **Michael Sherwin** » Mon Apr 13, 2020 3:28 am

Thanks guys. I found this online.

WRITE CLEAR, SIMPLE CODE
Some of the very things that make code clear and readable to humans also make it clear and readable to compilers. Complicated expressions are harder to optimize and can cause the compiler to "fallback" to a less intense mode of optimization. I've seen one compiler that has translation-unit limits which when overrun will cause the entire module to be de-optimized by one level.

Part of the clarity is making hunks of code into functions when appropriate. The cost of a function call is extremely small on modern machines, so optimization is NOT a valid excuse for writing ten-page functions.

lauriet · Post by **lauriet** » Mon Apr 13, 2020 3:31 am

regardless of any optimising issues, its just good practice to brake things into small, easy readable/debug functions.
One function = one job. if it spans more than a screen, break it up into small logical parts. they are easier to prove correct and then
your mind can abstract them away.

Michael Sherwin · Post by **Michael Sherwin** » Mon Apr 13, 2020 3:38 am

lauriet wrote: ↑Mon Apr 13, 2020 3:31 am regardless of any optimising issues, its just good practice to brake things into small, easy readable/debug functions.
One function = one job. if it spans more than a screen, break it up into small logical parts. they are easier to prove correct and then
your mind can abstract them away.

And maybe easier for the compiler to do the same?

Michael Sherwin · Post by **Michael Sherwin** » Mon Apr 13, 2020 3:43 am

Here is the function I'm concerned about. It is almost complete, just have to add the castling code. It generates all moves one at a time and calls Qsearch(). Only legal moves are stored in the move list. The variable "t" is the thread index and it uses arrays rather than a structure.

Code: Select all

void GenAllMovesQ(s32 t, s32 alpha, s32 beta, s32 depth) {
  s32 pce, typ, target, adjtyp, score, i;
  u64 bb, pieces, wPieces, bPieces, aPieces, empty, notme;
  u32 fs, ts, sq;

  i = first[t][ply[t]];

  pieces = piece[t][wtm[t]];
  wPieces = piece[t][WHITE];
  bPieces = piece[t][BLACK];
  aPieces = wPieces | bPieces;
  empty = 0xffffffffffffffff ^ aPieces;
  notme = 0xffffffffffffffff ^ pieces;

  do {
    _BitScanForward64(&fs, pieces);
    pce = board[t][fs];
    switch (pce) {
      case OO: break;
      case WP:
        typ = fs >> 3;
        switch (typ) {
          case RANK1: break;
          case RANK2:
            _BitScanForward64(&sq, wPawnMoves[fs] & aPieces);
            bb = (wPawnMoves[fs] & below[sq]) | (wPawnCapts[fs] & bPieces);
            typ = WDOU;
            break;
          case RANK3:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WSGL;
            break;
          case RANK4:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WSGL;
            break;
          case RANK5:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & (bPieces | epbit[t][ply[t]]));
            typ = WCEP;
            break;
          case RANK6:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WSGL;
            break;
          case RANK7:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WPRO;
            break;
        }
        break;
      case WN:
        bb = knightMoves[fs] & notme;
        typ = WNMV;
        break;
      case WB:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & bob[fs]
           & notme;
        typ = WBMV;
        break;
      case WR:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & rob[fs]
           & notme;
        typ = WRMV;
        break;
      case WQ:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & notme;
        typ = WQMV;
        break;
      case WK:
        bb = kingMoves[fs] & notme;
        typ = WKMV;
        break;
      case BP:
        typ = fs >> 3;
        switch (typ) {
          case RANK1: break;
          case RANK2:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            typ = BPRO;
          break;
          case RANK3:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            typ = BSGL;
            break;
          case RANK4:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & (wPieces | epbit[t][ply[t]]));
            typ = BCEP;
            break;
          case RANK5:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            BSGL;
            break;
          case RANK6:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            typ = BSGL;
            break;
          case RANK7:
            _BitScanForward64(&sq, bPawnMoves[fs] & aPieces);
            bb = (bPawnMoves[fs] & above[sq]) | (bPawnCapts[fs] & wPieces);
            typ = BPRO;
            break;
        }
        break;
      case BN:
        bb = knightMoves[fs] & notme;
        typ = BNMV;
        break;
      case BB:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & bob[fs]
           & notme;
        typ = BBMV;
        break;
      case BR:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & rob[fs]
           & notme;
        typ = BRMV;
        break;
      case BQ:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & notme;
        typ = BQMV;
        break;
      case BK:
        bb = kingMoves[fs] & notme;
        typ = BKMV;
        break;
    }

    while (bb) {
      _BitScanForward64(&ts, bb);
      board[t][fs] = OO;
      piece[t][wtm[t]] ^= one << fs;
      piece[t][wtm[t]] ^= one << ts;
      adjtyp = adjTyp[typ];
      switch (adjtyp) {
        case WMOV:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = pce;
          break;
        case WDUO:
          board[t][ts] = pce;
          epbit[t][ply[t] + 1] = (u64)(ts - fs == 16) << (fs + 8); // method from H.G.Muller
          break;
        case WEPC:
          sq = ts - ((u64)(epbit[t][ply[t]] == one << ts) << 3); // method from H.G.Muller
          target = board[t][sq];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << sq;
          board[t][ts] = pce;
          break;
        case WPRM:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = WQ;
          mat[t] += 800;
          break;
        case BMOV:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = pce;
          break;
        case BDUO:
          board[t][ts] = pce;
          epbit[t][ply[t] + 1] = (u64)(fs - ts == 16) << (fs - 8); // method from H.G.Muller
          break;
        case BEPC:
          sq = ts + ((u64)(epbit[t][ply[t]] == one << ts) << 3); // method from H.G.Muller
          target = board[t][sq];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << sq;
          board[t][ts] = pce;
          break;
        case BPRM:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = WQ;
          mat[t] -= 800;
          break;
      }

      wtm[t] = 1 - wtm[t];
      ply[t]++;

      score = Qsearch(t, alpha, beta);

      ply[t]--;
      wtm[t] = 1 - wtm[t];

      mat[t] -= pceval[target];
      board[t][fs] = OO;
      piece[t][wtm[t]] ^= one << fs;
      piece[t][wtm[t]] ^= one << ts;

      switch (adjtyp) {
        case WMOV:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          break;
        case WDUO:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          epbit[t][ply[t] + 1] = 0;
          break;
        case WEPC:
          board[t][sq] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << sq;
          break;
        case WPRM:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          mat[t] -= 800;
          break;
        case BMOV:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          break;
        case BDUO:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          epbit[t][ply[t] + 1] = 0;
          break;
        case BEPC:
          board[t][sq] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << sq;
          break;
        case BPRM:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          mat[t] += 800;
          break;
      }
    }

    if (score != -INFINITY && (score > alpha || depth > 0)) {
      frsq[t][i] = fs;
      tosq[t][i] = ts;
      trgt[t][i] = target;
      type[t][i] = typ;
      scor[t][i] = score;
      i++;
    }

    pieces ^= (one << fs);
  } while (pieces);

  // castling code goes here

}

bob · Post by **bob** » Mon Apr 13, 2020 4:26 am

Here's the simple answer from a compiler guy. Small functions are just fine. If they SHOULD be inlined, the compiler will do it during the compilation process. If it won't help, it won't. Not inlining can improve cache efficiency if the function is executed infrequently. Better to have the natural instruction stream sequential rather than broken up into blocks with interspaced dead code.

With today's optimizers, you can do pretty well to just write readable code you can understand and work on. Let the compiler do its thing.

For example

Code: Select all

if (condition) {

do some stuff 

}

Might be compiled as is. Or if the condition is mostly false, the compiler might turn that into this:

Code: Select all

if (! condition) 
  go to xx);
  getback:
  
  
  Somewhere else in memory:
  
  xx:
    do some stuff
    go to getback;

I used to have to write code like the latter in the early FORTRAN days, the optimizers were not very good. Today the compilers can do this so that your code looks readable, and then gets reorganized when converting from source to object code, leaving the source easy to read.

Yes, there are places you might do better. In particular if you have information the compiler doesn't have. IE you know that ALL the cases in your switch statement are covered explicitly. The compiler still has to check for the rest and skip all the way to the closing "}". You don't. That can eliminate instructions.

Compiler Optimization Question

Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question

Re: Compiler Optimization Question