Compiler Optimization Question

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 1:00 am
Location: WY, USA
Full name: Michael Sherwin

Compiler Optimization Question

Post by Michael Sherwin » Sun Apr 12, 2020 11:11 pm

A really large function has many local variables and accesses many global variables. Will modern compilers be able to optimize register usage effectively or is it better to break up the large function into a few subfunctions with fewer variables so the compiler can better optimize register usage?
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through

Dann Corbit
Posts: 11216
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: Compiler Optimization Question

Post by Dann Corbit » Sun Apr 12, 2020 11:45 pm

I would not worry about trying to out-think the compiler.
Modern compilers can even eliminate tail recursion.
The first rule of optimization is "Don't do it."
The second rule, for experts only, is "Don't do it yet."

Of course it is kind of a gag, but the point is never to waste time optimizing until you have run the code through the profiler.
You might make a routine 100 times faster and have no measurable impact on the performance of the program.

The most important thing for writing fast, correct code is to choose good algorithms and debug them carefully.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 1:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Compiler Optimization Question

Post by Michael Sherwin » Mon Apr 13, 2020 1:09 am

Dann Corbit wrote:
Sun Apr 12, 2020 11:45 pm
I would not worry about trying to out-think the compiler.
Modern compilers can even eliminate tail recursion.
The first rule of optimization is "Don't do it."
The second rule, for experts only, is "Don't do it yet."

Of course it is kind of a gag, but the point is never to waste time optimizing until you have run the code through the profiler.
You might make a routine 100 times faster and have no measurable impact on the performance of the program.

The most important thing for writing fast, correct code is to choose good algorithms and debug them carefully.
That is good advice Dann! However, my question is not about me optimizing the code but helping the compiler to optimize better which is almost the exact same but not quite. Let me rephrase, can the compiler optimize better one really huge function or several smaller functions? Or does it not matter?
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through

jdart
Posts: 3951
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Compiler Optimization Question

Post by jdart » Mon Apr 13, 2020 1:15 am

Michael Sherwin wrote:
Mon Apr 13, 2020 1:09 am
Let me rephrase, can the compiler optimize better one really huge function or several smaller functions? Or does it not matter?
It is not likely to matter much. Usually you have a few performance-critical parts in any program. Optimizing anything else doesn't change the overall performance very much.

The one place function size matters is in relation to inlining. Compilers have rules that determine whether a function can be inlined or not. Generally small ones are inlined by the optimizer, larger ones may not be (at least with gcc, though, this is adjustable via options).

User avatar
lucasart
Posts: 3109
Joined: Mon May 31, 2010 11:29 am
Full name: lucasart
Contact:

Re: Compiler Optimization Question

Post by lucasart » Mon Apr 13, 2020 1:22 am

Michael Sherwin wrote:
Sun Apr 12, 2020 11:11 pm
A really large function has many local variables and accesses many global variables. Will modern compilers be able to optimize register usage effectively or is it better to break up the large function into a few subfunctions with fewer variables so the compiler can better optimize register usage?
I don't think you can get a good answer on talkchess. There are as many such questions as there are examples of such code. Just measure it, and answer your own question.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 1:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Compiler Optimization Question

Post by Michael Sherwin » Mon Apr 13, 2020 1:28 am

Thanks guys. I found this online.
WRITE CLEAR, SIMPLE CODE
Some of the very things that make code clear and readable to humans also make it clear and readable to compilers. Complicated expressions are harder to optimize and can cause the compiler to "fallback" to a less intense mode of optimization. I've seen one compiler that has translation-unit limits which when overrun will cause the entire module to be de-optimized by one level.

Part of the clarity is making hunks of code into functions when appropriate. The cost of a function call is extremely small on modern machines, so optimization is NOT a valid excuse for writing ten-page functions.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through

lauriet
Posts: 197
Joined: Sun Nov 03, 2013 8:32 am
Contact:

Re: Compiler Optimization Question

Post by lauriet » Mon Apr 13, 2020 1:31 am

regardless of any optimising issues, its just good practice to brake things into small, easy readable/debug functions.
One function = one job. if it spans more than a screen, break it up into small logical parts. they are easier to prove correct and then
your mind can abstract them away.

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 1:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Compiler Optimization Question

Post by Michael Sherwin » Mon Apr 13, 2020 1:38 am

lauriet wrote:
Mon Apr 13, 2020 1:31 am
regardless of any optimising issues, its just good practice to brake things into small, easy readable/debug functions.
One function = one job. if it spans more than a screen, break it up into small logical parts. they are easier to prove correct and then
your mind can abstract them away.
And maybe easier for the compiler to do the same?
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 1:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: Compiler Optimization Question

Post by Michael Sherwin » Mon Apr 13, 2020 1:43 am

Here is the function I'm concerned about. It is almost complete, just have to add the castling code. It generates all moves one at a time and calls Qsearch(). Only legal moves are stored in the move list. The variable "t" is the thread index and it uses arrays rather than a structure.

Code: Select all

void GenAllMovesQ(s32 t, s32 alpha, s32 beta, s32 depth) {
  s32 pce, typ, target, adjtyp, score, i;
  u64 bb, pieces, wPieces, bPieces, aPieces, empty, notme;
  u32 fs, ts, sq;

  i = first[t][ply[t]];

  pieces = piece[t][wtm[t]];
  wPieces = piece[t][WHITE];
  bPieces = piece[t][BLACK];
  aPieces = wPieces | bPieces;
  empty = 0xffffffffffffffff ^ aPieces;
  notme = 0xffffffffffffffff ^ pieces;

  do {
    _BitScanForward64(&fs, pieces);
    pce = board[t][fs];
    switch (pce) {
      case OO: break;
      case WP:
        typ = fs >> 3;
        switch (typ) {
          case RANK1: break;
          case RANK2:
            _BitScanForward64(&sq, wPawnMoves[fs] & aPieces);
            bb = (wPawnMoves[fs] & below[sq]) | (wPawnCapts[fs] & bPieces);
            typ = WDOU;
            break;
          case RANK3:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WSGL;
            break;
          case RANK4:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WSGL;
            break;
          case RANK5:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & (bPieces | epbit[t][ply[t]]));
            typ = WCEP;
            break;
          case RANK6:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WSGL;
            break;
          case RANK7:
            bb = (wPawnMoves[fs] & empty) | (wPawnCapts[fs] & bPieces);
            typ = WPRO;
            break;
        }
        break;
      case WN:
        bb = knightMoves[fs] & notme;
        typ = WNMV;
        break;
      case WB:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & bob[fs]
           & notme;
        typ = WBMV;
        break;
      case WR:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & rob[fs]
           & notme;
        typ = WRMV;
        break;
      case WQ:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & notme;
        typ = WQMV;
        break;
      case WK:
        bb = kingMoves[fs] & notme;
        typ = WKMV;
        break;
      case BP:
        typ = fs >> 3;
        switch (typ) {
          case RANK1: break;
          case RANK2:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            typ = BPRO;
          break;
          case RANK3:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            typ = BSGL;
            break;
          case RANK4:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & (wPieces | epbit[t][ply[t]]));
            typ = BCEP;
            break;
          case RANK5:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            BSGL;
            break;
          case RANK6:
            bb = (bPawnMoves[fs] & empty) | (bPawnCapts[fs] & wPieces);
            typ = BSGL;
            break;
          case RANK7:
            _BitScanForward64(&sq, bPawnMoves[fs] & aPieces);
            bb = (bPawnMoves[fs] & above[sq]) | (bPawnCapts[fs] & wPieces);
            typ = BPRO;
            break;
        }
        break;
      case BN:
        bb = knightMoves[fs] & notme;
        typ = BNMV;
        break;
      case BB:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & bob[fs]
           & notme;
        typ = BBMV;
        break;
      case BR:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & rob[fs]
           & notme;
        typ = BRMV;
        break;
      case BQ:
        bb = qss[fs][(aPieces >> (fs & 56)) & 127][0]
           & qss[fs][(aPieces >> 8) & 255][1]
           & qss[fs][(aPieces >> 16) & 255][2]
           & qss[fs][(aPieces >> 24) & 255][3]
           & qss[fs][(aPieces >> 32) & 255][4]
           & qss[fs][(aPieces >> 40) & 255][5]
           & qss[fs][(aPieces >> 48) & 255][6]
           & notme;
        typ = BQMV;
        break;
      case BK:
        bb = kingMoves[fs] & notme;
        typ = BKMV;
        break;
    }

    while (bb) {
      _BitScanForward64(&ts, bb);
      board[t][fs] = OO;
      piece[t][wtm[t]] ^= one << fs;
      piece[t][wtm[t]] ^= one << ts;
      adjtyp = adjTyp[typ];
      switch (adjtyp) {
        case WMOV:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = pce;
          break;
        case WDUO:
          board[t][ts] = pce;
          epbit[t][ply[t] + 1] = (u64)(ts - fs == 16) << (fs + 8); // method from H.G.Muller
          break;
        case WEPC:
          sq = ts - ((u64)(epbit[t][ply[t]] == one << ts) << 3); // method from H.G.Muller
          target = board[t][sq];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << sq;
          board[t][ts] = pce;
          break;
        case WPRM:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = WQ;
          mat[t] += 800;
          break;
        case BMOV:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = pce;
          break;
        case BDUO:
          board[t][ts] = pce;
          epbit[t][ply[t] + 1] = (u64)(fs - ts == 16) << (fs - 8); // method from H.G.Muller
          break;
        case BEPC:
          sq = ts + ((u64)(epbit[t][ply[t]] == one << ts) << 3); // method from H.G.Muller
          target = board[t][sq];
          mat[t] += pceval[target];
          piece[t][1 - wtm[t]] ^= shftv[target] << sq;
          board[t][ts] = pce;
          break;
        case BPRM:
          target = board[t][ts];
          mat[t] += pceval[target];
          piece[t][wtm[t]] ^= shftv[target] << ts;
          board[t][ts] = WQ;
          mat[t] -= 800;
          break;
      }

      wtm[t] = 1 - wtm[t];
      ply[t]++;

      score = Qsearch(t, alpha, beta);

      ply[t]--;
      wtm[t] = 1 - wtm[t];

      mat[t] -= pceval[target];
      board[t][fs] = OO;
      piece[t][wtm[t]] ^= one << fs;
      piece[t][wtm[t]] ^= one << ts;

      switch (adjtyp) {
        case WMOV:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          break;
        case WDUO:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          epbit[t][ply[t] + 1] = 0;
          break;
        case WEPC:
          board[t][sq] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << sq;
          break;
        case WPRM:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          mat[t] -= 800;
          break;
        case BMOV:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          break;
        case BDUO:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          epbit[t][ply[t] + 1] = 0;
          break;
        case BEPC:
          board[t][sq] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << sq;
          break;
        case BPRM:
          board[t][ts] = target;
          piece[t][1 - wtm[t]] ^= (u64)shftv[target] << ts;
          mat[t] += 800;
          break;
      }
    }

    if (score != -INFINITY && (score > alpha || depth > 0)) {
      frsq[t][i] = fs;
      tosq[t][i] = ts;
      trgt[t][i] = target;
      type[t][i] = typ;
      scor[t][i] = score;
      i++;
    }

    pieces ^= (one << fs);
  } while (pieces);

  // castling code goes here

}
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through

bob
Posts: 20916
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Compiler Optimization Question

Post by bob » Mon Apr 13, 2020 2:26 am

Here's the simple answer from a compiler guy. Small functions are just fine. If they SHOULD be inlined, the compiler will do it during the compilation process. If it won't help, it won't. Not inlining can improve cache efficiency if the function is executed infrequently. Better to have the natural instruction stream sequential rather than broken up into blocks with interspaced dead code.

With today's optimizers, you can do pretty well to just write readable code you can understand and work on. Let the compiler do its thing.

For example

Code: Select all

if (condition) {

do some stuff 

}

Might be compiled as is. Or if the condition is mostly false, the compiler might turn that into this:

Code: Select all

if (! condition) 
  go to xx);
  getback:
  
  
  Somewhere else in memory:
  
  xx:
    do some stuff
    go to getback;
    
I used to have to write code like the latter in the early FORTRAN days, the optimizers were not very good. Today the compilers can do this so that your code looks readable, and then gets reorganized when converting from source to object code, leaving the source easy to read.

Yes, there are places you might do better. In particular if you have information the compiler doesn't have. IE you know that ALL the cases in your switch statement are covered explicitly. The compiler still has to check for the rest and skip all the way to the closing "}". You don't. That can eliminate instructions.

Post Reply