The mailbox trials

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
hgm
Posts: 26576
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: The mailbox trials

Post by hgm » Sun Apr 11, 2021 9:26 pm

The puzzling thing is that (on my laptop, this time) using the same, complete list of moves for the leapers on every square, rather than a per-square adapted list that only contains moves that stay on board. I saw the same thing on my PC with one of the earlier versions, when I first implemented the neighbor table. I blamed it on cache pressure then. But now I am strarting to doubt that, as this version should use less memory than that previous one.

So it must have something to do with branch prediction working better when the lists of move steps always have the same length, even if that means in many cases they are unnecessary long.

User avatar
hgm
Posts: 26576
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: The mailbox trials

Post by hgm » Mon Apr 12, 2021 9:46 am

Ineresting. The critical loop in AddMoves() for leapers was

Code: Select all

  int dir = firstCapt[u->piece-16];
  int v = steps[dir];
  do {
    int sqr = u->from + v;               // target square
    int victim = board[sqr];             // occupant of that square
    if(victim + 16 & 32) {               // rejects empty squares and edge guards
      attackers[victim] -= bit;          // set/clear the bit for this attacker
    }
  } while((v = steps[++dir]));
This was using the step lists for the leapers that was not differentiated by square (firstCapt[piece] instead of firstLeap[piece][square]), and thus always had the maximum number of moves, including those that went off board. Hence the more complicated test on 'victim'. The differentiated list would only have contained moves that stayed on board, and the test would have simply been if(victim) to weed out empty squares. That would still require a test instruction on 'victim', though, and now we have + and &, the latter acting as test for the branch. So just one extra ALU operation, and plenty of CPU idle slots to squeeze that in next to all the load instructions. As remarked in the previous posting, using the undifferentiated step lists was actually a bit faster. (From 6.83 to 6.74 sec on the laptop.)

To investigate this further, I increased the size reserved for the attack map from 32 to 50 ints. On copy-make it would still only copy the relevant 32 elements for the pieces, but this adds space for 2 dummy elements for empty squares and edge guards (which have codes 0 and 48). Then I removed the condition on updating the elements altogether:

Code: Select all

  int dir = firstCapt[u->piece-16];
  int v = steps[dir];
  do {
    int sqr = u->from + v;               // target square
    int victim = board[sqr];             // occupant of that square
    attackers[victim] -= bit;            // set/clear the bit for this attacker
  } while((v = steps[++dir]));
So now it always updates 8 attack-map elements (for K and N; for P only 2); if the moves go off board it 'updates' the attackers set for an edge guard, if it hits an empty square that for empty squares. Both of which always contain garbage that is never looked at. So a lot of extra loads and stores, as many as for the 'worst case' of a centralized Knight that attacks/protects something with all its moves.

And guess what? This is faster! Time went down to 6.35 sec. That is about 7% faster than the version using the differentiated step lists, which I profiled, and spent 20% of its time in AddMoves. Gaining 7% by addressing a function that only used 20% must have caused an eormous acceleration of that function! Especially when you realize that I addressed only the leaper part of the function, and that not all moves are leaper moves. If 50% of the moves were leaper moves, those originally were responsible for 10% of the execution time. And now I cut 7% off that!

So it seems by far the most costly part of the loop was branch mispredictions. Both for the branch that had to predict whether the move would hit a piece or not, and for the branch that has to guess how many moves there are.

To exploit this fact I rewrote the code to use separate MovePiece() routines for each piece type, and an array of pointers to those functions that can be indexed by piece type to know what to call:

Code: Select all

typedef void MoveProc(UndoInfo *);

MoveProc *movePiece[] = {
  MoveWP, MoveWP, MoveWP, MoveWP, MoveWP, MoveWP, MoveWP, MoveWP, MoveN, MoveN, MoveSlider, MoveSlider, MoveSlider, MoveSlider, MoveSlider, MoveK,
  MoveBP, MoveBP, MoveBP, MoveBP, MoveBP, MoveBP, MoveBP, MoveBP, MoveN, MoveN, MoveSlider, MoveSlider, MoveSlider, MoveSlider, MoveSlider, MoveK,
};

// in the make-move part of SearchCapt()
    (*movePiece[u->piece-16])(u);
For the leaper functions I then just hardcoded the various board steps in an unrolled version of the loop, so that no table has to be consulted to know where to move to. E.g. for white Pawns:

Code: Select all

void MoveWP(UndoInfo *u)
{ // generate moves for the piece from the square, and add or remove them from the attack map
  int bit = attacker2bit[u->piece-16];  // bit in attackers set that represents moving piece
  attackers[board[u->from+15]] -= bit; // left forward
  attackers[board[u->from+17]] -= bit; // right forward
  board[u->from] = 0;
  board[u->to] = u->piece;
  Evacuate(u->from);
  attackers[board[u->to+15]] += bit;
  attackers[board[u->to+17]] += bit;
}
This gave no further improvement, though. It was an attempt to avoid the branch of the loop, which would have to predict whether we have to do 2 or 8 moves. But I guess the indirect branch through the jump table that has to select the fuction takes its place. The advantage of eliminating the load from the steps[] array could be too small to measure; it could very well be that the execution time taken by this function is already reduced to nearly 0.

As a more general remark: initially I was a bit dismayed by the idea that this particular task would have been better performed by encoding the move lists (differentiated by square) as bitboards. Because that would also provide a means of looping over the target squares, through bit extraction (bsf, plus conversion of the bit number to 0x88 square number by sqr = bitnr + (bitnr & 070);). And this would have the avantage that you would also never generate off-board moves. And that you could weed out the moves to empty squares by & with an 'occupied' bitboard, which you could maintain incrementally.

But the idea that this would be better was based on the assumption that skipping unneeded iterations was faster. But that assumption proves false. It is faster to do all iterations, replacing the unneeded ones by dummies. (Like in this case diverting their store operation to an unused attack-map element.)

User avatar
hgm
Posts: 26576
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: The mailbox trials

Post by hgm » Mon Apr 12, 2021 11:04 am

I now tried a similar 'trick' with the slider part of the MovePiece() routine (and did away with the dedicated routines for each piece type, as this did not seem to help, and thus woul just cause unneeded pressure on the L1 instruction cache when the program gets more complicated). That is, I no longer use list of sliding directions differentiated per square, but just the one for each piece type that has all the moves. The sliders had deviating lists only on edge squares anyway, and could slam into the edge from a distance too. So they old cod was testing whether the target specified in the neighbor table was a edge guard, and I removed this test.

This indeed caused further speedup, to 6.00 sec. That is 12% improvement (compare That is 12% improvement by re-implementing a function that before took 20% of the time! to the 6.83 sec) by re-implementing a function that before took 20% of the time!

The code for incremetally updating the attack map for the attacks by the player that moves (i.e. the part only done after the much simpler updating of the moves for the other player has shown that we will indeed be visiting a node where these moves are needed for move generation or evaluation) is now:

Code: Select all

void MovePiece(UndoInfo *u)
{ // generate moves for the piece from the square, and add or remove them from the attack map
  int bit = attacker2bit[u->piece-16];  // addsub is 1 or -1, and determines we set or clear
  if(code[u->piece] & 0200) { // slider
    int dir = firstCapt[u->piece-16];
    int d = steps[dir+14];                    // direction nr (+1)
    do {
      int sqr = neighbor[u->from].d[d-1];   // the directions were offset by 1 to avoid the 0 sentinel
      int victim = board[sqr];
      attackers[victim] -= bit;           // set/clear the bit for this attacker
    } while((d = steps[++dir+14]));
    board[u->from] = 0;
    board[u->to] = u->piece;
    Evacuate(u->from);
    dir = firstCapt[u->piece-16];
    d = steps[dir+14];                        // direction nr (+1)
    do {
      int sqr = neighbor[u->to].d[d-1];     // the directions were offset by 1 to avoid the 0 sentinel
      int victim = board[sqr];
      attackers[victim] += bit;           // set/clear the bit for this attacker
    } while((d = steps[++dir+14]));
  } else { // leaper
    int dir = firstCapt[u->piece-16];
    int v = steps[dir];
    do {
      int sqr = u->from + v;               // target square
      int victim = board[sqr];             // occupant of that square
      attackers[victim] -= bit;            // set/clear the bit for this attacker
    } while((v = steps[++dir]));
    board[u->from] = 0;
    board[u->to] = u->piece;
    Evacuate(u->from);
    dir = firstCapt[u->piece-16];
    v = steps[dir];
    do {
      int sqr = u->to + v;                 // target square
      int victim = board[sqr];             // occupant of that square
      attackers[victim] += bit;            // set/clear the bit for this attacker
    } while((v = steps[++dir]));
  }
}

User avatar
hgm
Posts: 26576
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: The mailbox trials

Post by hgm » Mon Apr 12, 2021 2:23 pm

I also have a Linux VM on my laptop, and managed to move the program there, so I could profile it. The gcc I have there apparently is a 64-bit one. This makes the program run significantly faster, even though it hardly uses 64-bit data types. The version with the differentiated move-step lists now takes 6.27 sec, that with the fixed move-step lists 5.56 sec. So 11.3% faster by also doing the map updates of the attacks on the dummies for edge guard and empty square.

Profiling the old version:

Code: Select all

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 32.43      2.00     2.00 32820332     0.00     0.00  SearchCapture
 27.41      3.69     1.69 49996194     0.00     0.00  AddMoves
 17.51      4.77     1.08 21424467     0.00     0.00  Search
  7.62      5.24     0.47 30577986     0.00     0.00  PreUpdate
  3.89      5.48     0.24 59149691     0.00     0.00  Discover
  3.73      5.71     0.23  3573624     0.00     0.00  InterceptAttacks
  2.59      5.87     0.16  3573688     0.00     0.00  Occupy
  2.59      6.03     0.16  3584257     0.00     0.00  MakeMove
  1.30      6.11     0.08   126616     0.00     0.00  GenNoncapts
  0.65      6.15     0.04  3573624     0.00     0.00  UnMake
  0.16      6.16     0.01        2     0.01     0.01  InitNeighbor
  0.16      6.17     0.01                             PrintBoard
  0.00      6.17     0.00       89     0.00     0.00  Move2text
  0.00      6.17     0.00       66     0.00     0.00  PrintRow
  0.00      6.17     0.00        2     0.00     0.01  Setup
  0.00      6.17     0.00        1     0.00     0.00  Init
  0.00      6.17     0.00        1     0.00     0.00  MapFromScratch
  0.00      6.17     0.00        1     0.00     0.00  PrintMaps
  0.00      6.17     0.00        1     0.00     6.15  SearchRoot
AddMoves() even took 27.4% here. For the new version its task is mostly taken over by MovePiece(), although it is still called from MakeMove() to update the attack map for non-captures. This makes the comparison a bit difficult, because SearchCaptures has delegated the call to Evacuate() for updating the neighbor table (which the compiler spontaneously inlines) to MovePiece(). Anyway, the total time is shorter.

Code: Select all

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 36.30      2.01     2.01 32820332     0.00     0.00  SearchCapture
 20.41      3.14     1.13 21424467     0.00     0.00  Search
 14.54      3.95     0.81 21424457     0.00     0.00  MovePiece
  6.50      4.31     0.36 30577986     0.00     0.00  PreUpdate
  5.51      4.61     0.31 59149691     0.00     0.00  Discover
  5.33      4.91     0.30  3573624     0.00     0.00  InterceptAttacks
  2.98      5.07     0.17  7147280     0.00     0.00  AddMoves
  2.71      5.22     0.15  3573688     0.00     0.00  Occupy
  2.35      5.35     0.13  3584257     0.00     0.00  MakeMove
  1.72      5.45     0.10   126616     0.00     0.00  GenNoncapts
  1.35      5.52     0.08  3573624     0.00     0.00  UnMake
  0.09      5.53     0.01        2     0.00     0.00  InitNeighbor
  0.09      5.53     0.01                             NewAddMoves
  0.09      5.54     0.01                             OldAddMoves
  0.09      5.54     0.01                             R
  0.00      5.54     0.00       89     0.00     0.00  Move2text
  0.00      5.54     0.00       66     0.00     0.00  PrintRow
  0.00      5.54     0.00        2     0.00     0.00  Setup
  0.00      5.54     0.00        1     0.00     0.00  Init
  0.00      5.54     0.00        1     0.00     0.00  MapFromScratch
  0.00      5.54     0.00        1     0.00     0.00  PrintMaps
  0.00      5.54     0.00        1     0.00     5.52  SearchRoot
Part of the tasks of SearchCapture() have been delegated to MovePiece(), though. In particular Evacuate(), responsible for updating the neighbor table. This routine itself doesn't occur in the profile, because the compiler spontaneously inlines it, just as its unmake counterpart Reoccupy(). The only routine for maintainning the neighbor table that is in the profile list is Occupy(). This definitely does more work than Evacuate(), but it still takes 3.5% of the total time even though it is called 6 times less often than MovePiece().

[Edit] I uploaded the latest source to http://hgm.nubati.net/mailbox7.c .

User avatar
hgm
Posts: 26576
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: The mailbox trials

Post by hgm » Mon Apr 12, 2021 7:01 pm

So the next routine to scrutinize is SearchCapture(), which is the combined make-search-unmake-minimax for captures, to which the Search() it calls furthermore outsources the determination of the MVV. (So that overkill pruning can be applied before making the move if it turns out no suitable MVV is available, so that the child would have nothing to do.) Since this routine had grown a bit long, I split it up, relocating two of its tasks into separate routines, which it then calls: GetMVV() will be responsible for the MVV determination (which logically belongs to the task of the child node) after the attack map has been 'semi-update' (i.e. for the attacks of the upcoming pside to move), and Finish(), which does the remaining make-search-unmake-minimax once we are committed to searching the move.

The top of the profiling list then becomes:

Code: Select all

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 22.35      1.26     1.26 21424467     0.00     0.00  Search
 16.23      2.18     0.92 21424457     0.00     0.00  MovePiece
 15.26      3.04     0.86 29939960     0.00     0.00  GetMVV
  9.40      3.57     0.53 21424457     0.00     0.00  Finish
  8.69      4.06     0.49 32820332     0.00     0.00  SearchCapture
  7.45      4.48     0.42 59149691     0.00     0.00  Discover
  6.12      4.82     0.35 30577986     0.00     0.00  PreUpdate
Note that SearchCapture(), GetMMVV() and Finish() are not called equally many times: SeachCapture() can abort before calling GetMVV (but after the semi-update of the attack map) when it discovers the move is futile or illegal. Which apparently happens in 7.5% of the cases. This is a useful statistic for deciding how much it is useful to invest in catching illegal moves through a dedicated test before the semi-update of the map (PreUpdate()). We see that 2% of the calls to SearchCapture() already aborts before getting to the PreUpdate(), because of futility or failure to resolve a pre-existing check. (The latter is easy to test for: the capture to search must capture the (only) checker or move the King.) The remaining 5.5% of the captures exposes the King with the move, either by capturing a protected piece with the King or by moving a pinned piece off the pin ray. So to break even a test to intercept that before the update must not be more costly than 5% of the Update().

Only about 2/3 of the captures get to the Finish() stage; the remaining 1/3 is 'overkill pruned', returning a fail-high score without search, because the determined MVV isn't worth enough to change that fate.

The code of these three parts of SearchCapture is this:

Code: Select all

int GetMVV(UndoInfo *u, int victimBit)
{
    int todo, worseThreats, futMask, index;
    int oppoPresence = presence[stm];
    int gap = u->alpha - MARGIN - 10 - u->pstEval;     // futility gap child will have to bridge with a capture

    if(u->newThreats & victimBit) {                    // Yeghh!: victim could have been pin anchor, and collect attack meant for piece in discovery
      u->newThreats |= victim2bit[u->piece-16];        // displace it to the mover
    }
    u->newThreats &= vPresence[COLOR-stm];
    
    // use score gap to create a victim-set mask for non-futile victims (kludgy!)
    index = gap >> 3; index = (gap > 0 ? index : 0); index = (gap > 1023 ? 127 : index);
    futMask = nonFutile[gapOffset[index] + (gap & 3)]; // victim set of non-futiles
    
    worseThreats = u->newThreats - 1 & ~u->newThreats; // all more valuable pieces than what the move attacks
    worseThreats &= futMask;                           // remove the futile targets
    todo = u->existingThreats | u->undetectedThreats;  // set of potential victims frompre-existing attacks
    todo &= worseThreats;                              // only keep the non-futile more valuable than the newly attacked
    todo &= ~victim2bit[u->piece-16];                  // remove the moved piece; attacks on it now hit thin air
    while(todo) {
      int bit; BSF(todo);                              // get next (MVV order)
      int victim = COLOR-1 - bit;                      // calculate piece number
      int bitMask = 1 << bit;                          // bit representing this victim in victim set
      int attacks = attackers[victim-SZ];              // pre-existing attacks on it (from old attack map)
      u->undetectedThreats &= ~bitMask;                // this threat will now be revealed
      if(attacks & u->oldPresence) {                   // it was indeed a threat
        u->existingThreats |= bitMask;                 // record this fact (so sibbling nodes can use it)
        if(attacks & oppoPresence) break;              // the threat still exists => MVV identified
      }
      todo -= bitMask;                                 // this one done; remove to extract next
    }
    todo |= u->undetectedThreats;                      // unconfirmed threats must also be tried by child
    todo &= ~victim2bit[u->piece-16];                  // except the old ones on the mover (the new ones would be in newThreats)
    todo |= u->newThreats;                             // add the newly created threats
    todo &= futMask;                                   // and remove the futile victims it might have contained
    
    if(!todo && gap > 0) {                             // opponent has no non-futile move
      u-> parentAlpha = -u->alpha;                     // fail-high score (-u->alpha is parent's beta)
      attackers -= SZ; presence[stm] = u->oldPresence; // undo updates
      u->searched++; nodeCnt++; qsCnt += (u->depth <= 0);
      return 1;                                        // and abort move ('overkill pruning')
    }
    
    u->targets = todo;                                 // pass set of victims that should be captured to child
    return 0;
}

int Finish(UndoInfo *u, int victimBit);

int SearchCapture(UndoInfo *u)
{ // make / search / unmake and score minimaxing all in one
  int victimBit = victim2bit[u->victim-16];
  u->victimMask = attacker2bit[u->victim-16];               // save for unmake

  if(u->inCheck && u->inCheck & ~u->victimMask              // we were in check, and did not capture checker,
                && u->piece != COLOR + 15 - stm) return 0;  // nor moved the King => move is illegal

  // decode move
  u->to = location[u->victim];

  // detect futility at earliest opportunity
  u->pstEval = u->oldEval + PST(u->victim, u->to);
  if(u->depth <= 0 && u->pstEval < u->parentAlpha - MARGIN) return 1; // futility (child will stand pat)\}

  // finish decoding move and updating evaluation
  u->from = location[u->piece];
  u->pstEval = -(u->pstEval + PST(u->piece, u->to) - PST(u->piece, u->from));

  // update hash key, and possibly abort on repetition
  u->hashKey =   u->oldKey  ^ KEY(u->piece, u->to) ^ KEY(u->piece, u->from) ^ KEY(u->victim, u->to);

  // semi-update of the attack map (copy-make)
  memcpy((void*) (attackers + WHITE + SZ), (void*) (attackers + WHITE), 128);
  attackers += SZ;
  PreUpdate(u);

  if(attackers[COLOR+15-stm] & presence[stm]) { attackers -= SZ; presence[stm] = u->oldPresence; return 0; } // illegal move, abort

  if(u->depth <= 1 SPOIL) { // overkill pruning: test if child will be futile
    if(GetMVV(u, victimBit)) return 1;
  } else u->targets = vPresence[COLOR - stm]; // at higher depth we let child try all pieces that are present (as we could not prune anyway)

  // committed to move 
  return Finish(u, victimBit);
}

int Finish(UndoInfo *u, int victimBit)
{
  int score;

  u->searched++;
  {
    // finish map update and apply move to board
    location[u->piece]  = u->to;
    location[u->victim] = CAPTURED;
    vPresence[stm] -= victimBit;
    MovePiece(u);
    Discover(u->moverSet & presence[COLOR-stm], u->from);

    path[ply++] = move; captCnt[0]++;

    u->beta = -u->parentAlpha; // beta for child
    score = - Search(u);

    // unmake move
    board[u->from] = u->piece;
    board[u->to]   = u->victim;
    location[u->piece]  = u->from;
    location[u->victim] = u->to;
    presence[stm] = u->oldPresence;
    vPresence[stm] += victimBit;
    Reoccupy(u->from);
    attackers -= SZ;
    ply--;
  }

  // minimax the score
  if(score > u->parentAlpha) {
    int *p;
    u->parentAlpha = score;
    if(score + u->alpha >= 0) { // u->alpha = -parentBeta!
      return 1;
    }
    p = pvPtr; pvPtr = u->pv;   // pop old PV
    *pvPtr++ = move;            // push new PV, starting with this move
    while((*pvPtr++ = *p++)) {} // and append child PV
  }
  return 0;
}
I am a bit surprised Finish() takes that much time; it doesn't seem to do all that much. A ew assignements to array elements, a few fuction calls. No branches except in the minimaxing part. (But moves should not score above alpha that often, should they?) he compiler seems to inline the Reoccupy() function, (which restores the neighbor table), and this contains a loop. (But a very simple one, with only 4 iterations.)

The remaining (non-delegated) part of SearchCapture() also seems quite simple, and takes less time than Finish(), if you take into account it has 1.5 times more calls. Even though the memcpy call it contains (for the copy-make of the attack map) is inlined, as 16 loads and 16 stores of a 64-bit register.

GetMVV() is the most expensive part. That is good, because it is conceivable that the method here can still be improved. (Not much can be improved on minimaxing the score, or updating the hash key...) So this will be our next target.

For example, the AddMoves() / MovePiece() that record the new moves of the moved piece in the attack map can very easily keep record of this in a victim set. They already calculate all the victims to address the attackers sets of those, and can easily set a bit 1<<(victim&31) in a victim set kept in a register, in parallel with all the loads and stores to the attack map. These will be pre-existing threats in the child. The routine for discovering slider moves already keeps such a record, currently only used when discovering enemy slider moves as newThreats, (which will be targets in the child). The similarly returned set of attacked victims through discovery of friendly sliders will be existingThreats in the child too.

So always starting a node with existingThreats=0 and undetectedThreats = everything present is not really making the most of things. The parent could pass a lot of existingThreats it practically gets for free, just like it passes the 'targest' on which captures should be tried (in the UndoInfo struct). Those could immediately be removed from undetectedThreats.

Also of help can be what the parent learned from generating captures, namely the pieces it could not capture. If the MVV in the parent was a Bishop, it knows that King Queen and Rooks were not attacked. So these pieces can be removed from the undetectedThreats in the child, and will only be attacked there by freshly created attacks of the side that is moving. If those only attacked a Kight and a Pawn (say), the GetMVV() function in any SearchMove() called by the child can save itself the trouble of testing for attacks on K,Q or R; they are already known not to exist.

In other words, we should not initialize existingThreats and undetectedThreats at the start of Search(), but pass it to there from the calling SearchMoves(), which gets the newThreats intended for the child as a side effect of the attack-map update it does, and the undetectedThreats from the confirmedCaptures of the parent.

User avatar
hgm
Posts: 26576
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: The mailbox trials

Post by hgm » Tue Apr 13, 2021 12:13 pm

Keeping Track of Attacked Pieces

Because things were getting more complex than is healthy, the following thought occurred to me: what if we just try to keep an incrementally updated summary of attacked pieces? In the format of a 'victim set', i.e. one bit per piece to indicate if that piece is attacked, such that they extract in order of decreasing value. This would then give us the MVV by a single bit extraction. Let's store these in an array attacked[color] indexed by the color of the attacking pieces.

Each time we modify an attack-map element we would then have to test whether this ends up empty or not, and modify the corresponding bit in attacked[] accordingly. Attack-map elements are modified during the incremental update, through AddMoves() to delete or add moves of the moved piece, and in the routine that discovers slider attacks. These typically use a statement attackers[victim] += bit, and the routines would execute a whole bunch of such statements for various victims. We could accompany each such statement with a:

set |= 1 << VictimBitnr(victim);

This would shift a 1 to the position of the bit for the victim in a victim set. With clever piece encoding the converison of victim piece number to bit number would be a very simple calculation (rather than a table lookup), like BLACK_KING - victim in the current scheme. Or if we adopt a really clever piece-numbering scheme, perhaps just victim itself. (To simplify what seems to be the performance-critical task.) The variable 'set' would be kept in an (initially zeroed) register, so this extra accounting only needs the loading of a constant 1, a shift and an OR. Which can probably be done in parallel with the load and store instructions, without taking extra time. After the loop 'set' would then contain the set of newly attacked victims, which teh routine could return as its result.

The above is in fact already done in the enemy-slider-discovery routine, to create the newThreats set of newly attacked victims that are MVV candidates for the child node. We could also do it in AddMoves() for the benefit of the grand child, and OR these together with the victims of friendly slider discovery. Let'scall the result 'futureThreats'.

For deleting attacks it is more tricky, as that doesn't automatically make the victim unattacked. So we would have to test if the attackers set is zero or not after subtracting 'bit' from it:

set |= ((attackers[victim] & oppoPresence) == 0) << VictimBitnr(victim);

That is, the result of the & operation (on the attackers set that would be left in a register after the ' += bit') would be converted to a truth value 0 or 1, and we shift that instead of always a 1. The comparison with zero is actually an automatic side effect of the '-= bit' operation, and the CPU has an instruction to shuttle it as 0 or 1 to a register. So that instruction would take the place of loading the constant 1. The victim set this creates will be called 'unattackeds'.

So as a basically free side effect of the incremental update we would get two sets of newly attacked pieces, (one for each color attacker) and two sets of unattacked pieces. We could use those sets to update attacked[] for the attacker color they apply to:

attacked[movingPlayer] &= ~unattackeds;
attacked[movingPlayer] |= futureThreats;
attacked[movingPlayer] &= presence[movingPlayer]; // remove protections that sneaked in (through slider discovery)

Because we add moves only after having deleted all moves that need deletion, this would give no conflict if a victim is contained in both sets. Or if a victim would be referred to twice during deletion or addition of attacks. For the other player we would have

attacked[opponent] |= newThreats;
attacked[opponent] &= presence[opponent]; // removes the captured piece (and protections)

We would still have to update the bit for the moved piece separately, depending on whether the piece it captured was protected.

There is one show-stopper, though: when we capture a piece, we don't remove the attacks it exerted from the attack map, but just remove the captured piece from the presence mask. So the pieces that used to be attacked by this captured piece will not automatically end up in the unattackedSet, unless we start to explicitly delete the attacks of that piece from the map too (i.e. an extra call to AddMoves(), which is expensive). This means the attacked[] sets updated according to the above scheme are more 'could-be-attacked' sets, where some of the pieces in them on closer inspection would turn out to have no valid attackers anymore, because their listed attacker has already been captured.

But even with such 'partial info', capture generation could be highly accelerated. It would extract the next-highest could-be-attacked piece, merge its attacks with that of some other piece of the same value group, and if after masking with the presence mask it would turn out there are no attacks after all, it can clear the bits for these pieces from the (could-be-)attacked set. To prevent any of its children would make the same mistake of trying to generate captures on an apparently already captured piece.

For the first victim (the MVV) this verification process would be 'outsourced' to SearchCapture(), the GetMVV() part, which would do it after the semi-update of the map. (So it can cheaply abort if no suitable MVV is found.) It would loop through the attacked[] set (perhaps in pairs, to reduce the number of branches) by bit extraction. This extraction would identify the next-highest 'could-be-attacked' piece (or pair), which would then be verified against the (already updated) presence mask for the existence of the alleged attack. If the piece/pair was attacked, GetMVV() has completed its task, and can return the 'go on' code. If not, it clears the piece/pair from attacked[], and extracts the next-highest remaining piece to repeat the process. If the attacked[] set runs empty, GetMVV() returns the 'abort' code, which would make the caller (SearchCapture()) exercise overkill pruning, aborting the move with a fail-high score.

Code: Select all

int todo = attacked[xstm]; // stm here is already set for the child; xstm is his opponent
while(todo & futMask) { // only consider non-futile victims
  int bit = BSF(todo);
  int victim = bitnr2victim[bit] & ~1; // or calculate, e.g. as BLACK_KING - bit & ~1
  int attacks = attackers[victim] | attackers[victim+1]; // test in pairs for speed
  if(attacks & presence[stm]) { attacked[stm] = todo; return 0; } // MVV is in this pair, signal success
  int mask = 3 << bit;
  todo &= ~mask; // clear the two bits for the now excluded pair
}
// no suitable MVV found
attacked[stm] = todo; // update for excluded pairs
return 1; // ask for overkill pruning
The loop would only have to treat pieces that have a strong suspicion against them for being attacked, because one of the nodes on the path to them are known to have created attacks on them, which were not known to be resolved by moving the attacker (or the piece) away. They might have been resolved by capture of the attacker, in case it was the only attacker (or all attackers happened to be captured). And because we work in pairs that would still not hurt if the other member of the pair was and has remained attacked. That we occasionally find an unattacked pair amongst the 'selected few', and have to retry, should be much less work than always removing all attacks of the captured piece, while for most of those this might not make any difference (because there are other attackers targeting the pair), or which we might never want to know (because the search never attempts to capture those victims).

When GetMVV() has confirmed the existence of a non-futile capture in the child, making of the capture leading to it will be finished, and the child is guaranteed to find a piece worth capturing by simply extracting the lowest bit from attacked[xstm]. That should save a lot of time both in GetMVV() and Search().

[Edit] A remaining technical problem is that the loops in AddMoves() also treat empty squares and edge guards as victims. We must make sure these would not be recorded in 'set' as possible victims, as there are no bits available for those in a victim set that could be used as dummies. In Intel architecture the shift is interpreted modulo 32 (for 32-bit ints), so 1<<33 would be the same as 1<<1. This would cause empty squares or edge guard 'victims' to set the bit reserved for a real piece.

To prevent that we could use a 64-bit shift 1LL<<victim (and proper encoding for the pieces/guards/empty squares) to shift the the 1 bit out of the lowest 32 bits for the targets to be ignored. Another method would be to make the unwanted targets collide with Kings. (E.g. wK=0, bK=1, wQ=2, ... empty=32, guard=33). This would falsely include the Kings in the attacked sets. But these sets would in general not be used for testing whether the King is attacked. Because such an attack would have to lead to another action than for other pieces. (Such as aborting the node with a fail high when the stm can capture one.) The tests for that now work all by directly examining the King's attackers sets. So we can simply mask the two lowest bits from the victim sets (like newThreats) away (& ~3), before using them in an extraction loop to decide which pieces we will search captures on. Yet another method would be to replave the constant 1 in "set |= 1<<victim;" by a test for whether the 'victim' is improper. Like "set |= (victim < 32) << victim. And for loops that delete attacks, which already have a test there (set |= ((attackers[victim] & oppoPresence) == 0) << victim), we could make sure the improper victims never cause any harm by setting their (dummy) attackers[] elements to 0xFFFFFFFF, so that deleting one attack from them would never make it hit 0. (Note that oppoPresence would contain at least a King, and that we would never want to remove an attack on that King because we move the piece attacking that King elsewhere; if such an attack exists we would 'cash' it by declaring victory by King capture.)

Dann Corbit
Posts: 12232
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: The mailbox trials

Post by Dann Corbit » Wed Apr 14, 2021 5:24 am

This stuff is really interesting.
I do have one request:
Post each incremental update of the generator after each change.
Then (being lazy, as I am) I can do diffs and consider those when I read your posts.

BTW, the most interesting sequence in years.
I do hope it somehow makes it to the chess programming WIKI
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

User avatar
hgm
Posts: 26576
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: The mailbox trials

Post by hgm » Wed Apr 14, 2021 6:41 pm

One problem is ha the development is not linear. I try a lot of stuff that seems it should work, but then in practice just slows down things. Then I revert those, but some of the modifications that were neutral stay in. So by the time I have found something that does work, the diff compared to the previously published version is quite large.

These modern CPUs are really very temperemntal beasts. It is hard to predict what will work and what not. Obvious reductions of the amount of work that it would have to do often lead to slower execution. I guess very much depends on branch prediction, and it is very hard to fathom which branches in practice will be easily predictable, and which not.

I also discovered that the gcc I am using now (on Ubuntu 10.04) is stupid. It doesn't understand that in table[piece-16] (which by C definition is *(table + piece -16)) the table-16 (if table is an array) is a constant expression, which should be calculated at compile time. It really emits subl $16 instructions for that. And I used this syntax a lot in my program, as my piece numbers run from 16 to 47, and it seems a waste to include 16 dummy elements below that in many tables. I now changed that to (table-16)[piece] everywhere (which is something the compiler understands). A micro-optimization,but still good for 1-2% speedup, just by change of code style.

Most of the time trying to be smart for saving work backfires. I had this nice code for detecting pre-existing attacks on the stm's pieces as they were needed, and not earlier. But making the decision for what should be calculated or not very easily turns out to be more expensive than the calculation. I had hoped that keeping track of newly attacked pieces (as bit flags in an int) during generation of these attacks for the benefit of the attack map (which has them distributed over many different map elements) should be nearly free, and could be beneficial for MVV determination. But it wasn't as free as I had hoped. (Can these modern CPU's perhaps do 2 loads from L1 per clock?), and couldn't earn back its overhead.

What did in fact work was to just pre-emptively detect attacks on Q and 2R as a group (first ORing their attackers sets together, before testing with the presence mask for opponent attackers), and exclude those from the undetectedThreats (branchless, with a conditional move) if no such attacks existed. Also, testing pieces in the undetectedThreats set indiviually (to get them out of there, and possibly promote them to existingThreat) didn't pay. I just do that in groups of 4 now (again first ORing their attackers sets together, before testing anything). I also refactored the code a bit; the detection of (soft-)pins is now done 'non-destructively', i.e. without modifying the attack map. The avantage was that the copy (for copy-make) of the attack map could be moved until after the decision for overkill pruning (and thus be executed 1.5 times less frequently). This disadvantage is that the same, rather complex loop a second time for really updating the map. But this could be merged with the loop for discovery of friendly sliders for the map update, by leaving pieces of both colors in the attackers set of the evacuated square. All thise slight improvements together brought the excution time down to 5 sec on my laptop.

User avatar
hgm
Posts: 26576
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: The mailbox trials

Post by hgm » Wed Apr 14, 2021 10:18 pm

Just for the record, this is the latest profile:

Code: Select all

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 21.58      1.10     1.10 24766537     0.00     0.00  Search
 17.46      1.99     0.89 21869485     0.00     0.00  MovePiece
 13.54      2.68     0.69 21869485     0.00     0.00  Finish
 13.14      3.35     0.67 32777822     0.00     0.00  SearchCapture
 11.38      3.93     0.58 29900523     0.00     0.00  GetMVV
  5.00      4.19     0.26  3573126     0.00     0.00  InterceptAttacks
  3.92      4.39     0.20 21869485     0.00     0.00  NewDiscover
  3.33      4.56     0.17  3573190     0.00     0.00  Occupy
  3.14      4.72     0.16  3574997     0.00     0.00  MakeMove
  1.77      4.81     0.09 21869485     0.00     0.00  NewPreUpdate
  1.77      4.90     0.09   117866     0.00     0.00  GenNoncapts
  1.57      4.98     0.08  3573126     0.00     0.00  UnMake
  1.18      5.04     0.06  7146284     0.00     0.00  AddMoves
  1.18      5.10     0.06  7146252     0.00     0.00  Discover
  0.00      5.10     0.00   117866     0.00     0.00  SearchNonCapts
I split off the conventional loop over moves from a move list for the non-captures from Search(), (now called SearchNonCapts()), to get a better idea where the time went. But virtually no time went there. The listed time for Search() is all for identifying the capture to search next. And even though it is helped a lot in this by its caller, SearchCapture(), which identifies the MVV, it apparently still finds this a hard task, as it still tops the list with over 20%. On an absolute scale: it uses 1.1 sec for about 25M calls, so 45ns/call.

Somehow his feels disappointing. My laptop runs at 2.2GHz, so that is 100 clocks, or up to 400 instructions. From the ratio of the number of calls of SearchCapture() and Finish() I can see that that about 1/3 of the captures gets overkill-pruned, which causes it to fail high. Of course the captures that are searched can also produce a fail high. So the average number of moves searched per QS node cannot be more than 3. It seems to me that it shouldn't take 100 instructions to find a move, if the moves just have to be extracted a bit set.

Of course I suspect mispredicted branches. So the goal should be to make branch-poor code for Search(). Currently I have many extraction loops, because the larger value groups (minors and Pawns) get the merged capture sets for a pair of them broken up in many pieces, each treated in a separate extraction loop. And each extraction loop will cause a mispredict on terminating.

I guess it is essential to make the extraction loops work on move sets that are as large as possible. No matter how many shifts and Boolean operations it takes to juggle the bits into a single word in the right order, as long as it involves no branches, it would be faster. I suppose modern instructions like pdep could be very helpful here. But my laptop doesn't support those. But since it does have a 64-bit compiler, I could try to use uint64_t for merging the capture sets by shifts and Boolean logic. All PxP captures could be packed into a single 64-bit int by taking the merged attackers set for the 4 pairs, and doing

uint64_t total = (set1 + (set2 << 32ull) & 0xFFFF0000FFFFull) + ((set3 << 16) + (set4 << 48ull) & 0xFFFF0000FFFF0000ull);

The Other x P captures could likewise be packed. That is three <<, three + and two &, with 4 loads, which should be possible in 4 clocks = 2ns. That sounds very affordable. he captures on the minors could all be packed into a single64-bit int. This would be a bit more involved, as we were splitting those in 3 groups (P x minor, minor x minor, and major x minor). Packing the captures on majors is very easy: just shift the attackers set of Q behind that of the interleaved one for the Rooks. So 4 extraction loops should be sufficient.

User avatar
hgm
Posts: 26576
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: The mailbox trials

Post by hgm » Thu Apr 15, 2021 9:02 am

Piece-Pair Encoding

This morning I got a new idea. By keeping the encoding of pieces such that pairs of equally valued pieces have consecutive number, you can load the 32-bit attackers sets of the pair of them in a single 64-bit operation. The encoding I am currently using (16=PPPPPPPPNNBBRRQKppppppppnnbbrrqk=47) already satisfies this requirement. The bit assignment in the attackers sets is 0=PpPpPpPpPpPpPpPpNnNnBbBbRrRrQqKk=31, to extract in LVA order, and leave room (after masking away the unwanted color and all captured pieces in one swoop) to inteleave two sets, for combined LVA-order extraction.

Up to now I tested for attacks on a value group by ORin all their attack sets together, and testing the result with the 'presence' mask for the color. E.g. for the group of minors (attackers[N1] | attackers[N2] | attackers[B1] | attackers[B2]) & presence[stm]. That is 5x load, 3x | and 1x &. When we do his with 64-bit arithmetic we would do this as (att64(N1) | att64(B1)) & presence64[stm], 3x load, 1x | and 1x &.(And often the presence mask will be re-used in iterations of a loop, so hat it is already in a register.) Where att64 could be #defined as

#define att64(X) *(uint64_t*) (&attackers[X])

and att64[N1] would get attackers[N2] piggy-backed on it in the high-order 32 bits. We would have to mainain the presence mask as a 64-bit mask with two equal halves.

This would be just for testing for attackers (to find the MVV). To actually extract the captures you would have to interleave the two halves, which after the load are merely concatenated. This can be done (after &= presence64[stm]) by *= 0x200000001ull and >>= 32 (or *= 0x80000001 and >>= 31 for the other color).

For the minors we would have to move four attackers sets, and we have to create room for the interleaving, as it wasn't any naturally in the attackers sets. We could do this as follows:

Code: Select all

setN = (att64(N1) & presence64[stm])*0x200000001ull >> 32;
setB = (att64(B1) & presence64[stm])*0x200000001ull >> 32;
set1 = setN * 0x100010000ull & 0xFF000000FFFF0000ull; // align Pawns and majors
set2 = setB * 0x1000001ull   & 0x00FF00000000FFFFull;
total = set1 | set2;
total |= (setB & 0xFF0000) << 16 | (setN & 0xFF0000) << 24; // select and align minors
The PRQK alignment is a bit tricky; rather than shifting the Pawn and majors sections after masking to their desired bit position, and add them together, we do both shift and add before masking with a multiply, and mask afterwards. The following shows this procedure schematically. (Each letter represents a group of 4 pieces; pieces we do not want to end up in the result are written in lower case):

Code: Select all

................................KKQQRRRRBBBBNNNNPPPPPPPPPPPPPPPP B attackers
................................KKQQRRRRBBBBNNNNPPPPPPPPPPPPPPPP N attackers
........KKQQRRRRbbbbnnnnpppppppppppppppp                         (B<<24 =   B*0x1000000)
................................kkqqrrrrbbbbnnnnPPPPPPPPPPPPPPPP (B<<0  =           B*1)
KKQQRRRRbbbbnnnnpppppppppppppppp                                 (N<<32 = N*0x100000000)
................kkqqrrrrbbbbnnnnPPPPPPPPPPPPPPPP                 (N<<16 =     N*0x10000)
................xxxxxxxxxxxxxxxx................................ kept open for minors
The shifts that happen in the process of multiplying are not large enough to pull the entire sets of 32 attackers apart. So thereis an actual addition going on in the overlapping range. Which is the range we mask away afterwards. But there is the risk of carry propagation into the areas of interest (the majors).

In this case the bits for the majors are quite far away from the overlapping area. But there is no limit to the distance of carry propagation; it propagates as long as it encounters 1 bits. In both cases (N and B) the intervening pieces include the minors, however. And since it is not possible for a piece to be attacked by all Bishops because of color binding, we can be sure the carry will not propagate beyond the minors into the majors fields. So the merging takes 4x *, 6x &, 4x shift and 3x |.

For the Pawn victims we first interleave the Pawns in pairs, similar to obtaining setN and setB above, except that we would have then 4 such sets. One of the sets will not have to be shifted, because the Pawn attackers in it will happen to end up in the right location (bit 32-47). The others will have to be shifted (<<16, >>16 and >>32):

Code: Select all

setP1 = (att64(P1) & presence[stm])*0x200000001ull >> 32;
setP2 = (att64(P1) & presence[stm])*0x200000001ull;
setP3 = (att64(P1) & presence[stm])*0x200000001ull;
setP4 = (att64(P1) & presence[stm])*0x200000001ull >> 16;
setPxP = (setP1 | setP2) & 0xFFFF0000FFFFull; // leave Pawn attackers
setPxP |= (setP3 << 16 | setP4) & 0xFFFF0000FFFF0000ull;
The sets can first be combined in two non-overlapping pairs, before masking out the non-Pawn atackers. The latter could be combined in a similar way. I am still of the opinion that they can simply concatenated, and that enforceing LVA order there by interleaving is a waste of time. Whether SEE for them will be positive or not will depend on whether they are protected, which poorly correlates with the value of the attacker.

Code: Select all

setXxP = (setP1 | setP2) & 0xFFFF0000FFFF0000ull;
setXxP |= (setP3 >> 24 | setP4) & 0xFFFF0000FFFFull;
So that is 4x load, 4x *, 3x shift, 6x & and 3x | for the PxP, and then an additional 1x shift, 2x * and 3x | for the other x P. And presumably 4 reloads, as PxP captures might have been searched in the mean time, spoiling the register contents. This could be an argument for calculating setXxP immediately, even though we did not need it. The 6 extra operations can be done in 2 clock cycles, and then we only have to save one set for later use (setXxP) instead of four(SetP1-setP4). Then two loops for extracting captures from setPxP and setXxP can follow.

Post Reply