The mailbox trials

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

Well, perhaps it does, and this is just my ignorance. I never used any intrinsics before, so I would not know how to do it even on a compiler that supports them. My approach has always been to just hand-edit the assembler generated by the compiler. I am using gcc 3.4.4. It says (C) 2004.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: The mailbox trials

Post by Joost Buijs »

hgm wrote: Thu Apr 08, 2021 5:31 pm Well, perhaps it does, and this is just my ignorance. I never used any intrinsics before, so I would not know how to do it even on a compiler that supports them. My approach has always been to just hand-edit the assembler generated by the compiler. I am using gcc 3.4.4. It says (C) 2004.
GCC 3.4.4 is 17 years old. I think the function you need is __builtin_ctz(x). The documentation is still here: https://gcc.gnu.org/onlinedocs/gcc-3.4. ... fclzl-2143
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

OK, thanks. I will have a look at it.

I have a version that runs in 4.40 sec now:

Code: Select all

 1  -16       372 0 e2a6 b4c3 b2c3 e6d5
 2  -16      1314 0 e2a6 b4c3 b2c3 e6d5
 3  -27      3126 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 4  -27     40932 0 e2a6 e6d5 g2h3 b4c3 d2c3 d5e4
 5  -37     97493 0 e2a6 e6d5 c3d5 b6d5 e4d5 e7e5
 6  -32    277881 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 7  -32    623552 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 8  -32   3060559 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 b7d5 f6d5 e5f7 e7f7 e4d5 g7b2 f3f7 e8f7
 9 -101   7561037 0 e2a6 e6d5 c3d5 b6d5 a6b7 a8b8 e5f7 b8b7 f7h8 g7h8 g2h3 f6e4
10  -82  26016106 0 e2a6 e6d5 e5g6 f7g6 c3b5 d5e4 f3g3 e8f8 g2h3 c7c5 g3g6 h8h3
t =  4.440 sec
  26016106 nodes (5.9 Mnps)
  22704635 QS (87.3%)
   4506472 evals (19.8%)
   1583973 stand-pats (7.0%)
  10303563 leaves (45.4%)
    121425 move gens
captures: 87.9%
The score at 9 ply is different though, which means there could still be a bug. Anyway, the speed improvement is a bit disappointing. What I did was moving the search for the MVV (running top-down through the piece list) basically to the parent node, so that the work can be shared by all children. And can be used to prune those when capturing the MVV is futile. So that the attack map doesn't have to be fully updated for those nodes.

The whole algorithm is a bit cumbersome. The parent node keeps track of the highest piece of the stm that is under attack, or, when this is not yet known, the lowest piece that is certainly not under attack. When a capture from that node is going to be searched, it first does a relatively cheap update of the opponent moves in the attack map (recaptures and pin punishment), and determines what is the highest victim of those. If the parent node cannot exclude that all higher victims were not already under attack, that would be the MVV for the child. If it cannot exclude that, it starts examing other pieces between the lowest non-attacked and the highest newly attacked, to expand the range of 'excluded victims' in the parent node. For this it has to use the old attack map, which fortunately is still around. (The advantage of copy-make!)

If it does find a pre-existing attack there, it records it in the parent node, and this will then remain such for all subsequent moves. Thus all non-attacked pieces above the highest attacked one in the parent node have to be examined only once. Which is otherwise what one of the children would have done. There is never any reason to search below the highest newly attacked piece of the move that is going to be searched, though.

But if we do find a pre-existing attack that tops the highest attack created by the current move, it could no longer be valid after that move. (E.g. the attacker could be captured, or the victim could have moved away.) So we then have to continue the search through the piece list starting from the highest pre-existing attack downward for a piece that has attacks in the semi-updated attack map, until we reach the highest newly created attack. This work cannot be shared between siblings, as each sibbling would get a different attack map. But there still is nothing done that would not have been done in the children if they were reponsible for finding the MVV on their own. It is just that the pieces that can be ruled out because they were not even attacked in the parent (and no new attacks were created on them) don't have to be traversed in each sibbling. And the remaining, node-specific part of looking for a surviving atack amongst the pieces that already were attacked has been moved to before doing the most-expensive part of the attack-map update, so that we can skip the latter if no suitable attacks are found at all.

Yet there isn't very much speedup. This makes me wonder wether I am 'barking up the wrong tree'; the remaining time might not be spend on the things I am trying to optimize now. Perhaps the non-captures (which have gone up to 13%) are already dominating the execution time. These are still doing the stupid from-scratch generation of the attack map. The problem here is that for a non-capture the piece lands on an empty square, for which the map holds no info. (Because it records attacks on pieces, not squares.)

But of course that problem is not insurmountable. The neighbor table also had that problem, and it could be solved by 4 ray scans of the board to find neighbors of an unoccupied square. After that a conventional IsSquareAttacked() routine, looking at the 8 neighbors through the neighbor table, and the two Knights in the piece list, could collect the attacks on that square, to add them to the map. That should be very much faster than creating a map from scratch, even when the IsSquareAttacked() is roughly twice as expensive as the generation of moves of a typical given piece.
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

Finding the MVV

The attack map makes it easy to see whether a piece is attacked, by just consulting its attackers[] element. But the capture generation still has to loop through the piece list top-down for finding pieces that are attacked. Some time could be saved by somehow skipping the test on pieces that are no longer present (e.g. by having a second presence mask from which pieces can be extracted in MVV order). But the main problem is not that there are many pieces that are not present, but that, as you get closer to the leaves of QS, most pieces are no longer attacked.

The code I have now for finding the highest attacked piece looks absolutely awful, like this:

Code: Select all

    int gap = u->alpha - MARGIN - 10 - u->pstEval; // futility gap child will have to bridge with a capture
    if(newMVV < u->threat) {                // newly created exposures might not be the worst
      int attacks, xstm = COLOR - stm;
      int worst = xstm - 1;                 // worst threat (this value is a kludge to represent 'none')
      if(!u->threatVal) {                   // pre-existing threat not yet found
        attackers -= 32;                    // switch back to the map before the move to do the deferred threat detection
        switch(u->threat & 15) {            // skip part of unrolled search loop already done before
          case 14: // Queen
            if(gap > QVAL) break;
            attacks = attackers[xstm+14];
            if(attacks & u->oldPresence) { u->threatVal = QVAL; worst = u->threat = xstm + 14; break; } 
            u->threat = xstm + 13;    // next untested
            if(newMVV >= xstm + 12) break;
          case 13: // Rooks
            if(gap > RVAL) break;
            attacks = attackers[xstm+13] | attackers[xstm+12];
            if(attacks & u->oldPresence) { u->threatVal = RVAL; worst = u->threat = xstm + 13; break; } 
            u->threat = xstm + 11;
            if(newMVV >= xstm + 8) break;
          case 11: // minors
            if(gap > BVAL) break;
            attacks = attackers[xstm+11] | attackers[xstm+10]
                    | attackers[xstm+9]  | attackers[xstm+8];
            if(attacks & u->oldPresence) { u->threatVal = BVAL; worst = u->threat = xstm + 11; break; }
            u->threat = xstm + 7;
            if(newMVV >= xstm) break;
          case 7: // Pawns
            if(gap > PVAL) break;
            attacks = attackers[xstm+7] | attackers[xstm+6]
                    | attackers[xstm+5] | attackers[xstm+4]
                    | attackers[xstm+3] | attackers[xstm+2]
                    | attackers[xstm+1] | attackers[xstm];
            if(attacks & u->oldPresence) { u->threatVal = PVAL; worst = u->threat = xstm + 7; break; }
            // if we get here, nothing was attacked!
            u->threatVal = 1; worst = u->threat = xstm - 1;
        }
        attackers += 32;       // deferred threat detection done for now; continue with updated map
if(PATH) printf("%2d     newMVV=%d threat=%d val=%d worst=%d\n", ply, newMVV, u->threat, u->threatVal, worst), fflush(stdout);
      } else worst = u->threat; // pre-existing gravest threat already known, just use it
[code]
Basically this is two unrolled nested loops, an outer lop over values group and an inner loop over pieces within the value group. For that inner loop I don't even bother to test for a 'partial result', but just OR all attackers sets together and then test if there were any attacks on the group as a whole. That seemed faster than testing them one at the time (which would require an AND operation with the presence mask and a branch for each of them), although for the Pawn value group that could be debatable. But even then, the code is splattered with branches. Because there are so many conditions for exiting it: you could have reached non-futile victims, you could have found an attacked piece, you could have reached a value group that you already know to be attacked by a newly created capture. At least entering the loop is done through a switch, which skips the tests that were already done before (on a previous move in the same node). But even if that is implemented by an indirect jump through a jump table (which I am not sure the compiler will do), it is an almost certain branch misprediction.

How can we do this better? The newly created captures emerge as a 'victim bitmap', where each bit represents a victim (or value group); the process of discovering enemy sliders at the from-square sets these bits, and the recapture is easily added to it too. It would also be possible to record in a bitmap of similar format which pieces have a pre-existing attack on them, and in yet another bitmap which pieces do not have a pre-existing attack on them. (And for pieces not in either map we then would not know whether they are attacked or not.) If we also had a bitmap to indicate which victims were non-futile, we could just do some Boolean logic on these bitmaps, to be left with the pieces that need to be tested for attackers. And then extract the bits from those in MVV order to do the actual tests. That would be a branch-poor way of doing things.

Which victims are futile is determined by the gap between the current (estimated) eval and alpha:

gap = alpha - MARGIN - estimatedEval;

Victims with a piece value > gap are 'non-futile'. Of course we could get these from a table indexed by score, int nonFutile[gap]. The table would have to be big, though, (essentially running from -INF to +INF). This would waste a lot of L1 cache space, which is also not good for performance. Of course we could reduce resolution, and take (say) nonFutile[gap>>4], effectively rounding the scores to multiples of 16 before making the decision. And large parts of the table, namely those for gap < 0 or for gap > QVAL are not very interesting; they have all pieces non-futile, or none at all, respectively. We could test for that and then handle the cases separately, and only use a table for scores between 0 and 1024 (assuming QVAL < 1024). But then we have branches again. And 1024 ints is still 4KB.

To save space we can use a two-step approach:
[code]unsigned char offset[128];
int nonFutile[5*8]; // the actual victim sets, 8x all, 8xNBRQK, 8xRQK, 8xQK, 8x none
mask = nonFutile[offset[gap >> 3 & 127] + (gap & 7)];
So basically the nonFutile[] table is indexed by the low-order 3 bits of gap (gap & 7), but starting at a certain offset. For gaps far away from any piece value, this offset would be 0, 8, 16, 24 or 32, the start of stretches of identical victim sets. Only close to a piece value the offset would be such that within the range determined by (gap & 7) two different victim sets can be seen. E.g. for offset[62], which would be used for gap = 62*8 (= 496) to 503, we would want to see 4x RQK (for gap = 496-499) and 4 times QK (for gap >= 500(=RVAL)-503). So offset[62] = 12.

This would reduce the memory requirement to 5*8*4 = 160 bytes for nonFutile, and 128 for offset[], 288 bytes in total, which seems acceptible. We have not dealt with the out-of-range problem, though: for gap >= 1024 or gap < 0 the & 127 in the offset[] access protects us from out-of-bound accesses, but it returns faulty values. These really should have used offset[127]=32 and offset[0]=0, respectively, rather than what (gap >> 3 & 127) calculated. But this can be achieved in a branchless way with conditional move (CMOV) instructions, the compiler would presumably use for:

Code: Select all

int index = gap >> 3;
index = (gap < 0 ? 0 : index);
index (gap > 1023 ? 127 : index);
mask = nonFutile[offset[index] + (gap & 3)];
With this mask, we can then decide which pieces we would have to test

Code: Select all

int worseThreats = newThreats - 1 & ~newThreats;
worseThreats &= mask; // not interested in non-futile victims
int todo = worseThreats & undetectedThreats; // potential worse threats of interest
while(todo) { // update parent-node's knowledge on pre-existing attacks
  int bitnr = BSF(todo); // nr of trailing zeros
  int victim = bitnr2victim[bit];
  int bitMask = 1 << bitnr;
  if((attackers-32)[victim] & oldPresence) { // test for pre-existing attack using old map (hence the -32)
    existingThreat = bitMask; // we found one!
    undetectedThreats = 0; // not interested in more
    break;
  }
  undetectedTheats -= bitMask;
  todo -= bitMask;
}
potentialThreats = existingThreat | undetectedThreats;
todo = worseThreats & potentialThreats; // mask away confirmed non-attacked pieces
while(todo) { // verify whether pre-existing attacks are still valid after this move
  int bitnr = BSF(todo); // nr of trailing zeros
  int victim = bitnr2victim[bit];
  int bitMask = 1 << bitnr;
  if(attackers[victim] & presence[stm]) { // test based on (partially) updated map
    break; // when attacked piece found, leave its bit in todo set
  }
  todo -= bitMask;
}
todo |= newThreats; // add the newly created attacks
if(!todo) goto overkill; // when no victims found, abort MakeMove for overkill pruning
mvv = bitnr2victim[BSF(todo)]; // otherwise pass this mvv to recursive Search() call.
The first while-loop represents the code shown earlier. Far fewer branches now.
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

I am still a bit in doubt here, on what to do when the pre-existing threat is no longer valid. We would have to keep searching for a victim amongst the remaining members of the 'worseThreats' victim set. One that is attacked after the move. Which is a different test than for finding a pre-existing threat: it must exclude attacks by the piece that we are going to capture (by using the updated presence mask) and attacks on the piece that was moved (by using the updated attack map). 'newThreats' would already contain the recapture of the latter, if this is possible. If it is attacked, we know for sure that attack must have been pre-existing. (As worseThreats doesn't contain any of the newly created attacks.) But we are really interested in knowing the opposit: if there was no pre-existing attack, we could remove it from existingThreat (which in the code above would include all pieces below the highest attacked one). But not being attacked after the move doesn't prove that, as the move could have evaded the pre-existing attack.

So perhaps it would be better to keep using the old attack map (from before the move) for finding the MVV. On negative test we could then remove it from existingThreats, so that other moves from the same node would automatically skip it. When it tests positive, though, we still would have to redo the test in the situation after the move to verify it as MVV. So we couldn't find an MVV without doing two tests, while a single one in principle would have sufficed. This is an investment we might never earn back, as later moves might not ever require this info. (There might be no later moves, or the current move might give a cutoff, or the later move might all have newly created retaliations against pieces higher than this one, or the higher pre-existing threat was not evaded on those...)

We could exclude the moved piece from being tested by removing it from the victim set before the loop, though. Then the amount of overhead for making the dual test could become very low, and would probably be worth it:

Code: Select all

// initialization in parent:
int undetectedThreats = 0xFFFFFFFF; // or load from an incrementally updated victimPresence[stm] bitmap
int existingThreats = 0;            // nothing found so far

// in MakeMove():
int worseThreats = newThreats - 1 & ~newThreats; // newThreats was obtained in the partial map update
worseThreats &= mask; // weed out non-futile victims
int todo = worseThreats & undetectedThreats; // potential worse threats of interest
todo &= ~bitnr2victim[u->piece];             // pre-existing attacks on the moved piece will have been evaded
while(todo) {                                // update parent-node's knowledge on pre-existing attacks
  int bitnr = BSF(todo);                     // nr of trailing zeros
  int victim = bitnr2victim[bit];
  int bitMask = 1 << bitnr;
  int attacks = (attackers-32)[victim];       // attack map before move (hence -32)
  if(attacks & oldPresence) {                 // there was a pre-existing attack
    existingThreat += bitMask;                // mark it
    undetectedThreats = -= bitMask;           // and mark it as detected
    // ascertain it was not evaded
    if(attacks & presence[stm])               // test without the captured piece as valid attacker
       break;                                 // this is our MVV; stop looking!
  }
  todo -= bitMask;
}
todo |= newThreats;            // add the newly created attacks (in case todo ended empty)
if(!todo) goto overkill;       // when no victims found, abort MakeMove for overkill pruning
mvv = bitnr2victim[BSF(todo)]; // otherwise pass this mvv to recursive Search() call.
[Edit]Since these victim sets will live in the UndoInfo struct that is passed around between Search(), MakMove(), UnMake() and the Search() of the child node, the latter can continue providing knowledge of pre-existing attacks, when it is in the process of searching the captures. At the end of the code in the previous posting, todo & mask will contain the set of all non-futile victims that could be under attack. So the capture-generation process in the child would just have to loop over the victims indicated in that set. It could recognize pieces in 'todo' that are not in newThreats or existingThreats, and do the test for pre-existence (updating existingThreads and clearing it from undetectedThreads) before deciding whether it has to be searched now. Its sibblings could then benefit from that info, by having fewer non-zero bits in their 'todo' victim sets.
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

Although I haven't been able to exploit this yet to achieve further speedup, I want to dwell on this some more. Because I have the feeling I am really on to something.

Moving the code for finding the MVV to the parent node / makemove was originally inspired by the wish to avoid the expensive attack-map update when there is no MVV, or when its capture would be futile. So I did not do it on null move, where the attack map doesn't need any updating at all. But since this method has the additional advantage of being able to share information needed to find the MVV between its children, it could be beneficial to have the null-move search contribute to MVV identification too.

To recapitulate: we pass an argument to Search() (as a member of the UndoInfo struct) that specifies which pieces it should consider capturing. This has the format of a 'victim set', where each of the 32 pieces is represented by a bit, in a way that would extract the bits in order of decreasing value. This should help speeding up its capture generation: if, in the parent, we have figured out that only Pawns are attacked (and a Pawn still is a worthwhile gain), the capture generation in the child can refrain from wasting time on testing whether any of the non-Pawns is attacked. When the parent has no clue as to what is attacked, it simply passes a victim set with all bits set (or at least all bits representing pieces of the right color that are still on the board, info we assume will always be available). Then the child will try to capture all of them, like it would when we didn't have this mechanism.

To sensibly restrict the pieces the children have to target, the parent keeps track of pre-existing attacks on pieces of the side to move ('threats'), also in the format of victim sets. It distinguishes existingThreats and undetectedThreats, where the latter indicates pieces the pieces for which we haven't tested yet whether they were attacked or not. Initially (when entering the node) the set of existingThreats is empty, and every present piece is in the set of undetectedThreats. When we are going to search captures, we will be forced to examine some of the pieces for being attacked, in order to determine the MVV for the child. When we do this, the piece will be removed from the undetectedThreats set, and, depending on whether it was indeed attacked, added to existingThreats.

But we could also have a child return the info of which pieces it succeeded to generate captures on. At the stage where we do null move, all pieces would still be in the undetectedThreats set. If we don't do null move, it will stay that way. If the null move fails high, we are done with the node. But if the null move fails low, the child will have searched some moves, and can report on that in a 'confirmedVictims' victim set. All pieces in that set can then be removed from the undetectedThreats, and added to the existingThreats of the parent. E.g. suppose that current eval is 50cP above beta. So we try null move, but one of our Pawns was hanging, and the null move is refuted by its capture. None of our other pieces was under attack, so the child couldn't generate any captures on them, and the confirmedVictims set it return will only contain that Pawn. So the Pawn goes into existingThreats and out of undetectedThreats. But more importantly, we know that all pieces in the value groups above the Pawns (i.e. NBRQK) are not under attack, and can be removed from undetectedThrats as well. Otherwise the child would have tried those first, and they would be in confirmedVictims. We cannot be sure of other Pawns, though; these might still have attacks on them by higher pieces, so that they came later in the LVA ordering than the capture that refuted the null move. This means none of the captures we must search next has to consider non-Pawns as MVV (unless the capture specifically created a new threat on any of those, which we would know).

After a refuted capture we should be a bit more careful how to interpret the confirmedVictims it reports. But in any case, confirmedVictims that did not have a newly created attack ('newThreats') on them must have had a pre-existing attack on them. So these can be cleared from undetectedThreats and moved to existingThreats. The pieces in the higher value groups than the lowest confirmedVictim that were not confirmedVictims themself apparently were not attacked. But this could have been an effect of the capture that led to the child: it could have captured their attacker, or they could have been moved away. The latter is easy to remedy: just don't draw any conclusions on the moved piece. (Like you cannot draw any conclusions on pre-existing attacks for pieces in newThreats.) There is no shortcut to figure out whether a piece was attacked by the captured piece, though. So the information we can distill from this is much less than what we can get from the null move.
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

I moved the stuff to Linux, so I could do profiling to see where the time goes. (Un Cygwin profiling never worked for me: it always lists all functions as using 0% of the time!) This gave the following list:

Code: Select all

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 38.53      1.61     1.61 141169697   11.41    11.41  AddMoves
 31.47      2.93     1.32 32820332    40.08    65.34  SearchCapture
 13.16      3.48     0.55 21424467    25.68    93.67  Search
  4.31      3.66     0.18 30577986     5.89     8.97  PreUpdate
  3.83      3.82     0.16 52002443     3.08     3.08  Discover
  2.39      3.92     0.10  3573624    27.99   341.86  MapFromScratch
  1.68      3.99     0.07  3573688    19.59    19.59  Occupy
  1.56      4.05     0.07  3584257    18.14   380.64  MakeMove
  1.44      4.11     0.06   126616   474.02   474.02  GenNoncapts
  0.96      4.15     0.04  3573624    11.20    11.20  UnMake
  0.48      4.17     0.02                             PrintMaps
  0.24      4.18     0.01                             HalfMapFromScratch
  0.00      4.18     0.00       89     0.00     0.00  Move2text
  0.00      4.18     0.00       66     0.00     0.00  PrintRow
  0.00      4.18     0.00        2     0.00     0.00  InitNeighbor
AddMoves() is the routine that generates captures (including 'friendly fire') by a given piece, and records those in the attack map. It is called twice from SearchCapture(), for the incremental update of that map after a capture. (To remove the moves from the old location of the moved piece, and add those in its new location.) And it is called for every piece on the board in MapFromScratch(), which updates the map after a non-capture. The call graph shows that this takes on average 27.5 calls to AddMoves(), which sounds reasonable (as we start with 32 pieces). That still makes MapFromScratch() responsible for 70% of the calls to AddMoves(), despite the fact that there are 9.2 times as many captures as non-captures.

So it is obvious that most gains can be expected from also making the map update after a non-capture incremental, rather thangenrating the map from scratch. Then it would also require only 2 calls to AddMoves(), like for the captures,reducing the time spent in that by a factor 3.

Another thing I did was put in some counters to see how often the loop for discovery of enemy slider moves has to run in the incremental update. This confirmed my intuition: on average, the only 0.12 enemy sider moves are dicovered per evacuated from-square. So the preliminary update of the attack map, just for enemy moves, is really fast: concluding there were no enemy sliders in the attackers set of the mover, assigning the attackers set of the victim to the mover (minus its own attack), clearing that of the victim, and removing the victim from the presence mask. The expensive part is updating the moves of the moving player; this requires calling of AddMoves().
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

I made the map update for non-captures now also incremental. For that I had to add a new routine for collecting the attacks on a hitherto unoccupied square, and then blocking all slider attacks that it catches. (I.e. remove those from the attackers sets of the downstream pieces they were originally attacking.) This did the trick:

Code: Select all

unsigned char code[] = { // capture directions per piece
   0,  0,  0,  0,  0,  0,  0,  0,  0,  0,   0,   0,   0,   0,   0,  0,
   1,  1,  1,  1,  1,  1,  1,  1,  0,  0,0203,0203,0204,0204,0207,  7,
 020,020,020,020,020,020,020,020,  0,  0,0230,0230,0240,0240,0270,070,
};

unsigned char capt[] = { 044, 022, 044, 011, 044, 011, 044, 022 }; // capture flags per direction

int vec[8] = { 16, 17, 1, -15, -16, -17, -1, 15 }; // board steps per direction

int InterceptAttacks(int sqr)
{
  int d, i, mask = 0;
  for(d=0; d<8; d++) {                            // look in 8 directions
    int src = neighbor[sqr].d[d];                 // nearest occupied square
    int piece = board[src];                       // its occupant
    if(piece + 16 & 32) {                         // reject empty and edge guard
      int c = code[piece];                        // how does it move?
      if(c & 0200) {                              // it is a slider
        if(c & capt[d]) {                         // hits us (from any distance)
          int dest = neighbor[sqr].d[d^4];        // opposit neighbor square
          int victim = board[dest];               // what was there
          if(victim != COLOR)                     // reject edge guards
            attackers[victim] -= attacker2bit[piece-16];
          mask |= attacker2bit[piece-16];         // record this attack
        }
      } else {                                    // it is a leaper
        if(c & capt[d] && src == sqr + vec[d]) {  // hits us (only when adjacent)
          mask |= attacker2bit[piece-16];         // record this attack
        }
      }
    }
  }
  for(i=8; i<10; i++) { // Knights
    int from = location[WHITE+i];
    if(from != CAPTURED && hippo[sqr-from])
      mask |= attacker2bit[i+WHITE-16];
    from = location[BLACK+i];
    if(from != CAPTURED && hippo[sqr-from])
      mask |= attacker2bit[i+BLACK-16];
  }
  return mask;
}
The update is still not very well though out; it is for instance not optimized for getting an quick preview of the new attacks it creates, for allowing an easy abort in case overkill pruning could be applied. But it is not done very often, so the lack of overkill pruning on such moves shouldn't hurt much. It was just that the mapping from scratch took excessively long, so even when it was only applied to an 11% minority of the moves, it still ate a lot of time.

This gave (on Linux now, so numbers perhaps not comparable to previous tests) a speedup from 4.995 sec to 4.33 sec. The new profile looks like this:

Code: Select all

  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ns/call  ns/call  name    
 36.32      1.22     1.22 32820332    37.18    66.85  SearchCapture
 19.65      1.88     0.66 49996194    13.20    13.20  AddMoves
 16.37      2.43     0.55 21424467    25.68    53.76  Search
  6.70      2.66     0.23 59149691     3.80     3.80  Discover
  6.25      2.87     0.21 30577986     6.87    10.67  PreUpdate
  4.76      3.03     0.16  3573624    44.79    44.79  InterceptAttacks
  4.02      3.16     0.14  3584257    37.68   138.16  MakeMove
  2.08      3.23     0.07   126616   553.01   553.01  GenNoncapts
  2.08      3.30     0.07  3573688    19.59    19.59  Occupy
  1.34      3.35     0.05  3573624    12.60    12.60  UnMake
  0.45      3.36     0.02                             HalfMapFromScratch
  0.00      3.36     0.00       89     0.00     0.00  Move2text
  0.00      3.36     0.00       66     0.00     0.00  PrintRow
  0.00      3.36     0.00        2     0.00     0.00  InitNeighbor
We see the contribution of AddMoves() has now dropped to 20%, and the majority of the time is used by SearchCapture(). So it is clear where the efforts have to go now!
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

Well, the fact that my PC went dead yesterday (graphics card?) will not be very beneficial for the progress of this project. But let me dwell on some theoretical aspects. We see that a significant fraction of the time (20%, in some compiles this even went up to 24%, without any clear reason) is spent in the very simple routine AddMoves(). The source code for this routine was

Code: Select all

void AddMoves(int piece, int sqr, int addsub)
{ // generate moves for the piece from the square, and add or remove them from the attack map
  int bit = attacker2bit[piece-16]*addsub;  // addsub is 1 or -1, and determines we set or clear
  if(code[piece] & 0200) { // piece is slider
    int dir = firstSlide[slideOffs[piece-16] + sqr];
    int d = slides[dir];                    // direction nr (+1)
    do {
      int to = neighbor[sqr].d[d-1];        // the directions were offset by 1 to avoid the 0 sentinel
      int victim = board[to];
      if(victim != COLOR) {                 // ignore edge guards
        attackers[victim] += bit;           // set/clear the bit for this attacker
      }
    } while((d = slides[++dir]));
  } else {
    int dir = firstLeap[slideOffs[piece-16] + sqr];
    int v = leaps[dir];
    do {
      int to = sqr + v;                    // target square
      int victim = board[to];              // occupant of that square
      if(victim) {                         // ignore empty squares
        attackers[victim] += bit;          // set/clear the bit for this attacker
      }
    } while((v = leaps[++dir]));
  }
}
There are several inefficiencies in this. For one, the extra parameter addsub, to determine whether the moves should be added to the map, or deleted. This is only a single multiply, but the parameter still has to be passed, which means an extra push onto the stack, and an extra load to retrieve it from there. This could be prevented by making separate routines for adding and deleting moves.

But now that the update is fully incremental, the routine is always called in pairs, for the same piece. (Well, except during position setup, but that is not used in search, and can use its own dedicated code.) These calls can be combined, to save some calling overhead. And then we would only have to figure out whether we are dealing with a slider or a leaper (which cannot be avoided, as these need different treatment) once. As the piece for which we have to add the moves in the new location is the same as the one for which we have to delete the moves from the old location. (This would not be true for promotions, but these would need dedicated code to handle, and would be so rare in the tree that the way you do this will have no impact on speed. In this simplified study we have no promotions at all.)

So the choice between slider and leaper can already be made in the caller, like:

Code: Select all

if(code[u->piece] & 0x200) // slider
  MoveSlider(u);
else // leaper
  MoveLeaper(u);
In the original code the calls to AddMoves() for deleting and adding the moves was separated by a board update, to prevent the piece for which you are updating the attacks to see itself standing on a square it should not be (blocking its own moves). But this board update can be absorbed in the new routines as well:

Code: Select all

void MoveLeaper(UndoInfo *u)
{ // update attacks of the moved piece in the attack map, and apply move to the board
  int bit = attacker2bit[u->piece-16];  // addsub is 1 or -1, and determines we set or clear
  int dir = firstLeap[slideOffs[u->piece-16] + u->from];
  int v = leaps[dir];
  do {
    int sqr = u->from + v;               // target square
    int victim = board[sqr];             // occupant of that square
    if(victim) {                         // ignore empty squares
      attackers[victim] -= bit;          // set/clear the bit for this attacker
    }
  } while((v = leaps[++dir]));
  board[u->from] = 0;   // board update
  board[u->to] = u->piece;
  dir = firstLeap[slideOffs[u->piece-16] + u->to];
  v = leaps[dir];
  do {
    int sqr = to + v;                    // target square
    int victim = board[sqr];             // occupant of that square
    if(victim) {                         // ignore empty squares
      attackers[victim] += bit;          // set/clear the bit for this attacker
    }
  } while((v = leaps[++dir]));
}
So the leaper loop from AddMoves() is duplicated, once subtracting, once adding 'bit' to the attackers set of the identified targets. Note that these loops use the same variable u->piece, u->from and u->to as the board update needs, so these would alraedy be in registers, making the board update just two STORE instructions. To save on calling overhead we just pass the pointer to the struct that contains this info, so that no copies of from, to and piece have to be made on the stack.

Another refinement is that the while() loops have been replaced by do-while() loops, testing only for a 0 sentinel in the list of steps at the end. The assumption is that all pieces have moves, no matter where they are located on the board, so the test before the first iteration would be useless. (Unfortunately it turned out not to be true in the test case, because of the lack of promotion: Pawns can get to 8th rank, and will have no moves there. This is an artifact of the model, which I solved by giving last-rank Pawns the move list of a King, to make the model more representative of normal chess.)

Lookup versus Calculation

There is one problem with this routine (and in fact most of the code for this approach), which might not be too obvious: it is too much dominated by memory loads. Modern CPUs can execute up to 4 instructions per clock cycles: one memory load or store, and 3 arithmetic / logic operations. But we have hardly any of the latter, making it hard to fully exploit the ' super-scalar' capabilities of the CPU. The basic operation for leaper moves is

Code: Select all

            load                add              load            load             add           store
list index -------> board step -----> to square -------> victim -------> att. set ----> updated -----|
           leaps[]             +from            board[]       attackers[]         +bit        attackers[]
(And for sliders it is even worse, as it requires yet an extra lookup, in the neighbor table to get from the item retrieved from the move list to the to-square.) We could have saved the addition of the from-square to the step by not listing steps in leaps[], but target squares. This would require a different list for every starting square, however. When listing steps all squares where the piece doesn't hit the edge can use the same list. The reason I chose to save space here is that this ' addition' in board[from+step] doesn't require a separate instruction; it is just part of the addess calculation of the load instruction. Intel architecture has instructions that load memory[constant + register1 + register2], so if 'from' and 'step' are in registers, no explicit adding is needed, and the workis all done in the 'Address Generation Unit' of the CPU. But that makes that it is basically all memory operations, waiting for each other. (And even from the L1 cache the latency of a load is 3 or 4 cycles!)

Now if we look at the entire loop, the waiting time between the loads can be filled with loads from other iterations. Because all the iterations are independent, and the out-of-order execution of the CPU would not wait for them to complete before starting with the next. The data flow looks like this:

Code: Select all

          |
          |   load       add     load         load            add          store
        index ----> step ---> to ----> victim ----> attackers ---> updated -----|
          |           |
       +1 |           | test
          |  branch   V
          X <------- T/F
          |
          |   load       add     load         load            add          store
        index ----> step ---> to ----> victim ----> attackers ---> updated -----|
          |           |
       +1 |           | test
          |  branch   V
          X <------- T/F
          |
          |   load       add     load         load            add          store
        index ----> step ---> to ----> victim ----> attackers ---> updated -----|
          |           |
       +1 |           | test
          |  branch   V
          X <------- T/F
                      |
                     exit
 
This at least allows the CPU to interleave the loads of the various iterations, to prevent the CPU will be entirely idle, and at least keep the load unit busy. There is awfully little to do besides the loads, however. The addition for adding the bit to the attackers set, and an increment of the index. An of course the test and branch to decide whether the loop will continue. And if, at a high level of optimization, the loop is unrolled, you would not even need the increment, but woul just access leaps[] with various (constant) offsets.

The moral lesson of this story is that the algorithm is heavily load dominated, and that the arithmetic units of the CPU are idle most of the time. So to speed things up we really should be looking for eliminating load operations, while adding calculations (or register-to-register moves) would not hurt at all. So it is ill-adviced to use small lookup tables for things that could also be calculated, even when lookup would only take a single instruction, and the calculation would take 3 or 4. Because that single instruction would be a load, and thus add a cycle to the execution time, while those 4 caluclation instructions would be done in parallel with the loads we cannot avoid, and add nothing!

For this reason lookups like bit = attacker2bit[piece-16] are undesirable. (Note that the -16 is not really calculated at run time, but at compile time by just using attacker2bit-16 as the (fixed) base address of the array access.) The assignments of bits to attackers is non-trivial, because the white and black pieces are interleaved. So with the piece encoding we adopted, which has a WHITE bit (16) and a BLACK bit (32), we would have to juggle the '32'-bit to the lsb (>>5), an shift up the remaining bits (<<1 or *2) one place to make room for it, and then AND with 31 to mask away the original COLOR bits: bitnr = (piece >> 5 & 1 | 2*piece & 31). That is 5 operations. And then we still would have to use that to shift a 1 bit to the proper location to create a mask, 1<<bitnr. But these would probably all be 'free', taking up execution slots where the CPU otherwise would have been idle.

Piece Encoding

This brings me to another thought: the piece encoding that was adopted is in hindsight a stupid one! Life would have been much simpler if the piece encodings were already in the same order as the corresponding bits in the attack map. I.e. interleaved for white and black. That means the lsb (bit 0) should have been the BLACK bit Then bit 1-4 could have contained the (piece-type-implying) piece number. And we could dispense of a separate WHITE bit. To have both WHITE and BLACK bits was only helpful in a move generator that would produce both captures and non-captures, because it made it possible to recognize both own pieces and edge guards (to exclude the capture of those as valid moves) in the same test (if the edge guards have both color bits set). But we don't have that anymore; our by-mover generation is limited to the non-captures, and those only have to test for the to-square being empty. Which would be equally easy no matter what code we use for them; it would always be if(board[to] == EMPTY), and having EMPTY=0 would have no special advantage.

Captures are generated in AddMoves() (or the improvements for that suggested above) as attacks, and do not have to distinguish piece color of their target. The algoritm used there is such that leaper moves are guaranteed to not go off board and only have to test for non-emptiness; slider moves are guaranteed to go to a non-empty square through the neighbor table, and only have to exclude edge guards. So both test for a single value, and it doesn't matter what that value is. Generation of pseuo-legal captures (rather than friendly captures) is enforced by picking the attackers set of a piece of the desired victim color, and mask the pieces of its own color out of it before bit-extraction commences.

So the pieces can better be encoded as:

0 = wP, bP, wP, bP, wP, bP, wP, bP, wP, bP, wP, bP, wP, bP, wP, bP, wN, bN, wN, bN, wB, bB, wB, bB, wR, bR, wR, bR, wQ, bQ, wK, bK, Edge Guard, Empty = 33

Using this encoding we could get rid of the branch in the critical loop of the capture generator, by extending the attack map with two dummy elements, for Edge Guards an Empty squares. We could then always do attackers[victim] += bit, even when 'victim' isn't really a valid victim. That would of course require a load and store operation on those attack-set elements that could have been avoided. But that takes at most two cycles, while a branch misprediction would cost far more (like 12?).

The conversion from piece number to bit mask is now a simple shift: bit = 1<<piece. The conversion from attackers set to piece number of the LVA would be a bare bsf(). But for move generation we usually extract bits from a pair of merged attackers sets, and there it is a bit more complex: the lsb of the bit number in such sets doesn't indicate color of the attacker, but which of the victims whose attackers sets were merged is the target of the capture. But that can be handled by the relatively simple operations (after bsf(mergedSet) gave us the bitnr) by attacker = bitnr & ~1 | stm, and victim = baseVictim | (bit & 1). That seems a proper number of logical operations to make good use of the execution slots the CPU has in parallel to all the necessary loads.

We also use victim sets. But there the only requirement of the bit assignment is that piece of the same color extract in MVV order. Using the reverse order as in the attackers sets satisfies that. The conversion of bit number to piece is then simply victim = BLACK_KING - bitnr, and from piece to bitmask would be mask = 1 << (piece ^ 31). Not unduly complex, so easily manageable without a lookup table.

Packing Move Tables

Another place where we could eliminate,or at least reduce load operations is in getting the move lists. Rather than step = leaps[dir++] we could have packed all steps in a single machine word, which is fetched into a CPU register in a single load, unsigned int list = moveLists[piece][sqr]. The loop could then get the board steps (or in the case of sliders the move direction numbers) through step = list & 255; list >>= 8. After all steps have been consumed, 'list' wil be zero, and the loop terminates on that. The list >>= 8 takes over the role of dir++.

A complication here is that the board steps can be negative. (The direction numbers for the sliders do not suffer from that.) But this can be solved by storing them offsetted, and subtract the offset: step = (list & 255) - 0x80. Using two operations (& and -) to save one load is a very good deal.

Another complication is the word size. Kings and Knights can move in 8 directions, so taking 8 bits for each step requires a 64-bit int to pack them. I know, in these times of x64 architecture not a big deal. But still... On a 0x88 board Knight leaps run from -33 to +33. That is a span of 67, requiring 7 bits to encode 'verbatim'.
User avatar
hgm
Posts: 27789
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: The mailbox trials

Post by hgm »

Magic Move Step

Leapers in Chess can only have King steps or Knight leaps, and of each there are only eight. So in principle 4 bits would be enough to encode what move is meant. But using a lookup table for decoding these would defeat the purpose of eliminating load instructions. I was hoping that some magic multiplier N would exist so that N*x >> 25 would produce all the K and N board steps for x=0-15, but this appears to be asking too much. (For 32-bit multipliers at least.) If it would have existed, this would have provided a simple prescription for converting 4-bit move codes into actual board steps through calculation. And then we could pack 8 of those 4-bit codes into a 32-bit word.

A next-best thing would be to have such a prescription that also depends on the piece type (N vs K or P). Then only 8 of the 16 possible codes would have to correspond to a valid move of the piece in question, and we don't care what the other codes give (as we won't use those on that piece). I could come up with the following prescription:

Assume that the 4 bits encoding the move step are not contiguously packed into the 32-bit word, but in two groups of 2, spaced by 14 bits (used for containg the other 7 moves of that piece from that square). So after having been shifted right so its turn comes up, a step descriptor would have to be extracted by stepCode = list & 0x30003. The idea is that the low bits encode the file, and the high bits the rank. The prescription would then bring the bits together through multiplying by 0x10010, shifting right 16 bits, and masking with 0x77. E.g. a code 0x20003 would give 0x00230030 after multiply, 0x23 after shift and 0x23 after mask. This doesn't account for negative numbers, but we can subtract 0x22 for that. (This is not an actual operation, but an offset when using the obtained number when accessing the board. Which is the only thing we will be using it for. So (board - 0x22)[from+step].) This way code 0x20002 would encode for a step 0, while using a 1 would encode a rank/file down, and 3 a rank/file up. That would take care of the King.

The Knight can have file or rank steps of -2, -1, +1 or +2, however. Given the 0x22 offset this will require the generated rank & file codes to be 0, 1, 3 or 4 (instead of the 0, 1, 2, 3 they were for King). But that is a multiplication with 1.5, rounded down. So instead of multiplying with 0x10010, we multiply with 0x18018 (which is 1.5 times as large). 0x10003 (say) would then give 0x80048000 + 0x180048 = 0x801C8048 after multiply, 0x1C after shift, 0x14 after mask. That is (compared to 0x22) 1 rank down and 2 files to the right.

It has to be encoded somewhere whether the piece in question is a slider or a leaper, and we might as well use a table of these multipliers for that. Sliders would get 0 (by which they are recognized as sliders), while the leapers would find either 0x18018 (N) or 0x10010 (K and P) there, and use it as the multiplier M in the statement victim = (board-0x22)[from + ((list & 0x20002)*M >> 16 & 0x77)].

Code: Select all

int list = moveList[piece][from];
int stepCode = list & 0x20002;
do {
  int step = stepCode*M >> 16 & 0x77;
  int victim = board[from+step-0x22];
  attackers[victim] -= bit;
  list >>= 2;
} while((stepCode = list & 0x20002));
Note that neither the Knight nor the King use the stepCode 0 (which corresponds to a (-2,-2) step in both systems), so that it remains available as a sentinel. The & operation that masks out the next stepCode does set condition codes, and can thus be used to branch on (for looping) without the need for an additional TEST instruction (as would be needed when loading the next code from a memory table). The list shifting takes the place of dir++ in the old code. So new are the *, >> and &, which replace the load from the leaps[] array. The critical data-flow path here is the shifting of 'list'; after each shift the result can be used for the next shift, and the & which start a long-latency side branch for which eventually no one has to wait until the attack map is needed in the child node for move generation.

To keep the multiplier available in a register after having tested it for deciding slider or leaper, it would be better to combine the two flavors of MovePiece() again, like

Code: Select all

void MovePiece(UndoInfo *u)
{
  int M = magics[u->piece-16]; // distinguishes N / KP / BRQ
  int list = moveList[u->piece][u->from];
  if(M) {
    // leaper stuff, using M to convert list elements to steps
  } else {
    // slider stuff, extracting direction numbers (which run 1-8) directly from list
  }
}