The mailbox trials

hgm · Post by **hgm** » Mon Apr 19, 2021 1:28 pm

This concern about in-check nodes is sparked by he wish to improve the speed of SearchCaptures() by eliminating branches. We can classify branches into 3 groups: 'big decisions', which prune entire search trees. No way to avoid taking these. Then there are trivial things that sometimes must be skipped. These we can get rid of by doing the thing always, but somehow subverting the action so that its effect goes unnoticed when it should really have been skipped. (E.g. by discarding what it pushed on a stack by resetting the stack pointer with a conditional move, or diverting a store to a table ot a dummy element.)

The hardest case is non-trivial tasks that would significantly drive up the amount of (dummy) work done if you would do it always (subverted when needed), controlled by a branch that is not mispredicted very often. (E.g. because the conditional code has to be executed only in 5=10% of the cases.)

In SearchCaptures() a number of big decisions are taken for the caputre under consideration:

1) does the move resolve an existing check?
2) is the move futile (reply depth <=0, eval + victimValue < alpha - MARGIN)?
3) is there a repetition? (Well, not after a capture)
4) is there a TT hit that causes a cutoff? (But TT no yet implemented)
5) does the move expose our King? (Requires pin- and recapture detection)
6) is the move overkill? (reply depth <= 1, eval + victimValue - replyMVV > beta + MARGIN)

All these tests can lead to pruning of the move, some (such as futility) just saving an unnecessary makemove / map update (as the child would cut off by stand pat immediately), others also saving us a fruitless move generation in the child. And the TT or repeitions could prune an enormous sub-tree. Some tests are cheap, only requiring calculaion of the new incremenal eval (2) or hash key (3), which would have to be done anyway if here is no pruning.

One strategy for getting rid of necessary branches is to prevent the decision has to be taken on every move, but take them 'in bulk' for the node that generated these moves a whole. For the futility pruning this is done by zeroing the attackers sets of futile victims with a conditional move before the capture extraction. This isn't a perfect method, because futility is based on alpha, and alpha can increase in the process of searching the moves compared to the initial value it has in the node at the time of move extraction. But increase of alpha should happen only in PV nodes, which should be rare. (I probably should take some actual statistics on that.) So the per-move test remains necessary (especially since it doesn't jus prune the current capture, but all subsequent moves in the node as well, as these have even lower-valued victims). But it should only rarely trigger a cutoff, which makes it easily predictable, so that the common case of no pruning is almost free.

The test for existing checks can similarly replaced by a bulk test at the start of the node, which then prevents captures that would not solve it tobe generated. That makes the per-move test for them completely unnecessary. Overkill pruning guarantees there is at least one move to search in every node, so even if the bulk test is equally expensive as the per-move test, you won't lose anyhing. I shouldn't be very much more expensive, though, as there aren't that much captures to be searched in a node. (Especially if it is a cut-node...)

A bulk test for pinned piece to avoid generation of illegal moves is no easy in he current conext, where moves are basically not generated, bu already there, waiting in the attack map to be extracted. We could detect pinned pieces in advance, and remove those from the presence mask while building the capture stack. And exempt the attackers set of the pinner from that. Pieces are rarely pinned on their King, though. So the branch that tests this will be predicted quite well. And there aren't too many captures to search in a node. So there isn't very much to save here, and idenifying pins in advance is most likely too complex to make it competitive.

The overkill pruning might still benefit from moving part of the task to a per-node effort, rather than a per-move effort. More about that later.

hgm · Post by **hgm** » Mon Apr 19, 2021 11:36 pm

The code I presented earlier for detecting the MVV was close to optimal for a test on a per-move basis. But could it still be impoved by sharing work between the moves from the same node? The problem is that the exact test that is needed is dependent on the move we want to play. This is already accounted for w.r.t. the new opponent moves a move creates (soft-pin enforcement and recapture); these are calculated after every move, dependent on the specifics of that move. But we also have to test for resolution of pre-existing threats. The code I presented solved that by redoing the test for every move, after two move-dependent modifications af the attack map: remove the attacks on the moved piece, (in the attackers[] array) and those by the captured piece (by clearing that from the presence mask).

Apart from those two changes, the test looks pretty much the same for every move: OR all attackers sets of non-futile victims together, and test it with the presence mask. This can require loading and ORing the sets for all 16 pieces, in 8 pairs, in the worst case. And if we want to do it branch free, every case is a worst case, as the operations we did not need are still done, as dummies. But we could do part of the work pre-emptively, e.g. ORing the attackers sets of pairs of piece together two at the time, to get four 'combi sets' (pawns1, pawns2, minors and majors). And store those in an array.

To judge an individual move, we then only have to load 4 combi sets, and OR those together, instead of 8 pairs. We can test the result of that with the presence mask that applies after the move.

There is one show stopper there: the attacks on the moved piece would still be in the ORed result. And these are no longer valid after the move. We really should have left out that piece when combining the atttacks. But that means that only one of the 4 combi sets was wrong, the one that contained the moved piece. So we could recalculate that set, with two loads and one OR, and use the pre-calculated values for the others.

Now if we did not have to deal with futility, we could extend our pre-emptive calculation by not only calculating the 4 combi sets, but also the OR of all combi sets except 1. This is about twice as much work: you can go through the list of combi-sets top-down, to create the OR of 1, 1|2, 1|2|3, ..., and then bottom up to create 8, 8|7, 8|7|6, ... (if there were 8), and store all these partial results. To get the OR of all but 3, you would then OR the element above it from the top-down cumulative table (1|2) with the element below it from the bottom-up table (4|5|6|7|8). If the moved piece was in combi-set 3, you would then recalculate set 3 without the piece, and OR that with the tabulated result for all-but-3. No need to OR a lot of elements together for each move.

But futility sort of spoils this. The move doesn't alter what we have to test by moving a different piece, or having a different victim. It also has a different 'gap' w.r.t. beta, when it captures a piece of different value. Beta is the same throughout a node (unlike alpha, which can get increased). But the eval after the move depends on what we capture. This means that we might want to omit different combi-sets at the low-value end of the list from the OR that combines them. This interferes with preparing a table of all-but-X combi-sets that would be valid for all moves.

I haven't really found a solution for that yet. The all-but method could still be used within a group of pieces of the same value. As this has to be included in its entirety, or not at all. E.g. the there are 4 Pawn pairs, and we could prepare a table of all-but-pair-X OR results. To calculate whether there still are attacks on the Pawn group after a move, we would just OR the attackers[] element for the pair the Pawn was in with the tabulated OR of the others. This then could conditionally be replaced by 0 if capturing a Pawn is futile, as usual, before ORing in attacks on other pieces. We would not have to OR all 4 Pawn pairs together for every move.

Making a set of 4 all-but combinations is not that much work. You would first OR pairs A=1|2, B=3|4, which you would have to do anyway when calculating th OR of all. But instead of continuing with A|B, you would then calculate A|3, A|4, B|1, B|2, and store those.

I will keep at it...

hgm · Post by **hgm** » Wed Apr 21, 2021 10:53 pm

Redesign

As I was away from home a few days without an opporunity to program, I had a good opportunity to rethink everything. Which is good, because in any case I will have to rewrite a lot of what I have so far, in a way that cannot be done with incremental changes of exising code. I came to the following conclusion:

Because the attack-map update for true attacks on pieces of the moving player is so cheap (pin enforcement, which in the common case boils down to testing the attackers set of the moved piece to conclude no enemy sliders were amongst those, and replacing the atackers set of the moved piece by that of the captured piece for the recaptures), it hardly hurts to do it in place pre-emptively (i.e. without being sure the move will not be pruned). Copy-make is only useful for the remaining part of update (attacks by the moved piece and discovered attacks), which is much more expensive, and equally expensive to undo.By always doing the update of pins and recaputure, we can do the 'advance' MVV detection for the child on an up-to-date attack map, which highly simplifies it:

Code: Select all

uint64_t attacks, presence64;
int mvvCode = 0;
undo.pstEval = -(u->pstEval + value[undo.victim]);           // calculate new PST eval from child POV

// speculative pre-update
undo.moverSet = atackers[undo.piece];                        // save for unmake
DISCOVER_SLIDERS(undo.moverSet & presence[stm] & SLIDERS);   // detect (soft-)pins
attackers[undo.piece] = attackers[undo.victim];              // detect recaptures

// detect MVV child
attacks = att64(P12) | att64(P34) | att64(P56) | att64(P78); // combine attacks on all four Pawn pairs
mvvCode = (attacks & presence64 ? PVAL : mvvCode);           // are any Pawns attacked?
attacks = att64(BN1) | att64(BN2);
mvvCode = (attacks & presence64 ? BVAL : mvvCode);           // are any minors attacked?
attacks = att64(QR) | att64(KR);
mvvCode = (attacks & presence64 ? RVAL : mvvCode);           // are any majors attacked?
mvvCode = (attacks << 32 & presence64 ? QVAL : mvvCode);     // did that include a Queen?

// decide on searching / pruning
if(attackers[KING] & presence[stm]) {                        // move exposed King to attacks
  UNDO_PRE_UPDATE;
  continue;                                                  // is illegal, ignore
} else
if(undo.pstEval + mvvCode < u->alpha - MARGIN) {             // best capture in child is futile (overkill)
  undo.parentAlpha = u->beta;                                // child fails low
  UNDO_PRE_UPDATE;
  goto cutoff;                                               // beta cutoff in parent
} else 
if(u->pstEval > u->beta + MARGIN) {                          // the current capture is futile
  UNDO_PRE_UPDATE;
  goto cutoff;                                               // alpha cutoff in parent
} else {                                                     // child has captures to make to decide issue
  COPY_MAP;                                                  // start copy-make
  FULLY_MAKE_MOVE;                                           // finishes attack-map update, updates board
  score = -Search(&undo);                                    // recurse
  RESTORE_BOARD;                                             // take back move on board
  SWITCH_BACK_TO_OLD_MAP;
  UNDO_PRE_UPDATE;
  if(score >= u->beta) goto cutoff;
}

This code could be placed inside Search(), in the loop that extracts the captures. The code that decides on wheher to prune contains several if-statements. Some are easy to predict, because they almost never apply. E.g. futility pruning should be rare, because the capture generaion already does bulk fuitlity pruning. An individual capture can only turn out futile when an earlier capure raised alpha compared to what it was at move-generation time (without causing a beta cutoff). Which can happen only in PV nodes. And PV nodes are rare. Having a King stumble into check happens only on some 5% of the moves. The overkill pruning is hard to predict, though: 1/3 of the captures is pruned that way. So mispredicts will be frequent there. The other branches probably won't hurt much.

An advantage of this that might not be obvious is that the common case of searching the move starts with copying the attack map for copy-make. And the 16 elements for the pieces of the moving player have jus been used (as 8 pairs) to determine the MVV. As x64 architecture has 16 registers, a smart optimizer would have kept all of those in memory. So we could start by storing all those to the copy map, saving us eight 64-bit loads.

But it is even better: these 8 pairs are also needed for building the capture stack in the child. We could also move the code for that to the child, to before the full update of the attack map. The attacks / protections by pieces of the moving player are not needed for that; he will not be on move in the child. (These would be needed to see whether any of the non-futile victims in the child is protected, though.) So we could also save us eight 64-bit loads there, and after storing the these attacker-set pairs to the map copy, also merge (interleave) them to create four capture sets, and push those on the capture stack as emptiness and futility require (possibly splitting them to get LVA order).

Code: Select all

#define att64(X) *(uint64_t*) (attackers + X)

// after pre-update of map:
// gather some pairs of attackers sets through 64-bit loads
uint64_t set1 = att64(P12);
uint64_t set2 = att64(P34);
uint64_t set3 = att64(P56);
uint64_t set4 = att64(P78);
uint64_t set5 = att64(BN1);
uint64_t set6 = att64(BN2);
uint64_t set7 = att64(QR);
uint64_t set8 = att64(KR);
// MVV determination
uint64_t presence64 = presence[stm]*0x100000001ull; // duplicate presence mask in high 32-bit
int mvvCode = 0;
attacks = set1 | set2 | set3 | set4;
mvvCode = (attacks & presence64 ? PVAL : mvvCode);
attacks = set5 | set6;
mvvCode = (attacks & presence64 ? BVAL : mvvCode);
attacks = set7 | set8;
mvvCode = (attacks & presence64 ? RVAL : mvvCode);
mvvCode = (attacks << 32 & presence64 ? QVAL : mvvCode);

if() { // abort the move
  ...
else { // search the move
  attackers += 48;   // new map
  att64(P12) = set1; // copy half-map
  att64(P34) = set2;
  att64(P56) = set3;
  att64(P78) = set4;
  att64(BN1) = set5;
  att64(BN2) = set6;
  att64(QR)  = set7;
  att64(KR)  = set8;

  // advance building of capture stack
  undo.SP = 0;
  // mask away protectors
  set1 &= presence64;
  set2 &= presence64;
  set3 &= presence64;
  set4 &= presence64;
  set5 &= presence64;
  set6 &= presence64;
  set7 &= presence64;
  set8 &= presence64;
  // merge two pairs of attackers sets to a capture set for four victims
  uint64_t pawns1 = set1 + 2*set2; // for black this would be: set1 >> 1 | set2
  uint64_t pawns2 = set3 + 2*set4;
  uint64_t minors = set5 + 2*set6;
  uint64_t majors = set7 + 2*set8;
  // kill the futile captures
  pawns1 = (gap > PVAL ? 0 : pawns1);
  pawns2 = (gap > PVAL ? 0 : pawns2);
  minors = (gap > BVAL ? 0 : minors);
  majors = (gap > RVAL ? majors & 0xFFFFFFFF : majors); 
  // split (where needed) and push onto capture stack
  undo.sets[undo.SP] = set1 = pawns1 & mask;  // Other x P (mask = 0xFFFF0000FFFF0000ull)
  undo.SP = (set1 ? undo.SP + 1 : undo.SP);
  undo.sets[undo.SP] = set2 = pawns2 & mask;  // Other x P
  undo.SP = (set2 ? undo.SP + 1 : undo.SP);
  undo.sets[undo.SP] = pawns1 -= set1;
  undo.SP = (pawns1 ? undo.SP + 1 : undo.SP); // P x P
  undo.sets[undo.SP] = pawns2 -= set2;
  undo.SP = (pawns2 ? undo.SP + 1 : undo.SP); // P x P
  ...
  undo.sets[undo.SP] = majors;
  undo.SP = (majors ? undo.SP + 1 : undo.SP); // any x Q, any x R
  // only now finish copying of attack map, (the other half)
  // and update these, and apply move to board
  int score = -Search(&undo); // pass the prepared capture set to the child
  // unmake
  attackers -= 48;
  ...
  undo.parentAlpha = (score > undo.parentAlpha ? score : undo.parentAlpha); // minimax
  if(score >= u->beta) {
    goto cutoff; // beta cutoff
  }
}

This code is a lot more economical with 64-bit memory loads (of which I am pretty sure CPUs can do only 1 per cycle) than wha I had before. It also has very few branches. Also note that we are back to calling Search() recursively direcly from Search() again, like for the non-captures. I is just that MakeMove() / UnMake() are explicitly written out here, because they (and especially MakeMove()) are no longer done as a single step, but interleaved with other tasks done on behalf of the child node. This should remove some calling overhead.

hgm · Post by **hgm** » Thu Apr 22, 2021 9:24 am

Capture Set or Capture list?

In the code shown above I again forgot to push the victim offset with the capture sets that make it to the capture stack. It is actually an inconvenience that this has to be done. It can be avoided by storing the capture sets in fixed locations, (so that the location implies the victims), and identify the non-empty ones by putting them in a linked list.

Code: Select all

  // split (where needed) and push onto capture stack
  undo.SP = 0;
  undo.sets[1] = set1 = pawns1 & mask;  // Other x P (mask = 0xFFFF0000FFFF0000ull)
  undo.next[1] = undo.SP;
  undo.SP = (set1 ? 1 : undo.SP);
  undo.sets[2] = set2 = pawns2 & mask;  // Other x P
  undo.next[2] = undo.SP;
  undo.SP = (set2 ? 2 : undo.SP);
  undo.sets[3] = pawns1 -= set1;
  undo.next[3] = undo.SP;
  undo.SP = (pawns1 ? 3 : undo.SP); // P x P
  undo.sets[4] = pawns2 -= set2;
  undo.next[4] = undo.SP;
  undo.SP = (pawns2 ? 4 : undo.SP); // P x P
  ...
  undo.sets[8] = majors;
  undo.next[8] = undo.SP;
  undo.SP = (majors ? 8 : undo.SP); // any x Q, any x R

This still requires storing something together with the capture sets, in this case for indicating what is the next set to be treated, instead of what the set itself represents. Instead of a stack-pointer that is conditionally incremented for non-empy capture sets, undo.SP now keeps track of the location in undo.sets[] that holds the start element of the link list. This gives a very slight speedup, as to conditionally increment SP you have to copy it to another register to increment it there, so that you can conditionally move it back to replace the original. Here you just have to conditionally move a constant, which doesn't have to be calculated first.

The main advantage is that all the partial sets now are stored in a fixed location. This would make it easy to do 'post-processing' on those. E.g. for further postponing major x minor captures (undo.sets[5]) on protected pieces, we could mask away the attacks on protected minors from those, and save those for after the regular extract-and-search loop. (Or somehow insert them at the tail of the list.) This would be hard if the major x minor capture set is somewhere on a stack, but you don't know where.

To handle the linked list, the extraction loop becomes:

Code: Select all

uint64_t todo = undo.sets[undo.SP];
do {
  int bit = BSF(todo);
  int captureNr = undo.SP + 8*bit;
  int nextSP = next[undo.SP]; // prefetch next capture set
  int nextSet = sets[nextSP];
  todo &= todo - 1;         // clear extracted capture from set
  undo.SP = (todo ? undo.SP : nextSP); // seamlessly glue sets
  todo = (todo ? todo : nextSet);
  undo.piece = capt2piece[captureNr];   // decode move
  undo.victim = capt2victim[captureNr];
  undo.from = location[undo.piece];
  undo.to = location[undo.victim];
  ... // make, search, unmake, minimax (or prune) code from previous posting
} while(undo.SP);

hgm · Post by **hgm** » Sat Apr 24, 2021 9:07 am

I did not make much progress yet rewriting the move generation based on a new piece numbering sysyem. One thing that makes it hard to progress is that the compiler I am using (gcc on Linux) loves branches very much. Or at least, hates conditional move instructions. It is really next to impossible to write code that doesn't use branch instructions on compilation. E.g. x = (a == b ? x : y);, which could be compiled (when all involved variables are already in registers) to the two simple instructions cmp %a, %b; cmovne %y, %x, will instead result in a jne to some code elsewhere that moves %y to %x, and then jumps back into the main control flow. That seems already undesirable from the POV of the number of instructions that have to be executed. But even that this introduces extra branches, which can be mispredicted, is seems a disastrous choice. (It also has a rather funny way to compile expressions x = (c ? b : 0);. It does use a branch-free method there by subtracting 1 from c to create a carry if c == 0, then subtract a reister from itself with borrow (sbb) to create -1 (=0xFFFFFFFF) if the carry was set, and 0 otherwise, then negate that, and finally & this with b. That takes many instructions where a simple cmov would have sufficed.)

Yet when I hand-edit the assembler produced by the compiler to use the conditional move, I hardly see any speedup. Apparently the prediction of these branches is very good. (But that seems pure luck on the part of the compiler; it could not possibly have known that without a profile-guided optimization.) Note that these CPUs do not predict branches on an individual basis, but take the global history of recently executed branches into account. So if two branches act on unpredictable data, but correlate with each other, only the first of the two will be frequently mispredicted, as for the prediction of the second it will predict what the other actually did. Eliminating the first branch will then just move the unpredictability problem to the second, and not reduce the number of mispredictions.

This could be the problem here; the difficult part could be predicting whether a piece is a slider or a leaper, and once you know that you can make a good guess whether its loop over moves will need 2, 4, or 8 iterations. Merging the loop over moves for sliders and leapers, so that you do not need the branch to separate their treatment, will then just displace the pain to the looping branch. Something like that must have thwarted my attempt to achieve a speedup through such uniform treatment of all piece types. Leapers need the data flow:

moveIndex -> boardStep(=moveList[moveIndex]) -> toSqr(=fromSqr+boardStep) -> victim(=board[toSqr]) -> attackersSet(=attackers[victim]),

while sliders need a lookup (in the neighbor table) instead of a calculation to get their to-square:

moveIndex -> direction(=moveList[moveIndex]) -> toSqr(=neighbor[fromSqr][direction]) -> victim(=board[toSqr]) -> attackersSet(=attackers[victim])

I combined this by always calculating the toSqr as neighbor[dummy][direction]+offset, where offset is fromSqr for leapers and 0 for sliders, while dummy is fromSqr for sliders, and some piece-type-dependent invalid square number for leapers. Where I stuffed the corresponding off-board element of the neighbor table with the board steps for that leaper type. I thought that was pretty smart (especially since the +offset should not require an extra instruction, but can be hidden in the address calculation for the board access). But it gave no speedup, because the compiler still implemented the choice between offset = 0 or fromSqr by branching based on leaper/slider classification. And even replacing that branching code by a conditional move gave me no speedup. The branchless approach is not 'unconditionally better', as we invest an extra memory load for obtaining the board steps of the leaper. The 64-bit code furthermore needs extra instructions for converting loaded or calculated 32-bit values to 64-bit, so that these can be used as pointers.

What actually did give me a speedup was combining the loops for removing the old and adding the new captures of the moved piece. These loops are nearly identical (the difference being that one of them subtracts, and the other adds the single-bit mask for the piece to the attackers sets of the victims). But they act on different data: removal uses the moves departing from the from-square, addition the moves departing from the to-square. They both loop over the same list of board steps (for leapers) or directions (for sliders), though.

These loops can be easily combined into one. E.g. for leapers:

Code: Select all

step = leaps[dir];
do {
  int victim = board[fromSqr + step];
  attackers[victim] -= bit; // remove old attack
  victim = board[toSqr + step];
  attackers[victim += bit; // add new attack
} while((step = leaps[++dir])); // 0 sentinel terminates list of leaps

If the captured piece still in board[toSqr] when this code is run, the attack on it by the moved piece will be removed from its attackers set. This set can then be inherited by the moved piece afterwards. (Which will overwrite any unwanted modification of the moved piece's attackers set by that piece seeing itself on the from-square when generating its moves from the to-square. For minimal confusion (e.g. when also using this loop to calculate the piece its mobility) it would be best to run it after the moved piece has already been cleared from board[fromSqr], though. An extra statement in the loop like mobilty += bonus[piece][victim] would then take care of the mobility contribution of captures. (Which in the simple move-counting approach to mobility would then have bonus[piece][victim] = 1 when piece and victim are real pieces of opposit color, and 0 for equal-colored pieces or empty-square and edge-guard victims.) Sliders would also need a contribution distanceMinusOne[targetSqr - fromSqr] to account for their non-captures, after the target square has been looked up in the neighbor table. (Note that this can be implemented by setting pointers p = &bonus[piece] and q = &distanceMinusOne[-fromSqr] before the loop, so that the loop only needs to do mobility += p[victim] + q[targetSqr], two double-indexed loads and two additions per move direction for sliders, while leapers only need half of that, because they don't need the q-term.)

We have to make sure that the neighbor table is also updated for evacuation of the from-square before the corresponding code for sliders is run. So that when the slider moves from the to-square are calculated, it already 'looks through' the from-square, and is attacking the piece down-stream from it, rather than blocking its own move. (The evacuated square will keep listing its former neighbors on evacuation, so it can still be used to generate slider captures departing from the from-square, or discovering slider moves that were hitting it.)

Anyway, merging the loops did give some speedup; the test search completes in 4.34 sec now on my laptop. The percentage of it spent in MovePiece() dropped from some 20% to about 15%.

hgm · Post by **hgm** » Sat Apr 24, 2021 11:27 am

Mobility

Now that I already mentioned mobility in the previous posting, let me explain my design for implementing the incremental update for that. So far all tests I have been doing were with a simple PST evaluation. As already mentioned, a quite general mobility term can be defined as SUM(bonus[piece][target]), where the sum runs over all pseudo-legal moves, target is the occupant of the to-square of these moves, and piece the piece that makes it. So I could also have written SUM(bonus[board[fromSqr]][board[toSqr]] for all moves given as (fromSqr, toSqr). Even more general would be a non-linear approach, where the bonuses come from a table indexed by the number of moves of a single piece: SUM_p(bonus[piece][SUM_m(w[target])], where SUM_p runs over all pieces, and SUM_m over all moves of that piece. Each type of target could be counted as a different number of 'move points' (usually one would just take 0 or 1), and the actual bonus for the number of move points would then come from the bonus[][] table.

The approach I had in mind would store pieceMob[piece] = SUM_m(w[target]) for each piece in the piece list. The total mobility term in the evaluation would then be SUM_p(pieceMob[piece]) or (in the more complex approach) SUM_p(bonus[pieceMob[piece]]). Both expressions are easy to update incremenally for the change in pieceMob[] for a single piece; just apply the change also to the total directly, (totalMob += new - old) or after the double lookup (totalMob += bonus[piece][new] - bonus[piece][old]), where 'new' and 'old' are values of pieceMob[piece].

In particular, when a piece gets captured, we can just subtract its contribution as specified in the piece list from the total entirely, and there is no reason to update the piece list for that. In general (e.g. when a piece is moved) we would have to calculate the new mobility, and add that to the total (after subtracting the stored value from the piece list), and put that new value in the piece list. After saving the old value, so we can restore it on unmake.

Now mobility can also change for non-moving pieces, e.g. by discovery of slider moves. The latter can be handled in the discovery loop, where we can get the number of new non-captures for a slider whose moves are discovered by distance[target - src] - distance[fromSqr - src], where fromSqr is the square evacuated by the move, src the location of the slider targeting that square, and target the new downstream target of that slider (all already calculaetd during that loop).

Now we also have to account for the change in nature of the target. So for slider discovery we would have to subtract the old bonus[slider][mover], and add bonus[slider][target], where mover is the piece that evacuated the square, and target the new downstram target. A similar correction is needed when pieces alter their capture target because it gets replaced in the process of capture. Any piece that was attacking the to-square of a move will now see another occupant there. Consequently it will have to get bonus[piece][victim] subtracted, and bonus[piece][mover] added to its pieceMob[piece] (where mover and victim are the actors in the current move). This could be the most costly part; we would have to loop through all attacks in the attackers set for the victim (both true attacks and protections) to update the pieceMob[] of those. A similar thing holds for the leaper attacks on the from-square of a move, which now turns empty. (For sliders this was already accounted for during the discovery of their moves).

All in all a lot of pieceMob[] elements could be changed in a single move: all friendly and enemy attackers of the from-square and to-square, plus the moved piece (if it was not already amongst those). And all this should be undone during unmake. This makes it likely that copy-make will be the more-efficient method for updating the pieceMob[] array. Especially since the number of moves for a single piece recorded in pieceMob[], even when somewhat weighted, is likely not exceed 255. So pieceMob[] can be an array of bytes, which can be copied for the entire set of 32 piece with just four 64-bit moves (4 load and 4 store instructions).

Mike Sherwin · Post by **Mike Sherwin** » Fri May 07, 2021 10:48 am

First let me say; I hope that you are okay!

So, are you in the finalizing stage?
Are you burned out and taking a break?
Did you determine that it was not going to work?
Maybe you are starting over from scratch?
Maybe you have decided to come out and surprise us with a fully functional engine?

This thread has 24016 views at this moment which means a lot of people are being left in the dark.

hgm · Post by **hgm** » Fri May 07, 2021 1:24 pm

I was taking a break; there are other things I have to attend to beside just this project. I do more or less have to start from scratch, with another encoding of the pieces. I am also re-organizing the code, so it will no longer be in a single file: The Search routines will be put in a separate file, that (depending on a macro flag) can be compiled either to give routines dedicated to white or to black (which need shifing in opposit direction), which then can be linked with the color-independent routines from another source file. As this makes building the executable non-trivial, I also need to make a Makefile.

But I was busy with upgrading my 'interactive diagram' web applet, and improving the appearence of the diarams posted with it on chessvariants.com. I also designed a new army for a popular chess variant, with novel pieces, and am currently running test games on my laptop to determine some piece values.

Mike Sherwin · Post by **Mike Sherwin** » Fri May 07, 2021 5:55 pm

hgm wrote: ↑Fri May 07, 2021 1:24 pm I was taking a break; there are other things I have to attend to beside just this project. I do more or less have to start from scratch, with another encoding of the pieces. I am also re-organizing the code, so it will no longer be in a single file: The Search routines will be put in a separate file, that (depending on a macro flag) can be compiled either to give routines dedicated to white or to black (which need shifing in opposit direction), which then can be linked with the color-independent routines from another source file. As this makes building the executable non-trivial, I also need to make a Makefile.

But I was busy with upgrading my 'interactive diagram' web applet, and improving the appearence of the diarams posted with it on chessvariants.com. I also designed a new army for a popular chess variant, with novel pieces, and am currently running test games on my laptop to determine some piece values.

+1

Jakob Progsch · Post by **Jakob Progsch** » Sat May 08, 2021 9:55 am

hgm wrote: ↑Sat Apr 24, 2021 9:07 am One thing that makes it hard to progress is that the compiler I am using (gcc on Linux) loves branches very much. Or at least, hates conditional move instructions. It is really next to impossible to write code that doesn't use branch instructions on compilation. E.g. x = (a == b ? x : y);, which could be compiled (when all involved variables are already in registers) to the two simple instructions cmp %a, %b; cmovne %y, %x, will instead result in a jne to some code elsewhere that moves %y to %x, and then jumps back into the main control flow. That seems already undesirable from the POV of the number of instructions that have to be executed.

Branches are weird... Most x86 architectures (and iirc also many ARM ones) can fuse conditional jumps into comparison and logic ops. So even if the assembly reads as "CMP... JNE..." most modern CPUs only execute one fused microp. While "CMP... CMOVxy..." is always two operations that add their full latency (of one or two cycles each) to the critical path. On top of that predicted branches are basically free since they don't even have to "execute" before the predicted path. Only before that path retires. So they usually add no latency at all to a predicted critical path. Which is pretty hard to beat. Between branches usually being free and even in the worst case saving you the execution of the other result they become the preferred solution most of the time.

The last part about branching away one path seems to be the main deciding factor in gcc from what I can tell.

See for example this (interactive on godbolt)

Code: Select all

uint64_t lone_ternary(uint64_t x, uint64_t y, uint64_t a, uint64_t b) {
    return x>y ? a : b; 
}

uint64_t bishopdiagonal2(uint64_t pos) {
    uint64_t x = (7-(pos&UINT64_C(7))) << 3u;
    uint64_t y = pos&UINT64_C(0x38);
    uint64_t diagonal = UINT64_C(0x0102040810204080);
    uint64_t a = diagonal << (x-y);
    uint64_t b = diagonal >> (y-x);
    
    return x>y ? a : b; 
}

uint64_t bishopdiagonal2_forced_computation(uint64_t pos, uint64_t &dummy) {
    uint64_t x = (7-(pos&UINT64_C(7))) << 3u;
    uint64_t y = pos&UINT64_C(0x38);
    uint64_t diagonal = UINT64_C(0x0102040810204080);
    uint64_t a = diagonal << (x-y);
    uint64_t b = diagonal >> (y-x);
    dummy = a+b;
    return x>y ? a : b; 
}

compiles to:

Code: Select all

lone_ternary(unsigned long, unsigned long, unsigned long, unsigned long):
        cmp     rdi, rsi
        mov     rax, rcx
        cmova   rax, rdx
        ret
bishopdiagonal2(unsigned long):
        mov     rax, rdi
        and     edi, 56
        not     rax
        sal     rax, 3
        and     eax, 56
        cmp     rax, rdi
        jbe     .L5
        sub     eax, edi
        movabs  rdi, 72624976668147840
        shlx    rax, rdi, rax
        ret
.L5:
        sub     edi, eax
        movabs  rax, 72624976668147840
        shrx    rax, rax, rdi
        ret
bishopdiagonal2_forced_computation(unsigned long, unsigned long&):
        mov     rax, rdi
        and     edi, 56
        movabs  rdx, 72624976668147840
        not     rax
        mov     ecx, edi
        sal     rax, 3
        and     eax, 56
        mov     r8d, eax
        sub     ecx, eax
        sub     r8d, edi
        cmp     rax, rdi
        shlx    r8, rdx, r8
        shrx    rdx, rdx, rcx
        lea     rcx, [r8+rdx]
        cmovbe  r8, rdx
        mov     QWORD PTR [rsi], rcx
        mov     rax, r8
        ret

In isolation if it just has the values it will use conditional moves just fine. Once there is any "diverging" computation involved in producing a and b it will use a branch. Presumably because it can skip half the computation that way. If however you force the computation of a and b anyway (by using them for that nonsensical sum up there for example) it will use CMOV again.

The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials

Re: The mailbox trials