Fast capture sorting

hgm · Post by **hgm** » Sat Nov 05, 2022 2:18 pm

I don't know if this has ever been discussed. But many engines sort captures by having the move generator put them in an array ('move list') together with an assigne sort key (e.g. MVV/LVA), and then in the loop over moves run each time through the remaining part of the list to extract the one with the highest key to get the move to search next. Something like:

Code: Select all

// generation:
moveList[nrOfMove++] = { fromSqr, toSqr, SORTKEY(piece, victim) };

// searching:
for(k=0; k<nrOfMoves; k++) {
  move = moveList[k];
  for(i=k+1; i<nrOfMove; i++) if(moveList[i].key > move.key)  SWAP(move, moveList[i]); // extract next move
  moveList[k] = move; // in case we will make a second pass through the move list, e.g. for IID
  // search the move
  MakeMove(move.from, move.to);
  ...
}
[/quote]
This can be improved upon a lot if the sort key is a simple MVV/LVA. The idea is to store the generated captures in an attack map rather than a move list. In such a map each capture is represented by a single bit. The captures on the same victim are packed in the same word, and the bits within such a word (representing different attackers) are assigned such that they would extract in LVA order. The loop over moves can then simply extract the next bit from the attackers of the next victim to know what it has to search next, without additional sorting. So

[code]// generation:
  attackers[victim] |= attackBit[piece];
  
// searching
for(victim=QUEEN; victim >= LASTPAWN; victim--) { // run through opponent pieces top to bottom
  toSqr = location[victim];
  h = attackers[victim];
  while(h) { // extract attackers low to high
    int n = BSF(h);
    piece = fromTable[n];
    fromSqr = location[piece];
    // search the corresponding move
    MakeMove(fromSqr, toSqr);
    ...
    h &= h - 1;
  }
}

Note that attackBit[piece] is constant during the generation of moves of one piece, so this already saves time during generation, compared to assigning sort keys. (!But!: attackers[] has to be initialized to 0 for all pieces.) Note that in a mailbox move generator it can also save us the trouble of testing what actually is on the target square of a move: if we make sure that the empty square (and possibly boundary guards) have an element in the attackers[] array assigned to them, these would act as 'dummies' to collect any non-captures (and moves that stray off board) in a place we would never look at. And the moves that hit friendly pieces would collect in their attackers[] elements, which we can ignore, or (when properly initialized to 0) could use to see whether our own pieces are protected (e.g. for evaluation purposes).

If running through all potential victims would be inefficient (e.g. because few pieces are left), we could maintain a separe bitmap 'attacked' to indicate which pieces are attacked (the bits assigned so they extract in MVV order), and would get

Code: Select all

// generation:
  attackers[victim] |= attackBit[piece]; attacked |= victimBit[victim];
  
// searching
v = attacked;
while(v) { // run through opponent pieces that are actually attacked, top to bottom
  int n = BSF(v);
  victim = victimTable[n];
  toSqr = location[victim];
  h = attackers[victim];
  do { // extract attackers low to high
    int n = BSF(h);
    piece = fromTable[n];
    fromSqr = location[piece];
    // search the corresponding move
    MakeMove(fromSqr, toSqr);
    ...
    h &= h - 1;
  } while(h);
}

Now the above examples don't really reproduce MVV/LVA order, because there can be multiple victims with the same value, and PxR2 should go before NxR1, even if R1 is before R2 in the victim order. This can be solved by merging attackers sets before extracting the attackers. Each merged set would contain all attackers of a certain value group, rather than just on an individual piece. Of course we have to preserve LVA ordering of the bits during the merge, so simple concatenation of the attackers[] elements of the pieces in the group would not work.

This can be solved by making the attackers[] elements overly large, and spread out the bits in them. So that there is enough spacing between them to slip in the corresponding bits of another piece of the same value. There are only 16 possible attackers, but if we make attackers[] an array of uint64_t, we can keep 3 unused bits between the different attackers, like:

MSB 000K000Q000R000R000B000B000N000N000P000P000P000P000P000P000P000P LSB

When we get to searching captures of minors, we can then first merge attackers[B1] + 2*attackers[B2] + 4*attackers[N1] + 8*attackers[N2], before going into an extraction loop as above, to search all captures on the minors in perfect MVV/LVA order. For captures on the Rooks we would merge two attackers[] elements, while captures of the Queen would require no merging.

The captures on the Pawns are still problematic, though, because there can be 8 Pawns, and all potential captures on those will not fit in a uint64_t. We can at most merge the captures on 4 Pawns, splitting the Pawn value group into two. But then the problem returns that QxP on a group of the first value group would be searched before NxP on a Pawn in the second. We can at least make sure all PxP captures are searched by any others, by concatenating the P sections of the merged attackers of the first value group. After merging the attackers of 4 Pawns the merged set looks like

MSB KKKKQQQQRRRRRRRRBBBBBBBBNNNNNNNNPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPP LSB

By taking mergeP1 & 0xFFFFFFFFull | mergeP2 << 32 we get all PxP attacks in one set, which can go into a usual attacker extraction loop (except that it has to use a different fromTable[]). Capture of Pawns by other pieces can be merged in another uint64_t (from which we extract captures after the PxP), albeit in a more complex way. If we calculate mergeP1 >> 16 | mergeP2 >>32 & 0xFFFF we would get

MSB 0000000000000000KKKKQQQQRRRRRRRRBBBBBBBBNNNNNNNNBBBBBBBBNNNNNNNN LSB

where the bold attackers are from mergeP2. Rather than trying to fudge in the K, Q and R of mergeP2 in the right place, we could just mask away K and Q attackers from P1 and run it through an extraction loop (which also does the RxP captures on P-group 1). The remaining captures can be collected in a third merge, as (mergeP2 >> 48 & 0xFFF) | (mergeP1 >> 52 & 0XFF000) | (mergeP2 & (0xFF<<56)), where the three terms in parantheses are

000000000000000000000000000000000000000000000000QQQQRRRRRRRR
00000000000000000000000000000000000000000KKKKQQQQ000000000000
KKKK000000000000000000000000000000000000000000000000000000000000

which OR to

KKKK0000000000000000000000000000000000000000KKKKQQQQQQQQRRRRRRRR

Note, however, that this juggling of merge sets (which is a kind of bulk sorting) is only needed on captures of Pawns, i.e. after all other captures have already been searched, and have had a chance to cause a cutoff. In addition, very often capturing a Pawn will be futile, so that you would have no need to search them at all, and would not care about the order. Finally, capturing a Pawn with a non-Pawn is very likely to be a bad capture, unless it is not protected. And in the latter case the LVA ordering no longer has any rationalization. Ordering should be purely based on SEE for these.

Finally some thoughts about the overhead of clearing all attackers[] elements before the move generation. Obviously this overhead will be worse if we make the attackers sets larger; zeroing 16x8 bytes is more work than zeroing 16x2 bytes. There is a trade-off possible between the work of clearing the attackers[] array, and the work we do when using its elements: we could scatter the bits over the word to create space for merging only afterwards, so that initially we can use smaller words to collect the attackers. E.g. when we assign bits initially in a 16-bit int as

MSB PPBKPPBQPPNRPPNR LSB

we could take h = attackers[victim] * 0x8000400020001ull & 0x8888888888888888ull; to space out the bits in the way suitable for merging. Now using 16-bit integers usually causes so much overhead that this will never be worth it, the same idea could be used when storing the attacker bits already with a single spacer bit in a 32-bit word, in the order P0K0P0Q0P0R0P0R0P0B0P0B0P0N0P0N0, and multiplying by 0x400000000ull. And we could do that after merging a pair (e.g. attacker[R1] + 2*attackers[R2]), to reduce the overhead of unpacking by a factor 2.

It would also be possible to directly use the words in the attackers[] array to contain the attackers on a pair of pieces already during collection. That would drive up the work during generation a bit, because we would have to record the attacks as

attackers[victim >> 1] |= attackBit[piece] << (victim & 1);

That would save you the work of initializing 32 bytes to zero, and the pair merging afterwards. But whether that would earn you back the extra cost of recording the captures would depend on how many captures there actually are. On a very sparsely populated board it probably would. But of course there are other methods to make things more efficient on sparsely populated boards. (Like not assigning piece numbers to pieces or piece types that are no longer present.)

tcusr · Post by **tcusr** » Sat Nov 05, 2022 5:28 pm

how much improvement does this give you?

hgm · Post by **hgm** » Sat Nov 05, 2022 7:31 pm

That obviously depends on how fast the rest of your code is. If your code spends 20% of its time on 5 different things, you will only speed up by a factor 1.22 when you make one of those 10 times faster. It would speed up by a factor 10, though, if you make all five tasks 10 times faster.

The golden rule is this: if a big improvement in one part of your engine doesn't cause a lot of speedup, it means that the other parts of your engine are way too slow.

Ras · Post by **Ras** » Sat Nov 05, 2022 8:17 pm

Looks pretty complicated, certainly a lot more than just looping through a move list, and the question is whether that gives any measurable speedup at all. In particular because I suspect that in most cases, the highest ranked MVV/LVA move will already end the loop due to a cutoff.

Mike Sherwin · Post by **Mike Sherwin** » Sat Nov 05, 2022 8:33 pm

If I understand the situation correctly all that's needed is a switch statement and a way to keep track which cases are possible. So a u32 (or u64) can have a bit set for each case that is possible and just extract the bits one at a time to generate captures of that type.

hgm · Post by **hgm** » Sat Nov 05, 2022 10:26 pm

Ras wrote: ↑Sat Nov 05, 2022 8:17 pm Looks pretty complicated, certainly a lot more than just looping through a move list, and the question is whether that gives any measurable speedup at all. In particular because I suspect that in most cases, the highest ranked MVV/LVA move will already end the loop due to a cutoff.

That of course depends on whether the node is a cut-node or an all-node. About half the nodes are all-nodes...

And yes, doing things more efficiently often requires more complex code. Although not necessarily execution of more instructions. Dumb looping can hide a lot of execution time in a very small code section.

Looping through the move list for extracting the highest sort key might be simple, but it involves poorly predictable branching (on the result of a key comparison), which slows it down a lot. The map-based algorithm is branch-poor; the loop over victim types has a fixed number of iterations, and could be unrolled. In the simplest version that leaves only the branch at the end of the attacker extraction loop.

BTW, it occurred to me that instead of initializing the entire attackers[] array to 0 you could use the 'attacked' map to see if a victim has been attacked before, and use a conditional move instruction to initialize it if not. So instead of

attackers[victim] |= attackBit[piece]; attacked |= victimBit[victim];

you could do

h = victimBit[victim]; attackers[victim] = attackBit[piece] | (h & attacked ? attackers[victim] : 0); attacked |= victimBit[victim];

The first case would require 3x LOAD, 2x OR and 2x STORE, (attackBit[piece] is a loop constant that would remain in a register), the second 3x LOAD, 2x OR, 2x STORE, 1x BIT and 1x CMOV. So it doesn't have that much more complexity as the C source might suggest. Compare this to the number of instructions needed to calculate the sort key, merge it into the move struct, and store the move in the list, incrementing the list pointer, and you probably already have a win even in the second case.

Ras · Post by **Ras** » Sat Nov 05, 2022 10:51 pm

hgm wrote: ↑Sat Nov 05, 2022 10:26 pmLooping through the move list for extracting the highest sort key might be simple, but it involves poorly predictable branching (on the result of a key comparison), which slows it down a lot.

I still doubt that this is an area where much improvement can be had because it doesn't take much time in the first place, which is why so many engines use this approach - even Stockfish if I'm reading the SF15 move picker code correctly. So how much speedup did you actually measure when putting that idea into one of your engines?

Mike Sherwin · Post by **Mike Sherwin** » Sun Nov 06, 2022 12:07 am

Given these piece values:

P = 100
N = 290
B = 310
R = 500
Q = 900

Then the order would be:

PxQ 800
NxQ 610
BxQ 590
RxQ 400
PxR 400
PxB 210
NxR 210
PxN 190
BxR 190
NxB 20
QxQ 0
RxR 0
BxB 0
NxN 0
PxP 0
BxN -20
RxB -190
NxP -190
RxN -210
BxP -210
QxR -400
RxP -400
QxB -590
QxN -610
QxP -800

So a enum can be used:

enum { PXQ, NXQ, BXQ, RXQ, PXR, PXB, NXR, PXN, BXR, NXB, QXQ, RXR, BXB, NXN, PXP, BXN, RXB, NXP, RXN, BXP,
QXR, RXP, QXB, QXN, QXP };

The initial selecter would be 1111111111111111111111111.

The deselecter mask for queens would be 0001011111111101111110000.

So one & and all queen cases are disabled.

The values of the captured pieces can be increased to change the order of the cases.

The select case: will be huge but it would be very fast!

OOPS: The queen mask is wrong but I have no time right now to correct it.

hgm · Post by **hgm** » Sun Nov 06, 2022 7:24 am

This is not MVV/LVA ordering. MVV/LVA would be PxQ, NxQ, BxQ, RxQ, QxQ, PxR, NxR, BxR, RxR, (QxR), PxB, PxN, NxB, NxN, BxB, BxN, (RxB), (RxN), (QxB), (QxN), PxP, (NxP), (BxP), (RxP), (QxP). Where the parentheses indicate 'suspect captures', which might be bad when the victim is protected, and are better postponed when you have such info.

The order you sketch should be highly inferior.

Mike Sherwin · Post by **Mike Sherwin** » Sun Nov 06, 2022 7:36 am

hgm wrote: ↑Sun Nov 06, 2022 7:24 am This is not MVV/LVA ordering. MVV/LVA would be PxQ, NxQ, BxQ, RxQ, QxQ, PxR, NxR, BxR, RxR, (QxR), PxB, PxN, NxB, NxN, BxB, BxN, (RxB), (RxN), (QxB), (QxN), PxP, (NxP), (BxP), (RxP), (QxP). Where the parentheses indicate 'suspect captures', which might be bad when the victim is protected, and are better postponed when you have such info.

The order you sketch should be highly inferior.

The order was just an example. The method is the point. Did you not notice I gave an example that the order could be changed. That means the order can be any order one wants.

Fast capture sorting

Fast capture sorting

Re: Fast capture sorting

Re: Fast capture sorting

Re: Fast capture sorting

Re: Fast capture sorting

Re: Fast capture sorting

Re: Fast capture sorting

Re: Fast capture sorting

Re: Fast capture sorting

Re: Fast capture sorting