So I was wondering if it wouldn't be better to use 32-bit integers for the attack-map elements, and expand these to 64 bits when they are used for capture extraction. This is possible with a single extra multiply, if in the 32-bit format the Pawns are interlieved with the other pieces:
Code: Select all
pPkKpPqQpPrRpPrRpPbBpPbBpPnNpPnN a = (uint64_t) attackers[victim];
KpPqQpPrRpPrRpPbBpPbBpPnNpPnN00pPkKpPqQpPrRpPrRpPbBpPbBpPnNpPnN0 a *= (8LL<<32) + 2;
K000Q000R000R000B000B000N000N000P000P000P000P000P000P000P000P000 a &= 0x8888888888888888LL;
The 64-bit attack map could be coppied with 32 load and 32 store instructions, the 32-bit attack map with 16 load and 16 store instructions (all transferring 64-bit words). So we would save 17 instructions. What might even be more important is that we reduce pressure on the cache bandwith, which is likely the bottleneck in the attack-map upddate.
I guess you could also expand only after merging the attackers sets on two victims:
Code: Select all
pPkKpPqQpPrRpPrRpPbBpPbBpPnNpPnN a = (uint64_t) attackers[victim1];
pPkKpPqQpPrRpPrRpPbBpPbBpPnNpPnN b = (uint64_t) attackers[victim2];
0P0K0P0Q0P0R0P0R0P0B0P0B0P0N0P0N a &= 0x55555555;
0P0K0P0Q0P0R0P0R0P0B0P0B0P0N0P0N b &= 0x55555555;
PPKKPPQQPPRRPPRRPPBBPPBBPPNNPPNN a += 2*b;
KKPPQQPPRRPPRRPPBBPPBBPPNNPPNN00PPKKPPQQPPRRPPRRPPBBPPBBPPNNPPNN a *= (4LL<<32) + 1;
KK00QQ00RR00RR00BB00BB00NN00NN00PP00PP00PP00PP00PP00PP00PP00PP00 a &= 0xC0C0C0C0C0C0C0C0LL;
I guess one can load the two attackers sets of two victims in the same value group with a single 64-bit load, effectively concatenating them. Then you could mask out the protector bits with a single AND, and use multiply + shift to interleave them:
Code: Select all
pPkKpPqQpPrRpPrRpPbBpPbBpPnNpPnNpPkKpPqQpPrRpPrRpPbBpPbBpPnNpPnN a = *(unit64_t*)(attackers [victim1]);
0P0K0P0Q0P0R0P0R0P0B0P0B0P0N0P0N0P0K0P0Q0P0R0P0R0P0B0P0B0P0N0P0N a &= 0x5555555555555555LL;
PPKKPPQQPPRRPPRRPPBBPPBBPPNNPPNN0P0K0P0Q0P0R0P0R0P0B0P0B0P0N0P0N a *= (2<<32) + 1;
PPKKPPQQPPRRPPRRPPBBPPBBPPNNPPNN a >>= 32;
Merging two attackers sets in the 64-bit format would have taken only 4 instructions (two loads, a LEA for a += 4*b, and an AND with 0x5555555555555555LL). So the unpacking takes 2 extra instructions per pair. For 8 pairs that is still 16 instructions. So there is not much to gain here.