Incremental update of PST values slower than not incremental?

hgm · Post by **hgm** » Fri May 12, 2023 12:01 pm

Mike Sherwin wrote: ↑Fri May 12, 2023 8:29 am I use a switch on piece type because I have pseudo piece types and even pseudo move types.
Here they are enumerated.
Code: Select all
enum {
  OO, WP, WN, WB, WR, WRC, WQ, WK, WC, BP, BN, BB, BR, BRC, BQ, BK, BC,
  Wd, We, Wb, Wr, Wn, Wq, Bd, Be, Bb, Br, Bn, Bq, WS, WL, BS, BL
};
...

You could still use general code (i.e. not dedicated to a specific piece type and color) for the non-special moves, and put all case labels for such moves above that code. That should greatly improve the predictability of the branch instruction that implements the switch: it would always predict execution of that code, and it would only be wrong in the rare case the move was special. By executing different code for each piece type, the probability it would correctly predict what code is to be executed after the branch becomes very small.

You have an extra 'type' field in your move representation, which does double duty of distinguishing special moves, as well as identifying the type of the moved piece. The latter is essentially redundant information, as you could have read it from the board (like you do for ctype) just as easily as from the move.

By encoding 'virgin' pieces as a different type you essentially introduced extra 'promotions'. If you would use m->type to indicate the 'promotion piece' rather than the moving piece (as it in a sense already does for the true promotions) you could always do

board[m->ts] = m->type;

That way the KC, RC and d cases could also be grouped with the non-special moves.

It is true that the d case distinguishes itself from the non-special moves in that it also needs to set the e.p. square. There might be a more efficient solution for this, though. The move generator would know when it was generating a double push, and could set a special flag bit in the move representation when it does that, outside the 'type' field that you switch on. If the daughter node would be aware of the move that led to it, it could examine that flag at the time it asks itself whether it has to generate an e.p. capture or not; there would be no need to set it in MakeMove through a separate switch case or dedicated test. Your MakeMove code only sets epbb[ply+1] for moving a virgin Pawn, but of course it would need to clear it for all other moves. Just passing m to the daughter is just as easy as passing it epbb, and the move generator of the daughter would then test the 'double-push flag' in m to see if it must attempt to generate an e.p. capture (m->ts telling it what to capture and where). Which is probably cheaper than wondering for every diagonal Pawn move whether it goes to the e.p. square.

That only leaves castlings, promotions and e.p. captures, all extremely rare. So having separate cases for those in the switch (which would always be mispredicted when they actually occur) probably won't measurably hurt. All these cases need to do something extra (moving the Rook, clearing the e.p. square or adjusting the material key). And considering how rare they are, it would probably be a bad idea to always do these things (in a dummy way) for all other moves, just to save the branch mispredict associated with distinguishing the cases.

The We case indeed is a good way to treat e.p. capture and normal Pawn moves the same way. But this is a bit wasted on your code, as you would only get to the We case if you already know you are dealing with an e.p. capture; otherwise you would have gone to the WP case. So you might as well have written

sq = m->ts - 8;

(Note that in a color-agnostic way this could be written as sq = m->ts ^ 8;.)

Mike Sherwin · Post by **Mike Sherwin** » Fri May 12, 2023 3:35 pm

Back when branch prediction first became a thing, jmp [piecetype + edi * 4], type instructions were not predicted.
Are they now also included in branch prediction?

Actual code from Godzilla:

Code: Select all

public      c           ptmf, iptmf

iptmf       dd          0
            dd          wpm,wpm,wpm,wpm,wpm,wpm,wpm,wpm
            dd          wnm,wnm,wbm,wbm,wrm,wrm,wqm,wkm
            dd          wck,wcq,wmter
            dd          cbcam
            dd          0
            dd          bpm,bpm,bpm,bpm,bpm,bpm,bpm,bpm
            dd          bnm,bnm,bbm,bbm,brm,brm,bqm,bkm
            dd          bck,bcq,bmter
            dd          cwcam

wmg:        push        ebp
            push        edi
            push        esi
            mov         ebp,[ply]
            mov         eax,[first+ebp*4]
            mov         [lis+ebp*4],eax
            mov         edi,[nxt]
            jmp         [ptmf+edi*4]

More code for queen as example:

Code: Select all

wqmf        dd          0
            dd          wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd
            dd          wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd,wqmnd
            dd          0,0,0
            dd          wqmrm
            dd          0
            dd          wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wqmrc
            dd          wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wqmrc,wmcbki
            dd          wnxtm

wnxtm:      mov         edi,[nxt+edi*4]
            jmp         [ptmf+edi*4]
        
wqm:        mov         ecx,[ps+edi*4]
            mov         esi,[qol+ecx*4]
            movsx       ebx,[qns+esi+ecx]
            mov         edx,[brd+ebx*4]
            jmp         [wqmf+edx*4]
        
wqmrm:      mov         [tree.fsq+eax*8],cl
            mov         [tree.tsq+eax*8],bl
            mov         [tree.typ+eax*8],QMOV
            inc         eax
            movsx       ebx,[qns+esi+ebx]
            mov         edx,[brd+ebx*4]
            jmp         [wqmf+edx*4]

wqmrc:      mov         [tree.fsq+eax*8],cl
            mov         [tree.tsq+eax*8],dl
            mov         [tree.typ+eax*8],QCAP
            inc         eax
wqmnd:      movsx       ebx,[qnd+esi+ebx]
            mov         edx,[brd+ebx*4]
            jmp         [wqmf+edx*4]

Mike Sherwin · Post by **Mike Sherwin** » Fri May 12, 2023 4:25 pm

Something has changed over the years. Godzilla is the same as Carnivor except in Godzilla the move generator, MakeMove and TakeBack are in 80386 assembler. Godzilla used to be ~60% faster than Carnivor in the Bench function.

Today on 3950x, single threaded, all moves made and unmade, no tricks.
Carnivor bench 6: n/s = 52541730
Godzilla bench 6: n/s = 68146767

68146767/52541730 = 1.29 = only 29% faster.

These are not bad numbers because in Bench() all piece information is incrementally updated as in a real search.

The question though is why has Godzilla taken such a hit. Is it because indexed jmps are now included in branch prediction?

hgm · Post by **hgm** » Fri May 12, 2023 5:59 pm

If indirect branches are not predicted, that would be a very good reason never to use them. Of course there is no guarantee a compiler would actually use them for implementing a switch statement; it could very well us a tree of two-way conditional branches. To avoid such uncertainties you could assign the type codes such that you can easily distinguish those that need special treatment, (i.e. castling, e.p. capture and promotion, e.g. by assigning all of these 'type' codes larger than all the others), and those which don't. Because special moves are so rare, the prediction would go to the generic code that handles all the normal moves, which only remove the piece on m->fs, and put m->type on m->ts. The special moves could then switch on m->type, with cases only for the special moves.

Note that you could also write generic code for handling all special moves, so that you would need no further branches after the first if statement has decided the move is special. The special moves differ from the normal moves by clearing an extra board square (castling and e.p.), or adding some constant to a memory location (the board square where the Rook goes, the material key, or the incremental evaluation). So you could use m->type for those as index in a table that contains the number of a square that has to be cleared (using an invalid number for promotions to act as a dummy), a pointer to the memory location that needs to be changed, a constant to be added to that, and what the piece type is that has to appear on m->ts (as m->type now no longer encodes that).

Code: Select all

struct { int clear, promoChoice, adjustment, *adjust; } specials[...] = { ... };
 
    atype = board[m->fs];
    if(m->type > SPECIALS) {
      ptype = specials[m->type].promoChoice;
      board[specials[m->type].clear] = 0; // remove Rook for promotion, Pawn for e.p. or act on dummy for promotion
      *(specials[m->type].adjust) += specials[m->type].adjustment; // adjust eval for promotion and e.p., or Rook destination for castling
    } else ptype = m->type;

    ctype = board[m->ts];
    mat[xstm] -= value[ctype];
    pos[stm] += (pcePST[ptype][m->ts] - pcePST[atype][m->fs]);
    pos[xstm] -= pcePST[ctype][m->ts];
    piece[xstm] ^= (u64)(ctype != OO) << m->ts;
    board[m->ts] = ptype;

As for the relative speed of compiled C and assembler: I guess compilers get better at optimizing the assembly code they produce, and that CPUs are getting better at executing crappy assembly code (by optimizing it at run time when they convert it to micro-Ops in the instruction cache).

Mike Sherwin · Post by **Mike Sherwin** » Fri May 12, 2023 6:16 pm

I did a first test.

depth score time nodes pv

Original:
Begin Depth 19
19 23 2835 262270574 e2e4 d7d5 e4d5 d8d5 d2d4 e7e5 d4e5 f8b4 c1d2 d5e5 d1e2 b4d2 e1d2 b8d7 g1f3 e5e6 e2b5 e6b6 d2e1

Combining WP, WN, WB, WR, WQ and BP, BN, BB, BR, BQ
Begin Depth 19
19 23 2788 262270574 e2e4 d7d5 e4d5 d8d5 d2d4 e7e5 d4e5 f8b4 c1d2 d5e5 d1e2 b4d2 e1d2 b8d7 g1f3 e5e6 e2b5 e6b6 d2e1

28.35 - 27.88 = 0.47 seconds faster = 0.47/28 = 0.017 = 1.7% faster. Assuming the same for TakeBack then 3.4% faster

It is not an insignificant speedup! But, I'm not sure how much more can be gained. BTW 'We' only knows it is a WP on RANK5. 'We' does not know if it is an actual epc or not. This code generated it then put it in a list.

Code: Select all

      case RANK5:
        bb1 = wPawnCapts[fs] & (enemy | epbb[ply]);
        bb2 = wPawnMoves[fs] & empty;
        type = We;
        break;

Mike Sherwin · Post by **Mike Sherwin** » Fri May 12, 2023 6:22 pm

hgm wrote: ↑Fri May 12, 2023 5:59 pm

Code: Select all

struct { int clear, promoChoice, adjustment, *adjust; } specials[...] = { ... };
 
    atype = board[m->fs];
    if(m->type > SPECIALS) {
      ptype = specials[m->type].promoChoice;
      board[specials[m->type].clear] = 0; // remove Rook for promotion, Pawn for e.p. or act on dummy for promotion
      *(specials[m->type].adjust) += specials[m->type].adjustment; // adjust eval for promotion and e.p., or Rook destination for castling
    } else ptype = m->type;

    ctype = board[m->ts];
    mat[xstm] -= value[ctype];
    pos[stm] += (pcePST[ptype][m->ts] - pcePST[atype][m->fs]);
    pos[xstm] -= pcePST[ctype][m->ts];
    piece[xstm] ^= (u64)(ctype != OO) << m->ts;
    board[m->ts] = ptype;

I actually understand that!

I'll give it a try.

THANKS

hgm · Post by **hgm** » Fri May 12, 2023 6:45 pm

One tricky thing, though: when the same 'adjust' pointer is used to place the Rook on the board or to add the piece values to the incremental (PST) evaluation, it could be that there is a data-type mismatch as the board might be an array of uint8_t, while the evaluation is probably int32_t. This can be solved by applying the adjustment to the board to 4 squares simultaneously, making the adjustment zero for the bytes that you don't really want to change.

BTW, when writing the code above I forgot you also keep track of white and black bitboards. This would require a second (64-bit) adjustment in the special-moves section.

Optimizing the code for the special moves can never result in a large gain in speed, as that code is hardly ever executed.

Rebel · Post by **Rebel** » Fri May 12, 2023 7:40 pm

JVMerlino wrote: ↑Thu May 11, 2023 1:22 am Finally got around to converting the PST portion of my eval to be incrementally updated with make/unmake, rather than doing it all from scratch whenever Evaluate() is called. And my engine is now about 10% slower....

When you do a depth=x search, do the old and new version produce the same number of nodes searched?

Mike Sherwin · Post by **Mike Sherwin** » Fri May 12, 2023 7:49 pm

hgm wrote: ↑Fri May 12, 2023 6:45 pm Optimizing the code for the special moves can never result in a large gain in speed, as that code is hardly ever executed.

I agree. Therefore the generic code should be kept completely separate from the special code.

Correctly predicted branches that fall through rather than jump around are faster.

if (pieceType < SPECIAL) {
// do generic
}
else
{
// do special
}

Then it really does not matter if the special moves are handled by a switch statement or not?

hgm · Post by **hgm** » Fri May 12, 2023 8:22 pm

Mike Sherwin wrote: ↑Fri May 12, 2023 7:49 pmThen it really does not matter if the special moves are handled by a switch statement or not?

I guess it indeed would not matter much. It is still fun to see if it could be done, though.

The more important thing is to treat Pawn double pushes by the same code as non-special moves, as these aren't all that rare. In none of my engines I realized that it would be possible to do that. But calculating an e.p. square is not a goal in itself; the only thing that has to be done is somehow inform the move generator in the daughter whether it should generate e.p. captures, and if so, where. Passing the move itself would be a good way to solve the 'where' issue, and the move generator should be able to hide a flag in that move somewhere to indicate it is a double push.

Deferring the calculation of the actual e.p. square from this to the daughter's move generator, instead of doing it in MakeMove makes sense: the daughter might not get to generating that e.p. capture at all. It could fail high on null move, or it might do staged move generation, and cut off on a normally generated capture of a Queen before getting to e.p. captures.

Incremental update of PST values slower than not incremental?

Re: Incremental update of PST values slower than not incremental?

Re: Incremental update of PST values slower than not incremental?

Re: Incremental update of PST values slower than not incremental?

Re: Incremental update of PST values slower than not incremental?

Re: Incremental update of PST values slower than not incremental?

Re: Incremental update of PST values slower than not incremental?

Re: Incremental update of PST values slower than not incremental?

Re: Incremental update of PST values slower than not incremental?

Re: Incremental update of PST values slower than not incremental?

Re: Incremental update of PST values slower than not incremental?