PEXT/PDEP are even slower than you think on Zen

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: PEXT/PDEP are even slower than you think on Zen

Post by dragontamer5788 »

At the moment I do a second calculation to get passed-pawn locations that I need to calculate their interactions with pieces
Passed Pawn actually seems like an ideal candidate for pdep. Pull all the bits where pawns could block. Ex: White has pawn on E4. Check for black pawns on D5, E5, F5, D6, E6 F6... etc. etc. A natural pdep (generate the D5, E5, F5... mask to AND with the opponent's pawn structure).
The idea is to do it in a way that looks a bit like what you are doing in convolutional nets, take an area of let's say 3 bits wide and 4 bits high, with 2 PEXT operations (one for each color) you get two 12 bit indices that you can translate into a single index into an array with pre-calculated or texel-tuned values, after this shift one column or row and repeat the procedure.
Hmmm... got it. I wish you luck with your method. It does seem like a good application of pext / pdep. You'll need to be careful about the ordering of bits for black vs white, bit-reverse probably would put things in the right order, but Intel x86 doesn't have a fast bit-reverse operator. (GPUs and ARM assembly has single-cycle bit-reverse, ironically)
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: PEXT/PDEP are even slower than you think on Zen

Post by Joost Buijs »

dragontamer5788 wrote: Tue Dec 10, 2019 10:09 pm
At the moment I do a second calculation to get passed-pawn locations that I need to calculate their interactions with pieces
Passed Pawn actually seems like an ideal candidate for pdep. Pull all the bits where pawns could block. Ex: White has pawn on E4. Check for black pawns on D5, E5, F5, D6, E6 F6... etc. etc. A natural pdep (generate the D5, E5, F5... mask to AND with the opponent's pawn structure).
The idea is to do it in a way that looks a bit like what you are doing in convolutional nets, take an area of let's say 3 bits wide and 4 bits high, with 2 PEXT operations (one for each color) you get two 12 bit indices that you can translate into a single index into an array with pre-calculated or texel-tuned values, after this shift one column or row and repeat the procedure.
Hmmm... got it. I wish you luck with your method. It does seem like a good application of pext / pdep. You'll need to be careful about the ordering of bits for black vs white, bit-reverse probably would put things in the right order, but Intel x86 doesn't have a fast bit-reverse operator. (GPUs and ARM assembly has single-cycle bit-reverse, ironically)
Fortunately Intel x64 processors have a single-cyle byte reversal instruction BSWAP64 which effectively flips the board vertically. However, in my implementation I solve it in the routine that calculates the final index. The difficulty is that I have to translate the index from 2^(2n) to 3^n and that it has to be fast, this is doable and speed wise it compares favorably with the traditional implementation that I used before. It is possible that my old pawn evaluator is slow (I wrote it back in 2000 or 2001) and that current implementations like in Stockfish are faster, this is something I don't know.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: PEXT/PDEP are even slower than you think on Zen

Post by dragontamer5788 »

Joost Buijs wrote: Wed Dec 11, 2019 11:13 am
dragontamer5788 wrote: Tue Dec 10, 2019 10:09 pm
At the moment I do a second calculation to get passed-pawn locations that I need to calculate their interactions with pieces
Passed Pawn actually seems like an ideal candidate for pdep. Pull all the bits where pawns could block. Ex: White has pawn on E4. Check for black pawns on D5, E5, F5, D6, E6 F6... etc. etc. A natural pdep (generate the D5, E5, F5... mask to AND with the opponent's pawn structure).
The idea is to do it in a way that looks a bit like what you are doing in convolutional nets, take an area of let's say 3 bits wide and 4 bits high, with 2 PEXT operations (one for each color) you get two 12 bit indices that you can translate into a single index into an array with pre-calculated or texel-tuned values, after this shift one column or row and repeat the procedure.
Hmmm... got it. I wish you luck with your method. It does seem like a good application of pext / pdep. You'll need to be careful about the ordering of bits for black vs white, bit-reverse probably would put things in the right order, but Intel x86 doesn't have a fast bit-reverse operator. (GPUs and ARM assembly has single-cycle bit-reverse, ironically)
Fortunately Intel x64 processors have a single-cyle byte reversal instruction BSWAP64 which effectively flips the board vertically.
Good point. Vertical flip might be sufficient, because all other pawn properties seem like they're horizontally symmetrical. A backwards pawn to the left is identical to a backwards pawn on the right. So a complete bit-reversal may not be necessary.
However, in my implementation I solve it in the routine that calculates the final index. The difficulty is that I have to translate the index from 2^(2n) to 3^n and that it has to be fast, this is doable and speed wise it compares favorably with the traditional implementation that I used before. It is possible that my old pawn evaluator is slow (I wrote it back in 2000 or 2001) and that current implementations like in Stockfish are faster, this is something I don't know.
Slowish is fine if you're bitwise-SIMD'd to 256-bit AVX... I've seen some crazy magic happen with bitwise AVX tricks.

The main "problem" is that there comes a point where it will be better to just go over the Bitboard backwards... one pass left-to-right (handling white pieces), and a second pass right-to-left (handling black pieces). So whatever methodology you have for calculating indexes or reversing bits, or whatever... it has to be faster than a 2nd pass. Otherwise, you might as well just do the simpler two-pass methodology.