PEXT/PDEP are even slower than you think on Zen

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

PEXT/PDEP are even slower than you think on Zen

Post by Gian-Carlo Pascutto »

https://twitter.com/trav_downs/status/1 ... 7962364928

It was already known they were slow, but I didn't see anyone point out before that how slow they get actually depends on the argument.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: PEXT/PDEP are even slower than you think on Zen

Post by dragontamer5788 »

Very unfortunate.

PEXT / PDEP seem like they have a huge opportunity for sliding piece attacks in bitboards. Bishops, Queens, and Rook movement clearly can be calculated more efficiently with PEXT / PDEP.

I have a Zen processor though, so I have to basically use other methodologies. Intel having a 1-cycle PEXT / PDEP implementation is incredibly cool. But impractical for my home AMD processor.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: PEXT/PDEP are even slower than you think on Zen

Post by dragontamer5788 »

Since I don't have access to a PEXT / PDEP machine, I figure I might as well publish my basic idea here, and maybe someone else will want to run with it.

The basic idea is that sliding-piece attacks, that is rook, queen, and bishop, may be optimized even further than magic-bitboards, using PEXT / PDEP.

The current state-of-the-art seems to be PEXT/PDEP based magic bitboard techniques, which seem to get the answer in as little as ~5 operations. Can we do better? Maybe... but only if we parallelize the operation and try to use pext / pdep to perform the rook, queen, and bishop attacks simultaneously.

---------

Consider this board state.

[d]8/8/8/R3B3/8/8/8/Q7 w - - 0 1

Lets consider where the Rook can go first. We know Rook is on A5. The Rook can potentially go to B5, C5, D5, E5, F5, G5, and H5 on the horizontal ray. Of course, the Bishop at E5 blocks it.

This can be represented in binary as [0001000], each bit representing B5, C5, ... H5 respectively. The Bishop on E5 is represented as a 1: it is a "blocker". Along this ray, we want a methodology to generate: [1110000]. The following code would do that easily:

Code: Select all

// Take 0b0001000 and returns 0b1110000, performing a "Ray-scan" for where a bishop or rook can go.
uint8_t castRay(uint8_t blockers){
	blockers |= (blockers >> 1);
	blockers |= (blockers >> 2);
	blockers |= (blockers >> 4);
	
	return ~blockers;
}
Okay, this is obviously slower than magic bitboards. There are 7-operations here, and we haven't even pext the blockers from the bitboard, or pdep the information back into the bitboard yet. But I think this may be faster, because we can take advantage of the parallelism in 64-bit integers, and perform 4x of these calculations in parallel.

Code: Select all

// Like "Cast Ray", but perform 4x of them in parallel
uint64_t castRayx4(uint64_t blockersX4){
	blockersX4 |= (blockersX4>> 1);
	blockersX4 |= (blockersX4>> 2);
	blockersX4 |= (blockersX4>> 4);
	
	return ~blockersX4;
}
The 8 bytes in blockers X4 are: [Ray0, 0x00, Ray1, 0x00, Ray2, 0x00, Ray3, 0x00]. PDEP and PEXT could be used to pack, and unpack, the bits to-and-from this format rather easily. Since we're doing 7x operations to perform 4x ray-scans, along with 4x PEXT + 4x PDEP operations, I'm calculating an average of 3.75 operations-per-rayscan, which is faster than the 5-operations-per-ray that Magic Bitboards have.

If we use 256-bit AVX registers, there's potential to run as many as 16x ray-checks in parallel in 7-clock ticks.

Note: You'll have to cast a ray "Right to Left", with the following code:

Code: Select all

uint64_t castRayx4_RTL(uint64_t blockersX4){
	blockersX4 |= (blockersX4<< 1);
	blockersX4 |= (blockersX4<< 2);
	blockersX4 |= (blockersX4<< 4);
	
	return ~blockersX4;
}
The Rook in this position

[d]8/8/8/4B2R/8/8/8/8 w - - 0 1

Needs to be checked Right-to-Left, so all the shifts are backwards in this case. Top-to-Bottom and Bottom-to-Top have a similar left-to-right and right-to-left issue that needs to be worked out... but that's all easily handled by pext and pdep.

---------

Oh... and it looks like someone else already knew about this idea. Whatever. :-( I guess I should have studied the ChessProgramming wiki a bit more before typing this up.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: PEXT/PDEP are even slower than you think on Zen

Post by lucasart »

Gian-Carlo Pascutto wrote: Mon Dec 09, 2019 6:06 pm https://twitter.com/trav_downs/status/1 ... 7962364928

It was already known they were slow, but I didn't see anyone point out before that how slow they get actually depends on the argument.
Yes, AMD PEXT is a disaster. I noticed it recently, running Demolito pext on AMD machines via OpenBench. Why would AMD even bother to release a hardware instruction that is (much!) slower than its software emulation ? (magic bb)

I think this was a pure box ticking exercise for them. They couldn't come up with something decent, but wanted to tick the box BMI2 to be considered on par with Intel.

That said, AMD still beats Intel pants down, if you consider NPS/money, despite being pext-less. Their real competitive edge is price.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Dann Corbit
Posts: 12537
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: PEXT/PDEP are even slower than you think on Zen

Post by Dann Corbit »

lucasart wrote: Tue Dec 10, 2019 1:47 am
Gian-Carlo Pascutto wrote: Mon Dec 09, 2019 6:06 pm https://twitter.com/trav_downs/status/1 ... 7962364928

It was already known they were slow, but I didn't see anyone point out before that how slow they get actually depends on the argument.
Yes, AMD PEXT is a disaster. I noticed it recently, running Demolito pext on AMD machines via OpenBench. Why would AMD even bother to release a hardware instruction that is (much!) slower than its software emulation ? (magic bb)

I think this was a pure box ticking exercise for them. They couldn't come up with something decent, but wanted to tick the box BMI2 to be considered on par with Intel.

That said, AMD still beats Intel pants down, if you consider NPS/money, despite being pext-less. Their real competitive edge is price.
I think what they have done is shoot themselves in the foot.

Sure, the box is ticked.

But default optimization compiles will use PEXT/PDEP instructions because the CPU reports that they are available. The binaries would be faster using a reduced instruction set without those operations.
Stockfish is a standard tool for benchmarks now in many of the tool sets used to compute CPU horsepower.
Since binaries will be running the slow instructions, they make themselves look worse.

Considering the test on the AMD Ryzen Threadripper 3 3970X @3.9Ghz with 64 threads:
http://www.ipmanchess.yolasite.com/amd- ... -bench.php
It appears that with the PEXT type instructions the binary is only 81% as fast as without them:
82180052/101442355=0.81011577461899420611834179125672

I imagine there is a similar downgrade with every other application that has PEXT operations as an important phase of the calculations.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: PEXT/PDEP are even slower than you think on Zen

Post by Joost Buijs »

Pext/Pdep is not only very handy for sliding move generation, in the evaluation function you can also make good use of it, for instance to look at pawn configurations.

For this reason alone I ordered parts for a new workstation with an Intel i9-10980XE, 18 cores will be more then enough for the things I'm doing. I've been thinking about a Xeon W2295 which is the same processor with ECC support, but I don't want to shell out another $350 (plus extra memory costs) for ECC (which is IMHO nonsense). Unfortunately it takes approximately 2 months before the processor arrives.

Another consideration is noise, the 180W thermals from the 10980XE can be cooled down very quietly with a Noctua NH-D15, I have two systems with these running and I can hardly hear that they are turned on. For a TR3-3970X you will start at 280W (probably even more), a 64 core TR3 will be like 500W, so you'll need a large water-cooling system which is in my experience very noisy and I can't have this in my workspace.
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: PEXT/PDEP are even slower than you think on Zen

Post by dragontamer5788 »

Joost Buijs wrote: Tue Dec 10, 2019 8:41 am Pext/Pdep is not only very handy for sliding move generation, in the evaluation function you can also make good use of it, for instance to look at pawn configurations.
Could you elaborate on that?

Most of the pawn-configuration ideas (backwards pawn, isolated pawn, etc. etc.) seem like simple masks to me, without any need of pdep or pext.

EDIT: The reason why sliding piece attacks are useful for PEXT / PDEP, is that there is a calculation that needs to be done to "slide" a piece over. Granted, this calculation is very simple ( [Blocker on B1?] [Blocker on B1 OR C1?] [Blocker on B1 OR C1 OR D1?] ... [Blocker on B1 OR C1 OR D1 OR E1 OR F1 OR G1 OR H1?]), but this is a calculation nonetheless.

PEXT / PDEP for sliding piece attacks are needed because while the calculation is very simple (Just an "OR" operation), the complex part is generating the list of blocker-squares. Bishop on A1 needs to check [B2, C3, D4 E5, F6 G7, H8]. Bishop on A8 needs to check [B7, C6, ... H1]. Etc. etc. PEXT can generate this array in one step, while PDEP puts the array back into bitboard form in one step.

All pawn-configuration ideas seem to be of the form "Friendly Pawn on LocationX", and not over an "array" of locations like sliding-piece attacks are. I'm simply not seeing where you can "generate a bitwise-array" is useful in pawn formations.

-------------

My idea of PEXT / PDEP by the way is that it is a highly-efficient bitwise gather (pext) and bitwise scatter (pdep) operation. So any parallel-programmer familiar with gather/scatter operations will naturally find applications in PEXT and PDEP.
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: PEXT/PDEP are even slower than you think on Zen

Post by Gian-Carlo Pascutto »

Joost Buijs wrote: Tue Dec 10, 2019 8:41 am Another consideration is noise, the 180W thermals from the 10980XE can be cooled down very quietly with a Noctua NH-D15, I have two systems with these running and I can hardly hear that they are turned on. For a TR3-3970X you will start at 280W (probably even more), a 64 core TR3 will be like 500W, so you'll need a large water-cooling system which is in my experience very noisy and I can't have this in my workspace.
Competition for the 10980X would be the 16-core 3950X though, at 105/140W, not the (way faster) ThreadRipper. For my main use of compiling stuff (https://techgage.com/article/a-linux-pe ... n-9-3950x/) the 3950X is faster.

I probably posted about this before but my main concern is the lack of correctly working PMU unit on Zen cores: https://github.com/mozilla/rr/issues/2034
dragontamer5788
Posts: 201
Joined: Thu Jun 06, 2019 8:05 pm
Full name: Percival Tiglao

Re: PEXT/PDEP are even slower than you think on Zen

Post by dragontamer5788 »

Gian-Carlo Pascutto wrote: Tue Dec 10, 2019 6:28 pm
Joost Buijs wrote: Tue Dec 10, 2019 8:41 am Another consideration is noise, the 180W thermals from the 10980XE can be cooled down very quietly with a Noctua NH-D15, I have two systems with these running and I can hardly hear that they are turned on. For a TR3-3970X you will start at 280W (probably even more), a 64 core TR3 will be like 500W, so you'll need a large water-cooling system which is in my experience very noisy and I can't have this in my workspace.
Competition for the 10980X would be the 16-core 3950X though, at 105/140W, not the (way faster) ThreadRipper. For my main use of compiling stuff (https://techgage.com/article/a-linux-pe ... n-9-3950x/) the 3950X is faster.
The 4x16MB of L3 cache (aggregate total of 64MBs) on the 3950x is probably a major contributor. 10980X is Skylake, AVX512, and 18-cores > 16-cores. But 10980X is only 24.75MB.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: PEXT/PDEP are even slower than you think on Zen

Post by Joost Buijs »

dragontamer5788 wrote: Tue Dec 10, 2019 5:46 pm
Joost Buijs wrote: Tue Dec 10, 2019 8:41 am Pext/Pdep is not only very handy for sliding move generation, in the evaluation function you can also make good use of it, for instance to look at pawn configurations.
Could you elaborate on that?

Most of the pawn-configuration ideas (backwards pawn, isolated pawn, etc. etc.) seem like simple masks to me, without any need of pdep or pext.
Well you mention the traditional way of calculating pawn configurations which need many steps for each pawn like , doubled, backward, isolated, opposed, candidate, passed etc. The idea is to do it in a way that looks a bit like what you are doing in convolutional nets, take an area of let's say 3 bits wide and 4 bits high, with 2 PEXT operations (one for each color) you get two 12 bit indices that you can translate into a single index into an array with pre-calculated or texel-tuned values, after this shift one column or row and repeat the procedure. In this way you have to capture the 3x4 area 18 times to cover the whole board. I've been experimenting with this and the major drawback is that the density of pawns on the chessboard is rather low, so there were a lot of voids. Later I tried a scheme where I only capture a (3 wide, for edge pawns 2 wide) area of two rows in front to one row behind of a pawn (if possible, otherwise I use a smaller area) and this works remarkably well. Of course you will only get information about pawn-stucture. At the moment I do a second calculation to get passed-pawn locations that I need to calculate their interactions with pieces. It is also possible to store information about passed pawns etc. in the array with weights but this will make the array rather large, still something I have to try.