PEXT/PDEP are even slower than you think on Zen

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Mon Dec 09, 2019 6:06 pm

https://twitter.com/trav_downs/status/1 ... 7962364928

It was already known they were slow, but I didn't see anyone point out before that how slow they get actually depends on the argument.

dragontamer5788 · Post by **dragontamer5788** » Mon Dec 09, 2019 9:10 pm

Very unfortunate.

PEXT / PDEP seem like they have a huge opportunity for sliding piece attacks in bitboards. Bishops, Queens, and Rook movement clearly can be calculated more efficiently with PEXT / PDEP.

I have a Zen processor though, so I have to basically use other methodologies. Intel having a 1-cycle PEXT / PDEP implementation is incredibly cool. But impractical for my home AMD processor.

dragontamer5788 · Post by **dragontamer5788** » Tue Dec 10, 2019 12:42 am

Since I don't have access to a PEXT / PDEP machine, I figure I might as well publish my basic idea here, and maybe someone else will want to run with it.

The basic idea is that sliding-piece attacks, that is rook, queen, and bishop, may be optimized even further than magic-bitboards, using PEXT / PDEP.

The current state-of-the-art seems to be PEXT/PDEP based magic bitboard techniques, which seem to get the answer in as little as ~5 operations. Can we do better? Maybe... but only if we parallelize the operation and try to use pext / pdep to perform the rook, queen, and bishop attacks simultaneously.

---------

Consider this board state.

[d]8/8/8/R3B3/8/8/8/Q7 w - - 0 1

Lets consider where the Rook can go first. We know Rook is on A5. The Rook can potentially go to B5, C5, D5, E5, F5, G5, and H5 on the horizontal ray. Of course, the Bishop at E5 blocks it.

This can be represented in binary as [0001000], each bit representing B5, C5, ... H5 respectively. The Bishop on E5 is represented as a 1: it is a "blocker". Along this ray, we want a methodology to generate: [1110000]. The following code would do that easily:

Code: Select all

// Take 0b0001000 and returns 0b1110000, performing a "Ray-scan" for where a bishop or rook can go.
uint8_t castRay(uint8_t blockers){
	blockers |= (blockers >> 1);
	blockers |= (blockers >> 2);
	blockers |= (blockers >> 4);
	
	return ~blockers;
}

Okay, this is obviously slower than magic bitboards. There are 7-operations here, and we haven't even pext the blockers from the bitboard, or pdep the information back into the bitboard yet. But I think this may be faster, because we can take advantage of the parallelism in 64-bit integers, and perform 4x of these calculations in parallel.

Code: Select all

// Like "Cast Ray", but perform 4x of them in parallel
uint64_t castRayx4(uint64_t blockersX4){
	blockersX4 |= (blockersX4>> 1);
	blockersX4 |= (blockersX4>> 2);
	blockersX4 |= (blockersX4>> 4);
	
	return ~blockersX4;
}

The 8 bytes in blockers X4 are: [Ray0, 0x00, Ray1, 0x00, Ray2, 0x00, Ray3, 0x00]. PDEP and PEXT could be used to pack, and unpack, the bits to-and-from this format rather easily. Since we're doing 7x operations to perform 4x ray-scans, along with 4x PEXT + 4x PDEP operations, I'm calculating an average of 3.75 operations-per-rayscan, which is faster than the 5-operations-per-ray that Magic Bitboards have.

If we use 256-bit AVX registers, there's potential to run as many as 16x ray-checks in parallel in 7-clock ticks.

Note: You'll have to cast a ray "Right to Left", with the following code:

Code: Select all

uint64_t castRayx4_RTL(uint64_t blockersX4){
	blockersX4 |= (blockersX4<< 1);
	blockersX4 |= (blockersX4<< 2);
	blockersX4 |= (blockersX4<< 4);
	
	return ~blockersX4;
}

The Rook in this position

[d]8/8/8/4B2R/8/8/8/8 w - - 0 1

Needs to be checked Right-to-Left, so all the shifts are backwards in this case. Top-to-Bottom and Bottom-to-Top have a similar left-to-right and right-to-left issue that needs to be worked out... but that's all easily handled by pext and pdep.

---------

Oh... and it looks like someone else already knew about this idea. Whatever.

I guess I should have studied the ChessProgramming wiki a bit more before typing this up.

lucasart · Post by **lucasart** » Tue Dec 10, 2019 1:47 am

Gian-Carlo Pascutto wrote: ↑Mon Dec 09, 2019 6:06 pm https://twitter.com/trav_downs/status/1 ... 7962364928

It was already known they were slow, but I didn't see anyone point out before that how slow they get actually depends on the argument.

Yes, AMD PEXT is a disaster. I noticed it recently, running Demolito pext on AMD machines via OpenBench. Why would AMD even bother to release a hardware instruction that is (much!) slower than its software emulation ? (magic bb)

I think this was a pure box ticking exercise for them. They couldn't come up with something decent, but wanted to tick the box BMI2 to be considered on par with Intel.

That said, AMD still beats Intel pants down, if you consider NPS/money, despite being pext-less. Their real competitive edge is price.

Dann Corbit · Post by **Dann Corbit** » Tue Dec 10, 2019 2:52 am

lucasart wrote: ↑Tue Dec 10, 2019 1:47 am
Gian-Carlo Pascutto wrote: ↑Mon Dec 09, 2019 6:06 pm https://twitter.com/trav_downs/status/1 ... 7962364928

It was already known they were slow, but I didn't see anyone point out before that how slow they get actually depends on the argument.
Yes, AMD PEXT is a disaster. I noticed it recently, running Demolito pext on AMD machines via OpenBench. Why would AMD even bother to release a hardware instruction that is (much!) slower than its software emulation ? (magic bb)

I think this was a pure box ticking exercise for them. They couldn't come up with something decent, but wanted to tick the box BMI2 to be considered on par with Intel.

That said, AMD still beats Intel pants down, if you consider NPS/money, despite being pext-less. Their real competitive edge is price.

I think what they have done is shoot themselves in the foot.

Sure, the box is ticked.

But default optimization compiles will use PEXT/PDEP instructions because the CPU reports that they are available. The binaries would be faster using a reduced instruction set without those operations.
Stockfish is a standard tool for benchmarks now in many of the tool sets used to compute CPU horsepower.
Since binaries will be running the slow instructions, they make themselves look worse.

Considering the test on the AMD Ryzen Threadripper 3 3970X @3.9Ghz with 64 threads:
http://www.ipmanchess.yolasite.com/amd- ... -bench.php
It appears that with the PEXT type instructions the binary is only 81% as fast as without them:
82180052/101442355=0.81011577461899420611834179125672

I imagine there is a similar downgrade with every other application that has PEXT operations as an important phase of the calculations.

Joost Buijs · Post by **Joost Buijs** » Tue Dec 10, 2019 8:41 am

Pext/Pdep is not only very handy for sliding move generation, in the evaluation function you can also make good use of it, for instance to look at pawn configurations.

For this reason alone I ordered parts for a new workstation with an Intel i9-10980XE, 18 cores will be more then enough for the things I'm doing. I've been thinking about a Xeon W2295 which is the same processor with ECC support, but I don't want to shell out another $350 (plus extra memory costs) for ECC (which is IMHO nonsense). Unfortunately it takes approximately 2 months before the processor arrives.

Another consideration is noise, the 180W thermals from the 10980XE can be cooled down very quietly with a Noctua NH-D15, I have two systems with these running and I can hardly hear that they are turned on. For a TR3-3970X you will start at 280W (probably even more), a 64 core TR3 will be like 500W, so you'll need a large water-cooling system which is in my experience very noisy and I can't have this in my workspace.

dragontamer5788 · Post by **dragontamer5788** » Tue Dec 10, 2019 5:46 pm

Joost Buijs wrote: ↑Tue Dec 10, 2019 8:41 am Pext/Pdep is not only very handy for sliding move generation, in the evaluation function you can also make good use of it, for instance to look at pawn configurations.

Could you elaborate on that?

Most of the pawn-configuration ideas (backwards pawn, isolated pawn, etc. etc.) seem like simple masks to me, without any need of pdep or pext.

EDIT: The reason why sliding piece attacks are useful for PEXT / PDEP, is that there is a calculation that needs to be done to "slide" a piece over. Granted, this calculation is very simple ( [Blocker on B1?] [Blocker on B1 OR C1?] [Blocker on B1 OR C1 OR D1?] ... [Blocker on B1 OR C1 OR D1 OR E1 OR F1 OR G1 OR H1?]), but this is a calculation nonetheless.

PEXT / PDEP for sliding piece attacks are needed because while the calculation is very simple (Just an "OR" operation), the complex part is generating the list of blocker-squares. Bishop on A1 needs to check [B2, C3, D4 E5, F6 G7, H8]. Bishop on A8 needs to check [B7, C6, ... H1]. Etc. etc. PEXT can generate this array in one step, while PDEP puts the array back into bitboard form in one step.

All pawn-configuration ideas seem to be of the form "Friendly Pawn on LocationX", and not over an "array" of locations like sliding-piece attacks are. I'm simply not seeing where you can "generate a bitwise-array" is useful in pawn formations.

-------------

My idea of PEXT / PDEP by the way is that it is a highly-efficient bitwise gather (pext) and bitwise scatter (pdep) operation. So any parallel-programmer familiar with gather/scatter operations will naturally find applications in PEXT and PDEP.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Tue Dec 10, 2019 6:28 pm

Joost Buijs wrote: ↑Tue Dec 10, 2019 8:41 am Another consideration is noise, the 180W thermals from the 10980XE can be cooled down very quietly with a Noctua NH-D15, I have two systems with these running and I can hardly hear that they are turned on. For a TR3-3970X you will start at 280W (probably even more), a 64 core TR3 will be like 500W, so you'll need a large water-cooling system which is in my experience very noisy and I can't have this in my workspace.

Competition for the 10980X would be the 16-core 3950X though, at 105/140W, not the (way faster) ThreadRipper. For my main use of compiling stuff (https://techgage.com/article/a-linux-pe ... n-9-3950x/) the 3950X is faster.

I probably posted about this before but my main concern is the lack of correctly working PMU unit on Zen cores: https://github.com/mozilla/rr/issues/2034

dragontamer5788 · Post by **dragontamer5788** » Tue Dec 10, 2019 8:21 pm

Gian-Carlo Pascutto wrote: ↑Tue Dec 10, 2019 6:28 pm
Joost Buijs wrote: ↑Tue Dec 10, 2019 8:41 am Another consideration is noise, the 180W thermals from the 10980XE can be cooled down very quietly with a Noctua NH-D15, I have two systems with these running and I can hardly hear that they are turned on. For a TR3-3970X you will start at 280W (probably even more), a 64 core TR3 will be like 500W, so you'll need a large water-cooling system which is in my experience very noisy and I can't have this in my workspace.
Competition for the 10980X would be the 16-core 3950X though, at 105/140W, not the (way faster) ThreadRipper. For my main use of compiling stuff (https://techgage.com/article/a-linux-pe ... n-9-3950x/) the 3950X is faster.

The 4x16MB of L3 cache (aggregate total of 64MBs) on the 3950x is probably a major contributor. 10980X is Skylake, AVX512, and 18-cores > 16-cores. But 10980X is only 24.75MB.

Joost Buijs · Post by **Joost Buijs** » Tue Dec 10, 2019 8:30 pm

dragontamer5788 wrote: ↑Tue Dec 10, 2019 5:46 pm
Joost Buijs wrote: ↑Tue Dec 10, 2019 8:41 am Pext/Pdep is not only very handy for sliding move generation, in the evaluation function you can also make good use of it, for instance to look at pawn configurations.
Could you elaborate on that?

Most of the pawn-configuration ideas (backwards pawn, isolated pawn, etc. etc.) seem like simple masks to me, without any need of pdep or pext.

Well you mention the traditional way of calculating pawn configurations which need many steps for each pawn like , doubled, backward, isolated, opposed, candidate, passed etc. The idea is to do it in a way that looks a bit like what you are doing in convolutional nets, take an area of let's say 3 bits wide and 4 bits high, with 2 PEXT operations (one for each color) you get two 12 bit indices that you can translate into a single index into an array with pre-calculated or texel-tuned values, after this shift one column or row and repeat the procedure. In this way you have to capture the 3x4 area 18 times to cover the whole board. I've been experimenting with this and the major drawback is that the density of pawns on the chessboard is rather low, so there were a lot of voids. Later I tried a scheme where I only capture a (3 wide, for edge pawns 2 wide) area of two rows in front to one row behind of a pawn (if possible, otherwise I use a smaller area) and this works remarkably well. Of course you will only get information about pawn-stucture. At the moment I do a second calculation to get passed-pawn locations that I need to calculate their interactions with pieces. It is also possible to store information about passed pawns etc. in the array with weights but this will make the array rather large, still something I have to try.

PEXT/PDEP are even slower than you think on Zen

PEXT/PDEP are even slower than you think on Zen

Re: PEXT/PDEP are even slower than you think on Zen

Re: PEXT/PDEP are even slower than you think on Zen

Re: PEXT/PDEP are even slower than you think on Zen

Re: PEXT/PDEP are even slower than you think on Zen

Re: PEXT/PDEP are even slower than you think on Zen

Re: PEXT/PDEP are even slower than you think on Zen

Re: PEXT/PDEP are even slower than you think on Zen

Re: PEXT/PDEP are even slower than you think on Zen

Re: PEXT/PDEP are even slower than you think on Zen