PEXT/PDEP are even slower than you think on Zen
Posted: Mon Dec 09, 2019 6:06 pm
It was already known they were slow, but I didn't see anyone point out before that how slow they get actually depends on the argument.
Computer Chess Club
https://talkchess.com/
Code: Select all
// Take 0b0001000 and returns 0b1110000, performing a "Ray-scan" for where a bishop or rook can go.
uint8_t castRay(uint8_t blockers){
blockers |= (blockers >> 1);
blockers |= (blockers >> 2);
blockers |= (blockers >> 4);
return ~blockers;
}
Code: Select all
// Like "Cast Ray", but perform 4x of them in parallel
uint64_t castRayx4(uint64_t blockersX4){
blockersX4 |= (blockersX4>> 1);
blockersX4 |= (blockersX4>> 2);
blockersX4 |= (blockersX4>> 4);
return ~blockersX4;
}
Code: Select all
uint64_t castRayx4_RTL(uint64_t blockersX4){
blockersX4 |= (blockersX4<< 1);
blockersX4 |= (blockersX4<< 2);
blockersX4 |= (blockersX4<< 4);
return ~blockersX4;
}
Yes, AMD PEXT is a disaster. I noticed it recently, running Demolito pext on AMD machines via OpenBench. Why would AMD even bother to release a hardware instruction that is (much!) slower than its software emulation ? (magic bb)Gian-Carlo Pascutto wrote: ↑Mon Dec 09, 2019 6:06 pm
It was already known they were slow, but I didn't see anyone point out before that how slow they get actually depends on the argument.
I think what they have done is shoot themselves in the foot.lucasart wrote: ↑Tue Dec 10, 2019 1:47 amYes, AMD PEXT is a disaster. I noticed it recently, running Demolito pext on AMD machines via OpenBench. Why would AMD even bother to release a hardware instruction that is (much!) slower than its software emulation ? (magic bb)Gian-Carlo Pascutto wrote: ↑Mon Dec 09, 2019 6:06 pm
It was already known they were slow, but I didn't see anyone point out before that how slow they get actually depends on the argument.
I think this was a pure box ticking exercise for them. They couldn't come up with something decent, but wanted to tick the box BMI2 to be considered on par with Intel.
That said, AMD still beats Intel pants down, if you consider NPS/money, despite being pext-less. Their real competitive edge is price.
Could you elaborate on that?Joost Buijs wrote: ↑Tue Dec 10, 2019 8:41 am Pext/Pdep is not only very handy for sliding move generation, in the evaluation function you can also make good use of it, for instance to look at pawn configurations.
Competition for the 10980X would be the 16-core 3950X though, at 105/140W, not the (way faster) ThreadRipper. For my main use of compiling stuff (https://techgage.com/article/a-linux-pe ... n-9-3950x/) the 3950X is faster.Joost Buijs wrote: ↑Tue Dec 10, 2019 8:41 am Another consideration is noise, the 180W thermals from the 10980XE can be cooled down very quietly with a Noctua NH-D15, I have two systems with these running and I can hardly hear that they are turned on. For a TR3-3970X you will start at 280W (probably even more), a 64 core TR3 will be like 500W, so you'll need a large water-cooling system which is in my experience very noisy and I can't have this in my workspace.
The 4x16MB of L3 cache (aggregate total of 64MBs) on the 3950x is probably a major contributor. 10980X is Skylake, AVX512, and 18-cores > 16-cores. But 10980X is only 24.75MB.Gian-Carlo Pascutto wrote: ↑Tue Dec 10, 2019 6:28 pmCompetition for the 10980X would be the 16-core 3950X though, at 105/140W, not the (way faster) ThreadRipper. For my main use of compiling stuff (https://techgage.com/article/a-linux-pe ... n-9-3950x/) the 3950X is faster.Joost Buijs wrote: ↑Tue Dec 10, 2019 8:41 am Another consideration is noise, the 180W thermals from the 10980XE can be cooled down very quietly with a Noctua NH-D15, I have two systems with these running and I can hardly hear that they are turned on. For a TR3-3970X you will start at 280W (probably even more), a 64 core TR3 will be like 500W, so you'll need a large water-cooling system which is in my experience very noisy and I can't have this in my workspace.
Well you mention the traditional way of calculating pawn configurations which need many steps for each pawn like , doubled, backward, isolated, opposed, candidate, passed etc. The idea is to do it in a way that looks a bit like what you are doing in convolutional nets, take an area of let's say 3 bits wide and 4 bits high, with 2 PEXT operations (one for each color) you get two 12 bit indices that you can translate into a single index into an array with pre-calculated or texel-tuned values, after this shift one column or row and repeat the procedure. In this way you have to capture the 3x4 area 18 times to cover the whole board. I've been experimenting with this and the major drawback is that the density of pawns on the chessboard is rather low, so there were a lot of voids. Later I tried a scheme where I only capture a (3 wide, for edge pawns 2 wide) area of two rows in front to one row behind of a pawn (if possible, otherwise I use a smaller area) and this works remarkably well. Of course you will only get information about pawn-stucture. At the moment I do a second calculation to get passed-pawn locations that I need to calculate their interactions with pieces. It is also possible to store information about passed pawns etc. in the array with weights but this will make the array rather large, still something I have to try.dragontamer5788 wrote: ↑Tue Dec 10, 2019 5:46 pmCould you elaborate on that?Joost Buijs wrote: ↑Tue Dec 10, 2019 8:41 am Pext/Pdep is not only very handy for sliding move generation, in the evaluation function you can also make good use of it, for instance to look at pawn configurations.
Most of the pawn-configuration ideas (backwards pawn, isolated pawn, etc. etc.) seem like simple masks to me, without any need of pdep or pext.