yurikvelo wrote: ↑Mon May 18, 2020 9:59 am
Still true for Zen 2.
BMI2 compiles are much slower than POPCNT
I'm still waiting for the Intel 10980XE that I want to use for a new workstation, in the mean time I bought an AMD 3970X because I didn't want to wait any longer. It's a nice processor as long as you don't overclock it with precision boost (otherwise it runs extremely hot), PEXT and PDEP are unusable, maybe even worse as Zen 1. I tried to emulate PEXT in software and that runs faster as the native CPU instruction. AVX2 on the AMD is slow too, and it misses AVX-512.
Maybe scatter-gather is not so important for a chess engine, but there are other applications in which it is very useful.
I will keep the AMD 3970X for bulk applications, but as soon as the 10980XE is readily available I will use that one for a new workstation.
A. You are correct. no intel asm or intel-specific stuff was allowed.
B. The reason I stopped being in SpecInt was stupidity. They decided they wanted to move crafty to the parallel benchmarks. I told them "bad idea" and explained the non-determniism problem. They said "no problem." I replied "node counts will not match, so nobody can verify their test results are correct. They said "no problem." A month or two later, I received a call. "Crafty doesn't produce node counts that match for each run with the same data." I replied "go back and look at all the emails we swapped about this." I got a short "ahhh... that is what you were talking about. sheesh.
I tried to emulate PEXT in software and that runs faster as the native CPU instruction.
Do you have a fast implementation that you'd want to make public domain?
The software implementation is very basic and unusable slow too, the only thing that strucks me is that it runs somewhat faster than the native CPU implementation (at least for the problem I used it on, a pattern evaluation routine).
The serial implementation of PEXT and PDEP look quite similar.
Very well possible, but I'm sure I didn't get it from CPW because I never look there.
About five years ago I got this specific algorithm from somebody who has no connection with computer-chess at all and claimed to be the original author, so I wonder what it's origins are.
I only meant to say that the PEXT implementation of the Zen 2 is so bad that even software emulation runs faster. I'ts a pity because on the AMD I have to replace PEXT with a series of mask and shifts which performs clearly worse.
Last edited by Joost Buijs on Sat May 30, 2020 5:00 pm, edited 1 time in total.
The serial implementation of PEXT and PDEP look quite similar.
Very well possible, but I'm sure I didn't get it from CPW because I never look there.
About five years ago I got this specific algorithm from somebody who has no connection with computer-chess at all and claimed to be the original author, so I wonder what it's origins are.
I only meant to say that the PEXT implementation of the Zen 2 is so bad that even software emulation runs faster. I'ts a pity because on the AMD I have to replace PEXT with a series of mask and shifts which performs clearly worse.
The routines are so obvious with some bit-twiddling expecience - I would not claim ownership.