The serial implementation of PEXT and PDEP look quite similar.
Very well possible, but I'm sure I didn't get it from CPW because I never look there.
About five years ago I got this specific algorithm from somebody who has no connection with computer-chess at all and claimed to be the original author, so I wonder what it's origins are.
I only meant to say that the PEXT implementation of the Zen 2 is so bad that even software emulation runs faster. I'ts a pity because on the AMD I have to replace PEXT with a series of mask and shifts which performs clearly worse.
The routines are so obvious with some bit-twiddling expecience - I would not claim ownership.
The serial implementation of PEXT and PDEP look quite similar.
Very well possible, but I'm sure I didn't get it from CPW because I never look there.
About five years ago I got this specific algorithm from somebody who has no connection with computer-chess at all and claimed to be the original author, so I wonder what it's origins are.
I only meant to say that the PEXT implementation of the Zen 2 is so bad that even software emulation runs faster. I'ts a pity because on the AMD I have to replace PEXT with a series of mask and shifts which performs clearly worse.
The routines are so obvious with some bit-twiddling expecience - I would not claim ownership.
I agree it is a shame, that Zen is so slow with pext. AMD has to spent some transistors for a fast hardware pext in one cycle!
Best regards,
Gerd
Sorry, but I didn't knew that. I got it from somebody at a C++ programming forum, I think it was a few months after I build my 5960X PC somewhere late in 2014.
Recently I build a 3970X PC and was surprised to see how bad my engine performed on this machine. I use PEXT in several locations, move-generation, some parts of the evaluation function, and to calculate the index for my material-balance table. The engine ran (single core) about twice as slow as on my Intel 6950X PC. So I started fiddling, first with the PEXT emulation routine, this already gave somewhat better results (also very surprising BTW), finally I had to replace everything with dedicated routines to (almost) get the old performance back.
Maybe so, but unless it needs AVX512, I guess that my 3970x can keep up math wise with just about any IBM chip.
On the other hand, the BMI2 instructions sure would be nice for chess if they were done in silicon.
What they have done is worse than doing nothing. By that, I mean publishing through API that their hardware supports (for instance) PEXT, the compilers will generate it. That is why Stockfish builds for the native architecture STINK and SSE target builds perform well. Because AMD wanted a stupid little check box:
[x] BMI
So that people would think it is just as good as IBM hardware for that, but that lie is far, far worse than doing nothing.
They should either RIP OUT the terrible microcode they put in to "support" BMI, or they should implement it correctly in silicon. Nothing in-between
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
By that, I mean publishing through API that their hardware supports (for instance) PEXT, the compilers will generate it.
There are many software vendors (proprietary software with binaries) which just ignore existence of non-Intel CPU.
They used to look for string "GenuineIntel", but after some court decisions where forced to check only instruction set, not CPU brand name
Dann Corbit wrote: ↑Sat May 30, 2020 9:48 pm
Maybe so, but unless it needs AVX512, I guess that my 3970x can keep up math wise with just about any IBM chip.
Don't confuse Intel with IBM...
On the other hand, the BMI2 instructions sure would be nice for chess if they were done in silicon.
What they have done is worse than doing nothing. By that, I mean publishing through API that their hardware supports (for instance) PEXT, the compilers will generate it. That is why Stockfish builds for the native architecture STINK and SSE target builds perform well.
I doubt that a compiler like gcc generates pdep/pext instructions when compiling for the Ryzen architecture (in fact, I doubt that gcc ever generates them except where the code explicitly invokes them).
I do agree there is problem though. Builds that use the pext/pdep instructions do work on Ryzen but perform horribly. It would be better if they crashed immediately so the user would know he should use another build.
I guess in theory an engine compiled to use pext/pdep could check whether it is running on Ryzen and exit with an error message...
This problem is the reason I'm not going to build 3950x computer on a higher-end B550 board. I'd rather have a 109?0X series Intel (at least the 12 core version) on the old X299 chipset, but the 10X series CPU's are nowhere in sight in the Netherlands, last time I looked. Especially the 10980X.
Still, it's the old story with AMD.... They do something incredible, and then have some massive handicap that prevents me from even considering them. (Same with ATI... In that respect, they're good for one another.) I had one AMD/VIA based computer almost 20 years ago (Thunderbird 1400 CPU), and it took me half a year with just THAT driver for the sound card, SUCH driver for the main board and THIS setting to get it to run stable. And even then it ran incredibly hot and needed a positively huge cooler.
Since then I never had an AMD computer again.
Ryzen is tempting with its price for speed, but I refuse, because support for BMI2 and PEXT is so bad; and I know I'll eventually want to go and use those instructions.
I also wonder if Intel is going to release a successor for the x299, keeping the socket so the 10X cpu's can be used, or if they introduce a new socket, chipset, and 11X series in 2021. In that case, the 10X series may actually turn out to be a paper launch.
syzygy wrote: ↑Mon Jun 01, 2020 2:26 am
I do agree there is problem though. Builds that use the pext/pdep instructions do work on Ryzen but perform horribly. It would be better if they crashed immediately so the user would know he should use another build.
If you compile Stockfish using gcc with -znver2 it runs much slower if you use that flag.
Also mtune=native is OK, but march=native is not.
The best gcc build for my machine is to give it the SSE flags and that is all.
Though I admit I only tried it with the older version 9 compiler, I guess that it might be fixed now wit version 10.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.