VPOPCNTDQ and VBMI2

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
xr_a_y
Posts: 1871
Joined: Sat Nov 25, 2017 2:28 pm
Location: France

VPOPCNTDQ and VBMI2

Post by xr_a_y »

Did someone already tried VPOPCNTDQ and/or VBMI2 ? what is the expected performance improvment ?
mar
Posts: 2561
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: VPOPCNTDQ and VBMI2

Post by mar »

xr_a_y wrote: Tue May 04, 2021 8:46 pm Did someone already tried VPOPCNTDQ and/or VBMI2 ? what is the expected performance improvment ?
no idea, but I assume those are vector instructions
for example, arm64 doesn't have popcnt for gpr registers, so you have to load into a vfp register, execute cnt on 8-bit vector, then accumulate per-byte results and convert back to a general purpose register.

doesn't seem to me like this should be a win at all, assuming you want to do popcnt on 64-bit bitboards
Joost Buijs
Posts: 1565
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: VPOPCNTDQ and VBMI2

Post by Joost Buijs »

These AVX-512 instructions are only supported by a few Intel architectures like the new Rocket Lake i9-11900K, nowadays everybody seems to buy AMD so there won't be many people with access to one of these processors.
Ras
Posts: 2488
Joined: Tue Aug 30, 2016 8:19 pm
Full name: Rasmus Althoff

Re: VPOPCNTDQ and VBMI2

Post by Ras »

Joost Buijs wrote: Wed May 05, 2021 7:16 pmThese AVX-512 instructions are only supported by a few Intel architectures
And if you actually use these instructions, the CPU will throttle down so much that it will be a loss in anything but crafted benchmarks for dubious marketing - which is what these instructions are actually for.
Rasmus Althoff
https://www.ct800.net
Raphexon
Posts: 476
Joined: Sun Mar 17, 2019 12:00 pm
Full name: Henk Drost

Re: VPOPCNTDQ and VBMI2

Post by Raphexon »

Ras wrote: Wed May 05, 2021 7:21 pm
Joost Buijs wrote: Wed May 05, 2021 7:16 pmThese AVX-512 instructions are only supported by a few Intel architectures
And if you actually use these instructions, the CPU will throttle down so much that it will be a loss in anything but crafted benchmarks for dubious marketing - which is what these instructions are actually for.
AVX512 is actually a huge speed up if a majority of the instructions are avx512.
In mixed loads the throttle can indeed cause slow downs.

I think vnni256 is fastest for SF, followed by vnni512. But it also depends on the CPU, some throttle less than others.
Joost Buijs
Posts: 1565
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: VPOPCNTDQ and VBMI2

Post by Joost Buijs »

Raphexon wrote: Thu May 06, 2021 8:18 am
Ras wrote: Wed May 05, 2021 7:21 pm
Joost Buijs wrote: Wed May 05, 2021 7:16 pmThese AVX-512 instructions are only supported by a few Intel architectures
And if you actually use these instructions, the CPU will throttle down so much that it will be a loss in anything but crafted benchmarks for dubious marketing - which is what these instructions are actually for.
AVX512 is actually a huge speed up if a majority of the instructions are avx512.
In mixed loads the throttle can indeed cause slow downs.

I think vnni256 is fastest for SF, followed by vnni512. But it also depends on the CPU, some throttle less than others.
Indeed it gives some gain, I never tried with Stockfish but with my own engine the gain with AVX-512 is 20 to 25% in comparison with AVX2. This is on my i9-10980XE. When properly cooled it doesn't throttle much with AVX-512 and still runs at 3800 MHz. all core.

The 8 core i9-11900K is even better in this respect but uses a shitload of power (300W), and this is a lot for just 8 cores. An acquaintance of mine has one, it is extremely fast with AVX2 and AVX-512.
Modern Times
Posts: 3554
Joined: Thu Jun 07, 2012 11:02 pm

Re: VPOPCNTDQ and VBMI2

Post by Modern Times »

Joost Buijs wrote: Thu May 06, 2021 2:50 pm The 8 core i9-11900K is even better in this respect but uses a shitload of power (300W), and this is a lot for just 8 cores. An acquaintance of mine has one, it is extremely fast with AVX2 and AVX-512.
And yet, apart from the AVX-512 support, (or even despite the lack of AVX-512 support) you're probably better off with the 10-core 10900K.
Joost Buijs
Posts: 1565
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: VPOPCNTDQ and VBMI2

Post by Joost Buijs »

Modern Times wrote: Thu May 06, 2021 3:09 pm
Joost Buijs wrote: Thu May 06, 2021 2:50 pm The 8 core i9-11900K is even better in this respect but uses a shitload of power (300W), and this is a lot for just 8 cores. An acquaintance of mine has one, it is extremely fast with AVX2 and AVX-512.
And yet, apart from the AVX-512 support, (or even despite the lack of AVX-512 support) you're probably better off with the 10-core 10900K.
It depends upon what you want to do with it. If you are a programmer and want to experiment with AVX-512 you don't have much options. For running a multi-core chess engine the 10 core 10900K is probably better, is is cheaper and draws less power. Performance wise they are more or less on par. The 10900K has 2 extra cores and the 11900K has a higher IPC, not much of a difference.

If you don't need AVX-512 you are better of with the AMD 5950X, multi-core it has more than twice the speed of the 11900K, it is more expensive though.
Modern Times
Posts: 3554
Joined: Thu Jun 07, 2012 11:02 pm

Re: VPOPCNTDQ and VBMI2

Post by Modern Times »

Or indeed the 12-core 5900X