Page 1 of 2
__builtin_popcountll doesn't bring any gain
Posted: Fri Aug 28, 2020 9:15 pm
by OliverBr
Hello together,
until know I am using my own bitcount-methods:
A) for few bits:
Code: Select all
char _bitcnt(u64 bit) {
char c = 0;
while (bit) { bit &= (bit - 1); c++; }
return c;
}
B) for many bits, which precalcuated an array BITC:
Code: Select all
char bitcnt (u64 n) {
return BITC[LOW16(n)]
+ BITC[LOW16(n >> 16)]
+ BITC[LOW16(n >> 32)]
+ BITC[LOW16(n >> 48)];
}
Not I replaced those methods by
but I didn't bring any gain on a AMD EPYC 7502P 32-Core Processor, gcc -O9. What is the reason for it?
1) There is no hardware support.
2) The compiler doesn't use hardware support.
3) The hardware popcount isn't that fast.
Who can help with answers?
Re: __builtin_popcountll doesn't bring any gain
Posted: Fri Aug 28, 2020 9:40 pm
by mhouppin
The POPCNT instruction is (somehow) part of the SSE4 (Streaming SIMD Extensions 4) instruction set, so if you don't enable its use explicitly (by specifying -mpopcnt to the compiler (or -march=nehalem for Intel, -march=barcelona for AMD)), the compiler with replace the builtin popcount by its own implementation (which could actually be your implementation B).
Even doing so, do not expect an incredible speedup: for my engine the NPS difference between the non-POPCNT and the POPCNT build is around 2%.
Re: __builtin_popcountll doesn't bring any gain
Posted: Fri Aug 28, 2020 10:20 pm
by Dann Corbit
For AMD new architectures, AVX2 helps a whole lot.
I use these command line options when I compile Stockfish:
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -fprofile-generate -Wextra -Wshadow -DNDEBUG -O3 -mtune=native -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE2 -msse2 -c -o benchmark.o benchmark.cpp
The macro USE_AVX2 does not do anything except embellish the compiler information string in the program.
The flag that does the heavy lifing is :
-mavx2
Try it, you'll like it.
Re: __builtin_popcountll doesn't bring any gain
Posted: Fri Aug 28, 2020 11:05 pm
by Bo Persson
OliverBr wrote: ↑Fri Aug 28, 2020 9:15 pm
Hello together,
until know I am using my own bitcount-methods:
A) for few bits:
Code: Select all
char _bitcnt(u64 bit) {
char c = 0;
while (bit) { bit &= (bit - 1); c++; }
return c;
}
B) for many bits, which precalcuated an array BITC:
Code: Select all
char bitcnt (u64 n) {
return BITC[LOW16(n)]
+ BITC[LOW16(n >> 16)]
+ BITC[LOW16(n >> 32)]
+ BITC[LOW16(n >> 48)];
}
Not I replaced those methods by
but I didn't bring any gain on a AMD EPYC 7502P 32-Core Processor, gcc -O9. What is the reason for it?
1) There is no hardware support.
2) The compiler doesn't use hardware support.
3) The hardware popcount isn't that fast.
Who can help with answers?
or
4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:
Code: Select all
_bitcnt(unsigned long long):
test rdi, rdi
je .L3
xor eax, eax
popcnt rax, rdi
ret
.L3:
xor eax, eax
ret
Re: __builtin_popcountll doesn't bring any gain
Posted: Sat Aug 29, 2020 12:34 am
by syzygy
Bo Persson wrote: ↑Fri Aug 28, 2020 11:05 pm
4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:
Code: Select all
_bitcnt(unsigned long long):
test rdi, rdi
je .L3
xor eax, eax
popcnt rax, rdi
ret
.L3:
xor eax, eax
ret
I tried to replicate this but did not succeed.
Re: __builtin_popcountll doesn't bring any gain
Posted: Sat Aug 29, 2020 1:44 am
by jdart
Hardware popcnt has little effect in my program. If there is any gain at all, I believe it is under 5%. YMMV.
Re: __builtin_popcountll doesn't bring any gain
Posted: Sat Aug 29, 2020 4:25 am
by Dann Corbit
syzygy wrote: ↑Sat Aug 29, 2020 12:34 am
Bo Persson wrote: ↑Fri Aug 28, 2020 11:05 pm
4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:
Code: Select all
_bitcnt(unsigned long long):
test rdi, rdi
je .L3
xor eax, eax
popcnt rax, rdi
ret
.L3:
xor eax, eax
ret
I tried to replicate this but did not succeed.
I put in this source code:
Code: Select all
int popCount (unsigned long long x) {
int count = 0;
while (x) {
count++;
x &= x - 1; // reset LS1B
}
return count;
}
I put in these compiler flags:
Code: Select all
-O3 -msse -msse3 -mpopcnt -mavx2 -msse4.1 -mssse3 -msse2
And I got this assembly:
Code: Select all
popCount(unsigned long long):
xor eax, eax
popcnt rax, rdi
ret
I guess that you got your output because the compiler did not know if your chip had the popcnt instruction.
Re: __builtin_popcountll doesn't bring any gain
Posted: Sat Aug 29, 2020 12:55 pm
by thomasahle
What does the "xor eax eax" instruction do? Looks like it is setting some random register to 0?
Re: __builtin_popcountll doesn't bring any gain
Posted: Sat Aug 29, 2020 1:01 pm
by mar
thomasahle wrote: ↑Sat Aug 29, 2020 12:55 pm
What does the "xor eax eax" instruction do? Looks like it is setting some random register to 0?
it breaks false dependency issues for popcnt on some Intel CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011
Re: __builtin_popcountll doesn't bring any gain
Posted: Sat Aug 29, 2020 1:02 pm
by NaltaP312
try this in replace of _builtin_popcount
Code: Select all
static inline int getpopcnt(uint64_t b) {
uint64_t r;
asm("popcntq %1, %0" : "=r" (r) : "r" (b));
return r;
}
And test the speed.
For me under under linux there is a little difference : strange but there is.