Page 1 of 2

__builtin_popcountll doesn't bring any gain

Posted: Fri Aug 28, 2020 9:15 pm
by OliverBr
Hello together,

until know I am using my own bitcount-methods:

A) for few bits:

Code: Select all

char _bitcnt(u64 bit) {
	char c = 0;
	while (bit) { bit &= (bit - 1); c++; }
	return c;
}
B) for many bits, which precalcuated an array BITC:

Code: Select all

char bitcnt (u64 n) {
     return BITC[LOW16(n)]
       +  BITC[LOW16(n >> 16)]
       +  BITC[LOW16(n >> 32)]
       +  BITC[LOW16(n >> 48)];
}
Not I replaced those methods by

Code: Select all

__builtin_popcountll()
but I didn't bring any gain on a AMD EPYC 7502P 32-Core Processor, gcc -O9. What is the reason for it?
1) There is no hardware support.
2) The compiler doesn't use hardware support.
3) The hardware popcount isn't that fast.

Who can help with answers?

Re: __builtin_popcountll doesn't bring any gain

Posted: Fri Aug 28, 2020 9:40 pm
by mhouppin
The POPCNT instruction is (somehow) part of the SSE4 (Streaming SIMD Extensions 4) instruction set, so if you don't enable its use explicitly (by specifying -mpopcnt to the compiler (or -march=nehalem for Intel, -march=barcelona for AMD)), the compiler with replace the builtin popcount by its own implementation (which could actually be your implementation B).

Even doing so, do not expect an incredible speedup: for my engine the NPS difference between the non-POPCNT and the POPCNT build is around 2%.

Re: __builtin_popcountll doesn't bring any gain

Posted: Fri Aug 28, 2020 10:20 pm
by Dann Corbit
For AMD new architectures, AVX2 helps a whole lot.
I use these command line options when I compile Stockfish:
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -fprofile-generate -Wextra -Wshadow -DNDEBUG -O3 -mtune=native -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE2 -msse2 -c -o benchmark.o benchmark.cpp

The macro USE_AVX2 does not do anything except embellish the compiler information string in the program.
The flag that does the heavy lifing is :
-mavx2

Try it, you'll like it.

Re: __builtin_popcountll doesn't bring any gain

Posted: Fri Aug 28, 2020 11:05 pm
by Bo Persson
OliverBr wrote: Fri Aug 28, 2020 9:15 pm Hello together,

until know I am using my own bitcount-methods:

A) for few bits:

Code: Select all

char _bitcnt(u64 bit) {
	char c = 0;
	while (bit) { bit &= (bit - 1); c++; }
	return c;
}
B) for many bits, which precalcuated an array BITC:

Code: Select all

char bitcnt (u64 n) {
     return BITC[LOW16(n)]
       +  BITC[LOW16(n >> 16)]
       +  BITC[LOW16(n >> 32)]
       +  BITC[LOW16(n >> 48)];
}
Not I replaced those methods by

Code: Select all

__builtin_popcountll()
but I didn't bring any gain on a AMD EPYC 7502P 32-Core Processor, gcc -O9. What is the reason for it?
1) There is no hardware support.
2) The compiler doesn't use hardware support.
3) The hardware popcount isn't that fast.

Who can help with answers?
or

4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:

Code: Select all

_bitcnt(unsigned long long):
        test    rdi, rdi
        je      .L3
        xor     eax, eax
        popcnt  rax, rdi
        ret
.L3:
        xor     eax, eax
        ret
 

Re: __builtin_popcountll doesn't bring any gain

Posted: Sat Aug 29, 2020 12:34 am
by syzygy
Bo Persson wrote: Fri Aug 28, 2020 11:05 pm 4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:

Code: Select all

_bitcnt(unsigned long long):
        test    rdi, rdi
        je      .L3
        xor     eax, eax
        popcnt  rax, rdi
        ret
.L3:
        xor     eax, eax
        ret
 
I tried to replicate this but did not succeed.

Re: __builtin_popcountll doesn't bring any gain

Posted: Sat Aug 29, 2020 1:44 am
by jdart
Hardware popcnt has little effect in my program. If there is any gain at all, I believe it is under 5%. YMMV.

Re: __builtin_popcountll doesn't bring any gain

Posted: Sat Aug 29, 2020 4:25 am
by Dann Corbit
syzygy wrote: Sat Aug 29, 2020 12:34 am
Bo Persson wrote: Fri Aug 28, 2020 11:05 pm 4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:

Code: Select all

_bitcnt(unsigned long long):
        test    rdi, rdi
        je      .L3
        xor     eax, eax
        popcnt  rax, rdi
        ret
.L3:
        xor     eax, eax
        ret
 
I tried to replicate this but did not succeed.
I put in this source code:

Code: Select all

int popCount (unsigned long long x) {
   int count = 0;
   while (x) {
       count++;
       x &= x - 1; // reset LS1B
   }
   return count;
}
I put in these compiler flags:

Code: Select all

-O3 -msse -msse3 -mpopcnt  -mavx2   -msse4.1 -mssse3 -msse2 
And I got this assembly:

Code: Select all

popCount(unsigned long long):
        xor     eax, eax
        popcnt  rax, rdi
        ret
I guess that you got your output because the compiler did not know if your chip had the popcnt instruction.

Re: __builtin_popcountll doesn't bring any gain

Posted: Sat Aug 29, 2020 12:55 pm
by thomasahle
What does the "xor eax eax" instruction do? Looks like it is setting some random register to 0?

Re: __builtin_popcountll doesn't bring any gain

Posted: Sat Aug 29, 2020 1:01 pm
by mar
thomasahle wrote: Sat Aug 29, 2020 12:55 pm What does the "xor eax eax" instruction do? Looks like it is setting some random register to 0?
it breaks false dependency issues for popcnt on some Intel CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011

Re: __builtin_popcountll doesn't bring any gain

Posted: Sat Aug 29, 2020 1:02 pm
by NaltaP312
try this in replace of _builtin_popcount

Code: Select all

static inline int getpopcnt(uint64_t b) {
  uint64_t r;
  asm("popcntq %1, %0" : "=r" (r) : "r" (b));
  return r;
}
And test the speed.
For me under under linux there is a little difference : strange but there is.