__builtin_popcountll doesn't bring any gain

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

OliverBr
Posts: 725
Joined: Tue Dec 18, 2007 9:38 pm
Location: Munich, Germany
Full name: Dr. Oliver Brausch

__builtin_popcountll doesn't bring any gain

Post by OliverBr »

Hello together,

until know I am using my own bitcount-methods:

A) for few bits:

Code: Select all

char _bitcnt(u64 bit) {
	char c = 0;
	while (bit) { bit &= (bit - 1); c++; }
	return c;
}
B) for many bits, which precalcuated an array BITC:

Code: Select all

char bitcnt (u64 n) {
     return BITC[LOW16(n)]
       +  BITC[LOW16(n >> 16)]
       +  BITC[LOW16(n >> 32)]
       +  BITC[LOW16(n >> 48)];
}
Not I replaced those methods by

Code: Select all

__builtin_popcountll()
but I didn't bring any gain on a AMD EPYC 7502P 32-Core Processor, gcc -O9. What is the reason for it?
1) There is no hardware support.
2) The compiler doesn't use hardware support.
3) The hardware popcount isn't that fast.

Who can help with answers?
Chess Engine OliThink: http://brausch.org/home/chess
OliThink GitHub:https://github.com/olithink
mhouppin
Posts: 115
Joined: Wed Feb 12, 2020 5:00 pm
Full name: Morgan Houppin

Re: __builtin_popcountll doesn't bring any gain

Post by mhouppin »

The POPCNT instruction is (somehow) part of the SSE4 (Streaming SIMD Extensions 4) instruction set, so if you don't enable its use explicitly (by specifying -mpopcnt to the compiler (or -march=nehalem for Intel, -march=barcelona for AMD)), the compiler with replace the builtin popcount by its own implementation (which could actually be your implementation B).

Even doing so, do not expect an incredible speedup: for my engine the NPS difference between the non-POPCNT and the POPCNT build is around 2%.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: __builtin_popcountll doesn't bring any gain

Post by Dann Corbit »

For AMD new architectures, AVX2 helps a whole lot.
I use these command line options when I compile Stockfish:
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -fprofile-generate -Wextra -Wshadow -DNDEBUG -O3 -mtune=native -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE2 -msse2 -c -o benchmark.o benchmark.cpp

The macro USE_AVX2 does not do anything except embellish the compiler information string in the program.
The flag that does the heavy lifing is :
-mavx2

Try it, you'll like it.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
User avatar
Bo Persson
Posts: 243
Joined: Sat Mar 11, 2006 8:31 am
Location: Malmö, Sweden
Full name: Bo Persson

Re: __builtin_popcountll doesn't bring any gain

Post by Bo Persson »

OliverBr wrote: Fri Aug 28, 2020 9:15 pm Hello together,

until know I am using my own bitcount-methods:

A) for few bits:

Code: Select all

char _bitcnt(u64 bit) {
	char c = 0;
	while (bit) { bit &= (bit - 1); c++; }
	return c;
}
B) for many bits, which precalcuated an array BITC:

Code: Select all

char bitcnt (u64 n) {
     return BITC[LOW16(n)]
       +  BITC[LOW16(n >> 16)]
       +  BITC[LOW16(n >> 32)]
       +  BITC[LOW16(n >> 48)];
}
Not I replaced those methods by

Code: Select all

__builtin_popcountll()
but I didn't bring any gain on a AMD EPYC 7502P 32-Core Processor, gcc -O9. What is the reason for it?
1) There is no hardware support.
2) The compiler doesn't use hardware support.
3) The hardware popcount isn't that fast.

Who can help with answers?
or

4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:

Code: Select all

_bitcnt(unsigned long long):
        test    rdi, rdi
        je      .L3
        xor     eax, eax
        popcnt  rax, rdi
        ret
.L3:
        xor     eax, eax
        ret
 
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: __builtin_popcountll doesn't bring any gain

Post by syzygy »

Bo Persson wrote: Fri Aug 28, 2020 11:05 pm 4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:

Code: Select all

_bitcnt(unsigned long long):
        test    rdi, rdi
        je      .L3
        xor     eax, eax
        popcnt  rax, rdi
        ret
.L3:
        xor     eax, eax
        ret
 
I tried to replicate this but did not succeed.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: __builtin_popcountll doesn't bring any gain

Post by jdart »

Hardware popcnt has little effect in my program. If there is any gain at all, I believe it is under 5%. YMMV.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: __builtin_popcountll doesn't bring any gain

Post by Dann Corbit »

syzygy wrote: Sat Aug 29, 2020 12:34 am
Bo Persson wrote: Fri Aug 28, 2020 11:05 pm 4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:

Code: Select all

_bitcnt(unsigned long long):
        test    rdi, rdi
        je      .L3
        xor     eax, eax
        popcnt  rax, rdi
        ret
.L3:
        xor     eax, eax
        ret
 
I tried to replicate this but did not succeed.
I put in this source code:

Code: Select all

int popCount (unsigned long long x) {
   int count = 0;
   while (x) {
       count++;
       x &= x - 1; // reset LS1B
   }
   return count;
}
I put in these compiler flags:

Code: Select all

-O3 -msse -msse3 -mpopcnt  -mavx2   -msse4.1 -mssse3 -msse2 
And I got this assembly:

Code: Select all

popCount(unsigned long long):
        xor     eax, eax
        popcnt  rax, rdi
        ret
I guess that you got your output because the compiler did not know if your chip had the popcnt instruction.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
thomasahle
Posts: 94
Joined: Thu Feb 27, 2014 8:19 pm

Re: __builtin_popcountll doesn't bring any gain

Post by thomasahle »

What does the "xor eax eax" instruction do? Looks like it is setting some random register to 0?
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: __builtin_popcountll doesn't bring any gain

Post by mar »

thomasahle wrote: Sat Aug 29, 2020 12:55 pm What does the "xor eax eax" instruction do? Looks like it is setting some random register to 0?
it breaks false dependency issues for popcnt on some Intel CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011
Martin Sedlak
NaltaP312
Posts: 56
Joined: Wed Oct 29, 2008 1:06 pm
Full name: Marc Paule

Re: __builtin_popcountll doesn't bring any gain

Post by NaltaP312 »

try this in replace of _builtin_popcount

Code: Select all

static inline int getpopcnt(uint64_t b) {
  uint64_t r;
  asm("popcntq %1, %0" : "=r" (r) : "r" (b));
  return r;
}
And test the speed.
For me under under linux there is a little difference : strange but there is.