__builtin_popcountll doesn't bring any gain

Discussion of chess software programming and technical issues.

Moderators: Harvey Williamson, bob, hgm

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
OliverBr
Posts: 555
Joined: Tue Dec 18, 2007 8:38 pm
Location: Munich, Germany
Full name: Dr. Oliver Brausch
Contact:

__builtin_popcountll doesn't bring any gain

Post by OliverBr » Fri Aug 28, 2020 7:15 pm

Hello together,

until know I am using my own bitcount-methods:

A) for few bits:

Code: Select all

char _bitcnt(u64 bit) {
	char c = 0;
	while (bit) { bit &= (bit - 1); c++; }
	return c;
}
B) for many bits, which precalcuated an array BITC:

Code: Select all

char bitcnt (u64 n) {
     return BITC[LOW16(n)]
       +  BITC[LOW16(n >> 16)]
       +  BITC[LOW16(n >> 32)]
       +  BITC[LOW16(n >> 48)];
}
Not I replaced those methods by

Code: Select all

__builtin_popcountll()
but I didn't bring any gain on a AMD EPYC 7502P 32-Core Processor, gcc -O9. What is the reason for it?
1) There is no hardware support.
2) The compiler doesn't use hardware support.
3) The hardware popcount isn't that fast.

Who can help with answers?
Chess Engine OliThink: http://brausch.org/home/chess
OliThink GitHub:https://github.com/olithink

mhouppin
Posts: 67
Joined: Wed Feb 12, 2020 4:00 pm
Full name: Morgan Houppin

Re: __builtin_popcountll doesn't bring any gain

Post by mhouppin » Fri Aug 28, 2020 7:40 pm

The POPCNT instruction is (somehow) part of the SSE4 (Streaming SIMD Extensions 4) instruction set, so if you don't enable its use explicitly (by specifying -mpopcnt to the compiler (or -march=nehalem for Intel, -march=barcelona for AMD)), the compiler with replace the builtin popcount by its own implementation (which could actually be your implementation B).

Even doing so, do not expect an incredible speedup: for my engine the NPS difference between the non-POPCNT and the POPCNT build is around 2%.

Dann Corbit
Posts: 11622
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: __builtin_popcountll doesn't bring any gain

Post by Dann Corbit » Fri Aug 28, 2020 8:20 pm

For AMD new architectures, AVX2 helps a whole lot.
I use these command line options when I compile Stockfish:
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -fprofile-generate -Wextra -Wshadow -DNDEBUG -O3 -mtune=native -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE2 -msse2 -c -o benchmark.o benchmark.cpp

The macro USE_AVX2 does not do anything except embellish the compiler information string in the program.
The flag that does the heavy lifing is :
-mavx2

Try it, you'll like it.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

User avatar
Bo Persson
Posts: 181
Joined: Sat Mar 11, 2006 7:31 am
Location: Malmö, Sweden
Full name: Bo Persson

Re: __builtin_popcountll doesn't bring any gain

Post by Bo Persson » Fri Aug 28, 2020 9:05 pm

OliverBr wrote:
Fri Aug 28, 2020 7:15 pm
Hello together,

until know I am using my own bitcount-methods:

A) for few bits:

Code: Select all

char _bitcnt(u64 bit) {
	char c = 0;
	while (bit) { bit &= (bit - 1); c++; }
	return c;
}
B) for many bits, which precalcuated an array BITC:

Code: Select all

char bitcnt (u64 n) {
     return BITC[LOW16(n)]
       +  BITC[LOW16(n >> 16)]
       +  BITC[LOW16(n >> 32)]
       +  BITC[LOW16(n >> 48)];
}
Not I replaced those methods by

Code: Select all

__builtin_popcountll()
but I didn't bring any gain on a AMD EPYC 7502P 32-Core Processor, gcc -O9. What is the reason for it?
1) There is no hardware support.
2) The compiler doesn't use hardware support.
3) The hardware popcount isn't that fast.

Who can help with answers?
or

4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:

Code: Select all

_bitcnt(unsigned long long):
        test    rdi, rdi
        je      .L3
        xor     eax, eax
        popcnt  rax, rdi
        ret
.L3:
        xor     eax, eax
        ret
 

syzygy
Posts: 4803
Joined: Tue Feb 28, 2012 10:56 pm

Re: __builtin_popcountll doesn't bring any gain

Post by syzygy » Fri Aug 28, 2020 10:34 pm

Bo Persson wrote:
Fri Aug 28, 2020 9:05 pm
4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:

Code: Select all

_bitcnt(unsigned long long):
        test    rdi, rdi
        je      .L3
        xor     eax, eax
        popcnt  rax, rdi
        ret
.L3:
        xor     eax, eax
        ret
 
I tried to replicate this but did not succeed.

jdart
Posts: 4013
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: __builtin_popcountll doesn't bring any gain

Post by jdart » Fri Aug 28, 2020 11:44 pm

Hardware popcnt has little effect in my program. If there is any gain at all, I believe it is under 5%. YMMV.

Dann Corbit
Posts: 11622
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: __builtin_popcountll doesn't bring any gain

Post by Dann Corbit » Sat Aug 29, 2020 2:25 am

syzygy wrote:
Fri Aug 28, 2020 10:34 pm
Bo Persson wrote:
Fri Aug 28, 2020 9:05 pm
4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:

Code: Select all

_bitcnt(unsigned long long):
        test    rdi, rdi
        je      .L3
        xor     eax, eax
        popcnt  rax, rdi
        ret
.L3:
        xor     eax, eax
        ret
 
I tried to replicate this but did not succeed.
I put in this source code:

Code: Select all

int popCount (unsigned long long x) {
   int count = 0;
   while (x) {
       count++;
       x &= x - 1; // reset LS1B
   }
   return count;
}
I put in these compiler flags:

Code: Select all

-O3 -msse -msse3 -mpopcnt  -mavx2   -msse4.1 -mssse3 -msse2 
And I got this assembly:

Code: Select all

popCount(unsigned long long):
        xor     eax, eax
        popcnt  rax, rdi
        ret
I guess that you got your output because the compiler did not know if your chip had the popcnt instruction.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

thomasahle
Posts: 73
Joined: Thu Feb 27, 2014 7:19 pm

Re: __builtin_popcountll doesn't bring any gain

Post by thomasahle » Sat Aug 29, 2020 10:55 am

What does the "xor eax eax" instruction do? Looks like it is setting some random register to 0?

mar
Posts: 2185
Joined: Fri Nov 26, 2010 1:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: __builtin_popcountll doesn't bring any gain

Post by mar » Sat Aug 29, 2020 11:01 am

thomasahle wrote:
Sat Aug 29, 2020 10:55 am
What does the "xor eax eax" instruction do? Looks like it is setting some random register to 0?
it breaks false dependency issues for popcnt on some Intel CPUs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011
Martin Sedlak

NaltaP312
Posts: 56
Joined: Wed Oct 29, 2008 12:06 pm
Full name: Marc Paule

Re: __builtin_popcountll doesn't bring any gain

Post by NaltaP312 » Sat Aug 29, 2020 11:02 am

try this in replace of _builtin_popcount

Code: Select all

static inline int getpopcnt(uint64_t b) {
  uint64_t r;
  asm("popcntq %1, %0" : "=r" (r) : "r" (b));
  return r;
}
And test the speed.
For me under under linux there is a little difference : strange but there is.

Post Reply