The POPCNT instruction is (somehow) part of the SSE4 (Streaming SIMD Extensions 4) instruction set, so if you don't enable its use explicitly (by specifying -mpopcnt to the compiler (or -march=nehalem for Intel, -march=barcelona for AMD)), the compiler with replace the builtin popcount by its own implementation (which could actually be your implementation B).
Even doing so, do not expect an incredible speedup: for my engine the NPS difference between the non-POPCNT and the POPCNT build is around 2%.
__builtin_popcountll doesn't bring any gain
Moderator: Ras
-
mhouppin
- Posts: 116
- Joined: Wed Feb 12, 2020 5:00 pm
- Full name: Morgan Houppin
-
Dann Corbit
- Posts: 12845
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: __builtin_popcountll doesn't bring any gain
For AMD new architectures, AVX2 helps a whole lot.
I use these command line options when I compile Stockfish:
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -fprofile-generate -Wextra -Wshadow -DNDEBUG -O3 -mtune=native -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE2 -msse2 -c -o benchmark.o benchmark.cpp
The macro USE_AVX2 does not do anything except embellish the compiler information string in the program.
The flag that does the heavy lifing is :
-mavx2
Try it, you'll like it.
I use these command line options when I compile Stockfish:
g++ -Wall -Wcast-qual -fno-exceptions -std=c++17 -fprofile-generate -Wextra -Wshadow -DNDEBUG -O3 -mtune=native -DIS_64BIT -msse -msse3 -mpopcnt -DUSE_POPCNT -DUSE_AVX2 -mavx2 -DUSE_SSE41 -msse4.1 -DUSE_SSSE3 -mssse3 -DUSE_SSE2 -msse2 -c -o benchmark.o benchmark.cpp
The macro USE_AVX2 does not do anything except embellish the compiler information string in the program.
The flag that does the heavy lifing is :
-mavx2
Try it, you'll like it.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
Bo Persson
- Posts: 264
- Joined: Sat Mar 11, 2006 8:31 am
- Location: Malmö, Sweden
- Full name: Bo Persson
Re: __builtin_popcountll doesn't bring any gain
orOliverBr wrote: ↑Fri Aug 28, 2020 9:15 pm Hello together,
until know I am using my own bitcount-methods:
A) for few bits:B) for many bits, which precalcuated an array BITC:Code: Select all
char _bitcnt(u64 bit) { char c = 0; while (bit) { bit &= (bit - 1); c++; } return c; }
Not I replaced those methods byCode: Select all
char bitcnt (u64 n) { return BITC[LOW16(n)] + BITC[LOW16(n >> 16)] + BITC[LOW16(n >> 32)] + BITC[LOW16(n >> 48)]; }but I didn't bring any gain on a AMD EPYC 7502P 32-Core Processor, gcc -O9. What is the reason for it?Code: Select all
__builtin_popcountll()
1) There is no hardware support.
2) The compiler doesn't use hardware support.
3) The hardware popcount isn't that fast.
Who can help with answers?
4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:
Code: Select all
_bitcnt(unsigned long long):
test rdi, rdi
je .L3
xor eax, eax
popcnt rax, rdi
ret
.L3:
xor eax, eax
ret
-
syzygy
- Posts: 5942
- Joined: Tue Feb 28, 2012 11:56 pm
Re: __builtin_popcountll doesn't bring any gain
I tried to replicate this but did not succeed.Bo Persson wrote: ↑Fri Aug 28, 2020 11:05 pm 4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:
Code: Select all
_bitcnt(unsigned long long): test rdi, rdi je .L3 xor eax, eax popcnt rax, rdi ret .L3: xor eax, eax ret
-
jdart
- Posts: 4428
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: __builtin_popcountll doesn't bring any gain
Hardware popcnt has little effect in my program. If there is any gain at all, I believe it is under 5%. YMMV.
-
Dann Corbit
- Posts: 12845
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: __builtin_popcountll doesn't bring any gain
I put in this source code:syzygy wrote: ↑Sat Aug 29, 2020 12:34 amI tried to replicate this but did not succeed.Bo Persson wrote: ↑Fri Aug 28, 2020 11:05 pm 4) The compiler recognizes version A and already generates a popcnt instruction. This is what godbolt.org produces for gcc:
Code: Select all
_bitcnt(unsigned long long): test rdi, rdi je .L3 xor eax, eax popcnt rax, rdi ret .L3: xor eax, eax ret
Code: Select all
int popCount (unsigned long long x) {
int count = 0;
while (x) {
count++;
x &= x - 1; // reset LS1B
}
return count;
}
Code: Select all
-O3 -msse -msse3 -mpopcnt -mavx2 -msse4.1 -mssse3 -msse2 Code: Select all
popCount(unsigned long long):
xor eax, eax
popcnt rax, rdi
retTaking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
thomasahle
- Posts: 94
- Joined: Thu Feb 27, 2014 8:19 pm
Re: __builtin_popcountll doesn't bring any gain
What does the "xor eax eax" instruction do? Looks like it is setting some random register to 0?
-
mar
- Posts: 2676
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: __builtin_popcountll doesn't bring any gain
it breaks false dependency issues for popcnt on some Intel CPUsthomasahle wrote: ↑Sat Aug 29, 2020 12:55 pm What does the "xor eax eax" instruction do? Looks like it is setting some random register to 0?
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011
-
NaltaP312
- Posts: 56
- Joined: Wed Oct 29, 2008 1:06 pm
- Full name: Marc Paule
Re: __builtin_popcountll doesn't bring any gain
try this in replace of _builtin_popcount
And test the speed.
For me under under linux there is a little difference : strange but there is.
Code: Select all
static inline int getpopcnt(uint64_t b) {
uint64_t r;
asm("popcntq %1, %0" : "=r" (r) : "r" (b));
return r;
}For me under under linux there is a little difference : strange but there is.
-
syzygy
- Posts: 5942
- Joined: Tue Feb 28, 2012 11:56 pm
Re: __builtin_popcountll doesn't bring any gain
I was careful to tell godbolt.org to compile for an architecture with popcnt. But I forgot to add -O3....Dann Corbit wrote: ↑Sat Aug 29, 2020 4:25 am I guess that you got your output because the compiler did not know if your chip had the popcnt instruction.
So indeed, gcc recognises this. I'm impressed.