__builtin_popcountll doesn't bring any gain

syzygy · Post by **syzygy** » Sat Aug 29, 2020 1:04 pm

Dann Corbit wrote: ↑Sat Aug 29, 2020 4:25 am I guess that you got your output because the compiler did not know if your chip had the popcnt instruction.

I was careful to tell godbolt.org to compile for an architecture with popcnt. But I forgot to add -O3....

So indeed, gcc recognises this. I'm impressed.

OliverBr · Post by **OliverBr** » Sat Aug 29, 2020 1:11 pm

Dann Corbit wrote: ↑Fri Aug 28, 2020 10:20 pm For AMD new architectures, AVX2 helps a whole lot.

The macro USE_AVX2 does not do anything except embellish the compiler information string in the program.
The flag that does the heavy lifing is :
-mavx2

Try it, you'll like it.

Thank you very much. "-mavx2" does the job (other options did not do much:

1) Use the source as is:

Code: Select all

47.6 kN/s -mavx2 => 49.3 kN/s

2) Removing the bitcnt array (BITC[]) method from the source, first it became slower, then gained 10%, but from the original 47.6% it's only about 5%.

Code: Select all

45.5 kN/s -mavx2 => 50.1 kN/s

On my Macbook with i7 intel the gain is notably smaller.

OliverBr · Post by **OliverBr** » Sat Aug 29, 2020 8:23 pm

Dann Corbit wrote: ↑Sat Aug 29, 2020 4:25 am
I put in these compiler flags:
Code: Select all
-O3 -msse -msse3 -mpopcnt  -mavx2   -msse4.1 -mssse3 -msse2 
And I got this assembly:
Code: Select all
popCount(unsigned long long):
        xor     eax, eax
        popcnt  rax, rdi
        ret
I guess that you got your output because the compiler did not know if your chip had the popcnt instruction.

It depends on the compiler.

I was wondering why it worked on my mac, but not linux. Then I remembered that my mac compiler is actually clang when typing gcc.
So I tried with clang in linux and => popcnt is used, too.

I looks like clang is producing faster code anyway. This produces the fastest code:

Code: Select all

clang -O3 -mavx2

OliverBr · Post by **OliverBr** » Mon Aug 31, 2020 10:11 pm

Funny fact: As it is known there is a direct 1:1 translation of OliThink into Java.
Since Version 5.6.9 the built-in-pop-count method is used, I had a look to the Java version and found this deep in the Java runtime library:

Code: Select all

     public static int bitCount(long i) {
        // HD, Figure 5-14
        i = i - ((i >>> 1) & 0x5555555555555555L);
        i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L);
        i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
        i = i + (i >>> 8);
        i = i + (i >>> 16);
        i = i + (i >>> 32);
        return (int)i & 0x7f;
     }

It's quite fast. Interesting!

Dann Corbit · Post by **Dann Corbit** » Tue Sep 01, 2020 4:19 am

same algorithm in rust here:

Bo Persson · Post by **Bo Persson** » Tue Sep 01, 2020 10:50 am

OliverBr wrote: ↑Mon Aug 31, 2020 10:11 pm Funny fact: As it is known there is a direct 1:1 translation of OliThink into Java.
Since Version 5.6.9 the built-in-pop-count method is used, I had a look to the Java version and found this deep in the Java runtime library:
Code: Select all
     public static int bitCount(long i) {
        // HD, Figure 5-14
        i = i - ((i >>> 1) & 0x5555555555555555L);
        i = (i & 0x3333333333333333L) + ((i >>> 2) & 0x3333333333333333L);
        i = (i + (i >>> 4)) & 0x0f0f0f0f0f0f0f0fL;
        i = i + (i >>> 8);
        i = i + (i >>> 16);
        i = i + (i >>> 32);
        return (int)i & 0x7f;
     }
It's quite fast. Interesting!

Yes, it is quite fast, especially if there are many bits set. If we do a quick C++ translation (so gcc can understand it)

Code: Select all

  int bitCount(unsigned long i) {
        // HD, Figure 5-14
        i = i - ((i >> 1) & 0x5555555555555555L);
        i = (i & 0x3333333333333333L) + ((i >> 2) & 0x3333333333333333L);
        i = (i + (i >> 4)) & 0x0f0f0f0f0f0f0f0fL;
        i = i + (i >> 8);
        i = i + (i >> 16);
        i = i + (i >> 32);
        return (int)i & 0x7f;

we get the straight line code sequence

Code: Select all

bitCount(unsigned long):
        movabs  rdx, 6148914691236517205
        mov     rax, rdi
        shr     rax
        and     rax, rdx
        movabs  rdx, 3689348814741910323
        sub     rdi, rax
        mov     rax, rdi
        shr     rdi, 2
        and     rdi, rdx
        and     rax, rdx
        add     rax, rdi
        mov     rdi, rax
        shr     rdi, 4
        add     rdi, rax
        movabs  rax, 1085102592571150095
        and     rdi, rax
        mov     rax, rdi
        shr     rax, 8
        add     rdi, rax
        mov     rax, rdi
        shr     rax, 16
        add     rdi, rax
        mov     rax, rdi
        shr     rax, 32
        add     rax, rdi
        and     eax, 127
        ret

I count that as 27 assembly instructions (excluding the RET) for the bit counting part. It has no loops and no conditionals, which is an advantage. Same exectution time regardless of number of bits set.

On the other hand, the "type A" code

Code: Select all

char _bitcnt(unsigned long bit) {
	char c = 0;
	while (bit) { bit &= (bit - 1); c++; }
	return c;
}

(assuming no popcnt available) translates to

Code: Select all

_bitcnt(unsigned long):
        xor     r8d, r8d
        test    rdi, rdi
        je      .L1
.L3:
        lea     rax, [rdi-1]
        add     r8d, 1
        and     rdi, rax
        jne     .L3
.L1:
        mov     eax, r8d
        ret

which has a 4 instructions per bit central loop. If you have very few bits set in your bitboard, this would result in fewer instructions executed.

So we have to consider which case is more common in a chess program - many bits set or just a few bits set.

OliverBr · Post by **OliverBr** » Tue Sep 01, 2020 5:40 pm

Bo Persson wrote: ↑Tue Sep 01, 2020 10:50 am
On the other hand, the "type A" code
Code: Select all
char _bitcnt(unsigned long bit) {
	char c = 0;
	while (bit) { bit &= (bit - 1); c++; }
	return c;
}
which has a 4 instructions per bit central loop. If you have very few bits set in your bitboard, this would result in fewer instructions executed.

So we have to consider which case is more common in a chess program - many bits set or just a few bits set.

Yes. For years I was using two different bitCount methods, one for those with many bits and one for few bits. Actually only 4 times the many-bits-method were used: 3 of them were to calculate the number of Squares Queen, Rook or Bishop can move for mobility.
The rest was not more than a few bits, e.g. a bit for each attacker in check. They can only be 2 anyway and this is doublecheck. There are no triplechecks.

__builtin_popcountll doesn't bring any gain

Re: __builtin_popcountll doesn't bring any gain

Re: __builtin_popcountll doesn't bring any gain

Re: __builtin_popcountll doesn't bring any gain

Re: __builtin_popcountll doesn't bring any gain

Re: __builtin_popcountll doesn't bring any gain

Re: __builtin_popcountll doesn't bring any gain

Re: __builtin_popcountll doesn't bring any gain