syzygy users (and Ronald)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

syzygy users (and Ronald)

Post by bob »

I have been working with the new clang for macOS sierra, and my inline asm for MSB/LSB/PopCnt were not working correctly. Rather than trying to fix it, I decided it was time to eliminate that code anyway and use the built in intrinsics that seem to work fine for gcc/clang...

#define PopCnt(v) __builtin_popcountll(v)
#define LSB(v) __builtin_ctzll(v)
#define MSB(v) (63 - __builtin_clzll(v))

I found a few interesting things.

(1) the __builtin_popcount() intrinsic works with or without hardware popcnt. If you compile with -mpopcnt it produces the expected popcnt instruction, otherwise it does the usual shift/and/add trick. So it works on all platforms.

(2) builtin_ctzll (count trailing zeroes long long) converts directly to a BSF as expected.

(3) was a surprise. I could not find a BSR intrinsic, but the new count leading zeros was close. I didn't like the 63- trick to convert it to the proper numbering, but when I looked at the asm, the compiler was smart enough to realize that was equivalent to a BSR and did it correctly.

You (Ronald) might want to get rid of the HAS_POPCNT stuff and just use the intrinsic. I discovered this when I compiled without the -mpopcnt option and Crafty was running about 10% slower. When I added that option, it runs exactly as fast as the old inline asm version...

One less configuration thing to make the users get right, although you do need -mpopcnt as a compiler option to make it emit a popcnt opcode. Otherwise you get the folding popcnt version.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: syzygy users (and Ronald)

Post by syzygy »

bob wrote:You (Ronald) might want to get rid of the HAS_POPCNT stuff and just use the intrinsic. I discovered this when I compiled without the -mpopcnt option and Crafty was running about 10% slower. When I added that option, it runs exactly as fast as the old inline asm version...
My probing code does not really use popcount except in two functions which are used only for initialisation of TBs.

It does use bsf to loop through the bitboards of pawns, knights, etc. That is the part that will have to be adapted for each engine. It probably needed not much adaptation for bitboard-based Crafty.

My TB generator uses popcnt hidden in a couple of macros. If USE_POPCNT is not defined, it uses a slightly different implementation for those macros which is hardly slower.
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: syzygy users (and Ronald)

Post by D Sceviour »

bob wrote:1) the __builtin_popcount() intrinsic works with or without hardware popcnt. If you compile with -mpopcnt it produces the expected popcnt instruction, otherwise it does the usual shift/and/add trick. So it works on all platforms.
I don't follow here. Graham Banks and other users ask for both popcount and non-popcount versions to be available. One would still have to find a way of over-riding the intrinsic. I am not sure how this relates to syzygy, as I do not have an operational probing code yet.
bob wrote:You (Ronald) might want to get rid of the HAS_POPCNT stuff and just use the intrinsic.
A purpose of the HAS_POPCNT test is to inform the user if the hardware is suitable for the executable, and not just for the complier to know.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: syzygy users (and Ronald)

Post by bob »

D Sceviour wrote:
bob wrote:1) the __builtin_popcount() intrinsic works with or without hardware popcnt. If you compile with -mpopcnt it produces the expected popcnt instruction, otherwise it does the usual shift/and/add trick. So it works on all platforms.
I don't follow here. Graham Banks and other users ask for both popcount and non-popcount versions to be available. One would still have to find a way of over-riding the intrinsic. I am not sure how this relates to syzygy, as I do not have an operational probing code yet.
bob wrote:You (Ronald) might want to get rid of the HAS_POPCNT stuff and just use the intrinsic.
A purpose of the HAS_POPCNT test is to inform the user if the hardware is suitable for the executable, and not just for the complier to know.
You have to compile it twice. Once without -mpopcnt, which gives the "folding" popcnt code, and once with -mpopocnt which produces the actual popcnt instruction.

His "HAS_POPCNT" has to be defined by the user, before you compile. So I am not quite sure what you are thinking of here. I suppose one could write an all-inclusive popcnt that checks to see if it is available, and if so use it, otherwise use the folded approach. But I would not want that overhead myself.
D Sceviour
Posts: 570
Joined: Mon Jul 20, 2015 5:06 pm

Re: syzygy users (and Ronald)

Post by D Sceviour »

bob wrote:His "HAS_POPCNT" has to be defined by the user, before you compile. So I am not quite sure what you are thinking of here. I suppose one could write an all-inclusive popcnt that checks to see if it is available, and if so use it, otherwise use the folded approach. But I would not want that overhead myself.
That is the point. It has to be defined for the different types of compilation desired. How do you "get rid of the HAS_POPCNT stuff?"
There is a quick peek into the system registers, but I have a feeling this is not what you mean. An intel code for HavePopCnt():

Code: Select all

asm
	mov eax, 1
	push rbx
	cpuid
	pop rbx
	xor rax, rax
	bt ecx, 23
	jnc 0 
	mov rax, -1
0: 
	ret
end asm
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: syzygy users (and Ronald)

Post by bob »

D Sceviour wrote:
bob wrote:His "HAS_POPCNT" has to be defined by the user, before you compile. So I am not quite sure what you are thinking of here. I suppose one could write an all-inclusive popcnt that checks to see if it is available, and if so use it, otherwise use the folded approach. But I would not want that overhead myself.
That is the point. It has to be defined for the different types of compilation desired. How do you "get rid of the HAS_POPCNT stuff?"
There is a quick peek into the system registers, but I have a feeling this is not what you mean. An intel code for HavePopCnt():

Code: Select all

asm
	mov eax, 1
	push rbx
	cpuid
	pop rbx
	xor rax, rax
	bt ecx, 23
	jnc 0 
	mov rax, -1
0: 
	ret
end asm
You can use the CPUID instruction to check for the presence of the popcnt instruction as you suggested (I did not verify it is checking the right bit, but that is the idea). You could do that one time and set the POPCNT variable to 0 or 1, then test it in the popcnt() function and either use the popcnt or fold algorithm depending. But I don't like that overhead and would prefer to do as I do and just compile with or without depending on available hardware.


My point to Ronald was that the tests for HAS_POPCNT in his code can be eliminated since the intrinsic has options for hardware popcnt or no popcnt... leaving the conditional preprocessor stuff out of his code...
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: syzygy users (and Ronald)

Post by Dann Corbit »

This idea works if you compile the code on your own machine.

But if you run the compiled code on anther machine, we do not know if it should have the native popcount hardware instruction or the emulated popcount function call.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: syzygy users (and Ronald)

Post by bob »

Dann Corbit wrote:This idea works if you compile the code on your own machine.

But if you run the compiled code on anther machine, we do not know if it should have the native popcount hardware instruction or the emulated popcount function call.
So what? How is that different from what has to be done at the present? I am simply suggesting getting rid of the POPCNT conditional preprocessor directives and let the conditional directive in the include library for the intrinsics do the job. EXACTLY the same result, but without any extra code added to the syzygy code.

Currently, one has to edit the .h file and specify whether or not popcnt is available. With my suggestion, no editing, just add -mpopcnt to the compiler options if your machine has popcnt, leave it off and you get the folded popcnt. Same result, but simpler approach.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: syzygy users (and Ronald)

Post by syzygy »

bob wrote:Currently, one has to edit the .h file and specify whether or not popcnt is available. With my suggestion, no editing, just add -mpopcnt to the compiler options if your machine has popcnt, leave it off and you get the folded popcnt. Same result, but simpler approach.
I'm still wondering a bit which code you are referring to?

The interface code you find here needs lots of editing to adapt it to another engine:
https://github.com/syzygy1/tb/blob/mast ... ce/types.h

Same with the code here:
https://github.com/syzygy1/Stockfish/tr ... src/syzygy

The calls to Stockfish functions have to be replaced with calls to the engine's own functions. I'm not sure if this is what you did, or if you took Basil's library version:
https://github.com/basil00/Fathom/tree/master/src
Aha, this uses popcount quite heavily in calc_key(), which is invoked at every probe. So this must be what you're using in Crafty.

Personally I have always preferred integrating TB probing code as tightly as possible with the engine for maximum performance. My probing code needs to generate and play out captures and what better functions to use for that than the engine's native move generation. (Back in the days of the Edwards tables I noticed that removing the multiple conversions of board representation on every probe doubled probing speed.)

The obvious disadvantage of this approach is that it becomes a lot of work to add TB support. I suspect that the majority of engines with Syzygy support instead use Basil's wrapper code. And I guess performance is good enough anyway.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: syzygy users (and Ronald)

Post by bob »

syzygy wrote:
bob wrote:Currently, one has to edit the .h file and specify whether or not popcnt is available. With my suggestion, no editing, just add -mpopcnt to the compiler options if your machine has popcnt, leave it off and you get the folded popcnt. Same result, but simpler approach.
I'm still wondering a bit which code you are referring to?

The interface code you find here needs lots of editing to adapt it to another engine:
https://github.com/syzygy1/tb/blob/mast ... ce/types.h

Same with the code here:
https://github.com/syzygy1/Stockfish/tr ... src/syzygy

The calls to Stockfish functions have to be replaced with calls to the engine's own functions. I'm not sure if this is what you did, or if you took Basil's library version:
https://github.com/basil00/Fathom/tree/master/src
Aha, this uses popcount quite heavily in calc_key(), which is invoked at every probe. So this must be what you're using in Crafty.

Personally I have always preferred integrating TB probing code as tightly as possible with the engine for maximum performance. My probing code needs to generate and play out captures and what better functions to use for that than the engine's native move generation. (Back in the days of the Edwards tables I noticed that removing the multiple conversions of board representation on every probe doubled probing speed.)

The obvious disadvantage of this approach is that it becomes a lot of work to add TB support. I suspect that the majority of engines with Syzygy support instead use Basil's wrapper code. And I guess performance is good enough anyway.
I was specifically talking about this:

/*
* Define TB_NO_HW_POP_COUNT if there is no hardware popcount instruction.
*/
/* #define TB_NO_HW_POP_COUNT */

which is in tbprobe.h, and then this from tbprobe.c:

#ifndef TB_NO_HW_POP_COUNT
# include <popcntintrin.h>
# define popcount(x) _mm_popcnt_u64((x))
#else
static inline unsigned popcount(uint64_t x) {
x = x - ((x >> 1) & 0x5555555555555555ull);
x = (x & 0x3333333333333333ull) + ((x >> 2) & 0x3333333333333333ull);
x = (x + (x >> 4)) & 0x0f0f0f0f0f0f0f0full;
return (x * 0x0101010101010101ull) >> 56;
}
#endif

If you switch to the intrinsic I gave (__builtin_popcountll()) then it does that automatically. If you compile with -mpopcnt, you get the hardware popcnt instruction, if you compile without the-mpopcnt, you get the "folded popcnt" which looks just about like yours above. And you get it without any of the above code being included in your stuff. Simpler, cleaner, exact same result...

I was not paying any attention to WHERE/WHEN you use the popcnt stuff, I was just looking at cleaning it up. I have now removed my inline.h file which had inline asm for MSB(), LSB() and PopCnt(). I no longer use "boolean.c" which is where I had C-coded versions of the above for non X-86 compiles (such as the IBM PPC I have been playing with).

I like simple, particularly when there is no down-side at all...