Is there such a thing as branchless move generation?

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 10, 2012 9:16 pm

Your calculation for piece list is totally wrong, at least according to what I am using now. The piece list entries with LIST { int sq, List* prev,List* next } requires far more than a thousand bytes. You need to declare an LIST[128] too...

Edmund · Post by **Edmund** » Sun Jun 10, 2012 11:24 pm

Daniel Shawul wrote:Your calculation for piece list is totally wrong, at least according to what I am using now. The piece list entries with LIST { int sq, List* prev,List* next } requires far more than a thousand bytes. You need to declare an LIST[128] too...

a) what do you need the *prev for?
b) why 128 and not just 2*16 entries?

Daniel Shawul · Post by **Daniel Shawul** » Sun Jun 10, 2012 11:54 pm

Edmund wrote:
Daniel Shawul wrote:Your calculation for piece list is totally wrong, at least according to what I am using now. The piece list entries with LIST { int sq, List* prev,List* next } requires far more than a thousand bytes. You need to declare an LIST[128] too...
a) what do you need the *prev for?
b) why 128 and not just 2*16 entries?

What is your point here ? You can try to avoid a byte or so but at the end of the day piece lists are going to consume a lot more than 4 bitboards. Here is your answer anyway
a) faster insert/remove piece. Sure you can avoid that but your make/unmake will be slower.
b) An array of LIST[128] should be separately kept and updated when a move is made. Thas was completely ignored in the original calculation. Pointers do require an int and using char tables is dead slow in gpus because of bank conflicts. For scorpio with piece lists I have the following. After that tell me if 0x88 is better than one quad bitboard.

Code: Select all

typedef struct LIST&#123;
	int   sq;
	LIST* prev;
	LIST* next;
&#125;*PLIST;

int temp_board&#91;192&#93;;
PLIST list&#91;128&#93;;
PLIST plist&#91;15&#93;;

That is a total of 192x4+128x12+15x4 = 2364 byte!! Compare that to 32 bytes. Now I am not saying scoprio's is optimal but I know that I needed what used for reasons that were necessary. And when copying positions you have to do this:

Code: Select all

 
void SEARCHER&#58;&#58;COPY&#40;SEARCHER* srcSearcher&#41; &#123;
	int i,sq;
	for&#40;sq = A1;sq <= H8; sq++) &#123;
		if&#40;!&#40;sq & 0x88&#41;) &#123;
			board&#91;sq&#93; = srcSearcher->board&#91;sq&#93;;
			list&#91;sq&#93;->sq = sq;
			if&#40;srcSearcher->list&#91;sq&#93;->prev&#41;
				list&#91;sq&#93;->prev = list&#91;srcSearcher->list&#91;sq&#93;->prev->sq&#93;;
			else
				list&#91;sq&#93;->prev = 0;
			if&#40;srcSearcher->list&#91;sq&#93;->next&#41;
				list&#91;sq&#93;->next = list&#91;srcSearcher->list&#91;sq&#93;->next->sq&#93;;
			else
				list&#91;sq&#93;->next = 0;
		&#125; else &#123;
			sq += 0x07;
		&#125;
	&#125;

	for&#40;i = wking;i <= bpawn;i++) &#123;	
		if&#40;srcSearcher->plist&#91;i&#93;)
			plist&#91;i&#93; = list&#91;srcSearcher->plist&#91;i&#93;->sq&#93;;
		else
			plist&#91;i&#93; = 0;
	&#125;

johnhamlen · Post by **johnhamlen** » Mon Jun 11, 2012 12:14 am

Daniel Shawul wrote:Your calculation for piece list is totally wrong, at least according to what I am using now. The piece list entries with LIST { int sq, List* prev,List* next } requires far more than a thousand bytes. You need to declare an LIST[128] too...

Agreed. Totally wrong if I was calculating for a doubly linked list. Woodpusher used a singularly linked list before swapping to bitboards, but I was trying to think of the most memory-efficient representation which (I think) is an array of 2 x 16 entries: one byte for each of the 16 possible pieces for white and black.

johnhamlen · Post by **johnhamlen** » Mon Jun 11, 2012 12:58 am

Daniel Shawul wrote: Last time I checked NVIDIA tesla's have 64kb that you can divide into 48kb + 16kb shared memory and local cache. If you just want the device to do most of the caching of global mem for you, use 48kb for that otherwise 48kb of shared memory per multi-processor is what you get.

Cheers Daniel. Now all I is to be able to afford a Tesla

Daniel Shawul wrote: Well I use 8 bitboards (2 for white/black pieces and 6 for knight,bishop ...). With the most compact format i.e quad only 4 bitboards = 32 are required.
Daniel Shawul wrote: Also since you are using 0x88 representation and piece lists you need 128x32bit + 16x32bits = 560bytes. That is a significant difference. You are going to allocate that much for each thread you spawn if you want them to do different work. Also you should be careful about using char tables on gpus because of bank conflicts. Use int whenever possible. Some have tried to use multiple threads (could be a warp) to do a single search (i.e use the threads for fast move generation or evaluation). If that is what you want to do then SIMD friendly move generation makes sense, otherwise the discussion should be a SIMD friendly search...
Wow, that is efficient. Woodpusher used 14. I kept separate boards for the white and black piece types to avoid the &s (the bad old days of 32-bit longs!). Gerd posted something about the quad format too. Will definitely check it out. Thanks also for the memory alignment tip. Yes, SIMD friendly (move gen + move selection + makemove) is what I'll be aiming at to begin with, then I'll start worrying about the search.
Daniel Shawul wrote: There is no magic formula. Once you write your kernel and determine the amount of registers , shared memory and other resources consumed, you can determine what limits your occupancy. Usually it is register usage that is the problem. For cuda, there is an excel table you can use to tune your usage to get maximum occupancy.
I was afraid that would be the answer . I'll start my moking up in C and then port to CUDA or OpenCL. Thanks again Daniel.
John

Daniel Shawul · Post by **Daniel Shawul** » Mon Jun 11, 2012 2:49 am

johnhamlen wrote:
Daniel Shawul wrote:Your calculation for piece list is totally wrong, at least according to what I am using now. The piece list entries with LIST { int sq, List* prev,List* next } requires far more than a thousand bytes. You need to declare an LIST[128] too...
Agreed. Totally wrong if I was calculating for a doubly linked list. Woodpusher used a singularly linked list before swapping to bitboards, but I was trying to think of the most memory-efficient representation which (I think) is an array of 2 x 16 entries: one byte for each of the 16 possible pieces for white and black.

Doesn't matter if it is doubly linked list or not. You still need a pointer to the next element if it is to be a linked list. Plus an array of list entry of atleast 128 in size,i.e besides int board[128]. I don't see that. I store and update a linked list for _each_ piece. I looked around and found Fruit that uses a less elaborate linked list representation (only distinction b/n pieces and pawns) ,but still the total seems to be close to mine.

Code: Select all

int square&#91;SquareNb&#93;;
   int pos&#91;SquareNb&#93;;

   sq_t piece&#91;ColourNb&#93;&#91;32&#93;; // only 17 are needed
   int piece_size&#91;ColourNb&#93;;

   sq_t pawn&#91;ColourNb&#93;&#91;16&#93;; // only 9 are needed
   int pawn_size&#91;ColourNb&#93;;

256x4 + 256x4 + 2 x 32 + 2 x 16 = 2144 bytes.

johnhamlen · Post by **johnhamlen** » Mon Jun 11, 2012 2:43 pm

Daniel Shawul wrote: You still need a pointer to the next element if it is to be a linked list.

Agreed, but I didn't explain myself very well. The idea was to not have any sort of linked list, just an array of 32 entries, one for each possible piece on the board. Before you mentioned issues with memory alignment I was I was hoping that the entries could be signed chars containing the index of the piece on the 0x88 board representation (0-127) or -1 if that piece has already been taken. You know the piece/pawn colour by its position in the array, and also piece types by their position in the array. (For 'pawn' types, you'd always need to go to the board[] to check in case the pawn had been promoted).

Of course this sort of piece list representation is nonsense for a "proper" program like Fruit, yours or, dare I say it, Woodpusher. Whilst being lovely and compact it becomes stupidly inefficient once the number of pieces on the board reduces

.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jun 11, 2012 4:41 pm

johnhamlen wrote:
Daniel Shawul wrote: You still need a pointer to the next element if it is to be a linked list.
Agreed, but I didn't explain myself very well. The idea was to not have any sort of linked list, just an array of 32 entries, one for each possible piece on the board. Before you mentioned issues with memory alignment I was I was hoping that the entries could be signed chars containing the index of the piece on the 0x88 board representation (0-127) or -1 if that piece has already been taken. You know the piece/pawn colour by its position in the array, and also piece types by their position in the array. (For 'pawn' types, you'd always need to go to the board[] to check in case the pawn had been promoted).

Of course this sort of piece list representation is nonsense for a "proper" program like Fruit, yours or, dare I say it, Woodpusher. Whilst being lovely and compact it becomes stupidly inefficient once the number of pieces on the board reduces .

Well that is not a piece list. You should be able to loop over pieces of certain kind without testing any flag. If that was the case, why not do what TSCP does with only board[64]? The comparison of piece lists with bitboards should be without degrading performance. A set of bitboards for knight,bisthop,etc do allow you to loop over each kind of piece separately. Also chars infact do waste a lot of space, when 4bits per square would be enough to store the piece type. So you see if you go this road , you should call this method a bitboard too

smatovic · Post by **smatovic** » Mon Jun 11, 2012 4:53 pm

Hi John,

maybe you want to take a look at "Zeta CL"

http://zeta-chess.blogspot.de/

Code is published under GPL:

https://github.com/smatovic/Zeta/

Zeta uses QuadBitboards (thx to Gerd) for Board Presentation and a Magic-Bitboard Move Generator (parts ported from Stockfish).

I tried an 0x88 move generator (port of MicroMax) and it was slower than Magic-Bitboards.

I tried also an parallel Kogge-Stone approach with 8 threads for 8 directions, and it was slower.

Like Daniel already mentioned, Move Generation is only one point, the overall performance is tight to the post-process,:

http://zeta-chess.blogspot.de/2012/03/p ... llers.html

--
Srdja

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Jun 11, 2012 10:55 pm

smatovic wrote:Hi John,

maybe you want to take a look at "Zeta CL"

http://zeta-chess.blogspot.de/

Code is published under GPL:

https://github.com/smatovic/Zeta/

Zeta uses QuadBitboards (thx to Gerd) for Board Presentation and a Magic-Bitboard Move Generator (parts ported from Stockfish).

I tried an 0x88 move generator (port of MicroMax) and it was slower than Magic-Bitboards.

I tried also an parallel Kogge-Stone approach with 8 threads for 8 directions, and it was slower.

Like Daniel already mentioned, Move Generation is only one point, the overall performance is tight to the post-process,:

http://zeta-chess.blogspot.de/2012/03/p ... llers.html

--
Srdja

Hi Srdja,

Hmm, magic bitboards with a ~800KByte rook table on a GPU? I thought it is overkill there.

Your qbb legal move generation in zeta.cl with domove/undomove, king in check test, x-times getting piece codes from qbb in domove/undomove and elsewhere, and lots of branches in inner loops, looks horrorful inefficient and ugly to me

Fillstuff with all 8 ray-directions inside one thread with some local memory or SIMD register file to keep own and disjoint opponent dir-attacks of sliders (you know fillstuff works with multiple rooks/queens or bishops/queens in one run) and kings during generation time, should be much better suited, I guess. But of course, I am not a GPU expert.

Determining absolutely pinned pieces on their pinned direction is then a simple intersection of own pieces with opponent sliding attacks in that ray-direction with the "pseudo" sliding attacks of the own king in the opposite ray-direction. Also, with all opponent sliding attacks (x-raying own king in case of check, thus removing own king from occupancy for opponent sliding attacks), along with the union of opponent pawn, knight and king-attacks available, you have all attacked taboo squares concerning legal king moves and castling. Additionally, the sliding king dir-fills provide all informations to determine check giving moves, either discovered checks (similar to pinned pieces) or direct checks i.e. for some qsearch or tactical search layer etc... .

Gerd

Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Is there such a thing as branchless move generation?

Re: Zeta CL uses Magic Bitboards

Re: Zeta CL uses Magic Bitboards