Is there such a thing as branchless move generation?
Moderators: hgm, Rebel, chrisw
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Is there such a thing as branchless move generation?
Your calculation for piece list is totally wrong, at least according to what I am using now. The piece list entries with LIST { int sq, List* prev,List* next } requires far more than a thousand bytes. You need to declare an LIST[128] too...
-
- Posts: 670
- Joined: Mon Dec 03, 2007 3:01 pm
- Location: Barcelona, Spain
Re: Is there such a thing as branchless move generation?
a) what do you need the *prev for?Daniel Shawul wrote:Your calculation for piece list is totally wrong, at least according to what I am using now. The piece list entries with LIST { int sq, List* prev,List* next } requires far more than a thousand bytes. You need to declare an LIST[128] too...
b) why 128 and not just 2*16 entries?
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Is there such a thing as branchless move generation?
What is your point here ? You can try to avoid a byte or so but at the end of the day piece lists are going to consume a lot more than 4 bitboards. Here is your answer anywayEdmund wrote:a) what do you need the *prev for?Daniel Shawul wrote:Your calculation for piece list is totally wrong, at least according to what I am using now. The piece list entries with LIST { int sq, List* prev,List* next } requires far more than a thousand bytes. You need to declare an LIST[128] too...
b) why 128 and not just 2*16 entries?
a) faster insert/remove piece. Sure you can avoid that but your make/unmake will be slower.
b) An array of LIST[128] should be separately kept and updated when a move is made. Thas was completely ignored in the original calculation. Pointers do require an int and using char tables is dead slow in gpus because of bank conflicts. For scorpio with piece lists I have the following. After that tell me if 0x88 is better than one quad bitboard.
Code: Select all
typedef struct LIST{
int sq;
LIST* prev;
LIST* next;
}*PLIST;
int temp_board[192];
PLIST list[128];
PLIST plist[15];
Code: Select all
void SEARCHER::COPY(SEARCHER* srcSearcher) {
int i,sq;
for(sq = A1;sq <= H8; sq++) {
if(!(sq & 0x88)) {
board[sq] = srcSearcher->board[sq];
list[sq]->sq = sq;
if(srcSearcher->list[sq]->prev)
list[sq]->prev = list[srcSearcher->list[sq]->prev->sq];
else
list[sq]->prev = 0;
if(srcSearcher->list[sq]->next)
list[sq]->next = list[srcSearcher->list[sq]->next->sq];
else
list[sq]->next = 0;
} else {
sq += 0x07;
}
}
for(i = wking;i <= bpawn;i++) {
if(srcSearcher->plist[i])
plist[i] = list[srcSearcher->plist[i]->sq];
else
plist[i] = 0;
}
-
- Posts: 25
- Joined: Sat Feb 18, 2012 10:56 am
- Location: United Kingdom
Re: Is there such a thing as branchless move generation?
Agreed. Totally wrong if I was calculating for a doubly linked list. Woodpusher used a singularly linked list before swapping to bitboards, but I was trying to think of the most memory-efficient representation which (I think) is an array of 2 x 16 entries: one byte for each of the 16 possible pieces for white and black.Daniel Shawul wrote:Your calculation for piece list is totally wrong, at least according to what I am using now. The piece list entries with LIST { int sq, List* prev,List* next } requires far more than a thousand bytes. You need to declare an LIST[128] too...
-
- Posts: 25
- Joined: Sat Feb 18, 2012 10:56 am
- Location: United Kingdom
Re: Is there such a thing as branchless move generation?
Cheers Daniel. Now all I is to be able to afford a TeslaDaniel Shawul wrote: Last time I checked NVIDIA tesla's have 64kb that you can divide into 48kb + 16kb shared memory and local cache. If you just want the device to do most of the caching of global mem for you, use 48kb for that otherwise 48kb of shared memory per multi-processor is what you get.
Daniel Shawul wrote: Well I use 8 bitboards (2 for white/black pieces and 6 for knight,bishop ...). With the most compact format i.e quad only 4 bitboards = 32 are required.Wow, that is efficient. Woodpusher used 14. I kept separate boards for the white and black piece types to avoid the &s (the bad old days of 32-bit longs!). Gerd posted something about the quad format too. Will definitely check it out. Thanks also for the memory alignment tip. Yes, SIMD friendly (move gen + move selection + makemove) is what I'll be aiming at to begin with, then I'll start worrying about the search.Daniel Shawul wrote: Also since you are using 0x88 representation and piece lists you need 128x32bit + 16x32bits = 560bytes. That is a significant difference. You are going to allocate that much for each thread you spawn if you want them to do different work. Also you should be careful about using char tables on gpus because of bank conflicts. Use int whenever possible. Some have tried to use multiple threads (could be a warp) to do a single search (i.e use the threads for fast move generation or evaluation). If that is what you want to do then SIMD friendly move generation makes sense, otherwise the discussion should be a SIMD friendly search...I was afraid that would be the answer . I'll start my moking up in C and then port to CUDA or OpenCL. Thanks again Daniel.Daniel Shawul wrote: There is no magic formula. Once you write your kernel and determine the amount of registers , shared memory and other resources consumed, you can determine what limits your occupancy. Usually it is register usage that is the problem. For cuda, there is an excel table you can use to tune your usage to get maximum occupancy.
John
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Is there such a thing as branchless move generation?
Doesn't matter if it is doubly linked list or not. You still need a pointer to the next element if it is to be a linked list. Plus an array of list entry of atleast 128 in size,i.e besides int board[128]. I don't see that. I store and update a linked list for _each_ piece. I looked around and found Fruit that uses a less elaborate linked list representation (only distinction b/n pieces and pawns) ,but still the total seems to be close to mine.johnhamlen wrote:Agreed. Totally wrong if I was calculating for a doubly linked list. Woodpusher used a singularly linked list before swapping to bitboards, but I was trying to think of the most memory-efficient representation which (I think) is an array of 2 x 16 entries: one byte for each of the 16 possible pieces for white and black.Daniel Shawul wrote:Your calculation for piece list is totally wrong, at least according to what I am using now. The piece list entries with LIST { int sq, List* prev,List* next } requires far more than a thousand bytes. You need to declare an LIST[128] too...
Code: Select all
int square[SquareNb];
int pos[SquareNb];
sq_t piece[ColourNb][32]; // only 17 are needed
int piece_size[ColourNb];
sq_t pawn[ColourNb][16]; // only 9 are needed
int pawn_size[ColourNb];
-
- Posts: 25
- Joined: Sat Feb 18, 2012 10:56 am
- Location: United Kingdom
Re: Is there such a thing as branchless move generation?
Agreed, but I didn't explain myself very well. The idea was to not have any sort of linked list, just an array of 32 entries, one for each possible piece on the board. Before you mentioned issues with memory alignment I was I was hoping that the entries could be signed chars containing the index of the piece on the 0x88 board representation (0-127) or -1 if that piece has already been taken. You know the piece/pawn colour by its position in the array, and also piece types by their position in the array. (For 'pawn' types, you'd always need to go to the board[] to check in case the pawn had been promoted).Daniel Shawul wrote: You still need a pointer to the next element if it is to be a linked list.
Of course this sort of piece list representation is nonsense for a "proper" program like Fruit, yours or, dare I say it, Woodpusher. Whilst being lovely and compact it becomes stupidly inefficient once the number of pieces on the board reduces .
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Is there such a thing as branchless move generation?
Well that is not a piece list. You should be able to loop over pieces of certain kind without testing any flag. If that was the case, why not do what TSCP does with only board[64]? The comparison of piece lists with bitboards should be without degrading performance. A set of bitboards for knight,bisthop,etc do allow you to loop over each kind of piece separately. Also chars infact do waste a lot of space, when 4bits per square would be enough to store the piece type. So you see if you go this road , you should call this method a bitboard toojohnhamlen wrote:Agreed, but I didn't explain myself very well. The idea was to not have any sort of linked list, just an array of 32 entries, one for each possible piece on the board. Before you mentioned issues with memory alignment I was I was hoping that the entries could be signed chars containing the index of the piece on the 0x88 board representation (0-127) or -1 if that piece has already been taken. You know the piece/pawn colour by its position in the array, and also piece types by their position in the array. (For 'pawn' types, you'd always need to go to the board[] to check in case the pawn had been promoted).Daniel Shawul wrote: You still need a pointer to the next element if it is to be a linked list.
Of course this sort of piece list representation is nonsense for a "proper" program like Fruit, yours or, dare I say it, Woodpusher. Whilst being lovely and compact it becomes stupidly inefficient once the number of pieces on the board reduces .
-
- Posts: 2663
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Zeta CL uses Magic Bitboards
Hi John,
maybe you want to take a look at "Zeta CL"
http://zeta-chess.blogspot.de/
Code is published under GPL:
https://github.com/smatovic/Zeta/
Zeta uses QuadBitboards (thx to Gerd) for Board Presentation and a Magic-Bitboard Move Generator (parts ported from Stockfish).
I tried an 0x88 move generator (port of MicroMax) and it was slower than Magic-Bitboards.
I tried also an parallel Kogge-Stone approach with 8 threads for 8 directions, and it was slower.
Like Daniel already mentioned, Move Generation is only one point, the overall performance is tight to the post-process,:
http://zeta-chess.blogspot.de/2012/03/p ... llers.html
--
Srdja
maybe you want to take a look at "Zeta CL"
http://zeta-chess.blogspot.de/
Code is published under GPL:
https://github.com/smatovic/Zeta/
Zeta uses QuadBitboards (thx to Gerd) for Board Presentation and a Magic-Bitboard Move Generator (parts ported from Stockfish).
I tried an 0x88 move generator (port of MicroMax) and it was slower than Magic-Bitboards.
I tried also an parallel Kogge-Stone approach with 8 threads for 8 directions, and it was slower.
Like Daniel already mentioned, Move Generation is only one point, the overall performance is tight to the post-process,:
http://zeta-chess.blogspot.de/2012/03/p ... llers.html
--
Srdja
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: Zeta CL uses Magic Bitboards
Hi Srdja,smatovic wrote:Hi John,
maybe you want to take a look at "Zeta CL"
http://zeta-chess.blogspot.de/
Code is published under GPL:
https://github.com/smatovic/Zeta/
Zeta uses QuadBitboards (thx to Gerd) for Board Presentation and a Magic-Bitboard Move Generator (parts ported from Stockfish).
I tried an 0x88 move generator (port of MicroMax) and it was slower than Magic-Bitboards.
I tried also an parallel Kogge-Stone approach with 8 threads for 8 directions, and it was slower.
Like Daniel already mentioned, Move Generation is only one point, the overall performance is tight to the post-process,:
http://zeta-chess.blogspot.de/2012/03/p ... llers.html
--
Srdja
Hmm, magic bitboards with a ~800KByte rook table on a GPU? I thought it is overkill there.
Your qbb legal move generation in zeta.cl with domove/undomove, king in check test, x-times getting piece codes from qbb in domove/undomove and elsewhere, and lots of branches in inner loops, looks horrorful inefficient and ugly to me
Fillstuff with all 8 ray-directions inside one thread with some local memory or SIMD register file to keep own and disjoint opponent dir-attacks of sliders (you know fillstuff works with multiple rooks/queens or bishops/queens in one run) and kings during generation time, should be much better suited, I guess. But of course, I am not a GPU expert.
Determining absolutely pinned pieces on their pinned direction is then a simple intersection of own pieces with opponent sliding attacks in that ray-direction with the "pseudo" sliding attacks of the own king in the opposite ray-direction. Also, with all opponent sliding attacks (x-raying own king in case of check, thus removing own king from occupancy for opponent sliding attacks), along with the union of opponent pawn, knight and king-attacks available, you have all attacked taboo squares concerning legal king moves and castling. Additionally, the sliding king dir-fills provide all informations to determine check giving moves, either discovered checks (similar to pinned pieces) or direct checks i.e. for some qsearch or tactical search layer etc... .
Gerd