Fast table-driven move generation

vittyvirus · Post by **vittyvirus** » Sun Jan 01, 2017 2:05 pm

Hi!
Gerd Isenberg had the of "magically" hashing the attack masks to a precomputed movelist (see this). Here's the source code he provided:

Code: Select all

moveTarget = KingAttacks&#91;sq&#93; & ~&#40;ownPieces | attackedSquares&#41;;
idx = &#40;moveTarget * kingMagic&#91;sq&#93;) >> kingShift&#91;sq&#93;);
movelists = kingMoveLists&#91;sq&#93;;
movelist  = movelists&#91;idx&#93;;
moveCpy&#40;target, movelist, movelist->n&#41;;

However, as my implementation of his idea shows, the straightforward approach is in fact slower than using traditional magic bitboards. Although, I no longer am into chess programming, this idea popped into my head: instead of copying moves (i.e. moveCpy()), just store the pointer to the movelist. Now this requires two move stacks, one for storing normal moves (eg pawn moves), the other for storing references to move-lists.
Here is the pseudo-code:

Code: Select all

struct MoveList &#123;
	move_t moves&#91;MaxMoves&#93;; // MaxMoves = 8 for Knight, 13 for Bishop etc.
	char size; // Number of moves stored in this movelist
&#125;;
void add_moves&#40;square from, bitboard attacks, piece p&#41; &#123;
	//while &#40;attacks&#41;
    //  *end++ = make_move&#40;from, pop_1st_bit&#40;&attacks&#41;);
    MoveList* MList = &MListTable&#91;p&#93;&#91;sq&#93;&#91;mlist_magic_index&#40;p, sq, attacks&#41;&#93;;
    move_lists&#91;total_move_lists&#93; = &&#40;MList->moves&#91;0&#93;);
    move_list_count&#91;total_move_lists&#93; = MList->size;
    total_move_lists++;
&#125;

Some (obvious) changes are required in the perft() routine as well (see the provided source code for more info).

Results
---
This (lazy) implementation is 10-14% faster than my implementation of fancy bitboard magics even when all the pieces are on the board! It gets even faster when there are less pawns on the board. I have provided the source code, and I strongly recommend you to test it for yourself.
Besides speed, total stack needed for move generation is reduced almost by a factor of 2 for 64-bits, and even more for 32-bits.

The bad part
---
It takes huge tables to store the movelists for sliders. The movelist tables for rooks are ~56MB in size(!), for bishops around 2MB, and *just* 96kB for Knights. It is possible to get the rook tables down to 28MB by finding optimal 14-bit magics, but well, that's still a lot. Note that I haven't used tables for kings (out of laziness).

Future work
---
There is a lot of scope for improvement. With a little bit of clever work, the whole thing can be made faster and more feasible. Here are some ideas currently floating in my head:
1. Use "variable" MaxMoves in the MoveList struct (see above). This can drastically bring down table sizes.
2. Magic Pawns: Although theoretically possible, see if it is feasible to use separate tables for single pushes, promotions, double promotions, and captures. If you manage to do this, then you don't need to use two movestacks, which means lot more speed, lot less stack needed.
3. There might be a better approach to store and use pointers to movelists, especially in perft routine.

Binaries and Source Code
---
Here you go: https://sites.google.com/site/sydfhd/projects/yaka
Download the file "YakaMoveListExperiment.zip", there are some similarly named files so beware (or check the date).

Postscript
---
Be alarmed people, there's something faster than magic bitboards in town!

hgm · Post by **hgm** » Sun Jan 01, 2017 2:36 pm

vittyvirus wrote:Be alarmed people, there's something faster than magic bitboards in town!

For move generation that has always be the case. Magic bitboards are pretty slow, as far as move-generation methods go. You have to earn it back in evaluation.

Steve Maughan · Post by **Steve Maughan** » Sun Jan 01, 2017 6:08 pm

Hi Syed,

If I've understood you correctly then this is the way Maverick generates moves. I'm happy with the speed. Early versions with little evaluation did about 8.5 million nodes on a 2.4 Ghz i7.

The size of the move table is only going to get less relevant as technology advances.

Steve

tttony · Post by **tttony** » Sun Jan 01, 2017 11:29 pm

Excelent!

I will to try to implement it in my engine

mcostalba · Post by **mcostalba** » Mon Jan 02, 2017 9:03 am

vittyvirus wrote: The bad part
---

How can you score and sort the generated moves in this way?

Apart from perft, in normal search you need to score them.

brtzsnr · Post by **brtzsnr** » Mon Jan 02, 2017 9:10 pm

mcostalba wrote: How can you score and sort the generated moves in this way?

One solution is not to use a list of lists, but just copy the moves into a final moves array and sort that. This assumes that copying is cheap compared to actual move generation.

I see another issue which applies to engines that encode the capture in the move besides from/to square (e.g. Zurichess). One needs to update the captured pieces (if any) which can be tricky. So, this optimization can work only for generation of quiet moves of sliding pieces. In my engine sliding moves takes roughly 5% of total time, but most of that is not generating quiet moves.

I will give this optimization a try, but overall I don't expect a huge improvement or even an Elo gain.

vittyvirus · Post by **vittyvirus** » Wed Jan 04, 2017 2:32 pm

mcostalba wrote:
vittyvirus wrote: The bad part
---
How can you score and sort the generated moves in this way?

Apart from perft, in normal search you need to score them.

Hi Marco,

Your question is certainly interesting. Let me answer your question in perspective of Stockfish. I'm using Stockfish 5 as reference, so pardon me if I say some outdated stuff. There are two approaches I could think of.
In Stockfish move generator, move generation is done is various stages. So you can use separate movelists for storing captures, quiets etc, in a way that doesn't increase the movelist table size.
But I'd recommend using movelists only while generating "quiets" and "non-evasions", and not when generating captures, because this stuff is not so fast for sparsely populated bitboards. So for captures, at least in the search routine, avoid movelists entirely, and use a movestack instead. But for non-evasions and quiets, here are the two approaches:

First Approach
===
The obvious approach, as Alexandru Mosoi suggested, would of course be to simply extract the moves from the movelist and add them to them movestack , e.g. while scoring the moves. Here's the pseudo-code:

Code: Select all

void score_quiets&#40;) &#123;	// or non-evasions
	for &#40;int i = 0; i < n_movelists; ++i&#41; &#123;
		for &#40;int j = 0; j < movelist&#91;i&#93;.size; ++j&#41; &#123;
			end->move = movelist&#91;i&#93;->moves&#91;j&#93;;
			end->value = move_value&#40;end->move&#41;;
			end++;
		&#125;
	&#125;
&#125;

Second Approach (complicated, requires some coding)
===
In Stockfish you use insertion_sort() only to sort "quiet" moves. You can even go without sorting in the traditional way, since our concern is only to pick the best move fast. Suppose you have a set of movelists, containing only quiet/non-evasion moves. Here's a pseudo-code for how you'd "sort" a set of "quiet" movelists:

Code: Select all

// Sort each movelist individually based on move score
for &#40;int i = 0; i < n_movelists; ++i&#41; &#123;
	sort&#40;movelist&#91;i&#93;.begin&#40;), movelist&#91;i&#93;.end&#40;));
&#125;

I don't think it'd be any slower than doing a full insertion_sort(), in fact I think this approach would actually be faster, since it doesn't really do a full sort, and the number are moves inside each movelist are relatively low (so it kinda uses divide-and-conquer), but I could be wrong...
There are several implementation details lacking from the code above. Firstly, how would you store the value of each move? A straightforward approach would be to store scored moves inside the movelists. This would of course increase double the table size, if we use 16-bits to use store the score.
Secondly, how'd you then pick the best move from a individually sorted set of movelists? Here's the algorithm I had to figure out:

Code: Select all

char BestMoveIdx&#91;n_movelists&#93;;	// Initially &#123;0, 0, ...&#125;

// Pick the next best move, requires individually sorted movelists
move pick_next_best&#40;MoveList* movelist&#41; &#123;
	move* best = &&#40;movelist&#91;0&#93;->moves&#91;BestMoveIdx&#91;0&#93;&#93;);
	int best_mlist = 0;
	for &#40;int i = 0; i < n_movelists; ++i&#41; &#123;
		if &#40;movelist&#91;i&#93;->moves&#91;BestMoveIdx&#91;i&#93;&#93; > best->value&#41; &#123;
			best = &&#40;movelist&#91;i&#93;->moves&#91;BestMoveIdx&#91;i&#93;&#93;);
			best_mlist = i;
		&#125;
	&#125;
	
	BestMoveIdx&#91;best_mlist&#93;++;
	return *best;
&#125;

This routine should be certainly faster than what is currently used (i.e. pick_best()), because n_movelists is guaranteed to be <= 5 in our case, which means only five branches needed! Some extra work needed to handle pawns could be avoided by using this approach: treat the pointer to pawn movestack as a movelist, either in the score<>() rountine, or even in the movegen itself.

I'm yet to test both of these approaches so I can't really say which one is faster, but my pick would be approach 2.

Ed Trice · Post by **Ed Trice** » Tue Jan 10, 2017 12:16 am

Why not just generate moves progressively until you get an alpha-beta cutoff? Saves on generating wasteful moves that only bloat the search tree.

ZirconiumX · Post by **ZirconiumX** » Tue Jan 10, 2017 2:26 pm

Ed Trice wrote:Why not just generate moves progressively until you get an alpha-beta cutoff? Saves on generating wasteful moves that only bloat the search tree.

Because in my engine at least, saving the 2.68% of execution time that move generation takes is not going to make any significant difference. Languages with built-in yield support are welcome to use it, but it's not going to make any significant difference (at least in the languages i know that support it, the overhead of the interpreter is many times the overhead of generating all moves)

brtzsnr · Post by **brtzsnr** » Tue Jan 10, 2017 4:30 pm

Why not just generate moves progressively until you get an alpha-beta cutoff? Saves on generating wasteful moves that only bloat the search tree.

Sorting: Most engines sort the moves in order of their relevance (MVV/LVA or history). Moves cannot be sorted if they are generated one by one.

Fast table-driven move generation

Fast table-driven move generation

Re: Fast table-driven move generation

Re: Fast table-driven move generation

Re: Fast table-driven move generation

Re: Fast table-driven move generation

Re: Fast table-driven move generation

Re: Fast table-driven move generation

Re: Fast table-driven move generation

Re: Fast table-driven move generation

Re: Fast table-driven move generation