Further progress on new move generator

hgm · Post by **hgm** » Thu Mar 14, 2019 1:31 pm

Steve Maughan wrote: ↑Thu Mar 14, 2019 12:20 pmMaverick uses a hybrid magic bitboard and lookup-table method. The bitboards are used to generate the moves, which are then stored as a list of pointers into the move table. This has the advantage of bitboard move generation of captures etc, and the lookup table advantage of make-unmake. I'm sure it could be optimized more but Maverick was doing single core 8 million nps on a 2015 i7 when it had a basic piece-square table.

I am not sure I understand this. What information is in the lookup tables that would not be provided by the bitboard generation already? Is it that you store a lot of info on every move, that it pays to have an intermediate level of indirection through the pointers (so that later you would only have to sort the pointers, etc.?)

For me a move is usually just a from-square, to-square, promotion flag and sort key. This can almost always be packed into a 32-bit integer very easily, even on bloody-big boards. So I always manipulate the moves directly, with no need for pointers.

Only when I want to save time on sorting for the history heuristic I would assign two words to each: one for the move itself, and a 'pointer' (actually just an index into the move stack) of the 'next' move, so that I can put the non-captures in linked lists binned by history value. So that later I then just can loop through the bins and the lists without any further sorting.

Joost Buijs · Post by **Joost Buijs** » Thu Mar 14, 2019 2:09 pm

Indeed, you can store all necessary information in a 32 bit field. In my engine I don't even care to pack the move and use 8 bits from-square, 8 bits to-square, 8 bits flags, and 8 bits info. Adding a list of pointers to the move-list will only increase the load on the cache and doesn't enhance performance at all.

I never tried how fast my engine runs with material eval only. There are a lot of things that will have a very big influence upon the results, for instance using the TT or not, or using a-b or plain minimax. The only thing that I know is that my Perft function with full incremental update of the material evaluation and the hash keys (without bulk counting) runs at 100 mln. to 200 mln. moves per second depending upon the position I test with. The function that uses the most CPU is not the move generator nor the do/undo move function but the fuction that checks legality of the move.

Michael Sherwin · Post by **Michael Sherwin** » Thu Mar 14, 2019 4:01 pm

This sounds like a case of suspect counting.

And this is a strawman argument. You changed the argument from is one faster than the other to did you count nodes properly. Well, if both count nodes in the same way and the table lookup is faster, which I can also demonstrate, your argument is non sequitur.

hgm · Post by **hgm** » Thu Mar 14, 2019 4:09 pm

Joost Buijs wrote: ↑Thu Mar 14, 2019 2:09 pmIn my engine I don't even care to pack the move and use 8 bits from-square, 8 bits to-square, 8 bits flags, and 8 bits info.

Well, I consider that also packing, and in most engines it is exactly what I do too. (But some of my engines need 12-bit square numbers, to handle 25x25 boards with 0x88-type square numbering.).

BTW, I am not sure what would be faster: doing separate byte-size memory loads (with zero padding or sign extension) on each move component, or just load the entire move in a register, and extract the components by shift & mask on register-to-register copies. I usually program the latter, with the idea that the optimizer would be smart enough to recognize the equivalence, and generate the same code for both. I am not sure whether this would still be true if I first assign the entire move to an int variable (int move = moveStack[currentMove]

, though. (Which I would sure hope the optimizer eliminates, rather than making an actual memory copy.) Things like fromSqr = move >> 8 & 0xFF; can be done with a single load if 'move' is in memory, but would actually require the specified ALU operations when 'move' is in a register. One reason why I like to control which bits of the in32 are used for what is that I can then make sure the sort key goes into the high byte (so that I can sort by comparing the entire int, rather than also have to extract fields there), without relying on the endedness of the CPU architecture. (Which I would when I declare the Move as a union of int32 and an array of 4 char.)

hgm · Post by **hgm** » Thu Mar 14, 2019 4:33 pm

Michael Sherwin wrote: ↑Thu Mar 14, 2019 4:01 pmAnd this is a strawman argument. You changed the argument from is one faster than the other to did you count nodes properly. Well, if both count nodes in the same way and the table lookup is faster, which I can also demonstrate, your argument is non sequitur.

It is not my intention at all to suggest that one way of counting is better than the other. I just want to point out that the 33MNPS you report is meaningless to me, as the engines I have to compare with count in a different way. So when Spartacus does 2MNPS, it might be faster or slower than Carnivor, I simply have no idea. So the fact that you can report an extremely high NPS doesn't convince me that it is really fast.

But this was really an unimportant side note to the main point of my post; it would be sad if it dput you off from reading the main part:

The move generator you are presenting here does very much work in every node (namely establishing whether the path of a slider capture is clear, which for mailbox boards requires many memory accesses to the board array). And then you try to have that work done as fast as possible.

I just want to point out that there is a different strategy, where you don't do the work at all. At least not in the leaf nodes, but somewhere much closer to the root, where it can be shared by perhaps thousands of nodes, so that it is hardly relevant at all whether it is fast or slow. Doing nothing at all will always be faster than doing a lot very fast.

The clear path of most slider moves will be determined by a blocking piece that has been sitting there for very many ply, and often even was there already in the root. If you would have established that path at the moment it first came into existence, and then remembered it for all nodes in the sub-tree below it to consult, you can save an enormous number of board accesses.

Note that even the generation of non-captures can be sped up by the neighbor table, as all squares in the applicable direction upto the listed neighbor are guaranteed to be empty. So there is no need to access the board during the generation, you just crank out the moves. (You would have to structure the neighbor table such that each point where a ray goes off-board uses a different dummy cell, however, as if there is one extra rim of squares just beyond the edge, to maximally benefit from that.)

Steve Maughan · Post by **Steve Maughan** » Thu Mar 14, 2019 5:00 pm

hgm wrote: ↑Thu Mar 14, 2019 1:31 pm I am not sure I understand this. What information is in the lookup tables that would not be provided by the bitboard generation already? Is it that you store a lot of info on every move, that it pays to have an intermediate level of indirection through the pointers (so that later you would only have to sort the pointers, etc.?)

Hi H.G. - yes, I store all sorts of info in the table (e.g. hash delta, pawn hash delta, need to reset 50-move count etc) and the type of move e.g. castle, capture etc. The make / unmake routines are a large "switch" statement that uses the type of move info the make the necessary changes on the board. This can save time e.g. I don't update the pawn hash when castling. The other advantage is "killers" are simply a pointer to the move table so they incorporate the piece, from square, and to square. Normal killers usually only have from and to square info. Anyway it works well for me.

Steve

Joost Buijs · Post by **Joost Buijs** » Thu Mar 14, 2019 5:08 pm

hgm wrote: ↑Thu Mar 14, 2019 4:09 pm
Joost Buijs wrote: ↑Thu Mar 14, 2019 2:09 pmIn my engine I don't even care to pack the move and use 8 bits from-square, 8 bits to-square, 8 bits flags, and 8 bits info.
Well, I consider that also packing, and in most engines it is exactly what I do too. (But some of my engines need 12-bit square numbers, to handle 25x25 boards with 0x88-type square numbering.).

BTW, I am not sure what would be faster: doing separate byte-size memory loads (with zero padding or sign extension) on each move component, or just load the entire move in a register, and extract the components by shift & mask on register-to-register copies. I usually program the latter, with the idea that the optimizer would be smart enough to recognize the equivalence, and generate the same code for both. I am not sure whether this would still be true if I first assign the entire move to an int variable (int move = moveStack[currentMove], though. (Which I would sure hope the optimizer eliminates, rather than making an actual memory copy.) Things like fromSqr = move >> 8 & 0xFF; can be done with a single load if 'move' is in memory, but would actually require the specified ALU operations when 'move' is in a register. One reason why I like to control which bits of the in32 are used for what is that I can then make sure the sort key goes into the high byte (so that I can sort by comparing the entire int, rather than also have to extract fields there), without relying on the endedness of the CPU architecture. (Which I would when I declare the Move as a union of int32 and an array of 4 char.)

What I meant was packing to non native bit sizes. Usually I let the compiler do the packing and unpacking when I use structs or arrays the performance is comparable to move >> x & 0xFF; with bitfields of non native size the performance is a lot worse. I never looked at the assembler output so I have no idea what the compiler is actually doing.

I coded a move as a union of a packed struct with 4 bytes and an unsigned 32 bit integer:

Code: Select all


#pragma pack(1)

union move_t
{
	struct
	{
		uint8_t from;
		uint8_t to;
		uint8_t type;
		uint8_t info;
	};

	uint32_t move;
};

#pragma pack()

I don't know if this gives issues with little/big-endianness because I never used CPU other than x64. As long as I let the compiler do all the packing and unpacking it should be ok.

Fulvio · Post by **Fulvio** » Thu Mar 14, 2019 5:37 pm

hgm wrote: ↑Thu Mar 14, 2019 4:33 pm
Michael Sherwin wrote: ↑Thu Mar 14, 2019 4:01 pmAnd this is a strawman argument. You changed the argument from is one faster than the other to did you count nodes properly. Well, if both count nodes in the same way and the table lookup is faster, which I can also demonstrate, your argument is non sequitur.
It is not my intention at all to suggest that one way of counting is better than the other.

This discussion is interesting, but in my opinion all this claims about nodes/sec are confusing.
I think everyone will agree that a faster move generator does not automatically imply a stronger engine.
However, if it is about a new move generator, the interesting thing is to measure the time needed to generate the moves.
Lc0 uses bitboards:
https://github.com/LeelaChessZero/lc0/t ... /src/chess

This code measures how long it takes to generate roughly 200M nodes:

Code: Select all

#include "board.h"
#include <chrono>
#include <iostream>

#define NO_PEXT

long long perft(const lczero::ChessBoard &board, int max_depth, int depth = 0) {
  if (depth == max_depth)
    return 1;

  auto total_count = 0ll;
  const auto moves = board.GenerateLegalMoves();
  for (const auto &move : moves) {
    auto new_board = board;
    new_board.ApplyMove(move);
    new_board.Mirror();
    total_count += perft(new_board, max_depth, depth + 1);
  }
  return total_count;
}

int main(int argc, char **argv) {
  lczero::InitializeMagicBitboards();

  auto kiwipete = lczero::ChessBoard(
      "r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq - 0 1");

  const auto t_start = std::chrono::high_resolution_clock::now();

  const auto count = perft(kiwipete, 5);

  const auto t_end = std::chrono::high_resolution_clock::now();
  const std::chrono::duration<double> t_diff = t_end - t_start;

  std::cout << "Nodes: " << count << "\n";
  std::cout << "Time: " << t_diff.count() << " s\n";
  std::cout << "Nodes per sec: " << count / t_diff.count() << " s\n";
}

hgm · Post by **hgm** » Thu Mar 14, 2019 6:00 pm

Steve Maughan wrote: ↑Thu Mar 14, 2019 5:00 pmHi H.G. - yes, I store all sorts of info in the table (e.g. hash delta, pawn hash delta, need to reset 50-move count etc) and the type of move e.g. castle, capture etc. The make / unmake routines are a large "switch" statement that uses the type of move info the make the necessary changes on the board. This can save time e.g. I don't update the pawn hash when castling. The other advantage is "killers" are simply a pointer to the move table so they incorporate the piece, from square, and to square. Normal killers usually only have from and to square info. Anyway it works well for me.

You mean the move infos are static, precalculated tables, incorporating all conceivable moves in a chess game? I suppose that you would not only have to index them by from-square and to-square (and possibly promotion piece), but also by moving piece and capture victim, if you want to get the key changes. That would make them pretty large. Nevertheless it is a clever idea. (Not really practical for Tenjiku Shogi, though, with a 256-square board and some 60 piece types...

)

Joost Buijs · Post by **Joost Buijs** » Thu Mar 14, 2019 7:20 pm

Fulvio wrote: ↑Thu Mar 14, 2019 5:37 pm
hgm wrote: ↑Thu Mar 14, 2019 4:33 pm
Michael Sherwin wrote: ↑Thu Mar 14, 2019 4:01 pmAnd this is a strawman argument. You changed the argument from is one faster than the other to did you count nodes properly. Well, if both count nodes in the same way and the table lookup is faster, which I can also demonstrate, your argument is non sequitur.
It is not my intention at all to suggest that one way of counting is better than the other.
This discussion is interesting, but in my opinion all this claims about nodes/sec are confusing.
I think everyone will agree that a faster move generator does not automatically imply a stronger engine.
However, if it is about a new move generator, the interesting thing is to measure the time needed to generate the moves.
Lc0 uses bitboards:
https://github.com/LeelaChessZero/lc0/t ... /src/chess

This code measures how long it takes to generate roughly 200M nodes:
Code: Select all
#include "board.h"
#include <chrono>
#include <iostream>

#define NO_PEXT

long long perft(const lczero::ChessBoard &board, int max_depth, int depth = 0) {
  if (depth == max_depth)
    return 1;

  auto total_count = 0ll;
  const auto moves = board.GenerateLegalMoves();
  for (const auto &move : moves) {
    auto new_board = board;
    new_board.ApplyMove(move);
    new_board.Mirror();
    total_count += perft(new_board, max_depth, depth + 1);
  }
  return total_count;
}

int main(int argc, char **argv) {
  lczero::InitializeMagicBitboards();

  auto kiwipete = lczero::ChessBoard(
      "r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq - 0 1");

  const auto t_start = std::chrono::high_resolution_clock::now();

  const auto count = perft(kiwipete, 5);

  const auto t_end = std::chrono::high_resolution_clock::now();
  const std::chrono::duration<double> t_diff = t_end - t_start;

  std::cout << "Nodes: " << count << "\n";
  std::cout << "Time: " << t_diff.count() << " s\n";
  std::cout << "Nodes per sec: " << count / t_diff.count() << " s\n";
}

Well, I don't know how fast LC0 runs perft(kiwipete, 5), my engine which is also bitboard based does the 193690690 moves in 1.039 second on a 3800 MHz. Intel CPU. This is ~186 million moves/sec. As said, move generation speed is not very important. Positional evaluation, static exchange analysis and probing the transposition table take the largest part of the consumed time.

For LC0 it is even less important because a NN based evaluation function runs roughly 1000 times slower than a traditional hand crafted one.

Further progress on new move generator

Re: Further progress on new move generator

Re: Further progress on new move generator

Re: Further progress on new move generator

Re: Further progress on new move generator

Re: Further progress on new move generator

Re: Further progress on new move generator

Re: Further progress on new move generator

Re: Further progress on new move generator

Re: Further progress on new move generator

Re: Further progress on new move generator