Memory usage in make/unmake vs logic complexity

matthewlai · Post by **matthewlai** » Sun Aug 31, 2014 12:32 am

I recently started rewriting my ancient engine. Nothing special, just a good old magic bitboard engine.

I spent last night optimizing move generation, and just wanted to share an observation.

Basically, memory access is slow slow slow. Whenever there is a possible tradeoff between extra logic vs extra memory transfer, extra logic almost always wins. Obviously there is a limit to this, but the limit is much much higher than what I had imagined.

Needless to say, make/unmake is much much faster than copy/make.

Make/unmake is a whole lot more code and complexity and opportunity for bugs, but if you care about performance, that's probably the way to go.

Even just eliminating a few bitboard copies here and there, I was able to get very substantial speedups.

I now have a perft of close to 20Mnps on my MacBook Pro (2.4 GHz Haswell), with no perft-specific optimizations (no hashing, and not even bulk counting, since I'm interested in make/unmake performance). Crafty 24 on the same machine gets 30Mnps. It may not sound very fast compared to Crafty... but I hacked together this whole thing in about 2 days, compared to a few decades for Crafty.

My old engine (on rotated bitboards and copy/make) was getting about 3Mnps.

This is not surprising, because for example, even on Haswell (newest Intel architecture),
L1 = 4/5 cycles
L2 = 12 cycles
L3 = 36 cycles
RAM = on the order of hundreds of cycles

It can do 4 instructions PER CYCLE for simple instructions (including most things we do with bitboards, including things like popcount). Obviously that's the best case, and it will depend on branch prediction, etc, but I think it's reasonable to assume about 2 instructions per cycle average with a reasonably well-optimized program with a reasonably competent optimizing compiler (minimize unpredictable branches, etc).

That means even a L1 hit is equivalent to about 10 instructions, and a L2 hit is about 24 instructions.

Most processors are now switching to larger L3 and smaller L2, and an L3 hit is about 72 instructions.

If you have ONE cache miss, that one DRAM access will probably take longer than your entire move generation/application routine.

This is also with just 1 process/thread running on the CPU. In an actual chess engine, there would be other functions competing for memory bandwidth and caches as well, and logic will be even cheaper. And for scaling to more cores on a shared memory bus, that's going to be even worse (or better for logic).

So if you do make/unmake, and have to make a decision on whether to save something or recompute it, just remember that a memory load takes about as long as 30-50 instructions.

syzygy · Post by **syzygy** » Sun Aug 31, 2014 1:11 am

matthewlai wrote:IThat means even a L1 hit is equivalent to about 10 instructions, and a L2 hit is about 24 instructions.

Not correct. The CPU can execute instructions while waiting for L1 or L2.

With copy/make both the old and new state memory will as a rule be in L1. The state is copied many bytes at a time. This is known to be quite competitive on modern CPUs.

matthewlai · Post by **matthewlai** » Sun Aug 31, 2014 1:18 am

syzygy wrote:
matthewlai wrote:IThat means even a L1 hit is equivalent to about 10 instructions, and a L2 hit is about 24 instructions.
Not correct. The CPU can execute instructions while waiting for L1 or L2.

With copy/make both the old and new state memory will as a rule be in L1. The state is copied many bytes at a time. This is known to be quite competitive on modern CPUs.

That is true, but only if there are other instructions to be executed.

If the program is just copying say 200 bytes from one place to another (in L1), there aren't really any other instruction that can be scheduled for at least most of the time.

I certainly hope it's true that copy/make is competitive. That would make my code much simpler. I'll give it a try and compare.

syzygy · Post by **syzygy** » Sun Aug 31, 2014 1:45 am

Personally I think neither a strict copy/make nor a strict make/unmake approach can be optimal.

I would handle state data that usually does not change between moves using make/unmake. For example a 64-byte array that holds the piece type for each square. Copying the whole array during make seems more expensive than undoing the few changes to this array during the unmake. Same for the bitboard of knights (I think). And same, for sure, for piece lists if present.

I would handle state data that usally does change between moves using copy/make:
- read the data into a register from the old state;
- apply the changes;
- write the date to the new state.

If make/unmake this would be:
- read the data into a register from the old state;
- apply the changes;
- write the date to the old state.

So no real difference during make, but a gain during unmake.

As an example, not keeping the old value of the hash key but recalculating it during unmake is a certain loss. Same for the bitboard of occupied squares (I think).

How do you currently handle the hash key?

AlvaroBegue · Post by **AlvaroBegue** » Sun Aug 31, 2014 2:14 am

syzygy wrote:As an example, not keeping the old value of the hash key but recalculating it during unmake is a certain loss. Same for the bitboard of occupied squares (I think).

How do you currently handle the hash key?

My hash key is the Zobrist key for the board XORed with numbers that encode castling rights and the en-passant square. I update the Zobrist key for the board in the very lowest-level functions that modify the board (place_piece() and remove_piece()). It is possible that there are some gains to be had by not updating the Zobrist key while undoing moves, and have the value saved somewhere, but it is much easier to avoid bugs if every time the board changes the Zobrist key changes with it.

My program fits the description of Matthew's program, and I get 22M nodes per second in perft from the root, on a i7-4650U CPU @ 1.70GHz.

Aleks Peshkov · Post by **Aleks Peshkov** » Sun Aug 31, 2014 9:34 am

I belief that many competitive possibilities of board representation exist. Making another clone of 25 years old Crafty bitboards representation and magic attacks generation is boring and uncreative job.

matthewlai · Post by **matthewlai** » Sun Aug 31, 2014 9:49 am

syzygy wrote:Personally I think neither a strict copy/make nor a strict make/unmake approach can be optimal.

I would handle state data that usually does not change between moves using make/unmake. For example a 64-byte array that holds the piece type for each square. Copying the whole array during make seems more expensive than undoing the few changes to this array during the unmake. Same for the bitboard of knights (I think). And same, for sure, for piece lists if present.

I would handle state data that usally does change between moves using copy/make:
- read the data into a register from the old state;
- apply the changes;
- write the date to the new state.

If make/unmake this would be:
- read the data into a register from the old state;
- apply the changes;
- write the date to the old state.

So no real difference during make, but a gain during unmake.

As an example, not keeping the old value of the hash key but recalculating it during unmake is a certain loss. Same for the bitboard of occupied squares (I think).

How do you currently handle the hash key?

What I am doing (and I believe this is a pretty common way?) is in my position representation, I have 2 arrays - 1 of uint64_t's, and 1 of uint8_t's.

The uint64_t array stores bitboards (piece types, en passant square, white occupancy, and black occupancy).

The uint8_t array stores all other attributes like side to move, castling rights, half moves clock.

In my ApplyMove routine, I keep 2 "undo lists". One for each array.

Each entry in the undo list is an array index, and the content of that element.

And in the ApplyMove routine, every time I change an element in one of the 2 arrays, I push the old value (along with the index) to the undo list. The lists are eventually pushed onto a stack.

Then for unmaking moves, I just pop a list from the stack, iterate through the list, and apply all the changes.

So in a way, I guess I'm not really unmaking moves in the strictest sense, but just keeping a journal of changes, to allow rollback.

With my representation I have proven that I need to update at most 6 bitboards, and 8 uint8_t's per move. That's the absolute worst case, and most moves have much shorter undo lists.

I don't have zobrist yet, since I don't have a need for it at the moment. But when I add it I will shove it into the undo list for sure, because if updating the hash requires at least 1 access to the zobrist tables, that breaks even with shoving it into the undo list.

matthewlai · Post by **matthewlai** » Sun Aug 31, 2014 9:58 am

Aleks Peshkov wrote:I belief that many competitive possibilities of board representation exist. Making another clone of 25 years old Crafty bitboards representation and magic attacks generation is boring and uncreative job.

For me move generation and make/unmake are not the interesting parts of chess programming.

I just want a correct and fast implementation as a basis for experimenting on different ways to do search and evaluation. And bitboard + magic is a tried and true fast algorithm. If you come up with something significantly faster and it's not terribly complicated, I'll copy that, too

.

I am rewriting my engine because I want to use it as basis for my master's research (I have a few novel ideas on how to apply machine learning), and I don't want to be handicapped by slow low level stuff.

matthewlai · Post by **matthewlai** » Sun Aug 31, 2014 10:02 am

AlvaroBegue wrote:
syzygy wrote:As an example, not keeping the old value of the hash key but recalculating it during unmake is a certain loss. Same for the bitboard of occupied squares (I think).

How do you currently handle the hash key?
My hash key is the Zobrist key for the board XORed with numbers that encode castling rights and the en-passant square. I update the Zobrist key for the board in the very lowest-level functions that modify the board (place_piece() and remove_piece()). It is possible that there are some gains to be had by not updating the Zobrist key while undoing moves, and have the value saved somewhere, but it is much easier to avoid bugs if every time the board changes the Zobrist key changes with it.

My program fits the description of Matthew's program, and I get 22M nodes per second in perft from the root, on a i7-4650U CPU @ 1.70GHz.

That's awesome! Looks like we are getting about the same performance, if you account for the difference in hardware (my CPU turbo boosts to 3.1 GHz, yours to 3.3 GHz, and since this is single threaded, they are probably working at max boost frequency).

I guess it makes sense if we are doing about the same thing.

I did some profiling, and it looks like generating pawn moves is taking a lot of time, but I can't really think of any way to improve that. Pawn move rules are complicated.

Aleks Peshkov · Post by **Aleks Peshkov** » Sun Aug 31, 2014 11:06 am

matthewlai wrote:I am rewriting my engine because I want to use it as basis for my master's research (I have a few novel ideas on how to apply machine learning), and I don't want to be handicapped by slow low level stuff.

You would be handicapped anyway unless you start from Stockfish.

Memory usage in make/unmake vs logic complexity

Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity

Re: Memory usage in make/unmake vs logic complexity