performance of copy-make

Rein Halbersma · Post by **Rein Halbersma** » Tue Aug 02, 2011 12:52 pm

A while ago, a thread ran here about the costs of copy-make vs make/undo. As Don Dailey pointed out, copy-make makes the search interface simpler, especially in a parallel context and that you save the undo logic. The price you pay is copying around something like 192 bytes (3 chache lines!) every time you make a move. Compared to chess, the games of draughts and checkers need a lot less data to keep track of a position. Instead of bitboards for 6 types of pieces times 2 colors, one needs only 3 bitboards (24 bytes). This makes it even more attractive to use copy-make. For my draughts engine, copy-make was almost exactly a break-even transition from make/undo.

I ran a small experiment with my draughts engine. All in all my current position struct is 64 bytes, including the piece bitboards, 50ply counter, a hash key, and a pointer to the previous position. I created 7 test versions where I padded the struct to have 128, 192, 256, 512, 1024, 2048 and 4096 bytes, respectively. For each version, I ran a 15-ply search (29 seconds) on the initial position. Below are the timings and the percentage change with respect to the baseline version.

Code: Select all

state	time	change
64	  28672
128	 28828	 0,5%
192	 29281	 2,1%
256	 30688	 7,0%
512	 31875	11,2%
1024	33188	15,8%
2048	36547	27,5%
4096	38531	34,4%

Depending on whether one can emulate Don Dailey's 192 bytes or squeeze the entire state into 256 bytes, one would probably get a performance hit between 2 and 7 percent. But remember that my baseline version already has copy-make, so the above table is the pure "copy penalty". For me the padded versions don't receive the "undo bonus" that one would get if one were to try this for a make/undo engine such as Crafty/Fruit or Stockfish. E.g., the current Stateinfo struct in Stockfish (that is copied every move) is already about 72 bytes if I remember, so the impact of a full copy-make would likely be similar to the table above.

Conclusion: it seems copy-make can be made competitive compared to the usual make-undo. Plus of course the simplification of the whole split point business for multicore engines.

hgm · Post by **hgm** » Tue Aug 02, 2011 1:27 pm

Well, for me the make-undo for the board and piece-list is virtually unbeatable. I only use copy-abandon for things like the hash key and the incremental evaluation. Things passed as parameters to Search, such as alpha, beta and depth of course are implicit copy-abandons.

For the attack tableI also use make-undo, but the undo code is highly simplified by saveing every memoryword that gets changed on make onan undo stack together with its address. So it just has to pop that stack without figuring out anything, except whether it is done,and the loop branch for that has good predictability.

Joost Buijs · Post by **Joost Buijs** » Tue Aug 02, 2011 2:20 pm

I've been experimenting with this in the past quite a lot and I always found copy-make performing much worse as make/unmake.

In my engine I have to copy:
piece bitboards 96 bytes
occupied bitboard 8 bytes
piece array 64 bytes
king locations 2 bytes
hash 8 bytes
pawn hash 8 bytes
incremental evaluation for both colors 4 bytes
incremental stage for both colors 4 bytes
50 move counter 1 byte

So this is atleast 195 bytes and there even is some additional stuff like
castle status enpassant status etc.

Actually I don't use the 8 bit or 16 bit fields I mentioned above but I like to make most fields 64 bit because using 8 or 16 bit values in a packed struct gives a performance hit as well because it is likely that they are not properly aligned.

Joost Buijs · Post by **Joost Buijs** » Tue Aug 02, 2011 2:38 pm

For the split operation copying doesn't hurt that much because it usually takes place 3 or 4 ply below the leaves of the tree.

Rein Halbersma · Post by **Rein Halbersma** » Tue Aug 02, 2011 2:41 pm

Thanks for sharing this info. So you could comfortably fit the whole position within 256 bytes, while making sure of the various alignment issues? Did you record how much of a performance hit it gave? 50%? factor of 2?

Joost Buijs · Post by **Joost Buijs** » Tue Aug 02, 2011 2:59 pm

Yes the whole position fits comfortably within 256 bytes, no problem.
I don't have the exact numbers anymore for the copy-make because I abandoned it a long time ago.

The performance of my current engine with perft (move generation/make/unmake, no bulk counting ) with full incremental update of the static evaluation and hashkeys etc. is around 25 mln moves per second on a single core i7. I hope you can do something with these numbers.

Joost Buijs · Post by **Joost Buijs** » Tue Aug 02, 2011 3:17 pm

Maybe the performance issue of copy-make is not that important at all because 90% of the time is spent in the evaluation function. I assume you will hardly notice the performance penalty of copy-make on the program as a whole.

Rein Halbersma · Post by **Rein Halbersma** » Tue Aug 02, 2011 3:23 pm

My table in the top post of this thread was the time to depth for a full 15-ply search on the initial position in international draughts, not for a stripped down perft run! All versions with padding compare the extra time for this full search. So 7% extra time for the 256 struct compared to the 64 byte struct.

rbarreira · Post by **rbarreira** » Tue Aug 02, 2011 3:28 pm

Joost Buijs wrote:Maybe the performance issue of copy-make is not that important at all because 90% of the time is spent in the evaluation function. I assume you will hardly notice the performance penalty of copy-make on the program as a whole.

Even if 90% of the time is spent in the evaluation function, that doesn't mean copy-make can't have a very negative impact. For example, if it fills up the cache with garbage it will make the evaluation run slower.

Profiling code is very useful but it does not tell the whole story. It only shows where your program spends its time, not why it spends its time there.

Rein Halbersma · Post by **Rein Halbersma** » Tue Aug 02, 2011 3:38 pm

rbarreira wrote:
Joost Buijs wrote:Maybe the performance issue of copy-make is not that important at all because 90% of the time is spent in the evaluation function. I assume you will hardly notice the performance penalty of copy-make on the program as a whole.
Even if 90% of the time is spent in the evaluation function, that doesn't mean copy-make can't have a very negative impact. For example, if it fills up the cache with garbage it will make the evaluation run slower.

Profiling code is very useful but it does not tell the whole story. It only shows where your program spends its time, not why it spends its time there.

For a 16 ply search and 256 bytes per position, you are at about 4K per core. That should comfortably fit into cache.

performance of copy-make

performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make