undo move vs. Position Cloning

Tord Romstad · Post by **Tord Romstad** » Thu Sep 17, 2009 5:54 am

bob wrote:I don't follow. If you are talking about the second post in this thread (from the beginning) then that is _exactly_ what I did in Cray Blitz and early Crafty. I copied the parts of the state that were incrementally updated (bitboards, hash signature, 50move counter, etc) and then updated the new copy, keeping the original in place.

No, it's not about that at all. In fact, it looks like we do almost exactly the same thing you do in Crafty today. When making moves, you copy some data, like this (in Crafty 23.0, the most recent version I have):

Code: Select all

  tree->rep_list[wtm][Repetition(wtm)++] = HashKey;
  tree->position[ply + 1] = tree->position[ply];
  tree->save_hash_key[ply] = HashKey;
  tree->save_pawn_hash_key[ply] = PawnHashKey;

When unmaking moves, you copy a smaller amount of data back:

Code: Select all

  HashKey = tree->save_hash_key[ply];
  PawnHashKey = tree->save_pawn_hash_key[ply];

This is 100% equivalent to what Stockfish does, except that it doesn't have to copy the hash key and the pawn hash key back.

I'll probably be offline from now until Tuesday, and won't be able to reply in this thread until then.

mcostalba · Post by **mcostalba** » Thu Sep 17, 2009 8:48 am

bob wrote: As I said, I am going to go back to copy/make in the current Crafty (not that hard to do) and compare performance, in terms of NPS, with a single thread, and with 8 threads, to see what the issues are, if any. If it is as fast or faster, I will report that here and stick with it.

It is very probably you are going to do the wrong thing

Is gonna be a waste of time and a source of misunderstandings if you don't understand the idea _before_ to run the test.

I strongly suggest you to read the SF sources, specifically do_move() and undo_move() in position.cpp, I hope they are well commented and should be clear enough.

Reading your many posts I have understood that you don't have understood

(perhaps because I was not clear enough, so please read the sources to remove any doubt)

Gian-Carlo Pascutto · Thu Sep 17, 2009 11:11 am

Yes, as I already stated before, this thread started out with a question, Marco replied with something totally offtopic, and now there are two totally different discussions going on here, with each side not acknowledging the other side exists

The TWO questions are:

a) Whether make/unmake or copy/make are faster. I presume the result here is very strongly dependant on that state size. I'm somewhat curious at the sweet spot for current CPUs, so I hope Bob runs the test.

b) An optimization trick that observes that in make/unmake, you usually also copy some info that is small and expensive to undo. In this case, similar optimizations as for (a) can be used to make sure that the copy is only done 1x instead of 2x.

There's no question that (b) is always a gain. But I'm not sure it's a relevant gain, as we're talking about copying 32 bytes vs 64 bytes or so here (4 cycles less per node, hurray...).

Gian-Carlo Pascutto · Thu Sep 17, 2009 11:14 am

Don wrote: You mention 4.5 K of memory. Are there any chess programs that require that much memory to represent a single position?

Maybe read the original question that started the thread?

Unmake should always be faster than copying around 4.5K. So if copying is faster, the original poster has some other problem. (And he shouldn't store the hash history with the position)

Aleks Peshkov · Post by **Aleks Peshkov** » Thu Sep 17, 2009 11:44 am

Gian-Carlo Pascutto wrote:Unmake should always be faster than copying around 4.5K. So if copying is faster, the original poster has some other problem. (And he shouldn't store the hash history with the position)

It may be faster or may be not. Attack table is an example of large data that is very costly to undo.

bob · Post by **bob** » Thu Sep 17, 2009 7:59 pm

Tord Romstad wrote:
bob wrote:I don't follow. If you are talking about the second post in this thread (from the beginning) then that is _exactly_ what I did in Cray Blitz and early Crafty. I copied the parts of the state that were incrementally updated (bitboards, hash signature, 50move counter, etc) and then updated the new copy, keeping the original in place.
No, it's not about that at all. In fact, it looks like we do almost exactly the same thing you do in Crafty today. When making moves, you copy some data, like this (in Crafty 23.0, the most recent version I have):
Code: Select all
  tree->rep_list[wtm][Repetition(wtm)++] = HashKey;
  tree->position[ply + 1] = tree->position[ply];
  tree->save_hash_key[ply] = HashKey;
  tree->save_pawn_hash_key[ply] = PawnHashKey;
When unmaking moves, you copy a smaller amount of data back:
Code: Select all
  HashKey = tree->save_hash_key[ply];
  PawnHashKey = tree->save_pawn_hash_key[ply];
This is 100% equivalent to what Stockfish does, except that it doesn't have to copy the hash key and the pawn hash key back.

I'll probably be offline from now until Tuesday, and won't be able to reply in this thread until then.

Copying the hash signature (8 bytes) is not much of an issue. But that is not the major work done in UnmakeMove(). If you are copying the board, etc, then this is expensive. If you don't copy the board (a bunch of 8-byte bitboards) then this is not an issue...

BTW this was done back when we were using 32 bit architectures to deal with 64 bit things like the hash signature. I am not even convinced this is worth the effort today and it might be slightly faster to actually update the hash signature in UnmakeMove() rather than copying it so many times. The only thing I really needed to copy was the 50-move-rule counter since I don't know how to unmake a move on that other than to just restore it to its previous value, which does require a copy.

bob · Post by **bob** » Thu Sep 17, 2009 8:07 pm

Don wrote:
bob wrote:
Don wrote:
You might _think_ it is for free. But the copy is far from free. Add 7 more threads on an 8-core box and it is even less "free".
I honestly don't know which is faster Bob, but I'm not making any assumptions either. You seem to be just assuming that copying 192 bytes is so expensive it is worth the extra logic, memory access and missed branch predictions that go with undo() which is clearly NOT free. With copy, undo is FREE.
Let's go step-by-step thru the above. In UnmakeMove() there is little or no memory accesses. We are updating _exactly_ the same data that was updated when we made the move at the start of this sub-tree search. And we are updating exactly the same data that is updated at nodes below this one in the tree. So memory traffic should be nil. There are very few branches in UnmakeMove() and I'd bet the prediction rate is the typical 99% or so that is normal for most programs that have been thought out with respect to architectural issues. The "extra logic" is cache-based since the same make/unmake code is executed in every node of the search. Most likely it all comes from L1, where it is far better to access stuff than when it is in main memory and thousands of cycles away.

First, I am not assuming _anything_. This has been measured, _carefully_. I agree that Unmake() is free. And that Make() does no extra work at all either, except for the copy. But the copy is what I am looking at. In Crafty, when I was doing (originally) copy/make, I was copying 192 bytes of "stuff". My speed jumped _significantly_ when I dumped the copy and went to make/unmake. Several others had already discovered this. I was not the one that discovered that the copy part was bad. But when Bruce Moreland first mentioned the potential problem, I did some tests, and became convinced. It was a lot harder to go from copy/make to make/unmake, than it is to go the other direction. But I made the modifications and saw an instant significant performance jump.

As I said, I am going to go back to copy/make in the current Crafty (not that hard to do) and compare performance, in terms of NPS, with a single thread, and with 8 threads, to see what the issues are, if any. If it is as fast or faster, I will report that here and stick with it. If not, I am saving the make/unmake version and will revert to that. My Unmake() function is not very big, so eliminating that is not a huge deal from a simplicity point of view. I only worry about moving chunks of N bytes around (192 might be fairly accurate today, 12 piece boards, plus two for diagonals and ranks/files, by the time I am finished, 192 or less seems reasonable.

I hope to get this done tomorrow afternoon between classes, with any luck. More data once that is done. Then there won't be any speculation or guesswork at all.

If 8 processor machines are so limited that they cannot handle 4 or 5 K bytes of position states per processor, then they are probably not very practical machines.
The problem is that that 4 or 5K is bumping good stuff out of cache. And that turns into memory accesses to bring it back in. Multiplied by 8x if you have 8 cores, or 16x if you have 16, etc.

On my quad I run 8 instances of my chess program, but only 4 are thinking at a time and I notice very little slowdown over running on 1 core.
There you go. On my 8-core boxes, I notice _zero_ slowdown in terms of NPS. And I found the same on a 4x4 machine I tested on for a good while (4 processor chips, 4 cores per chip.

Are 8 processor machines really this fragile? I would never buy an 8 processor machine if it just means that every instance of a program runs at half the speed because the memory is so crippled up. I would think they would design these things better.

It isn't nearly as much about design as it is about cost. Go poking around to find a machine with 4 cores and 4-way memory interleaving. Potentially that machine would be exactly as fast as a single-core box with non-interleaved memory. And yes, you can find 'em. But they are not cheap. And you won't find 8-way or 16-way interleaved PCs, although in the Cray days we had 4096-way interleaving.

A test is certainly a better proposal than us arguing about it

A test I could do is to double my state. In my program state is 192 bytes, it's probably a little more in crafty since you carry extra bit boards around. I did a printf( "%d", sizeof(position)) to get that figure.

So I could add a 200 byte character array and see how it impacts performance. I'm pretty sure I would see almost no slowdown but I could be wrong of course. I'm pretty sure GCC is not smart enough to know that it could just copy the first 192 bytes

bob · Post by **bob** » Thu Sep 17, 2009 8:12 pm

mcostalba wrote:
bob wrote: As I said, I am going to go back to copy/make in the current Crafty (not that hard to do) and compare performance, in terms of NPS, with a single thread, and with 8 threads, to see what the issues are, if any. If it is as fast or faster, I will report that here and stick with it.
It is very probably you are going to do the wrong thing

Is gonna be a waste of time and a source of misunderstandings if you don't understand the idea _before_ to run the test.

I strongly suggest you to read the SF sources, specifically do_move() and undo_move() in position.cpp, I hope they are well commented and should be clear enough.

Reading your many posts I have understood that you don't have understood (perhaps because I was not clear enough, so please read the sources to remove any doubt)

You are right, but unfortunately it is not "I" that misunderstand here. What you are doing in StockFish is _hardly_ "copy/make". Copy-Make has been around forever, and the idea it is built on is _zero_ Unmake() usage. You are not doing copy/make in Stockfish. You save a very few things, but update the normal board and then undo the move later, if I am not completely misunderstanding your code.

Which leaves me with the question "Why would you call this copy/make" when "copy/make" has no unmake/undo operation of any kind??? Notice that we do not call it "copy/make/partially-unmake". There is no Unmake() in copy/make. This definition has been used for at least 30 years now. It was used in Crafty in 1995 until I removed the "copy" part and replaced it with a "unmake" part in version 9.something to improve the speed.

bob · Post by **bob** » Thu Sep 17, 2009 8:15 pm

I think the discussion is worse than even you suggested. I looked at the stockfish source in position.cpp. What it is doing is anything _but_ "copy/make".

The idea behind copy/make (and this is not my definition, it has been around for 30 years or longer) is that you copy the existing state to a new location, and then update that new state. When you are finished, you just return to the current state and do it again. There is no UnmakeMove() or undo_move() call of any kind. Yet stockfish has an undo_move() and it actually does significant work. It simply isn't copy/make so the discussion has not only gone off onto a tangent, it is off on a tangent that doesn't even exist.

This appears to be another classic example of non-standard vocabulary usage. Copy/Make has an _explicit_ meaning. The key point being there is _no_ Unmake of any kind. So without using a standard definition of terms, the discussion degrades into nothing but misunderstandings.

bob · Post by **bob** » Thu Sep 17, 2009 8:22 pm

Don wrote:
bob wrote: I don't "stash away" any undo information, with the only exception being the hash signature, and the 50-move counter.

The pentium-pro was not a bad machine. I had one with 4-way memory interleaving (a big server box). Getting rid of copy/make was a win on that machine as well. I suspect it is the same today as memory speeds have not changed much, yet processor speeds are way beyond that original 200mhz pentium pro. Much faster to make/unmake data in cache, rather than waiting hundreds (or thousands) of cycles for something to poke in from memory today.
But I don't think that is happening. If it were as bad as you are saying my program would be crawling because I have hash tables.

SImple first approximation to the answer: What is your NPS compared to Crafty's on the same hardware? If you want, I can run your code on our 8-core box, you you can download Crafty's source and run it on your box. Crafty is not "tricky". It spends fully 50% of the total execution time in Evaluate() and its derivatives. So it is not a particularly "fast and simple" program at all. But it is a "fast program".

Due to the cache issue I tried not storing hash tables in the quies search and it had no impact on my nodes per second. This is random access all over several megabyes of data, reads and writes.

Interesting, because Bruce Moreland, myself, and a couple of others ran this specific test several years ago. I found that hashing in the q-search slowed me by 10% in terms of NPS. And sped me up by about 10% due to reduced tree size. I elected to leave it out of the qsearch since it introduces less stress on the hash table replacement policies and lets me get by with smaller tables than if I hash at least 4x more data thanks to the q-search stores.

And yet you think a small stack of 30 or so commonly accessed position states is just trashing the whole cache of the machine. I really don't think you are thinking very clearly about this, because you seem to picking and choosing your assumptions. You assume that the extra logic has no appreciable cost but that copying 100 bytes is something modern processors just cannot handle.

I have no assumptions. Only concrete measurements. And not just from my program, but from a couple of others (including Ferret from Bruce).

re-using instructions at every ply in the search is very cache friedly. Using a different "state" at every ply is much less friendly and absolutely does increase cache fill rates. How much on today's machines I do not yet know, but I am going to find out.

You made the arugment that the problem is so bad they have to have level 1, level 2, level 3 cache and so on. So I'll make the argument that this caching technology actually works and you can actually expect 6K of position state to be in the cache.

In _some_ cache. Which do you prefer however, to access in 4 clock cycles, or 20 clock cycles or 100 clock cycles or even (worse case) a couple of thousand clock cycles? L1 is not big.

undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning