undo move vs. Position Cloning

Tord Romstad · Post by **Tord Romstad** » Wed Sep 16, 2009 11:37 pm

BubbaTough wrote:I haven't looked at the code, but it sounded like what Marco was doing was just saving what a lot of people put in an undo structure in a move stack which he indexes. The amount of copying he is doing is then just the same as if he was saving his undo info, no more. And his undo does not have to copy back from the undo structure. Thus, compared to a stack of Undo's, his system is clearly better.

Yes. This explanation is 100% correct.

Don · Post by **Don** » Wed Sep 16, 2009 11:41 pm

bob wrote: When I did this, I just indexed by ply. And I am fixing to test this copy/make vs make/unmake performance issue again, as it is not that hard to convert back since all of my board accesses are through macros that can be changed easily.

I do know that when I dumped copy/make 15 years ago or so on the original pentium and penrium pro, the speed difference was huge. Whether it still is or not is not known, hence my interest in answering the question precisely with a cluster test.

I think every 5 years a program has to be rewritten or at least re-visited for this very reason.

When I developed cilkchess, it was very fast on a 64 bit alpha machine - and it was a 64 bit program developed FOR the alpha machine.

When that same program was compiled and run on a Pentium, it was ridiculously slow. Of course the alpha was superior in every way and we were stepping down to 32 bits (with a 64 bit program), but that did not even begin to explain the difference which was something like an order of magnitude. When comparing the slowdown of a program like Crafty going from the alpha to the same pentium, it was not so bad. Perhaps 3 or 4 to 1.

My conclusion was cache, Crafty was probably very cache friendly compared to cilkchess. At MIT many students touched the code, and some of them went to ridiculous lengths to optimize it for the alpha - manually unrolling loops that the compiler wasn't catching, etc.

In fact some of the routines had multiple copies just so that more constants could be used in the program - which the alpha seemed to like. For instance the evaluation function had 2 versions, one for white and one for black. This was hidden cleverly with macros - but in each version everything that could be was a constant.

Our tentative conclusion was that we were blowing out the pentiums instruction cache.

At one tournament I was embarassed to play skittles games with cilchess running on my laptop. Thorsten and I played a series of games with Cilkchess vs Chess system Tal I think it was and Cilkchess did very poorly, running on a single processor at only a fraction of the speed it would have run on an alpha. On top of that Cilkchess was running on a slower computer ANYWAY than Tal, so it was an embarrassing experience for me. It was also beneath Cilkchess dignity to ask for a time advantage handicap.

I didn't try too hard to explain the losses against the might Ciklchess because it would have sounded like a sour grapes excuse! It was fun anyway watching Tal win those games with flashy and speculative sacafices which were mostly unsound - but they sure worked well against a cripple up Cilkchess!

Harald · Post by **Harald** » Thu Sep 17, 2009 12:40 am

As I understand this dicussion there are basically 2 methods:

("copy-make") We have an array or stack of positions.
make_move(): copy the position infos from ply to ply+1 and change the copy. ply++
undo_move(): ply--

("make-undo") We have only one position.
make_move(): store some parts in a stack. change the position. ply++
undo_move(): ply--; change the position. retrieve some parts.

In my engine Elephant I use copy-make but that will probably change in the next version.

There is a question related to make-undo:
Does it make sense to do some extra work in make_move() and store
all changes in an undo stack in a simplified form
list< pair<address, old_value> > ?
The undo_move() function does not have to look at the move and
find out what to change. It simply does something like this:
while (list not empty) { *address = old_value; }
(You dont have to use std::list. It is just pseudo code.)

Harald

Evert · Post by **Evert** » Thu Sep 17, 2009 12:41 am

bob wrote: You have to be _very_ careful when looking at profile output however. Yes, Make might be small. But it is kicking stuff out of cache that might make later things slower. And it is generating memory traffic that might be interfering with other threads in a parallel search. It is not so easy to attribute costs to the guilty party when the issue is not instructions being executed, but the cost of getting the needed data back into cache.

Wouldn't well-placed prefetch instructions that fetch larger data structures shortly before they're needed greatly reduce the cost of cache misses?

Don · Post by **Don** » Thu Sep 17, 2009 12:59 am

gingell wrote:Here's what I do in Chesley:
Code: Select all
child_position = parent_position;  // Copy the parent
child_position.apply_move &#40;move&#41;;  // Apply a move to the child.
search &#40;child_position&#41;            // Search the child
And then child_position is thrown away.

In Chesley a position is 168 bytes. The implementation is C++ and I've verified the compiler generates a memcpy as you'd expect. Experimenting with dummy fields to pad that out to 256 or 512 bytes makes very little difference in number of nodes per second I see.

The fact that virtually every other program I've looked at does move/undo worries me a little, since it's rather more likely I'm missing something everyone else sees than the other way round! But it's still hard for me to see a solution cheaper than doing that copy.

What you are seeing is just an example of heritage - make/unmake is how it's always been done. Your father did it that way, his father did it that way ...

It had to be done that way back then and thus every pseduo-code example of how to do some chess algorithm has undo() in it somewhere.

I also believe that modern processors are optimized for memory copies - it is a ridiculously common operation.

Your position state is just about the same size as mine and I believe that with states as small as ours, it's just not an issue.

It takes about the same time to UNMAKE a move as to make a move. But for us that time is cut in half, because we never have to undo() a move - we get that for free.

In theory we pay for that with a bit extra make time - but even that is not so obvious to me. Our make routines have less logic as we do not have the extra bookeeping and logic of stashing away undo information in yet another array.

I don't really know which is faster for programs like ours, but it would not suprise me if what we are doing is the faster way - it's certainly simple and clean compared to the alternative and simple and clean almost always win out (except when it doesn't.)

bob · Post by **bob** » Thu Sep 17, 2009 1:18 am

Evert wrote:
bob wrote: You have to be _very_ careful when looking at profile output however. Yes, Make might be small. But it is kicking stuff out of cache that might make later things slower. And it is generating memory traffic that might be interfering with other threads in a parallel search. It is not so easy to attribute costs to the guilty party when the issue is not instructions being executed, but the cost of getting the needed data back into cache.
Wouldn't well-placed prefetch instructions that fetch larger data structures shortly before they're needed greatly reduce the cost of cache misses?

Compilers are not doing this, as it is quite risky to pre-fetch and then not use the data because of an early exit or whatever...

Also, we are not just talking about cache misses due to the data not being present, there is a big cost to fetch the data, and fetching the data pushes something else out of cache, resulting in additional misses later. All of which add up to a lot of bandwidth.

bob · Post by **bob** » Thu Sep 17, 2009 1:28 am

Tord Romstad wrote:
bob wrote:
Tord Romstad wrote: I think you misunderstand what Marco is talking about: There is no extra copying. On the contrary, the ingenious trick Marco introduced in Stockfish is designed to reduce the amount of copying. At the same time, it reduces the number of instructions needed to unmake moves. It is so obviously faster that it isn't even necessary to run any benchmarks to be sure.
I am not convinced. What he explains as the "usual way" is anything but. I used copy/make in Cray Blitz and in early Crafty versions. I did not copy and then make, and then copy back to unmake. I copied to a new state, then updated that, keeping the original state. To unmake, I just used the old state again. I indexed the state by ply, which is incremented in the recursive search calls anyway, so that MakeMove() looked something like this:

position[ply+1] = position[ply];
update position[ply+1}

then I recursively call search with ply+1, so that it uses the new state. When it returns, there is no unmake of any kind done. And that was slower than the current make/unmake approach. It simply added too much memory bandwidth, and then when that is added to the other memory bandwidth (64 bytes per node for crafty for a hash probe, plus magic tables, etc, and then that is multiplied by the number of threads, the PCs of the past had no hope of keeping up. And bandwidth has not kept up with processor speed, which leads me to think it will be worse.

Again, faster to avoid unmaking is fine. But what is the cost of the copy and how does this affect everything else, particularly with multiple threads that can let this copy increase the needed memory bandwidth by 8x on an 8core box?
I agree entirely, but this is precisely the point: Marco's trick reduces the amount of copying even further. Stockfish has never used the copy-make approach, for exactly the reasons you mention. You are probably confused because the post which started the thread was about copy-make, but Marco's reply was about something very different.

I don't follow. If you are talking about the second post in this thread (from the beginning) then that is _exactly_ what I did in Cray Blitz and early Crafty. I copied the parts of the state that were incrementally updated (bitboards, hash signature, 50move counter, etc) and then updated the new copy, keeping the original in place. I then recursively called search which would use this updated copy. When it returned, no unmake was needed because the original values were still there for that ply. That is exactly what he is describing if I am reading it correctly. And it isn't new at all. You can look at the online source for Cray Blitz or if you can find a very old version of Crafty to see exactly what we were doing, and it sounds exactly like what he was describing. Back then, on the Cray, memory bandwidth was essentially unlimited with 4096 independent banks of memory, each of which could read and write simultaneously, and thanks to the vector load/store hardware, we could drive that memory quite effectively to produce bandwidth only dreamed about in today's PCs. T932 could do 64 64-bit reads, and 32 64 bit writes in one CPU cycle, using 32 cpus. At a clock speed of 2ns, that is 64*8 + 32*8 = 768 bytes, 500 million times a second. or a paultry 384 gigabytes per second, max. We could copy/make without thinking about the bandwidth we were wasting. The PC is a "little" more limited than that, and when I started Crafty I soon discovered that the "copy" operation was too expensive and letting a small set of data sit in cache for make/unmake was significantly faster. copy/make makes the code cleaner of course...

If I am looking at the wrong post, or am somehow misinterpreting what he wrote, let me know.

bob · Post by **bob** » Thu Sep 17, 2009 1:31 am

BubbaTough wrote:I haven't looked at the code, but it sounded like what Marco was doing was just saving what a lot of people put in an undo structure in a move stack which he indexes. The amount of copying he is doing is then just the same as if he was saving his undo info, no more. And his undo does not have to copy back from the undo structure. Thus, compared to a stack of Undo's, his system is clearly better. I am not sure what crafty does, but at least compared to one common method this is a good idea.

-Sam

Copy/Make as I have implemented it in the past never copied "back" either. I simply used a structure called "pos" that has the bitboards, hash signature, pawn hash signature, etc in it. I then define pos position[maxply].

To do a make, I simply do position[ply+1]=position[ply] and then update position[ply+1] and use that at the next level in the search. When I return to this level (ply) the current position is the same as it was before we did the copy since I didn't change it, so no unmake or copy back is needed. The idea of copying the current state to a backup, and then copying it back to unmake is not a solution I would have ever thought of, much less implemented. It is so obviously wrong as to not even be considered.

bob · Post by **bob** » Thu Sep 17, 2009 1:32 am

Tord Romstad wrote:
BubbaTough wrote:I haven't looked at the code, but it sounded like what Marco was doing was just saving what a lot of people put in an undo structure in a move stack which he indexes. The amount of copying he is doing is then just the same as if he was saving his undo info, no more. And his undo does not have to copy back from the undo structure. Thus, compared to a stack of Undo's, his system is clearly better.
Yes. This explanation is 100% correct.

And that is _exactly_ what was done in early Crafty versions. That is the very definition of "copy/make". Copy current state to next ply, then update that. When you return to this ply, you are back to the original state with no copy or unmake needed.

bob · Post by **bob** » Thu Sep 17, 2009 1:40 am

Don wrote:
gingell wrote:Here's what I do in Chesley:
Code: Select all
child_position = parent_position;  // Copy the parent
child_position.apply_move &#40;move&#41;;  // Apply a move to the child.
search &#40;child_position&#41;            // Search the child
And then child_position is thrown away.

In Chesley a position is 168 bytes. The implementation is C++ and I've verified the compiler generates a memcpy as you'd expect. Experimenting with dummy fields to pad that out to 256 or 512 bytes makes very little difference in number of nodes per second I see.

The fact that virtually every other program I've looked at does move/undo worries me a little, since it's rather more likely I'm missing something everyone else sees than the other way round! But it's still hard for me to see a solution cheaper than doing that copy.
What you are seeing is just an example of heritage - make/unmake is how it's always been done. Your father did it that way, his father did it that way ...

Actually, this is exactly the _opposite_ of what has happened. Many of us did copy/make in the 70's and beyond. But when we moved to the PC with the first pentiums, we discovered that memory bandwidth was limited, and the copy operation was too expensive to tolerate, and it only got worse if you added a second or third or fourth cpu to the mix.

Code: Select all

 *    9.16   Significant changes to the way MakeMove&#40;) operates.  Rather than  *
 *           copying the large position&#91;ply&#93; structure around, the piece       *
 *           location bitmaps and the array of which piece is on which square  *
 *           are now simple global variables.  This means that there is now an *
 *           UnmakeMove&#40;) function that must be used to restore these after a  *
 *           a move has been made.  The benefit is that we avoid copying 10    *
 *           bitmaps for piece locations when typically only one is changed,   *
 *           Ditto for the which piece is on which square array which is 64    *
 *           bytes long.  The result is much less memory traffic, with the     *
 *           probability that the bitmaps now stay in cache all the time.

That is right out of crafty's main.c and shows about when I stopped the copy/make approach and explains why. Copy/Make is not new. It is quite old, Make/Unmake is newer because it better matches up with the poor PC memory bandwidth limits.

It had to be done that way back then and thus every pseduo-code example of how to do some chess algorithm has undo() in it somewhere.

I also believe that modern processors are optimized for memory copies - it is a ridiculously common operation.

The old Cray T932 could copy almost 400 gigabytes per second. Check out your friendly PC's memory bandwidth. A processor with a single memory unit (as opposed to supercomputer-level interleaved memory banks) is hardly optimized for memory copying operations.

Your position state is just about the same size as mine and I believe that with states as small as ours, it's just not an issue.

It takes about the same time to UNMAKE a move as to make a move. But for us that time is cut in half, because we never have to undo() a move - we get that for free.

You might _think_ it is for free. But the copy is far from free. Add 7 more threads on an 8-core box and it is even less "free".

In theory we pay for that with a bit extra make time - but even that is not so obvious to me. Our make routines have less logic as we do not have the extra bookeeping and logic of stashing away undo information in yet another array.

I don't really know which is faster for programs like ours, but it would not suprise me if what we are doing is the faster way - it's certainly simple and clean compared to the alternative and simple and clean almost always win out (except when it doesn't.)

undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning