undo move vs. Position Cloning

bob · Post by **bob** » Wed Sep 16, 2009 9:51 pm

mcostalba wrote:
bob wrote:
mcostalba wrote:
BoldReceiver wrote:To my surprise Running perft 6 moves deep revealed the the cloning method to be marginally quicker.
Of course is faster and not only marginally.

The traditional way of doing this is copy in makemove the current state in a backup storage, then update the state then in undo move to copy back from the backup storage to the position state.

This last step is totally uneeded and I still don't understand why nobody uses the opposite way that makes undo move way faster.

When I have implemented this new scheme in Stockfish I was very surprised that nobody uses it.

In Stockfish do_move() and undo_move() work like this:

In do_move() the subset of the position state that is updated incrementally, for instance position key, is copied in the new position state and then the new state is updated as usual. The fundamental point is that the original state is not lost but is still kept in memory and there is a pointer that points to it.

In undo_move() we simply redirect the pointer to the old one.

While in do_move() we don't have any gain (but also no slowdown) in undo_move() we have a big speedup because we can remove the copy operations almost entirely and just change a pointer.

I am still wondering why nobody uses this scheme...
Performance.
In SF this is faster. Measured.

Faster in real games, or faster in perft? There is a difference. And then there is the parallel search issue. You still have one memory system. 4 cores produce 4x the memory bandwidth. This extra copying can then become more significant on overall performance and cause the NPS to scale poorly with number of processors. It is _very_ complex and a simple perft is not very revealing.

bob · Post by **bob** » Wed Sep 16, 2009 9:55 pm

Don wrote:I thought everyone did it like this. In my chess program I copy state and undo is not another copy of course, the state already exists.

I started doing this when I wrote my first parallel chess program and noticed then that it was actually faster than tediously unmaking moves.

Since state copy is faster, it's ludicrous to do it any other way since this is a big code simplification too. Unless memory is a serious constraint.

With modern processors, the logic of unmaking a move can be a significant bottleneck and still requires a bunch of memory references. With state copy, undo is free and I'm not even convinced that make is more expensive. Copying a contiguous small area of memory is pretty fast and there are no logic bottlenecks to this.

However, in modern processors memory is a _huge_ bottleneck. Hence the development of L1/L2/L3 cache to try to help. The more cores you throw into the mix, the worse this gets because you still have just one memory subsystem that is now being beat on by N cores simultaneously. The memory can only sustain a pretty limited bandwidth since these addresses (across multiple threads) are not sequential.

Make/Unmake should be updating things that remain in cache since the board state is used by everything. That's the key advantage of make/unmake over copy/make, although I am going to test this on a relatively new core-2 with dual quad-core processors to see if there is a significant Elo difference between the two (speed difference is really irrelevant when one can measure Elo just as accurately).

mcostalba wrote:
BoldReceiver wrote:To my surprise Running perft 6 moves deep revealed the the cloning method to be marginally quicker.
Of course is faster and not only marginally.

The traditional way of doing this is copy in makemove the current state in a backup storage, then update the state then in undo move to copy back from the backup storage to the position state.

This last step is totally uneeded and I still don't understand why nobody uses the opposite way that makes undo move way faster.

When I have implemented this new scheme in Stockfish I was very surprised that nobody uses it.

In Stockfish do_move() and undo_move() work like this:

In do_move() the subset of the position state that is updated incrementally, for instance position key, is copied in the new position state and then the new state is updated as usual. The fundamental point is that the original state is not lost but is still kept in memory and there is a pointer that points to it.

In undo_move() we simply redirect the pointer to the old one.

While in do_move() we don't have any gain (but also no slowdown) in undo_move() we have a big speedup because we can remove the copy operations almost entirely and just change a pointer.

I am still wondering why nobody uses this scheme...

Don · Post by **Don** » Wed Sep 16, 2009 10:30 pm

bob wrote:
Don wrote:
Gian-Carlo Pascutto wrote:This thread got a bit confused because the both of you are talking about different things.

The fastest is to undo the changes to large data structures, and clone the small ones. You need to think about the trade off in effort between undoing a change and copying it back.

Undoing moves should be faster than copying 4.5K of memory. If it's not, something is wrong.

What Marco describes is a small optimization for the case where you copy a datastructure (to have only 1 copy instead of 2).
I misunderstood a lot. I missed that he only copies a portion of the state.

One other consideration is that if you profile your code, you may discover that making a move has trivial overhead. Even if full state copy is expensive compared to a complicated undo function, it may not be worth the extra complexity, especially for muti-processor programs.
You have to be _very_ careful when looking at profile output however. Yes, Make might be small. But it is kicking stuff out of cache that might make later things slower. And it is generating memory traffic that might be interfering with other threads in a parallel search. It is not so easy to attribute costs to the guilty party when the issue is not instructions being executed, but the cost of getting the needed data back into cache.

I guess I don't understand this. I have a position state that is 192 bytes. In a typical search I am guessing that 20 or 30 of these are alive at any given time (15 ply search, for instance, with extensions and quies.) At each level in the tree I share one state with all the sibling moves. So I have about 6k of state being operated on.

Is it your belief that this 6k of state is probably trashing my cache?

When I'm near leaf nodes, these nodes are allocated pretty furiously and come and go very quickly to be sure, but I don't use malloc and free of course, they are allocated in the normal stack frame of the recursive calls to the search routine. I would assume all of that is in cache. In other words at the top of my search routine I declare a new position state which is shared with all the sibling nodes. i.e.:

(in bastardized pseudo code:)

Code: Select all

   position search&#40; position_state  *prev,  ...  )
   &#123;
       position_state  new_position;
       ... other stuff
       foreach move in movelist do &#123;
            make&#40; prev,  &new_position,  move );
       &#125;

   &#125;

Would it be better to pre-allocate say 50 states (or however deep my search goes) and just index them by the ply number of the search? Are you saying that just these few K bytes of state would be trashing my cache?

I find that hard to believe.

Aleks Peshkov · Post by **Aleks Peshkov** » Wed Sep 16, 2009 10:41 pm

There are many variations in between.

For example, I clone parent node information, but all sibling nodes update the same position object incrementally (per element basis: either copy bytes from the parent or undoing from previous move or even leaving some previous move information as is when move generator sequence does guarantee that it is valid).

Tord Romstad · Post by **Tord Romstad** » Wed Sep 16, 2009 10:46 pm

bob wrote:
mcostalba wrote:
bob wrote:
mcostalba wrote:
BoldReceiver wrote:To my surprise Running perft 6 moves deep revealed the the cloning method to be marginally quicker.
Of course is faster and not only marginally.

The traditional way of doing this is copy in makemove the current state in a backup storage, then update the state then in undo move to copy back from the backup storage to the position state.

This last step is totally uneeded and I still don't understand why nobody uses the opposite way that makes undo move way faster.

When I have implemented this new scheme in Stockfish I was very surprised that nobody uses it.

In Stockfish do_move() and undo_move() work like this:

In do_move() the subset of the position state that is updated incrementally, for instance position key, is copied in the new position state and then the new state is updated as usual. The fundamental point is that the original state is not lost but is still kept in memory and there is a pointer that points to it.

In undo_move() we simply redirect the pointer to the old one.

While in do_move() we don't have any gain (but also no slowdown) in undo_move() we have a big speedup because we can remove the copy operations almost entirely and just change a pointer.

I am still wondering why nobody uses this scheme...
Performance.
In SF this is faster. Measured.
Faster in real games, or faster in perft? There is a difference. And then there is the parallel search issue. You still have one memory system. 4 cores produce 4x the memory bandwidth. This extra copying can then become more significant on overall performance and cause the NPS to scale poorly with number of processors. It is _very_ complex and a simple perft is not very revealing.

I think you misunderstand what Marco is talking about: There is no extra copying. On the contrary, the ingenious trick Marco introduced in Stockfish is designed to reduce the amount of copying. At the same time, it reduces the number of instructions needed to unmake moves. It is so obviously faster that it isn't even necessary to run any benchmarks to be sure.

bob · Post by **bob** » Wed Sep 16, 2009 11:17 pm

Don wrote:
bob wrote:
Don wrote:
Gian-Carlo Pascutto wrote:This thread got a bit confused because the both of you are talking about different things.

The fastest is to undo the changes to large data structures, and clone the small ones. You need to think about the trade off in effort between undoing a change and copying it back.

Undoing moves should be faster than copying 4.5K of memory. If it's not, something is wrong.

What Marco describes is a small optimization for the case where you copy a datastructure (to have only 1 copy instead of 2).
I misunderstood a lot. I missed that he only copies a portion of the state.

One other consideration is that if you profile your code, you may discover that making a move has trivial overhead. Even if full state copy is expensive compared to a complicated undo function, it may not be worth the extra complexity, especially for muti-processor programs.
You have to be _very_ careful when looking at profile output however. Yes, Make might be small. But it is kicking stuff out of cache that might make later things slower. And it is generating memory traffic that might be interfering with other threads in a parallel search. It is not so easy to attribute costs to the guilty party when the issue is not instructions being executed, but the cost of getting the needed data back into cache.
I guess I don't understand this. I have a position state that is 192 bytes. In a typical search I am guessing that 20 or 30 of these are alive at any given time (15 ply search, for instance, with extensions and quies.) At each level in the tree I share one state with all the sibling moves. So I have about 6k of state being operated on.

Is it your belief that this 6k of state is probably trashing my cache?

Hard to say. But, for comparison, typical L1 cache (data) is 16K-32K byters. That is a _big_ chunk of it.

When I'm near leaf nodes, these nodes are allocated pretty furiously and come and go very quickly to be sure, but I don't use malloc and free of course, they are allocated in the normal stack frame of the recursive calls to the search routine. I would assume all of that is in cache. In other words at the top of my search routine I declare a new position state which is shared with all the sibling nodes. i.e.:

(in bastardized pseudo code:)
Code: Select all
   position search&#40; position_state  *prev,  ...  )
   &#123;
       position_state  new_position;
       ... other stuff
       foreach move in movelist do &#123;
            make&#40; prev,  &new_position,  move );
       &#125;

   &#125;
Would it be better to pre-allocate say 50 states (or however deep my search goes) and just index them by the ply number of the search? Are you saying that just these few K bytes of state would be trashing my cache?

I find that hard to believe.

When I did this, I just indexed by ply. And I am fixing to test this copy/make vs make/unmake performance issue again, as it is not that hard to convert back since all of my board accesses are through macros that can be changed easily.

I do know that when I dumped copy/make 15 years ago or so on the original pentium and penrium pro, the speed difference was huge. Whether it still is or not is not known, hence my interest in answering the question precisely with a cluster test.

bob · Post by **bob** » Wed Sep 16, 2009 11:23 pm

Tord Romstad wrote:
bob wrote:
mcostalba wrote:
bob wrote:
mcostalba wrote:
BoldReceiver wrote:To my surprise Running perft 6 moves deep revealed the the cloning method to be marginally quicker.
Of course is faster and not only marginally.

The traditional way of doing this is copy in makemove the current state in a backup storage, then update the state then in undo move to copy back from the backup storage to the position state.

This last step is totally uneeded and I still don't understand why nobody uses the opposite way that makes undo move way faster.

When I have implemented this new scheme in Stockfish I was very surprised that nobody uses it.

In Stockfish do_move() and undo_move() work like this:

In do_move() the subset of the position state that is updated incrementally, for instance position key, is copied in the new position state and then the new state is updated as usual. The fundamental point is that the original state is not lost but is still kept in memory and there is a pointer that points to it.

In undo_move() we simply redirect the pointer to the old one.

While in do_move() we don't have any gain (but also no slowdown) in undo_move() we have a big speedup because we can remove the copy operations almost entirely and just change a pointer.

I am still wondering why nobody uses this scheme...
Performance.
In SF this is faster. Measured.
Faster in real games, or faster in perft? There is a difference. And then there is the parallel search issue. You still have one memory system. 4 cores produce 4x the memory bandwidth. This extra copying can then become more significant on overall performance and cause the NPS to scale poorly with number of processors. It is _very_ complex and a simple perft is not very revealing.
I think you misunderstand what Marco is talking about: There is no extra copying. On the contrary, the ingenious trick Marco introduced in Stockfish is designed to reduce the amount of copying. At the same time, it reduces the number of instructions needed to unmake moves. It is so obviously faster that it isn't even necessary to run any benchmarks to be sure.

I am not convinced. What he explains as the "usual way" is anything but. I used copy/make in Cray Blitz and in early Crafty versions. I did not copy and then make, and then copy back to unmake. I copied to a new state, then updated that, keeping the original state. To unmake, I just used the old state again. I indexed the state by ply, which is incremented in the recursive search calls anyway, so that MakeMove() looked something like this:

position[ply+1] = position[ply];
update position[ply+1}

then I recursively call search with ply+1, so that it uses the new state. When it returns, there is no unmake of any kind done. And that was slower than the current make/unmake approach. It simply added too much memory bandwidth, and then when that is added to the other memory bandwidth (64 bytes per node for crafty for a hash probe, plus magic tables, etc, and then that is multiplied by the number of threads, the PCs of the past had no hope of keeping up. And bandwidth has not kept up with processor speed, which leads me to think it will be worse.

Again, faster to avoid unmaking is fine. But what is the cost of the copy and how does this affect everything else, particularly with multiple threads that can let this copy increase the needed memory bandwidth by 8x on an 8core box?

I suspect it is far more than would be suspected when testing this with just one thread.

bob · Post by **bob** » Wed Sep 16, 2009 11:25 pm

Aleks Peshkov wrote:There are many variations in between.

For example, I clone parent node information, but all sibling nodes update the same position object incrementally (per element basis: either copy bytes from the parent or undoing from previous move or even leaving some previous move information as is when move generator sequence does guarantee that it is valid).

I did not have to "unmake" in the old version of crafty. Make just copied and then updated the new copy which was used by the next recursive search call. when that search returns, I am back to the pre-existing position automatically since it was never changed... But the copy is far from free... more after I test it again on today's current hardware.

BubbaTough · Post by **BubbaTough** » Wed Sep 16, 2009 11:31 pm

I haven't looked at the code, but it sounded like what Marco was doing was just saving what a lot of people put in an undo structure in a move stack which he indexes. The amount of copying he is doing is then just the same as if he was saving his undo info, no more. And his undo does not have to copy back from the undo structure. Thus, compared to a stack of Undo's, his system is clearly better. I am not sure what crafty does, but at least compared to one common method this is a good idea.

-Sam

Tord Romstad · Post by **Tord Romstad** » Wed Sep 16, 2009 11:35 pm

bob wrote:
Tord Romstad wrote: I think you misunderstand what Marco is talking about: There is no extra copying. On the contrary, the ingenious trick Marco introduced in Stockfish is designed to reduce the amount of copying. At the same time, it reduces the number of instructions needed to unmake moves. It is so obviously faster that it isn't even necessary to run any benchmarks to be sure.
I am not convinced. What he explains as the "usual way" is anything but. I used copy/make in Cray Blitz and in early Crafty versions. I did not copy and then make, and then copy back to unmake. I copied to a new state, then updated that, keeping the original state. To unmake, I just used the old state again. I indexed the state by ply, which is incremented in the recursive search calls anyway, so that MakeMove() looked something like this:

position[ply+1] = position[ply];
update position[ply+1}

then I recursively call search with ply+1, so that it uses the new state. When it returns, there is no unmake of any kind done. And that was slower than the current make/unmake approach. It simply added too much memory bandwidth, and then when that is added to the other memory bandwidth (64 bytes per node for crafty for a hash probe, plus magic tables, etc, and then that is multiplied by the number of threads, the PCs of the past had no hope of keeping up. And bandwidth has not kept up with processor speed, which leads me to think it will be worse.

Again, faster to avoid unmaking is fine. But what is the cost of the copy and how does this affect everything else, particularly with multiple threads that can let this copy increase the needed memory bandwidth by 8x on an 8core box?

I agree entirely, but this is precisely the point: Marco's trick reduces the amount of copying even further. Stockfish has never used the copy-make approach, for exactly the reasons you mention. You are probably confused because the post which started the thread was about copy-make, but Marco's reply was about something very different.

undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning