undo move vs. Position Cloning

bob · Post by **bob** » Wed Sep 16, 2009 7:48 pm

BoldReceiver wrote:I started a hobby engine last month, my first programming project since college 12 years ago. Before I had written an "undo Move" function, I was lazy and simply cloned a position before making any alterations to it. So search looked like this:

for ( all moves) {
clone position to newPosition
score = -alphaBeta( -beta, -alpha, depthleft - 1 );
…
discard newPosition}

instead of

for ( all moves) {
make move;
score = -alphaBeta( -beta, -alpha, depthleft - 1 );
…
undo move}

Last night I added an array of 512 64bit Ints to track the game history (up to 256 full moves) to detect draws per the 3 move rule. Thinking that cloning and discarding millions of Position Objects during search was an increasingly bad idea now that the position Object was over 4k in size I wrote an “undo move” function, which seems to be the norm. To my surprise Running perft 6 moves deep revealed the the cloning method to be marginally quicker.
I am using Java and have no idea how memory is malloced or how the garbage collection works, but simply given that the piece array and position history stack (really the Zorbist key history) were being copied, it seems way off to have this be faster than undo move. Has anyone else tried anything like this? Anyone venture a guess as to what's going on? I'm a hardware guy and have never written code longer than a few hundred lines so I could be way off on this. If so, enlighten me. I have included stats and some code snippets below.

Perft from initial position with undo move function (time in ms):
Search time: 0
Nodes 20, Captues 0, EnPassant 0, Castles 0, Promotions 0, Checks 0, Mates 0
Search time: 14
Nodes 400, Captues 0, EnPassant 0, Castles 0, Promotions 0, Checks 0, Mates 0
Search time: 113
Nodes 8902, Captues 34, EnPassant 0, Castles 0, Promotions 0, Checks 12, Mates 0
Search time: 2099
Nodes 197281, Captues 1576, EnPassant 0, Castles 0, Promotions 0, Checks 469, Mates 0
Search time: 47758
Nodes 4865609, Captues 82719, EnPassant 258, Castles 0, Promotions 0, Checks 27351, Mates 0
Search time: 1193596
Nodes 119060324, Captues 2812008, EnPassant 5248, Castles 0, Promotions 0, Checks 809099, Mates 0

Perft with Cloning function.
Search time: 1
Nodes 20, Captues 0, EnPassant 0, Castles 0, Promotions 0, Checks 0, Mates 0
Search time: 14
Nodes 400, Captues 0, EnPassant 0, Castles 0, Promotions 0, Checks 0, Mates 0
Search time: 108
Nodes 8902, Captues 34, EnPassant 0, Castles 0, Promotions 0, Checks 12, Mates 0
Search time: 1920
Nodes 197281, Captues 1576, EnPassant 0, Castles 0, Promotions 0, Checks 469, Mates 0
Search time: 45269
Nodes 4865609, Captues 82719, EnPassant 258, Castles 0, Promotions 0, Checks 27351, Mates 0
Search time: 1134039
Nodes 119060324, Captues 2812008, EnPassant 5248, Castles 0, Promotions 0, Checks 809099, Mates 0

///Position Object
Code: Select all
class Position implements Cloneable&#123;
	int board&#91;&#93;=new int &#91;128&#93;; //0x88 Board representation
	long moveHistoryStack&#91;&#93; = new long &#91;512&#93;; //Position Zorbist key History
	int moveHistoryPointer; 
	int whiteKingSq;
	int blackKingSq;
	int moveNumber;
	int hundredHalfMoveCounter; // This is for the 50 move rule
	int enPassant; //  If en passant capture is legal it's for this column. A column of 8 means no En Passant is possible
	long hashKey=0;
	int materialSum;
	boolean castleKingSideBlack; //castle king side still legal
	boolean castleQueenSideBlack; // dito queen side
	boolean castleKingSideWhite; //castle king side still legal
	boolean castleQueenSideWhite; // dito queen side
	boolean whitesMove; //White to move &#40;else black&#41;
	boolean whiteCheck; //In check flag
	boolean blackCheck; //In check flag
	boolean draw;       //Draw flag
	
    void undoMove&#40;UnMove unMove&#41;&#123;
		this.enPassant = unMove.enPassant;
		this.materialSum = unMove.materialSum;
		this.hundredHalfMoveCounter = unMove.hundredHalfMoveCounter;
		this.castleKingSideBlack = unMove.castleKingSideBlack;
		this.castleQueenSideBlack = unMove.castleQueenSideBlack;
		this.castleKingSideWhite = unMove.castleKingSideWhite;
		this.castleQueenSideWhite = unMove.castleQueenSideWhite;
		this.blackCheck = unMove.blackCheck;
		this.whiteCheck = unMove.whiteCheck;
		if &#40;unMove.wasCastles&#41;&#123;
			switch&#40;unMove.to&#41;&#123; 
			case &#40;2&#41;&#58;&#123; //White castles Q
				this.board&#91;0&#93;=W_ROOK;
				this.board&#91;4&#93;=W_KING;
				this.board&#91;2&#93;=EMPTY;
				this.board&#91;3&#93;=EMPTY;
				this.whiteKingSq=4;
			&#125;
			case &#40;6&#41;&#58;&#123; //White castles K
				this.board&#91;7&#93;=W_ROOK;
				this.board&#91;4&#93;=W_KING;
				this.board&#91;5&#93;=EMPTY;
				this.board&#91;6&#93;=EMPTY;
				this.whiteKingSq=4;
			&#125;
			case &#40;114&#41;&#58;&#123; //Black castles Q
				this.board&#91;112&#93;=B_ROOK;
				this.board&#91;116&#93;=B_KING;
				this.board&#91;114&#93;=EMPTY;
				this.board&#91;115&#93;=EMPTY;
				this.blackKingSq=116;
			&#125;
			case &#40;118&#41;&#58;&#123; //Black castles Q
				this.board&#91;119&#93;=B_ROOK;
				this.board&#91;116&#93;=B_KING;
				this.board&#91;117&#93;=EMPTY;
				this.board&#91;118&#93;=EMPTY;
				this.blackKingSq=116;
			&#125;
			&#125;
			undoMoveCleanUp&#40;);
			return;
		&#125;
		if &#40;unMove.wasEnPassant&#41;&#123;
			this.board&#91;unMove.to&#93;=EMPTY;
			if &#40;this.whitesMove&#41;&#123;//side to move has not been undone, so this is blacks move
				this.board&#91;unMove.from&#93; = B_PAWN;
				this.board&#91;unMove.to+16&#93;=W_PAWN;
			&#125;
			else&#123;
				this.board&#91;unMove.from&#93; = W_PAWN;
				this.board&#91;unMove.to-16&#93;=B_PAWN;
			&#125;
			undoMoveCleanUp&#40;);
			return;
		&#125;
		this.board&#91;unMove.from&#93;=this.board&#91;unMove.to&#93;;
		this.board&#91;unMove.to&#93;=unMove.piece;
		if &#40;this.board&#91;unMove.from&#93;==W_KING&#41; this.whiteKingSq=unMove.from;
		if &#40;this.board&#91;unMove.from&#93;==B_KING&#41; this.blackKingSq=unMove.from;
		if &#40;unMove.wasPromotion&#41; &#123;
			if &#40;this.whitesMove&#41;//side to move has not been updated, so this is blacks move
				this.board&#91;unMove.from&#93;=B_PAWN;
			if &#40;this.whitesMove&#41; this.board&#91;unMove.from&#93;=W_PAWN;
		&#125;
    	undoMoveCleanUp&#40;);
    	return;
    	
    &#125;

    void undoMoveCleanUp&#40;)&#123;
    	this.whitesMove=!this.whitesMove;
    	this.moveHistoryPointer--;
    	this.hashKey=this.moveHistoryStack&#91;moveHistoryPointer&#93;;
    	this.moveNumber--;
    &#125;

Old news. But your test is broken, as perft does not do the same things a real search does. copy/make (the more common name for what you are calling "cloning") strains the PC's memory bandwidth. Most have found that it becomes a bottleneck to performance when you start doing lots of other stuff such as evaluation, hashing, move ordering, etc, none of which matters for perft results.

bob · Post by **bob** » Wed Sep 16, 2009 7:49 pm

mcostalba wrote:
BoldReceiver wrote:To my surprise Running perft 6 moves deep revealed the the cloning method to be marginally quicker.
Of course is faster and not only marginally.

The traditional way of doing this is copy in makemove the current state in a backup storage, then update the state then in undo move to copy back from the backup storage to the position state.

This last step is totally uneeded and I still don't understand why nobody uses the opposite way that makes undo move way faster.

When I have implemented this new scheme in Stockfish I was very surprised that nobody uses it.

In Stockfish do_move() and undo_move() work like this:

In do_move() the subset of the position state that is updated incrementally, for instance position key, is copied in the new position state and then the new state is updated as usual. The fundamental point is that the original state is not lost but is still kept in memory and there is a pointer that points to it.

In undo_move() we simply redirect the pointer to the old one.

While in do_move() we don't have any gain (but also no slowdown) in undo_move() we have a big speedup because we can remove the copy operations almost entirely and just change a pointer.

I am still wondering why nobody uses this scheme...

Performance.

mcostalba · Post by **mcostalba** » Wed Sep 16, 2009 8:28 pm

bob wrote:
mcostalba wrote:
BoldReceiver wrote:To my surprise Running perft 6 moves deep revealed the the cloning method to be marginally quicker.
Of course is faster and not only marginally.

The traditional way of doing this is copy in makemove the current state in a backup storage, then update the state then in undo move to copy back from the backup storage to the position state.

This last step is totally uneeded and I still don't understand why nobody uses the opposite way that makes undo move way faster.

When I have implemented this new scheme in Stockfish I was very surprised that nobody uses it.

In Stockfish do_move() and undo_move() work like this:

In do_move() the subset of the position state that is updated incrementally, for instance position key, is copied in the new position state and then the new state is updated as usual. The fundamental point is that the original state is not lost but is still kept in memory and there is a pointer that points to it.

In undo_move() we simply redirect the pointer to the old one.

While in do_move() we don't have any gain (but also no slowdown) in undo_move() we have a big speedup because we can remove the copy operations almost entirely and just change a pointer.

I am still wondering why nobody uses this scheme...
Performance.

In SF this is faster. Measured.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Sep 16, 2009 8:35 pm

bob wrote: Old news. But your test is broken, as perft does not do the same things a real search does. copy/make (the more common name for what you are calling "cloning") strains the PC's memory bandwidth. Most have found that it becomes a bottleneck to performance when you start doing lots of other stuff such as evaluation, hashing, move ordering, etc, none of which matters for perft results.

Agreed with your perft conclusion. Anyway, I also switched to an array of positions and copy/make. The pure position is 64-byte (quad-bb, two zobrist-keys, piece nibble counters, ep, castle rights, move50, etc.). The staged "plausible" movegen requires 16 direction wise legal move-target bitboards and a few more attack bitboards for a kind of "set-wise" SEE - no more movelists. In total 64 + 256 = 320 bytes per ply, including search variables and some padding.

How many bytes does Crafty need per ply? In total on the internal stack, explicit stacks or ply-arrays, and let say 16 moves in average per ply inside the movelist?

Don · Post by **Don** » Wed Sep 16, 2009 9:11 pm

I thought everyone did it like this. In my chess program I copy state and undo is not another copy of course, the state already exists.

I started doing this when I wrote my first parallel chess program and noticed then that it was actually faster than tediously unmaking moves.

Since state copy is faster, it's ludicrous to do it any other way since this is a big code simplification too. Unless memory is a serious constraint.

With modern processors, the logic of unmaking a move can be a significant bottleneck and still requires a bunch of memory references. With state copy, undo is free and I'm not even convinced that make is more expensive. Copying a contiguous small area of memory is pretty fast and there are no logic bottlenecks to this.

mcostalba wrote:
BoldReceiver wrote:To my surprise Running perft 6 moves deep revealed the the cloning method to be marginally quicker.
Of course is faster and not only marginally.

The traditional way of doing this is copy in makemove the current state in a backup storage, then update the state then in undo move to copy back from the backup storage to the position state.

This last step is totally uneeded and I still don't understand why nobody uses the opposite way that makes undo move way faster.

When I have implemented this new scheme in Stockfish I was very surprised that nobody uses it.

In Stockfish do_move() and undo_move() work like this:

In do_move() the subset of the position state that is updated incrementally, for instance position key, is copied in the new position state and then the new state is updated as usual. The fundamental point is that the original state is not lost but is still kept in memory and there is a pointer that points to it.

In undo_move() we simply redirect the pointer to the old one.

While in do_move() we don't have any gain (but also no slowdown) in undo_move() we have a big speedup because we can remove the copy operations almost entirely and just change a pointer.

I am still wondering why nobody uses this scheme...

Evert · Post by **Evert** » Wed Sep 16, 2009 9:16 pm

Gian-Carlo Pascutto wrote:Not really. If your position has enough state then the copying gets too expensive quickly. I don't think anyone starts out thinking he wants to unmove instead of just copying

It never occurred to me to do this until a month or two ago.
Simply copying the whole state over sounds awfully redundant and counter-intuitive to me.

bob · Post by **bob** » Wed Sep 16, 2009 9:29 pm

Gerd Isenberg wrote:
bob wrote: Old news. But your test is broken, as perft does not do the same things a real search does. copy/make (the more common name for what you are calling "cloning") strains the PC's memory bandwidth. Most have found that it becomes a bottleneck to performance when you start doing lots of other stuff such as evaluation, hashing, move ordering, etc, none of which matters for perft results.
Agreed with your perft conclusion. Anyway, I also switched to an array of positions and copy/make. The pure position is 64-byte (quad-bb, two zobrist-keys, piece nibble counters, ep, castle rights, move50, etc.). The staged "plausible" movegen requires 16 direction wise legal move-target bitboards and a few more attack bitboards for a kind of "set-wise" SEE - no more movelists. In total 64 + 256 = 320 bytes per ply, including search variables and some padding.

How many bytes does Crafty need per ply? In total on the internal stack, explicit stacks or ply-arrays, and let say 16 moves in average per ply inside the movelist?

I believe that at the time I used Copy/Make, I had to copy the usual 4 bitboards (rotated) plus the diagonal-movers and rank/file-movers, etc. It was not a lot. But If you get your NPS up into the 2.5-3M nodes per second on a single CPU, suddenly it becomes a _huge_ amount of memory traffic. 3M nodes per second, times 320 bytes (your numbers there since I do not do this any longer) is almost a gigabyte per second of memory bandwidth purely for the copy operations, ignoring all the other stuff we have to do. Eliminating that lets you keep the bitboards in cache, and update them twice (make/unmake) which is way cheaper in terms of cycles.

I suppose I could re-create the copy/make version since it is not very hard to do, and I still have a small copy/make piece anyway where I pass 4 bytes containing castling status, 50 move counter, etc. I'll give it a whirl later this week and report back. Maybe the PC has improved, but I suspect not. Copy/Make worked great on the older Cray super-computers, but they had far greater memory bandwidth than today's PCs.

Don · Post by **Don** » Wed Sep 16, 2009 9:32 pm

Gian-Carlo Pascutto wrote:This thread got a bit confused because the both of you are talking about different things.

The fastest is to undo the changes to large data structures, and clone the small ones. You need to think about the trade off in effort between undoing a change and copying it back.

Undoing moves should be faster than copying 4.5K of memory. If it's not, something is wrong.

What Marco describes is a small optimization for the case where you copy a datastructure (to have only 1 copy instead of 2).

I misunderstood a lot. I missed that he only copies a portion of the state.

One other consideration is that if you profile your code, you may discover that making a move has trivial overhead. Even if full state copy is expensive compared to a complicated undo function, it may not be worth the extra complexity, especially for muti-processor programs.

In my program making moves is less than 1 percent of the execution time (perhaps that is a bug?) of my code. Of course there are caching issue too to consider which could make this more expensive than I think.

I just measured the size of my state and it's only 192 bytes. I don't believe that is excessive or cache unfriendly. I do not keep any explicit lists, such as a list of moves or a list of hash keys or anything like this. 192 bytes fully describes the current position.

The position state has a pointer to the previous position state and that is how from any position I can reference previous positions.

I realize that some programs probably like to maintain much more information incrementally, such as attack boards and such. But each thing you add requres even more work in the "undo" function, so I don't see that there is much of a tradeoff, unless your state is enormous.

You mention 4.5 K of memory. Are there any chess programs that require that much memory to represent a single position?

Don · Post by **Don** » Wed Sep 16, 2009 9:43 pm

mjlef wrote:I just clone now, and the size of my board structure is about 500 bytes. Note you do no need a move stack for each board structure. You need a move stack for each processor, and it needs not be copied at all unless you are spiltting the search (just allocated at program startup).

I don't think you need a move stack at all. In my position state I just have a pointer to the parent position. I also keep the move that led to this position, so this is my move list.

Outside the search I just have a array of positions. The prototype to make a move is something like this:

int make( position *old_state, position *new_state, move_t move );

So outside the search just have a flat list of position states in an array. Inside the search I'm just passing states or pointers to them around as I can always discard them.

Before I decided to clone I did some timing tests and found way under 1% of processor time was being spent copying the board over, so I decided an undo would be a waste of time.

I think you are right. You would have to have a pretty large state for this not to be the case, perhaps something of the size Gian Carlo is talking about.

bob · Post by **bob** » Wed Sep 16, 2009 9:49 pm

Don wrote:
Gian-Carlo Pascutto wrote:This thread got a bit confused because the both of you are talking about different things.

The fastest is to undo the changes to large data structures, and clone the small ones. You need to think about the trade off in effort between undoing a change and copying it back.

Undoing moves should be faster than copying 4.5K of memory. If it's not, something is wrong.

What Marco describes is a small optimization for the case where you copy a datastructure (to have only 1 copy instead of 2).
I misunderstood a lot. I missed that he only copies a portion of the state.

One other consideration is that if you profile your code, you may discover that making a move has trivial overhead. Even if full state copy is expensive compared to a complicated undo function, it may not be worth the extra complexity, especially for muti-processor programs.

You have to be _very_ careful when looking at profile output however. Yes, Make might be small. But it is kicking stuff out of cache that might make later things slower. And it is generating memory traffic that might be interfering with other threads in a parallel search. It is not so easy to attribute costs to the guilty party when the issue is not instructions being executed, but the cost of getting the needed data back into cache.

In my program making moves is less than 1 percent of the execution time (perhaps that is a bug?) of my code. Of course there are caching issue too to consider which could make this more expensive than I think.

I just measured the size of my state and it's only 192 bytes. I don't believe that is excessive or cache unfriendly. I do not keep any explicit lists, such as a list of moves or a list of hash keys or anything like this. 192 bytes fully describes the current position.

The position state has a pointer to the previous position state and that is how from any position I can reference previous positions.

I realize that some programs probably like to maintain much more information incrementally, such as attack boards and such. But each thing you add requres even more work in the "undo" function, so I don't see that there is much of a tradeoff, unless your state is enormous.

You mention 4.5 K of memory. Are there any chess programs that require that much memory to represent a single position?

undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning

Re: undo move vs. Position Cloning