performance of copy-make

Don · Post by **Don** » Thu Aug 04, 2011 10:06 pm

rbarreira wrote:
Don wrote: I once measured copy-make and I think the program spends 2% of the time there now. (At one time Doch spend about 3% of the time, but it's not clear to me how accurate these measurement are. That does not make Komodo 2% slower than it has to be, because making and unmaking a move would also have overhead. I don't really know if make/unmake would give me a speedup, but the upper bound on the potential speedup is 2 percent and my estimate of the true difference is zero. It's possible that copy-make is faster, but I'm not claiming that. I don't really know.
Actually the upper bound is bigger than 2 percent, due to the cache issue.

Yes, it's not strictly correct as an upper bound because the effect is not known.

But don't forget, I did not deduct the time it takes to make and unmake a move either. We know for sure that make and unmake is not free. I believe that make + unmake is more expensive than copy-make but it would be difficult to prove unless I actually implemented it. Komodo spends almost no time making moves which would make the measurement even more difficult unless it turns out cache issues really is a big deal.

I am probably going to test this sooner or later.

Don · Post by **Don** » Thu Aug 04, 2011 10:10 pm

bob wrote:
Don wrote:
bob wrote:
Don wrote:
bob wrote:I don't see how it simplifies the split point code at all. All it does is get rid of Unmake() for a gain, but adds the copy during the make process for a loss. But affecting the split point code? I don't see it. You certainly have to replicate everything for each thread...
The first time I started using copy-make it was for the idea of simplifying the code, getting rid of global variables and avoiding the complexities and extra conditional branches involved in unmaking a move. It's often, but not always, the case that simplifying the code comes with many advantages including performance.

I usually try to determine using logic and my own intuition what should happen if I do things a certain way but I have found that my intuition is a very poor guide. So I am very careful to not make any assumptions no matter how obvious they seem to me or how much the fly in the face of logic. Of course one still has to use reason as a guide.

So my first gut feeling on this issue was that it was a horrible idea. At MIT I was convinced to try it as I was given a lecture on cache sizes, that copying a few hundred bytes on modern CPU's was nothing compared to things like missed branches and missed cache hits and other consideration. So when I stopped putting undue weight on the speed of copying state, the only thing left was win, win, win. But of course the only thing that really mattered to me was whether it really worked or not.

There are only 2 potential downsides to copy-make. Possible cache blowout issues and the speed of copying. The upsides to copy-make are:

1. much reduced logic (which means less missed branches)
2. great simplification of code. (which can be somewhat mitigated with good design)
3. easier to make parallel code.
4. No dealing with unmake and it's complexities.
5. less data structures to support unmaking a move.

None of these things are major - with good design I don't think a program needs to have copy-make and can still be bug free and easy to work with - but I am a firm believer in keeping things as simple as possible (but no simpler.) Sometimes simple gets in the way and you need a little more complexity, but I have concluded that this is not one of those cases.

I once measured copy-make and I think the program spends 2% of the time there now. (At one time Doch spend about 3% of the time, but it's not clear to me how accurate these measurement are. That does not make Komodo 2% slower than it has to be, because making and unmaking a move would also have overhead. I don't really know if make/unmake would give me a speedup, but the upper bound on the potential speedup is 2 percent and my estimate of the true difference is zero. It's possible that copy-make is faster, but I'm not claiming that. I don't really know.

So I did an experiment in order to test the theory that copy state is blowing out the cache and slowing the program to a crawl - which seems to be the big fear that is being put out there. Before getting to that, I want to mention that most good chess programs use a number of vary large data structures and that storing a few position states is a trivial amount of space in comparison.

Komodo has a position state that is 208 bytes. It used to be 192 bytes for alignment purposes but it has grown - but I can get it back down to 192 easily enough. But that is besides the point.

So the experiment is to see how much slower the program will get if I double the size of the position state. I added a character array the middle of the state struct definition that is also 208 bytes large and I did not even bother to align it in any special way but I doubt that matters in this case as it is on a 4 byte boundary anyway. Then I doubled the size of that to get 3x larger state that is normal in Komodo.

I ran 3 timings (50 position to 13 ply) and took the median value as the time.

When I doubled the state size, the program slowed down by 0.7 percent, that's less than 1 percent but non-trivial. When I tripled the state size it slowed down by 1.4 percent compared to the regular version. So the state size appears to matter and is worthy of consideration. Note that by doubling and tripling the state size I have to suffer the additional cache overhead as well as the state copy time.

It's difficult to draw any conclusions from this because I don't know how much unmake and the associated logic and data structures to support it would slow things down. I would point out that make is also a little more complicated without copy-make - so the question is whether make/unmake is a major win over copy-make. I seriously doubt that it is. I would guess that with a relatively small state size it's a non-issue. If I thought it was more than 2 percent I would go through the pain of changing the program but I doubt it's a slowdown at all.

I know this is not proof of anything, but Komodo manages to be a very strong program using this scheme but as far as I know it's the only program that does this so perhaps I am missing something big that would propel us over the level of Houdini if I change it.

It would not be too difficult to convert my program - I could keep everything the same but have a special routine that makes and unmakes a move to a target state. However, guess what? Now I would have to add more code and global variables in order to maintain the information to undue a move and do repetition testing. In other words you STILL have to maintain state, now it's just more complicated.
Early Crafty's were copy/make, because that was highly efficient on the Cray with its ridiculous (even by PCs of today) memory bandwidth. But my question to him was "how does copy make make a parallel split easier"? Whatever you do, you have to replicate the board and everything N times to use N threads. Whether you have a single board (as I do) in a structure unique to each thread, or whether you have N boards.

The problem I would have with copy/make is that I allow _any_ thread to help any other thread, and then back up results. So you end up copying a lot of stuff at split points unless you choose to use a "master/slave" sort of arrangement where only the master can back anything up.

But ignoring that, since it can be dealt with, what is the benefit of copy/make with regard to doing a parallel split? I don't see any. And when I changed Crafty, it did not affect parallel search at all..
The primary benefit is that unmake goes away and thus costs nothing and does not produce bugs. I think it may be faster because of this because I don't believe make is twice as slow as it would normally be using copy-make. Also, there are less opportunities for conditional branches to miss.

Like malloc/free and open/close your code has to be peppered with unmake() and you cannot forget to match them up properly - and when modifying code I am sure this will be an extra error to annoy you with.

So I guess the short answer is that there is no major benefit to either method that I can tell other than what I layed out and other than any relative performance difference between the two - but I don't even know which is faster. Somehow I don't feel that 200 bytes per position is a major cache blowout issue when other data structure in chess programs are orders of magnitude larger than this.

I may take the time to implement the unmake just to see for myself - if I do I will report the results to you and the forum. I just don't look forward to building the unmake routine and getting it right for castling, en-pasant, promotions, etc.
My make/unmake have _nothing_ to do with my parallel search effectiveness. I'm not talking about performance. I'm certain make/unmake is faster, because it always works on the same data with likely cache hits, as opposed to copy/make which always works on different data as it is copied first for every last make operation. How significant? Don't know today. On the original pentium it was a _huge_ performance hit. Maybe 2-3x slower at a minimum. But that had just L1 cache and very small on top of that. Today? I have not tried it. With today's 4M+ L2 caches, and newer machines with L3 (or older but non-X86 machines) it might not be that much of a win, although I can not imagine that it can't be measured.

But, the poster said it made his parallel split stuff simpler and I could not understand how it is even related to parallel splits at all... I don't use board information there, nor do I make moves while splitting. I just have to copy the board to each thread that is going to work together on the same position, and that's true for copy/make or make/unmake... unless someone can point out something I am missing...

No, I don't think you are missing anything. I assume that your position state is larger because you have to keep all the history of the zobrist keys and moves etc, but if the copies are rare that is no concern. So I don't see what it should matter either way unless bad cache behavior really is an issue.

.

wgarvin · Post by **wgarvin** » Thu Aug 04, 2011 10:10 pm

Hybrid approaches are also possible.

I believe Ippolit keeps the large parts of the position (square array and bitboards) in a fixed global variable structure, and updates them in a make,unmake fashion (or "spell","unspell" in their weird englishization). The rest of the position kept in a stack of structures and updated with copy-make.

rbarreira · Post by **rbarreira** » Thu Aug 04, 2011 10:13 pm

Don wrote:
rbarreira wrote:
Don wrote: I once measured copy-make and I think the program spends 2% of the time there now. (At one time Doch spend about 3% of the time, but it's not clear to me how accurate these measurement are. That does not make Komodo 2% slower than it has to be, because making and unmaking a move would also have overhead. I don't really know if make/unmake would give me a speedup, but the upper bound on the potential speedup is 2 percent and my estimate of the true difference is zero. It's possible that copy-make is faster, but I'm not claiming that. I don't really know.
Actually the upper bound is bigger than 2 percent, due to the cache issue.
Yes, it's not strictly correct as an upper bound because the effect is not known.

But don't forget, I did not deduct the time it takes to make and unmake a move either. We know for sure that make and unmake is not free. I believe that make + unmake is more expensive than copy-make but it would be difficult to prove unless I actually implemented it. Komodo spends almost no time making moves which would make the measurement even more difficult unless it turns out cache issues really is a big deal.

I am probably going to test this sooner or later.

I found a thread by Bob where he posted benchmarks for Crafty. Apparently the performance difference between copy-make and make-unmake depended on the number of threads:

http://talkchess.com/forum/viewtopic.php?t=29798

Bottom line is that copy/make costs 5% on single-thread programs, and can cost 25% on an 8-core box if you use all 8 cores.

Rein Halbersma · Post by **Rein Halbersma** » Thu Aug 04, 2011 10:37 pm

bob wrote:I don't see how it simplifies the split point code at all. All it does is get rid of Unmake() for a gain, but adds the copy during the make process for a loss. But affecting the split point code? I don't see it. You certainly have to replicate everything for each thread...

This remark was meant in the context of a Cilk type of parallel search where the scheduling is done behind the scenes. Having a copy-make approach simplifies things because you don't have to write out any split-point code at all (because there is no shared state). For programs having an explicit scheduler with pthreads and split points, or what have you, make/unmake might not make a difference in code simplicity compared to copy-make. But make/undo is much harder to adapt to Cilk type of parallelism.

Don · Post by **Don** » Thu Aug 04, 2011 10:43 pm

Rein Halbersma wrote:
bob wrote:I don't see how it simplifies the split point code at all. All it does is get rid of Unmake() for a gain, but adds the copy during the make process for a loss. But affecting the split point code? I don't see it. You certainly have to replicate everything for each thread...
This remark was meant in the context of a Cilk type of parallel search where the scheduling is done behind the scenes. Having a copy-make approach simplifies things because you don't have to write out any split-point code at all (because there is no shared state). For programs having an explicit scheduler with pthreads and split points, or what have you, make/unmake might not make a difference in code simplicity compared to copy-make. But make/undo is much harder to adapt to Cilk type of parallelism.

I would think either method would be relatively simple whether using Cilk or not.

Instead of just talking about it, I'm going to bite the bullet and make a clone of Komodo (hope I don't get into any trouble for this) that uses make/unmake. I will report the results - probably tomorrow.

Of course I cannot comment on MP performance yet but that is coming to Komodo pretty soon.

Zach Wegner · Post by **Zach Wegner** » Fri Aug 05, 2011 4:44 am

Copy/make can be beneficial in parallel search if you do DTS...

bob · Post by **bob** » Fri Aug 05, 2011 7:01 pm

Don wrote:
bob wrote:
Don wrote:
bob wrote:
Don wrote:
bob wrote:I don't see how it simplifies the split point code at all. All it does is get rid of Unmake() for a gain, but adds the copy during the make process for a loss. But affecting the split point code? I don't see it. You certainly have to replicate everything for each thread...
The first time I started using copy-make it was for the idea of simplifying the code, getting rid of global variables and avoiding the complexities and extra conditional branches involved in unmaking a move. It's often, but not always, the case that simplifying the code comes with many advantages including performance.

I usually try to determine using logic and my own intuition what should happen if I do things a certain way but I have found that my intuition is a very poor guide. So I am very careful to not make any assumptions no matter how obvious they seem to me or how much the fly in the face of logic. Of course one still has to use reason as a guide.

So my first gut feeling on this issue was that it was a horrible idea. At MIT I was convinced to try it as I was given a lecture on cache sizes, that copying a few hundred bytes on modern CPU's was nothing compared to things like missed branches and missed cache hits and other consideration. So when I stopped putting undue weight on the speed of copying state, the only thing left was win, win, win. But of course the only thing that really mattered to me was whether it really worked or not.

There are only 2 potential downsides to copy-make. Possible cache blowout issues and the speed of copying. The upsides to copy-make are:

1. much reduced logic (which means less missed branches)
2. great simplification of code. (which can be somewhat mitigated with good design)
3. easier to make parallel code.
4. No dealing with unmake and it's complexities.
5. less data structures to support unmaking a move.

None of these things are major - with good design I don't think a program needs to have copy-make and can still be bug free and easy to work with - but I am a firm believer in keeping things as simple as possible (but no simpler.) Sometimes simple gets in the way and you need a little more complexity, but I have concluded that this is not one of those cases.

I once measured copy-make and I think the program spends 2% of the time there now. (At one time Doch spend about 3% of the time, but it's not clear to me how accurate these measurement are. That does not make Komodo 2% slower than it has to be, because making and unmaking a move would also have overhead. I don't really know if make/unmake would give me a speedup, but the upper bound on the potential speedup is 2 percent and my estimate of the true difference is zero. It's possible that copy-make is faster, but I'm not claiming that. I don't really know.

So I did an experiment in order to test the theory that copy state is blowing out the cache and slowing the program to a crawl - which seems to be the big fear that is being put out there. Before getting to that, I want to mention that most good chess programs use a number of vary large data structures and that storing a few position states is a trivial amount of space in comparison.

Komodo has a position state that is 208 bytes. It used to be 192 bytes for alignment purposes but it has grown - but I can get it back down to 192 easily enough. But that is besides the point.

So the experiment is to see how much slower the program will get if I double the size of the position state. I added a character array the middle of the state struct definition that is also 208 bytes large and I did not even bother to align it in any special way but I doubt that matters in this case as it is on a 4 byte boundary anyway. Then I doubled the size of that to get 3x larger state that is normal in Komodo.

I ran 3 timings (50 position to 13 ply) and took the median value as the time.

When I doubled the state size, the program slowed down by 0.7 percent, that's less than 1 percent but non-trivial. When I tripled the state size it slowed down by 1.4 percent compared to the regular version. So the state size appears to matter and is worthy of consideration. Note that by doubling and tripling the state size I have to suffer the additional cache overhead as well as the state copy time.

It's difficult to draw any conclusions from this because I don't know how much unmake and the associated logic and data structures to support it would slow things down. I would point out that make is also a little more complicated without copy-make - so the question is whether make/unmake is a major win over copy-make. I seriously doubt that it is. I would guess that with a relatively small state size it's a non-issue. If I thought it was more than 2 percent I would go through the pain of changing the program but I doubt it's a slowdown at all.

I know this is not proof of anything, but Komodo manages to be a very strong program using this scheme but as far as I know it's the only program that does this so perhaps I am missing something big that would propel us over the level of Houdini if I change it.

It would not be too difficult to convert my program - I could keep everything the same but have a special routine that makes and unmakes a move to a target state. However, guess what? Now I would have to add more code and global variables in order to maintain the information to undue a move and do repetition testing. In other words you STILL have to maintain state, now it's just more complicated.
Early Crafty's were copy/make, because that was highly efficient on the Cray with its ridiculous (even by PCs of today) memory bandwidth. But my question to him was "how does copy make make a parallel split easier"? Whatever you do, you have to replicate the board and everything N times to use N threads. Whether you have a single board (as I do) in a structure unique to each thread, or whether you have N boards.

The problem I would have with copy/make is that I allow _any_ thread to help any other thread, and then back up results. So you end up copying a lot of stuff at split points unless you choose to use a "master/slave" sort of arrangement where only the master can back anything up.

But ignoring that, since it can be dealt with, what is the benefit of copy/make with regard to doing a parallel split? I don't see any. And when I changed Crafty, it did not affect parallel search at all..
The primary benefit is that unmake goes away and thus costs nothing and does not produce bugs. I think it may be faster because of this because I don't believe make is twice as slow as it would normally be using copy-make. Also, there are less opportunities for conditional branches to miss.

Like malloc/free and open/close your code has to be peppered with unmake() and you cannot forget to match them up properly - and when modifying code I am sure this will be an extra error to annoy you with.

So I guess the short answer is that there is no major benefit to either method that I can tell other than what I layed out and other than any relative performance difference between the two - but I don't even know which is faster. Somehow I don't feel that 200 bytes per position is a major cache blowout issue when other data structure in chess programs are orders of magnitude larger than this.

I may take the time to implement the unmake just to see for myself - if I do I will report the results to you and the forum. I just don't look forward to building the unmake routine and getting it right for castling, en-pasant, promotions, etc.
My make/unmake have _nothing_ to do with my parallel search effectiveness. I'm not talking about performance. I'm certain make/unmake is faster, because it always works on the same data with likely cache hits, as opposed to copy/make which always works on different data as it is copied first for every last make operation. How significant? Don't know today. On the original pentium it was a _huge_ performance hit. Maybe 2-3x slower at a minimum. But that had just L1 cache and very small on top of that. Today? I have not tried it. With today's 4M+ L2 caches, and newer machines with L3 (or older but non-X86 machines) it might not be that much of a win, although I can not imagine that it can't be measured.

But, the poster said it made his parallel split stuff simpler and I could not understand how it is even related to parallel splits at all... I don't use board information there, nor do I make moves while splitting. I just have to copy the board to each thread that is going to work together on the same position, and that's true for copy/make or make/unmake... unless someone can point out something I am missing...
No, I don't think you are missing anything. I assume that your position state is larger because you have to keep all the history of the zobrist keys and moves etc, but if the copies are rare that is no concern. So I don't see what it should matter either way unless bad cache behavior really is an issue.

.

When I did copy / make, I did not copy all the position stuff. I never "unmade" hash signatures, I just saved the original in today's code. All I really had to copy in copy/make was just the board stuff. Which would be smaller today since there are no rotated boards to copy, saving at least 24 bytes over the rotated version that did copy/make years ago.

In a split today, or in copy/make days, I had to copy the same things. The bitboards (12 for individual pieces, 2 for white/black occupied, and 2 for bishops/queens and rooks/queens. 128 bytes of that. 64 byte array of square contents. 192 bytes. Rest was tiny (ep target, castle status are 1 byte each)...

So splitting incurs nothing different whether you do copy/make or make/unmake, and I was trying to understand the original poster's comment that it was more efficient. I don't see how.

bob · Post by **bob** » Fri Aug 05, 2011 7:04 pm

Zach Wegner wrote:Copy/make can be beneficial in parallel search if you do DTS...

I designed DTS.

And I didn't use copy/make in Cray Blitz. If you look at the source available on the net, it is clearly make/unmake, just like Crafty of today. How, exactly, does it help? Each thread has to have its own "state". Whether that state is copied from a local state that is updated with make/unmake, or copied from a state that is updated via copy/make, I don't see how it matters at all...

Now an iterated search is certainly needed, otherwise DTS will become a nightmare of details to deal with. But that's outside copy/make vs make/unmake.

bob · Post by **bob** » Fri Aug 05, 2011 7:09 pm

rbarreira wrote:
Don wrote:
rbarreira wrote:
Don wrote: I once measured copy-make and I think the program spends 2% of the time there now. (At one time Doch spend about 3% of the time, but it's not clear to me how accurate these measurement are. That does not make Komodo 2% slower than it has to be, because making and unmaking a move would also have overhead. I don't really know if make/unmake would give me a speedup, but the upper bound on the potential speedup is 2 percent and my estimate of the true difference is zero. It's possible that copy-make is faster, but I'm not claiming that. I don't really know.
Actually the upper bound is bigger than 2 percent, due to the cache issue.
Yes, it's not strictly correct as an upper bound because the effect is not known.

But don't forget, I did not deduct the time it takes to make and unmake a move either. We know for sure that make and unmake is not free. I believe that make + unmake is more expensive than copy-make but it would be difficult to prove unless I actually implemented it. Komodo spends almost no time making moves which would make the measurement even more difficult unless it turns out cache issues really is a big deal.

I am probably going to test this sooner or later.
I found a thread by Bob where he posted benchmarks for Crafty. Apparently the performance difference between copy-make and make-unmake depended on the number of threads:

http://talkchess.com/forum/viewtopic.php?t=29798

Bottom line is that copy/make costs 5% on single-thread programs, and can cost 25% on an 8-core box if you use all 8 cores.

That's an architectural issue. Copy/Make eats bandwidth in a fast program. With 8 threads one has to hope for a high cache hit rate to keep from totally smoking the memory bus path. Copy/Make has a much higher miss rate overall, because each ply grabs 4 more cache blocks, displacing something that was already there frequently. Today's memories are better (triple channel, for example) and these numbers might not be quite so bad.

I might try to re-test this again, although my comment of 3.5 hours to make the change might make me think a bit about whether the information is worth the effort...

At least it is 100x easier to go from make/unmake to copy/make than it is to go in the opposite direction.

performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make

Re: performance of copy-make