Lazy SMP, part 2

dchoman · Post by **dchoman** » Mon Jan 21, 2013 1:57 am

bob wrote:
For parallel search measurements, the simplest test is to take a set of positions, search them to a fixed depth D using just one CPU/thread. Then search to the SAME depth D using multiple CPUS/threads. Your speedup = time(1cpu) / time(Ncpus).

That tells the tale. This type of algorithm fares poorly when # cpus > 2, and it doesn't do that well for 2, were a straightforward YBW algorithm can do 1.7x-1.8x using 2 cpus...

Actually, I measured things this way. The interesting thing is that this measure is misleading in this case. When compared to pure game results, time to depth does not explain the elo increase (which is quite a bit better than the time-to-depth numbers suggest). See the first post in this thread where I compare game-play results to a test just like you describe.

I am not saying that this approach will continue to scale well beyond 4 processors (in fact, I expect it won't), but there is more going on than just pure speed up.

- Dan

bob · Post by **bob** » Mon Jan 21, 2013 3:42 am

dchoman wrote:
bob wrote:
For parallel search measurements, the simplest test is to take a set of positions, search them to a fixed depth D using just one CPU/thread. Then search to the SAME depth D using multiple CPUS/threads. Your speedup = time(1cpu) / time(Ncpus).

That tells the tale. This type of algorithm fares poorly when # cpus > 2, and it doesn't do that well for 2, were a straightforward YBW algorithm can do 1.7x-1.8x using 2 cpus...
Actually, I measured things this way. The interesting thing is that this measure is misleading in this case. When compared to pure game results, time to depth does not explain the elo increase (which is quite a bit better than the time-to-depth numbers suggest). See the first post in this thread where I compare game-play results to a test just like you describe.

I am not saying that this approach will continue to scale well beyond 4 processors (in fact, I expect it won't), but there is more going on than just pure speed up.

- Dan

When results look strange, it is always advisable to figure out why. SMP ONLY gives you a faster search. Any other "gain" will always be suspicious. In an occasional position, you might get a better score due to HT interaction between threads (although not likely with your current implementation). But you will also see worse results/scores, and they balance out unless you are the luckiest person on the planet.

If you are seeing +50 Elo, you must be seeing an additional ply of depth, at a minimum. If not, I'd look carefully at everything to see what is going on... There is really nothing else to go on BUT speed up. Unless you have introduced some advantage you are not aware of, such as the SMP code taking more time to choose a move without it being intentional, or something similar.

Joerg Oster · Post by **Joerg Oster** » Mon Jan 21, 2013 2:16 pm

bob wrote:
dchoman wrote:
bob wrote:
For parallel search measurements, the simplest test is to take a set of positions, search them to a fixed depth D using just one CPU/thread. Then search to the SAME depth D using multiple CPUS/threads. Your speedup = time(1cpu) / time(Ncpus).

That tells the tale. This type of algorithm fares poorly when # cpus > 2, and it doesn't do that well for 2, were a straightforward YBW algorithm can do 1.7x-1.8x using 2 cpus...
Actually, I measured things this way. The interesting thing is that this measure is misleading in this case. When compared to pure game results, time to depth does not explain the elo increase (which is quite a bit better than the time-to-depth numbers suggest). See the first post in this thread where I compare game-play results to a test just like you describe.

I am not saying that this approach will continue to scale well beyond 4 processors (in fact, I expect it won't), but there is more going on than just pure speed up.

- Dan
When results look strange, it is always advisable to figure out why. SMP ONLY gives you a faster search. Any other "gain" will always be suspicious. In an occasional position, you might get a better score due to HT interaction between threads (although not likely with your current implementation). But you will also see worse results/scores, and they balance out unless you are the luckiest person on the planet.

If you are seeing +50 Elo, you must be seeing an additional ply of depth, at a minimum. If not, I'd look carefully at everything to see what is going on... There is really nothing else to go on BUT speed up. Unless you have introduced some advantage you are not aware of, such as the SMP code taking more time to choose a move without it being intentional, or something similar.

Wouldn't it be great if SMP would not only result in a 'quantitative' (speedup), but in a 'qualitative' gain, too? Meaning a somewhat 'better' search.

Dan's and Martin's results seem to imply that their approach gives a significant gain in the quality of the search. And with the big advantage that you can make use of it right from search depth == 1. I don't know exactly how you do it in Crafty, but most? don't split the search until they reach a certain depth.

Maybe it is an idea to combine both approaches?

mar · Post by **mar** » Mon Jan 21, 2013 3:12 pm

Joerg Oster wrote:Wouldn't it be great if SMP would not only result in a 'quantitative' (speedup), but in a 'qualitative' gain, too? Meaning a somewhat 'better' search.

Dan's and Martin's results seem to imply that their approach gives a significant gain in the quality of the search. And with the big advantage that you can make use of it right from search depth == 1. I don't know exactly how you do it in Crafty, but most? don't split the search until they reach a certain depth.

Maybe it is an idea to combine both approaches?

Hi Joerg,
yes it would and I think it may be the case. I have noticed that it especially helps to resolve fail highs and it kicks in in the endgame. Also it may be useful if new iteration would see nothing new.
I had to add a condition to immediately abort all other threads when either one finishes the root window faster than others (it does happen rarely but it does happen, otherwise I could easily end up with one thread still crunching).
I think Dan's idea of adding 1 to depth may be actually crucial.
Much more testing is needed, sure, but there's a lot of room for improvement, like trying to probe hashtable after next move comes out of movegenerator (only testing will show if it's worth it).
I have noticed a lot of nondeterminism however.
I guess it's common for SMP, but I have noticed that if the SMP is lucky it can find the solution two plies earlier, which means that it can either find the move in 40 seconds to 3.5 minutes. It's possible that state of the art algorithm will do much better here.
Anyway it looks promising according to the tests and the only thing that's certain is that it's probably not a coincidence.
If you already have code for threads and condition variables then the implementation can take about two hours of work. Bob's lockless hashing was a question of minutes actually.
Since the implementation is straightforward and I've watched CPU usage and some games carefully, I think there is zero chance of thinking longer or consuming more CPU time than expected. It simply seems to work.

Martin

bob · Post by **bob** » Mon Jan 21, 2013 10:58 pm

Joerg Oster wrote:
bob wrote:
dchoman wrote:
bob wrote:
For parallel search measurements, the simplest test is to take a set of positions, search them to a fixed depth D using just one CPU/thread. Then search to the SAME depth D using multiple CPUS/threads. Your speedup = time(1cpu) / time(Ncpus).

That tells the tale. This type of algorithm fares poorly when # cpus > 2, and it doesn't do that well for 2, were a straightforward YBW algorithm can do 1.7x-1.8x using 2 cpus...
Actually, I measured things this way. The interesting thing is that this measure is misleading in this case. When compared to pure game results, time to depth does not explain the elo increase (which is quite a bit better than the time-to-depth numbers suggest). See the first post in this thread where I compare game-play results to a test just like you describe.

I am not saying that this approach will continue to scale well beyond 4 processors (in fact, I expect it won't), but there is more going on than just pure speed up.

- Dan
When results look strange, it is always advisable to figure out why. SMP ONLY gives you a faster search. Any other "gain" will always be suspicious. In an occasional position, you might get a better score due to HT interaction between threads (although not likely with your current implementation). But you will also see worse results/scores, and they balance out unless you are the luckiest person on the planet.

If you are seeing +50 Elo, you must be seeing an additional ply of depth, at a minimum. If not, I'd look carefully at everything to see what is going on... There is really nothing else to go on BUT speed up. Unless you have introduced some advantage you are not aware of, such as the SMP code taking more time to choose a move without it being intentional, or something similar.
Wouldn't it be great if SMP would not only result in a 'quantitative' (speedup), but in a 'qualitative' gain, too? Meaning a somewhat 'better' search.

Dan's and Martin's results seem to imply that their approach gives a significant gain in the quality of the search. And with the big advantage that you can make use of it right from search depth == 1. I don't know exactly how you do it in Crafty, but most? don't split the search until they reach a certain depth.

Maybe it is an idea to combine both approaches?

Where does this "quality" come from?

SMP helps depth, not knowledge and such...

dchoman · Post by **dchoman** » Tue Jan 22, 2013 1:07 am

bob wrote:
Joerg Oster wrote:
bob wrote:
dchoman wrote:
bob wrote:
For parallel search measurements, the simplest test is to take a set of positions, search them to a fixed depth D using just one CPU/thread. Then search to the SAME depth D using multiple CPUS/threads. Your speedup = time(1cpu) / time(Ncpus).

That tells the tale. This type of algorithm fares poorly when # cpus > 2, and it doesn't do that well for 2, were a straightforward YBW algorithm can do 1.7x-1.8x using 2 cpus...
Actually, I measured things this way. The interesting thing is that this measure is misleading in this case. When compared to pure game results, time to depth does not explain the elo increase (which is quite a bit better than the time-to-depth numbers suggest). See the first post in this thread where I compare game-play results to a test just like you describe.

I am not saying that this approach will continue to scale well beyond 4 processors (in fact, I expect it won't), but there is more going on than just pure speed up.

- Dan
When results look strange, it is always advisable to figure out why. SMP ONLY gives you a faster search. Any other "gain" will always be suspicious. In an occasional position, you might get a better score due to HT interaction between threads (although not likely with your current implementation). But you will also see worse results/scores, and they balance out unless you are the luckiest person on the planet.

If you are seeing +50 Elo, you must be seeing an additional ply of depth, at a minimum. If not, I'd look carefully at everything to see what is going on... There is really nothing else to go on BUT speed up. Unless you have introduced some advantage you are not aware of, such as the SMP code taking more time to choose a move without it being intentional, or something similar.
Wouldn't it be great if SMP would not only result in a 'quantitative' (speedup), but in a 'qualitative' gain, too? Meaning a somewhat 'better' search.

Dan's and Martin's results seem to imply that their approach gives a significant gain in the quality of the search. And with the big advantage that you can make use of it right from search depth == 1. I don't know exactly how you do it in Crafty, but most? don't split the search until they reach a certain depth.

Maybe it is an idea to combine both approaches?
Where does this "quality" come from?

SMP helps depth, not knowledge and such...

Bob is correct. Depth appears to be the only factor. Out of curiosity, I ran a couple of very quick tests this evening using fixed depth, so time usage is not an issue.

Test 1: 1000 games, 4 cpus vs. 1 cpu with a fixed depth of 7. In this case, I used the implementation I described in an earlier post where in the 4 cpu case, some of the threads are searching 1 or 2 ply beyond the nominal iteration depth in the main thread. If one of these other threads happens, by chance, to complete before the main thread, the best move and score from those threads is substituted. This kind of substitution appears to be rare, but those other threads are contributing to the hash table all the time... so the average depth of hash table entries is greater than the iteration depth might indicate.

Result: 4 cpus had a 99% LOS with a nominal elo difference of 26 +/- 13

Test 2: 1000 games, 4 cpus vs. 1 cpu with a fixed depth of 7. I this case, I rigorously forced all threads to search a depth of no more than 7 ply.

Result: 1 cpu had an 85% LOS with a nominal elo difference of 10 +/- 13

So any improvement at a 'fixed depth' seems to be due to a difference in depth in the search in the other threads. However, as I described in an earlier post, searching these extra depths in some of the threads *also* improves the time to depth. So it does seem to be an overall improvement to this simple scheme to have at least some of the other threads searching more deeply in parallel to the main thread.

- Dan

bob · Post by **bob** » Tue Jan 22, 2013 5:11 am

Joerg Oster wrote:
bob wrote:
dchoman wrote:
bob wrote:
For parallel search measurements, the simplest test is to take a set of positions, search them to a fixed depth D using just one CPU/thread. Then search to the SAME depth D using multiple CPUS/threads. Your speedup = time(1cpu) / time(Ncpus).

That tells the tale. This type of algorithm fares poorly when # cpus > 2, and it doesn't do that well for 2, were a straightforward YBW algorithm can do 1.7x-1.8x using 2 cpus...
Actually, I measured things this way. The interesting thing is that this measure is misleading in this case. When compared to pure game results, time to depth does not explain the elo increase (which is quite a bit better than the time-to-depth numbers suggest). See the first post in this thread where I compare game-play results to a test just like you describe.

I am not saying that this approach will continue to scale well beyond 4 processors (in fact, I expect it won't), but there is more going on than just pure speed up.

- Dan
When results look strange, it is always advisable to figure out why. SMP ONLY gives you a faster search. Any other "gain" will always be suspicious. In an occasional position, you might get a better score due to HT interaction between threads (although not likely with your current implementation). But you will also see worse results/scores, and they balance out unless you are the luckiest person on the planet.

If you are seeing +50 Elo, you must be seeing an additional ply of depth, at a minimum. If not, I'd look carefully at everything to see what is going on... There is really nothing else to go on BUT speed up. Unless you have introduced some advantage you are not aware of, such as the SMP code taking more time to choose a move without it being intentional, or something similar.
Wouldn't it be great if SMP would not only result in a 'quantitative' (speedup), but in a 'qualitative' gain, too? Meaning a somewhat 'better' search.

Dan's and Martin's results seem to imply that their approach gives a significant gain in the quality of the search. And with the big advantage that you can make use of it right from search depth == 1. I don't know exactly how you do it in Crafty, but most? don't split the search until they reach a certain depth.

Maybe it is an idea to combine both approaches?

Where does this "quality" come from?

SMP helps depth, not knowledge and such...

Joerg Oster · Post by **Joerg Oster** » Tue Jan 22, 2013 3:32 pm

So maybe 'quality' is the wrong word.
But let me give an explanation how I see it.

Let's start with a standard search with 1 Thread. Every time we repeat searching the same position to a certain depth, with a certain alpha and beta and with the same aspiration window, the exact same tree is going to be searched. We get the same best move, the same node count, etc. We have a deterministic search. Of course, we all know this.

Here my analogy.
Let's think of a single tree, I choose a cypress tree*, on a meadow. Absolutely no wind, not the slightest breeze! And let's consider it is not growing in the meantime.

This represents our search tree from above. Whenever we look at it, we see the exact same tree.
*Mainly due to this remarkable post by Chris W.

Now let's come to a SMP search with 4 Threads, YBW type.
Suddenly there is wind blowing over our meadow, and it blows from different directions, and our tree is swaying forth and back, from left to right, etc. Now we always see a different tree, whenever we look! This also helps to visualize, why we don't get a perfect 4x speedup with 4 Threads. Not to speak of 8 and more.
So, every time we start this SMP search, we get different node counts, different best moves and/or at different depths. Non-deterministic. Sometimes we are lucky and the best move is found earlier, sometimes not or it is even found later, compared to our deterministic search from above. These 'sideway movements' swallow maybe 10 - 15% of our search time, and 85 - 90% result in a faster search. Thus the Elo-gain we measure is mainly due to this speedup. You're right that it makes sense to measure the time-to-depth for this kind of SMP search.

Now let's go ahead to the kind of SMP search (Shared Hash Tables (SHT)) Dan and Martin are experimenting with.
Suddenly, we have four cypress trees on our meadow! And at our first look we realize, that it wouldn't make sense to search 4x the exact same tree. So slightly modifying the search depth, aspiration window or maybe even alpha and beta, is a must. Otherwise we would gain almost nothing. Maybe a small speedup, eventually.
So to say, we create our own wind. And eventually we have some control over it by choosing the right modifications to our standard search. As Martin pointed out, there is still room for improvements, most likely.
But of course you see the main difference. We spend more of our time with these 'sideway movements', and thus we cannot get the same speedup like a YBW search. Maybe 50 - 80% for widening the search, and 20 - 50% for a speedup. I can only guess.

Assuming that I'm right, do you still think it makes sense to compare both approaches simply by time-to-depth?

What do you think where the Elo-gains with this SHT search, measured by both, come from? Especially at those lower depths, where a YBW SMP-search does not unfold its full strength!?

But thinking further, it also seems rather logical that we sooner or later get outsearched by the YBW-engine. Which leads me to my idea of combining both approaches.
Why not starting with a SHT implementation, and then switching to a YBW search? Of course, the depth at which we switch would depend on the engine. For a slow searcher, depth 10 or 12 might be ok. Whereas for a fast searcher like Stockfish depth 17 or 18 might be ok. My idea is, that the SHT-search will 'guide' the YBW-search towards the right direction.

Joerg Oster · Post by **Joerg Oster** » Tue Jan 22, 2013 3:37 pm

dchoman wrote: Bob is correct. Depth appears to be the only factor. Out of curiosity, I ran a couple of very quick tests this evening using fixed depth, so time usage is not an issue.

Test 1: 1000 games, 4 cpus vs. 1 cpu with a fixed depth of 7. In this case, I used the implementation I described in an earlier post where in the 4 cpu case, some of the threads are searching 1 or 2 ply beyond the nominal iteration depth in the main thread. If one of these other threads happens, by chance, to complete before the main thread, the best move and score from those threads is substituted. This kind of substitution appears to be rare, but those other threads are contributing to the hash table all the time... so the average depth of hash table entries is greater than the iteration depth might indicate.

Result: 4 cpus had a 99% LOS with a nominal elo difference of 26 +/- 13

Test 2: 1000 games, 4 cpus vs. 1 cpu with a fixed depth of 7. I this case, I rigorously forced all threads to search a depth of no more than 7 ply.
- Dan

Assuming you have a deterministic search with 1 Thread, do you think Test 2 makes sense?
Please, see also my answer to Bob.

bob · Post by **bob** » Tue Jan 22, 2013 7:06 pm

Joerg Oster wrote:So maybe 'quality' is the wrong word.
But let me give an explanation how I see it.

Let's start with a standard search with 1 Thread. Every time we repeat searching the same position to a certain depth, with a certain alpha and beta and with the same aspiration window, the exact same tree is going to be searched. We get the same best move, the same node count, etc. We have a deterministic search. Of course, we all know this.

Here my analogy.
Let's think of a single tree, I choose a cypress tree*, on a meadow. Absolutely no wind, not the slightest breeze! And let's consider it is not growing in the meantime.
This represents our search tree from above. Whenever we look at it, we see the exact same tree.
*Mainly due to this remarkable post by Chris W.

Now let's come to a SMP search with 4 Threads, YBW type.
Suddenly there is wind blowing over our meadow, and it blows from different directions, and our tree is swaying forth and back, from left to right, etc. Now we always see a different tree, whenever we look! This also helps to visualize, why we don't get a perfect 4x speedup with 4 Threads. Not to speak of 8 and more.
So, every time we start this SMP search, we get different node counts, different best moves and/or at different depths. Non-deterministic. Sometimes we are lucky and the best move is found earlier, sometimes not or it is even found later, compared to our deterministic search from above. These 'sideway movements' swallow maybe 10 - 15% of our search time, and 85 - 90% result in a faster search. Thus the Elo-gain we measure is mainly due to this speedup. You're right that it makes sense to measure the time-to-depth for this kind of SMP search.

Now let's go ahead to the kind of SMP search (Shared Hash Tables (SHT)) Dan and Martin are experimenting with.
Suddenly, we have four cypress trees on our meadow! And at our first look we realize, that it wouldn't make sense to search 4x the exact same tree. So slightly modifying the search depth, aspiration window or maybe even alpha and beta, is a must. Otherwise we would gain almost nothing. Maybe a small speedup, eventually.
So to say, we create our own wind. And eventually we have some control over it by choosing the right modifications to our standard search. As Martin pointed out, there is still room for improvements, most likely.
But of course you see the main difference. We spend more of our time with these 'sideway movements', and thus we cannot get the same speedup like a YBW search. Maybe 50 - 80% for widening the search, and 20 - 50% for a speedup. I can only guess.

Assuming that I'm right, do you still think it makes sense to compare both approaches simply by time-to-depth?

What do you think where the Elo-gains with this SHT search, measured by both, come from? Especially at those lower depths, where a YBW SMP-search does not unfold its full strength!?

But thinking further, it also seems rather logical that we sooner or later get outsearched by the YBW-engine. Which leads me to my idea of combining both approaches.
Why not starting with a SHT implementation, and then switching to a YBW search? Of course, the depth at which we switch would depend on the engine. For a slow searcher, depth 10 or 12 might be ok. Whereas for a fast searcher like Stockfish depth 17 or 18 might be ok. My idea is, that the SHT-search will 'guide' the YBW-search towards the right direction.

Yes, and here is why...

Certainly oddball search ordering will on occasion find something that the serial search will not. An example is a serial search on fine #70. This is a 26-ply search + capture to find that white wins a pawn with Kb1. How do programs find that move with a LESS than 26 ply search? Bad move ordering. If you order the moves perfectly, it will take 26 plies every last time..

The moral of that story is that on occasion, that extra search space finds something good. On other occasions, it slows you down. It MUST average out to nothing. Otherwise you could improve your serial search to accomplish the same thing.

Why not do this:

start two "threads". But with just one CPU. That SHOULD be stronger, if this idea is correct. But it won't be. If the duplicated searching helps, it will help whether the two threads run in parallel, or are interleaved on a single CPU.

Short answer, then, is "No, I don't believe that there is ANY improvement from parallel search, OTHER than searching the tree faster..."

This is somewhat like the conservation of mass/energy in physics. If two are stronger than one, then 4 must be stronger than two, and so forth.. We know that is not going to happen with the tt-approach to sharing/searching, as opposed to a traditional alpha/beta search with tree splitting.

Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2