Lazy SMP, part 2

Pio · Post by **Pio** » Tue Jan 22, 2013 8:58 pm

Hi Robert!

I understand what you mean and of course I have to agree with you.

I have one possible suggestion of why it might help searching with one thread to depth d and another to depth d+1. If your threads use the same history table then you might get some artifical aging that will be benificial and might help both the depth d search as well as the depth d+1 search.

I am just speculating but it might be the case that you will actually look at a quite different tree then if you do LMR based on your history table.

I might be very wrong and I have not finished my chess engine so be nice to me

Don · Post by **Don** » Wed Jan 23, 2013 2:45 pm

Pio wrote:Hi Robert!

I understand what you mean and of course I have to agree with you.

I have one possible suggestion of why it might help searching with one thread to depth d and another to depth d+1. If your threads use the same history table then you might get some artifical aging that will be benificial and might help both the depth d search as well as the depth d+1 search.

I am just speculating but it might be the case that you will actually look at a quite different tree then if you do LMR based on your history table.

I might be very wrong and I have not finished my chess engine so be nice to me

If you do a serial search that is iterative you will get your baseline. If another thread is always doing depth + 1, the main thread is going to benefit from higher quality moves that were put in the hash table by the main thread from the previous iteration. Of course we would also expect the performance to increase due to moves already being searched.

Due to the way LMR work anything discovered by the deeper search in the previous iteration is going to have a beneficial change on the move ordering and hence you might not reduce some moves you would have ordinarily.

Another way to do this on a 4 core machine for example is start with a 1,2,3 and 4 ply search. When 1 search completes, it frees up a thread so immediately start the next un-started depth. Let's assume the 1 ply search completes and the 2,3 and 4 are still running - you would start a 5 ply search on the newly freed thread. The program reports that a 1 ply search is complete and all the accounting is focused on the 2 ply search at this point and so on.

You can do the same thing with half ply increments since it may be that 4 full ply iterations at the same time is not very efficient.

I suspect that stopping the other threads and moving to the next iteration has almost the same impact as keeping them all running in the fashion I describe because the hash table essentially records the state anyway when you stop. So your way of doing this may be simpler and just as good.

Don · Post by **Don** » Wed Jan 23, 2013 2:57 pm

Don wrote:
Pio wrote:Hi Robert!

I understand what you mean and of course I have to agree with you.

I have one possible suggestion of why it might help searching with one thread to depth d and another to depth d+1. If your threads use the same history table then you might get some artifical aging that will be benificial and might help both the depth d search as well as the depth d+1 search.

I am just speculating but it might be the case that you will actually look at a quite different tree then if you do LMR based on your history table.

I might be very wrong and I have not finished my chess engine so be nice to me
If you do a serial search that is iterative you will get your baseline. If another thread is always doing depth + 1, the main thread is going to benefit from higher quality moves that were put in the hash table by the main thread from the previous iteration.

Correction! I meant that the main thread would benefit from the depth+1 search from the previous iteration.

Of course we would also expect the performance to increase due to moves already being searched.

Due to the way LMR work anything discovered by the deeper search in the previous iteration is going to have a beneficial change on the move ordering and hence you might not reduce some moves you would have ordinarily.

Another way to do this on a 4 core machine for example is start with a 1,2,3 and 4 ply search. When 1 search completes, it frees up a thread so immediately start the next un-started depth. Let's assume the 1 ply search completes and the 2,3 and 4 are still running - you would start a 5 ply search on the newly freed thread. The program reports that a 1 ply search is complete and all the accounting is focused on the 2 ply search at this point and so on.

You can do the same thing with half ply increments since it may be that 4 full ply iterations at the same time is not very efficient.

I suspect that stopping the other threads and moving to the next iteration has almost the same impact as keeping them all running in the fashion I describe because the hash table essentially records the state anyway when you stop. So your way of doing this may be simpler and just as good.

Joerg Oster · Post by **Joerg Oster** » Thu Jan 24, 2013 10:07 pm

bob wrote: The moral of that story is that on occasion, that extra search space finds something good. On other occasions, it slows you down. It MUST average out to nothing. Otherwise you could improve your serial search to accomplish the same thing.

That perfectly describes what's happening in a YBW search, doesn't it?
All in all no qualitative gain, only speedup.
But I still think this is not true for the SHT-approach ...
I'm afraid our opinions differ here, which is ok.

bob · Post by **bob** » Thu Jan 24, 2013 10:47 pm

Joerg Oster wrote:
bob wrote: The moral of that story is that on occasion, that extra search space finds something good. On other occasions, it slows you down. It MUST average out to nothing. Otherwise you could improve your serial search to accomplish the same thing.
That perfectly describes what's happening in a YBW search, doesn't it?
All in all no qualitative gain, only speedup.
But I still think this is not true for the SHT-approach ...
I'm afraid our opinions differ here, which is ok.

My point is this. IF your approach works, one can do that SAME thing in a purely serial/sequential search. You just write two identical search functions, and call the first which selects one move searches one node (using the shared HT). It returns. You call the other search which does the same thing. Then you call the first which searches one more node (deeper or wider doesn't matter). Then back to the second. In essence, you are doing a parallel search by interleaving the nodes. If the shared TT actually improves quality, so will this.

I simply do not believe in the concept, myself.

Don · Post by **Don** » Thu Jan 24, 2013 11:16 pm

bob wrote:
Joerg Oster wrote:
bob wrote: The moral of that story is that on occasion, that extra search space finds something good. On other occasions, it slows you down. It MUST average out to nothing. Otherwise you could improve your serial search to accomplish the same thing.
That perfectly describes what's happening in a YBW search, doesn't it?
All in all no qualitative gain, only speedup.
But I still think this is not true for the SHT-approach ...
I'm afraid our opinions differ here, which is ok.
My point is this. IF your approach works, one can do that SAME thing in a purely serial/sequential search. You just write two identical search functions, and call the first which selects one move searches one node (using the shared HT). It returns. You call the other search which does the same thing. Then you call the first which searches one more node (deeper or wider doesn't matter). Then back to the second. In essence, you are doing a parallel search by interleaving the nodes. If the shared TT actually improves quality, so will this.

I simply do not believe in the concept, myself.

Or better yet, if there is something there, try to figure out what the principle is so that you can take advantage of it more purposefully.

dchoman · Post by **dchoman** » Sat Jan 26, 2013 11:13 am

Well, I discovered at least one issue that might have affected my results.... Turbo-boost. I realized that Turbo-boost may be affect my tests of multiple-cores... both the time-to-depth test and the game playing tests. My chip is an i7, so when I ran the single core time-to-depth test, it was 15-20% faster than it would have been if the chip had been fully loaded. So this makes the relative improvement in time to depth for multiple threads is actually somewhat better than my first post indicates.

On the game results the interpretation is more difficult, in the single-thread case all four cores are running all the time because I am playing four games in parallel, but in the four-thread case, one core is running for the opponents (and therefore might get turbo-boost) while all four are running for my program. In principle, this reduces the rating of the four-thread version relative to the single-thread version because the opponents are stronger for the four-thread version, but given the tiny time-slices the opponents are playing in, I am not sure this matters.

The only way to tell is to try to disable TurboBoost. Anyone know a safe way to do this? There is no option in the BIOS. Here is a website claiming to have an approach, but I don't know anything about the tool they are referencing.

http://luisjdominguezp.tumblr.com/post/ ... t-in-linux

Are there any other adaptive speed changes the i7 does that might also affect the results? What do others do when testing to avoid problems like these?

- Dan

Joerg Oster · Post by **Joerg Oster** » Sat Jan 26, 2013 5:48 pm

Hi Dan,

are you really sure there is no such option in the BIOS?
What Mainboard do you have? Maybe you have to change into an 'Advanced Mode' to have all options available?

Maybe it will help to change the CPU settings to manual instead of using automatic settings. Good luck.

bob · Post by **bob** » Sat Jan 26, 2013 11:20 pm

dchoman wrote:Well, I discovered at least one issue that might have affected my results.... Turbo-boost. I realized that Turbo-boost may be affect my tests of multiple-cores... both the time-to-depth test and the game playing tests. My chip is an i7, so when I ran the single core time-to-depth test, it was 15-20% faster than it would have been if the chip had been fully loaded. So this makes the relative improvement in time to depth for multiple threads is actually somewhat better than my first post indicates.

On the game results the interpretation is more difficult, in the single-thread case all four cores are running all the time because I am playing four games in parallel, but in the four-thread case, one core is running for the opponents (and therefore might get turbo-boost) while all four are running for my program. In principle, this reduces the rating of the four-thread version relative to the single-thread version because the opponents are stronger for the four-thread version, but given the tiny time-slices the opponents are playing in, I am not sure this matters.

The only way to tell is to try to disable TurboBoost. Anyone know a safe way to do this? There is no option in the BIOS. Here is a website claiming to have an approach, but I don't know anything about the tool they are referencing.

http://luisjdominguezp.tumblr.com/post/ ... t-in-linux

Are there any other adaptive speed changes the i7 does that might also affect the results? What do others do when testing to avoid problems like these?

- Dan

You need to disable that for sensible testing. It's ok to run with it in real games, as when it kicks in, you get a performance boost. But it does screw up SMP comparisons significantly, because you now have yet another variable in the equation.

Don · Post by **Don** » Sat Jan 26, 2013 11:46 pm

bob wrote:
dchoman wrote:Well, I discovered at least one issue that might have affected my results.... Turbo-boost. I realized that Turbo-boost may be affect my tests of multiple-cores... both the time-to-depth test and the game playing tests. My chip is an i7, so when I ran the single core time-to-depth test, it was 15-20% faster than it would have been if the chip had been fully loaded. So this makes the relative improvement in time to depth for multiple threads is actually somewhat better than my first post indicates.

On the game results the interpretation is more difficult, in the single-thread case all four cores are running all the time because I am playing four games in parallel, but in the four-thread case, one core is running for the opponents (and therefore might get turbo-boost) while all four are running for my program. In principle, this reduces the rating of the four-thread version relative to the single-thread version because the opponents are stronger for the four-thread version, but given the tiny time-slices the opponents are playing in, I am not sure this matters.

The only way to tell is to try to disable TurboBoost. Anyone know a safe way to do this? There is no option in the BIOS. Here is a website claiming to have an approach, but I don't know anything about the tool they are referencing.

http://luisjdominguezp.tumblr.com/post/ ... t-in-linux

Are there any other adaptive speed changes the i7 does that might also affect the results? What do others do when testing to avoid problems like these?

- Dan
You need to disable that for sensible testing. It's ok to run with it in real games, as when it kicks in, you get a performance boost. But it does screw up SMP comparisons significantly, because you now have yet another variable in the equation.

So the rating agencies should have this disabled. Do you have a sense of whether it would benefit MP or SP more or is it just random?

Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

TurboBoost complications

Re: TurboBoost complications

Re: TurboBoost complications

Re: TurboBoost complications