Lazy SMP, part 3

dchoman · Post by **dchoman** » Sat Mar 09, 2013 3:23 pm

In the last lazy smp thread, I was asking about people's experience with a root-only, shared hash smp approach... where the parallel threads communicate only through the hash table. I was trying various ideas to improve this like searching some of the threads at iteration depth+1 rather than just iteration depth, and this approach was at least not terrible and seemed to have some promise. I also suggested searching the even/odd moves in even/odd threads at the root node to avoid duplication of work, but I was only applying this after the first move had been searched in parallel in each thread. The overall improvement from searching alternating moves after the first move was 5-10% in speed.

I recently fixed my linux box to turn-off turbo-boost, so that I could properly compare time-to-depth between different implementations, and the best I could get for 2 threads was a 1.57 time-to-depth improvement, 3 threads gave a 1.94 speed-up, and 4 threads gave a 2.13 speed-up. All with the implementation described above and in the previous thread.

This morning I woke up and realized that the restriction on searching the first move in all threads was perhaps not necessary. At least I cannot think of a good reason for it... I was just doing it because other SMP implementations rely on waiting for the first move to complete before splitting, but I realized this isn't really necessary at the root (I think). So I just removed my !first restriction on searching the even/odd moves in even/odd threads, and the improvement was dramatic. Here is the time-to-depth improvement I now get for 2, 3, and 4 threads compared to 1 thread.

For 317 positions, I get the following average time to depths
1 cores: NPS = 605952 <depth> = 12.0 <time to depth> = 0.430
2 cores: NPS = 1208744 <depth> = 12.0 <time to depth> = 0.225
3 cores: NPS = 1823264 <depth> = 12.1 <time to depth> = 0.184
4 cores: NPS = 2396858 <depth> = 12.1 <time to depth> = 0.160

So, comparing these to 1 core:
2 cores gives a 1.91 speed-up to finish iteration depth 12.
3 cores gives a 2.34 speed-up
4 cores gives a 2.69 speed-up

The 3rd and 4th threads are actually searching iteration depth+1, again with even/odd threads searching the even/odd moves at the root, which is why the average depth searched is a bit more than 12 in those cases. In principle I might speed the whole thing up further by having each of 3 threads search 1/3rd of the root moves and each of 4 threads searching 1/4th of the root moves, but that is more work to implement, and I haven't tried it yet.

- Dan

Daniel Shawul · Post by **Daniel Shawul** » Sat Mar 09, 2013 4:41 pm

This morning I woke up and realized that the restriction on searching the first move in all threads was perhaps not necessary. At least I cannot think of a good reason for it... I was just doing it because other SMP implementations rely on waiting for the first move to complete before splitting, but I realized this isn't really necessary at the root (I think).

You need to search the first move and set the lower bound to lower search overhead in PVS search. Here you are letting each thread set its own alpha so it is probably not a good idea but your result of 1.91x speedup is somewhat surprising. Maybe your PV don't change that much from depth to depth so that waiting for the first move's bound is not that important. If you reduce too much, you should try iterative deepening by 2 plies for the single threaded search to see if that improves things. That will make sure searching at depth and depth+1 simultaneously is not a big advantage for the parallel search. I seriously doubt that your results will translate to other engines, because i think there is something your parallel search improves a lot other than the occasional early fail highs you get from parallel search.

The 3rd and 4th threads are actually searching iteration depth+1, again with even/odd threads searching the even/odd moves at the root, which is why the average depth searched is a bit more than 12 in those cases. In principle I might speed the whole thing up further by having each of 3 threads search 1/3rd of the root moves and each of 4 threads searching 1/4th of the root moves, but that is more work to implement, and I haven't tried it yet.

Why don't you just use a lock on the root move list? That way a thread will search as much as it could and not be limited with an artificial limit of 1/n moves, that will incur some idle time anyway. If one thread happens to get the difficult ones, the others wait for it to finish. The fact that you now search the first and second with the same alpha may alleviate this problem because both moves are equally difficult. Anyway it is really simple to implement this with spinlocks that will let you get a move as fast as possible.

dchoman · Post by **dchoman** » Sat Mar 09, 2013 5:00 pm

Daniel Shawul wrote: Why don't you just use a lock on the root move list? That way a thread will search as much as it could and not be limited with an artificial limit of 1/n moves, that will incur some idle time anyway. If one thread happens to get the difficult ones, the others wait for it to finish. The fact that you now search the first and second with the same alpha may alleviate this problem because both moves are equally difficult. Anyway it is really simple to implement this with spinlocks that will let you get a move as fast as possible.

Good point, I agree, a lock on the movelist is better... the way this developed it was simpler at first to just search the alternate moves in two threads, but this is not the best solution generally (or even for two threads).

- Dan

dchoman · Post by **dchoman** » Sat Mar 09, 2013 5:03 pm

dchoman wrote:In the last lazy smp thread, I was asking about people's experience with a root-only, shared hash smp approach... where the parallel threads communicate only through the hash table. I was trying various ideas to improve this like searching some of the threads at iteration depth+1 rather than just iteration depth, and this approach was at least not terrible and seemed to have some promise. I also suggested searching the even/odd moves in even/odd threads at the root node to avoid duplication of work, but I was only applying this after the first move had been searched in parallel in each thread. The overall improvement from searching alternating moves after the first move was 5-10% in speed.

I recently fixed my linux box to turn-off turbo-boost, so that I could properly compare time-to-depth between different implementations, and the best I could get for 2 threads was a 1.57 time-to-depth improvement, 3 threads gave a 1.94 speed-up, and 4 threads gave a 2.13 speed-up. All with the implementation described above and in the previous thread.

This morning I woke up and realized that the restriction on searching the first move in all threads was perhaps not necessary. At least I cannot think of a good reason for it... I was just doing it because other SMP implementations rely on waiting for the first move to complete before splitting, but I realized this isn't really necessary at the root (I think). So I just removed my !first restriction on searching the even/odd moves in even/odd threads, and the improvement was dramatic. Here is the time-to-depth improvement I now get for 2, 3, and 4 threads compared to 1 thread.

For 317 positions, I get the following average time to depths
1 cores: NPS = 605952 <depth> = 12.0 <time to depth> = 0.430
2 cores: NPS = 1208744 <depth> = 12.0 <time to depth> = 0.225
3 cores: NPS = 1823264 <depth> = 12.1 <time to depth> = 0.184
4 cores: NPS = 2396858 <depth> = 12.1 <time to depth> = 0.160

So, comparing these to 1 core:
2 cores gives a 1.91 speed-up to finish iteration depth 12.
3 cores gives a 2.34 speed-up
4 cores gives a 2.69 speed-up

The 3rd and 4th threads are actually searching iteration depth+1, again with even/odd threads searching the even/odd moves at the root, which is why the average depth searched is a bit more than 12 in those cases. In principle I might speed the whole thing up further by having each of 3 threads search 1/3rd of the root moves and each of 4 threads searching 1/4th of the root moves, but that is more work to implement, and I haven't tried it yet.

- Dan

My apologies for my enthusiastic post... I am now running a large number of games and something is up. While this looked great in the test positions, it is actually playing weaker than the previous (slower) implementation. Probably it has something has something to do with having the wrong alpha limit when searching some moves.... or just another unanticipated bug somewhere. I need to give this some thought to sort it out.

- Dan

dchoman · Post by **dchoman** » Sat Mar 09, 2013 11:21 pm

dchoman wrote: My apologies for my enthusiastic post... I am now running a large number of games and something is up. While this looked great in the test positions, it is actually playing weaker than the previous (slower) implementation. Probably it has something has something to do with having the wrong alpha limit when searching some moves.... or just another unanticipated bug somewhere. I need to give this some thought to sort it out.

- Dan

Yes, it turns out that the 'improvement' was mostly a bug that caused the odd numbered threads to ignore the hash move entirely (rather than search it later, as intended). This naturally sped things up quite a bit, and while the testsuite results looked great and gave no hint of the missing moves, games showed it pretty clearly. My apologies again for not testing a 'simple' change much more thoroughly before getting excited.

- Dan

Daniel Shawul · Post by **Daniel Shawul** » Sun Mar 10, 2013 12:42 am

dchoman wrote:
Daniel Shawul wrote: Why don't you just use a lock on the root move list? That way a thread will search as much as it could and not be limited with an artificial limit of 1/n moves, that will incur some idle time anyway. If one thread happens to get the difficult ones, the others wait for it to finish. The fact that you now search the first and second with the same alpha may alleviate this problem because both moves are equally difficult. Anyway it is really simple to implement this with spinlocks that will let you get a move as fast as possible.
Good point, I agree, a lock on the movelist is better... the way this developed it was simpler at first to just search the alternate moves in two threads, but this is not the best solution generally (or even for two threads).

- Dan

If you have to implement that, maybe you can duplicate the movelist such that you can search on both depths simultaneously with any thread. Order the moves like:

Code: Select all

 e2e4 12, e2e4 13, d2d4 12, d2d4 13 ..... a2a3 12, a2a3 13

I think this might also be an interesting idea even for sequential search. It can help in time management because at any time you stop the moves at the front of the list are searched 2 plies deeper than those not searched yet. With the normal iterative deepening they are searched only 1 ply deeper...

Lazy SMP, part 3

Lazy SMP, part 3

Re: Lazy SMP, part 3

Re: Lazy SMP, part 3

Re: Lazy SMP, part 3

Re: Lazy SMP, part 3

Re: Lazy SMP, part 3