SMP search in Viper and idea about search in cluster system

nkg114mc · Post by **nkg114mc** » Fri Feb 22, 2013 8:25 am

Hi all:

I am studying about the parallel search implementation of the engine Viper by Tord Romstad. I have some questions about the implementation of YBW by Tord in Viper.

First is about Transposition Table and History Table. I notice that Viper did not use any lock on the TT (including Pawn Table) or History Table. Although the smp_search will not call the tt_store or tt_retrieve, the subtree of of smp_search will (because smp_search will call search to start an ordinary search later on). Is there any danger of synchronization problems when visit or revising TT and history table during search in this case?

The second is a question about parallel search in cluster system. Viper implements parallel search using Thread, while some other ones use process, specially the programs running in cluster system. Let's suppose that there is a cluster with 8 nodes, and each node contains 8 cores. It is easy to launch a parallel search with 64 processes to fully exploit the whole cluster system.
However, I have another way to do the search: make the splittings to two levels: process-level and thread-level. There is only 8 processes (equals to the number of nodes in the cluster), and each process have 8 threads (equals to the number of cores in each cluster node). In the first level, the search is split into 8 "level1-subtask"s, one for each process. When each process get their own "level1-subtask", it splits that into 8 smaller "level2-subtask"s, one for each thread. Thus, there are also 64 threads in the whole system. My question is: what's the difference between this "two-level architecture" implementation and the conventional "all process" implementation? Did someone once try this before?

Thanks!

Daniel Shawul · Post by **Daniel Shawul** » Fri Feb 22, 2013 3:05 pm

First is about Transposition Table and History Table. I notice that Viper did not use any lock on the TT (including Pawn Table) or History Table. Although the smp_search will not call the tt_store or tt_retrieve, the subtree of of smp_search will (because smp_search will call search to start an ordinary search later on). Is there any danger of synchronization problems when visit or revising TT and history table during search in this case?

There is a danger of collision so you either have to lock it which will be fatal to performance. Most don't use locks but just check validity of the move which viper probably does. You can also use one of the lockless schemes.

The second is a question about parallel search in cluster system. Viper implements parallel search using Thread, while some other ones use process, specially the programs running in cluster system. Let's suppose that there is a cluster with 8 nodes, and each node contains 8 cores. It is easy to launch a parallel search with 64 processes to fully exploit the whole cluster system.

Viper's or most other engines parallel implementations are for shared memeory systems. There are a handfull of cluster engines among which my engine and Toga are open source. You can look at my engine Scorpio's source code at github. Efficiency is massively dependent on the interconnect used.

However, I have another way to do the search: make the splittings to two levels: process-level and thread-level. There is only 8 processes (equals to the number of nodes in the cluster), and each process have 8 threads (equals to the number of cores in each cluster node). In the first level, the search is split into 8 "level1-subtask"s, one for each process. When each process get their own "level1-subtask", it splits that into 8 smaller "level2-subtask"s, one for each thread. Thus, there are also 64 threads in the whole system. My question is: what's the difference between this "two-level architecture" implementation and the conventional "all process" implementation? Did someone once try this before?

That is exactly what I do. At the time I wrote it I thought it was a unique idea since no engine i know of does it. Each node launches one process, and those processes automatically detect and launch enough threads to engage all cores fully. SMP implementation will be superior to message passing parallel search even if you keep all things same including a shared transposition table. MPI may use shared memory to pass messages when it knows the processes are on an SMP node but that will never be as efficient as the true SMP implementation. You can test these two level architectures you mentioned in scorpio by running an all process case (mpirun -n N) and a mixed process-thread case (mpirun -pernode -mt auto) .

hgm · Post by **hgm** » Fri Feb 22, 2013 4:03 pm

I never could pass the hurdle of actually producing a speedup with SMP. (I only worked on it for 3 days, though.) What I tried was running several engine processes with the hash table in a shared memory area. The idea was that one process would benefit from hash hits on searches done by the other, and people did report a 1.7-times speedup from such a completely unsynchronized approach. But what I got was an enormous slowdown; when I switched on the second process, (which was in analyze mode on the root position), it took the first far longer (like 1.5 times) to reach the same search depth.

All kind of clever schemes I had in mind to coordinate the search efforts did improve things, but never to the point where the master process would reach a given depth faster than when the other process was switched off. (And it was not that they were competing for CPU; when I ran them without sharing the memory area, they were both as fast as ever.)

Edmund · Post by **Edmund** » Fri Feb 22, 2013 4:42 pm

Daniel Shawul wrote:
However, I have another way to do the search: make the splittings to two levels: process-level and thread-level. There is only 8 processes (equals to the number of nodes in the cluster), and each process have 8 threads (equals to the number of cores in each cluster node). In the first level, the search is split into 8 "level1-subtask"s, one for each process. When each process get their own "level1-subtask", it splits that into 8 smaller "level2-subtask"s, one for each thread. Thus, there are also 64 threads in the whole system. My question is: what's the difference between this "two-level architecture" implementation and the conventional "all process" implementation? Did someone once try this before?
That is exactly what I do. At the time I wrote it I thought it was a unique idea since no engine i know of does it. Each node launches one process, and those processes automatically detect and launch enough threads to engage all cores fully. SMP implementation will be superior to message passing parallel search even if you keep all things same including a shared transposition table. MPI may use shared memory to pass messages when it knows the processes are on an SMP node but that will never be as efficient as the true SMP implementation. You can test these two level architectures you mentioned in scorpio by running an all process case (mpirun -n N) and a mixed process-thread case (mpirun -pernode -mt auto) .

Given the processes share the same memory:
If you compare multiple threads model with SMP implementation and multiple processor model with SMP implementation then the two options have similar speeds - right?

I know, its not part of the question from the OP. I just want to understand your terminology.

Daniel Shawul · Post by **Daniel Shawul** » Fri Feb 22, 2013 5:02 pm

hgm wrote:I never could pass the hurdle of actually producing a speedup with SMP. (I only worked on it for 3 days, though.) What I tried was running several engine processes with the hash table in a shared memory area. The idea was that one process would benefit from hash hits on searches done by the other, and people did report a 1.7-times speedup from such a completely unsynchronized approach. But what I got was an enormous slowdown; when I switched on the second process, (which was in analyze mode on the root position), it took the first far longer (like 1.5 times) to reach the same search depth.

All kind of clever schemes I had in mind to coordinate the search efforts did improve things, but never to the point where the master process would reach a given depth faster than when the other process was switched off. (And it was not that they were competing for CPU; when I ran them without sharing the memory area, they were both as fast as ever.)

I don't think 1.7x speed up for shared hash table idea is a correct figure,at least not for todays search depths. Even YBW searchers have difficult getting that much. The shared TT idea is embarrassingly parallel so nps scaling should be perfect. I would still guess something to be wrong in your setup for you to observe slowdowns, or it may be simply due to an anomalous test position. IMO if you don't lock the TT, then there should be nothing else to slow it down, and more TT hits should result in a faster search.

hgm · Post by **hgm** » Fri Feb 22, 2013 5:18 pm

Yes, that is what I thought too. I was completely flabbergasted by this. But after 3 days of trying I ran out of time, because the tourney started. (I did this in the plane to Japan for the ICGA Olympiad, and the two days in my hotel I had there before the Xiangqi tourney started. After the tourney the urgency for doing it had disappeared, and so far I never returned to it.)

Daniel Shawul · Post by **Daniel Shawul** » Fri Feb 22, 2013 5:19 pm

Edmund wrote: Given the processes share the same memory:
If you compare multiple threads model with SMP implementation and multiple processor model with SMP implementation then the two options have similar speeds - right?

Yes it shouldn't matter if you use processes or threads as long as you do the same SMP implementation. But we usually use processes for distributed computing , and that implementation will be slower than an SMP implementation with threads or processes. For parallelizing game tree search, it is worth it to have a separate SMP implementation but it gets complicated implementing the mixed SMP and MPI implemenations. Even some MPI calls are not thread safe, so you should make sure a designated thread say thread 0 does the message probing or use locks. When a thread runs out of work, now it can also get a job from another node which makes the implementation even more difficult. I am not sure if I solved all the nuisances but i recall there were some. In most other distributed computing, most people don't have a separate optimized SMP implemntation and rely on MPI alone.

bob · Post by **bob** » Fri Feb 22, 2013 7:00 pm

Edmund wrote:
Daniel Shawul wrote:
However, I have another way to do the search: make the splittings to two levels: process-level and thread-level. There is only 8 processes (equals to the number of nodes in the cluster), and each process have 8 threads (equals to the number of cores in each cluster node). In the first level, the search is split into 8 "level1-subtask"s, one for each process. When each process get their own "level1-subtask", it splits that into 8 smaller "level2-subtask"s, one for each thread. Thus, there are also 64 threads in the whole system. My question is: what's the difference between this "two-level architecture" implementation and the conventional "all process" implementation? Did someone once try this before?
That is exactly what I do. At the time I wrote it I thought it was a unique idea since no engine i know of does it. Each node launches one process, and those processes automatically detect and launch enough threads to engage all cores fully. SMP implementation will be superior to message passing parallel search even if you keep all things same including a shared transposition table. MPI may use shared memory to pass messages when it knows the processes are on an SMP node but that will never be as efficient as the true SMP implementation. You can test these two level architectures you mentioned in scorpio by running an all process case (mpirun -n N) and a mixed process-thread case (mpirun -pernode -mt auto) .
Given the processes share the same memory:
If you compare multiple threads model with SMP implementation and multiple processor model with SMP implementation then the two options have similar speeds - right?

I know, its not part of the question from the OP. I just want to understand your terminology.

There is no significant difference. Crafty started with threads. Then threads became an issue for a while due to various pthread implementations, so I switched to processes, using the system V shared memory calls (shmget, shmat, etc). Once pthreads became solid and portable again, I switched back. Took very little time. No performance issues at all. If you like egtbs, the egtb.cpp code IS thread-safe, and has the cache implemented as a shared object. If you use processes, egtb will allocate a different cache for each process, and the I/O will be duplicated for each cache fill...

bob · Post by **bob** » Fri Feb 22, 2013 9:15 pm

Daniel Shawul wrote:
Edmund wrote: Given the processes share the same memory:
If you compare multiple threads model with SMP implementation and multiple processor model with SMP implementation then the two options have similar speeds - right?
Yes it shouldn't matter if you use processes or threads as long as you do the same SMP implementation. But we usually use processes for distributed computing , and that implementation will be slower than an SMP implementation with threads or processes. For parallelizing game tree search, it is worth it to have a separate SMP implementation but it gets complicated implementing the mixed SMP and MPI implemenations. Even some MPI calls are not thread safe, so you should make sure a designated thread say thread 0 does the message probing or use locks. When a thread runs out of work, now it can also get a job from another node which makes the implementation even more difficult. I am not sure if I solved all the nuisances but i recall there were some. In most other distributed computing, most people don't have a separate optimized SMP implemntation and rely on MPI alone.

Even deep blue did this. shallow-level splits were across nodes on the SP box using message-passing, then each SP node had multiple chess processors to use in a more traditional parallel search model.

I could not imagine someone starting 8 individual processes on an 8-core node and then using message-passing from some master process to keep them busy. That's sloppy enough I wouldn't even consider it for a quick-and-dirty implementation.

nkg114mc · Post by **nkg114mc** » Sat Feb 23, 2013 1:15 am

Daniel Shawul wrote:
First is about Transposition Table and History Table. I notice that Viper did not use any lock on the TT (including Pawn Table) or History Table. Although the smp_search will not call the tt_store or tt_retrieve, the subtree of of smp_search will (because smp_search will call search to start an ordinary search later on). Is there any danger of synchronization problems when visit or revising TT and history table during search in this case?
There is a danger of collision so you either have to lock it which will be fatal to performance. Most don't use locks but just check validity of the move which viper probably does. You can also use one of the lockless schemes.
The second is a question about parallel search in cluster system. Viper implements parallel search using Thread, while some other ones use process, specially the programs running in cluster system. Let's suppose that there is a cluster with 8 nodes, and each node contains 8 cores. It is easy to launch a parallel search with 64 processes to fully exploit the whole cluster system.
Viper's or most other engines parallel implementations are for shared memeory systems. There are a handfull of cluster engines among which my engine and Toga are open source. You can look at my engine Scorpio's source code at github. Efficiency is massively dependent on the interconnect used.
However, I have another way to do the search: make the splittings to two levels: process-level and thread-level. There is only 8 processes (equals to the number of nodes in the cluster), and each process have 8 threads (equals to the number of cores in each cluster node). In the first level, the search is split into 8 "level1-subtask"s, one for each process. When each process get their own "level1-subtask", it splits that into 8 smaller "level2-subtask"s, one for each thread. Thus, there are also 64 threads in the whole system. My question is: what's the difference between this "two-level architecture" implementation and the conventional "all process" implementation? Did someone once try this before?
That is exactly what I do. At the time I wrote it I thought it was a unique idea since no engine i know of does it. Each node launches one process, and those processes automatically detect and launch enough threads to engage all cores fully. SMP implementation will be superior to message passing parallel search even if you keep all things same including a shared transposition table. MPI may use shared memory to pass messages when it knows the processes are on an SMP node but that will never be as efficient as the true SMP implementation. You can test these two level architectures you mentioned in scorpio by running an all process case (mpirun -n N) and a mixed process-thread case (mpirun -pernode -mt auto) .

Hi Daniel:

Thanks a lot for your reply, and also your work on Scorpio! I have download the code of Scropio the day before yesterday. Because someone said that Sungorus is very elo/code-lines
efficient engine, so I searched the Sungorus, and also found Scropio on Mr.Hair's website. Scorpio is the first engine I have seen so far, that implement both process and thread artitecture in its code. Implementing a thread-process engine for cluster system is something I want to do for several years. I will study the code of Scropio. Thanks again!

By the way: Have you seen the maximum improvements when run the Scripio on a larger scale hardware? I know that there is definitly speed-up when using more node machines and more cores, but it will also search more positions. What I concerned is about the performance. Because nowadays, the strength of hardware is much stronger than what it was in the years of Deep Blue. And also a lot of new techniqes and ideas has make the search more efficient. A strong engine running a multi-core CPU today is much stronger than Deep Blue in 1997. But I still curious that: How much could the search be benefited by only increasing the scale of hardware (like the number of node machines in the cluster)? What's the top-line performance we can get from a larger and larger scale parallel search when playing chess?

SMP search in Viper and idea about search in cluster system

SMP search in Viper and idea about search in cluster system

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys