SMP search in Viper and idea about search in cluster system

nkg114mc · Post by **nkg114mc** » Sat Feb 23, 2013 1:32 am

bob wrote:
Daniel Shawul wrote:
Edmund wrote: Given the processes share the same memory:
If you compare multiple threads model with SMP implementation and multiple processor model with SMP implementation then the two options have similar speeds - right?
Yes it shouldn't matter if you use processes or threads as long as you do the same SMP implementation. But we usually use processes for distributed computing , and that implementation will be slower than an SMP implementation with threads or processes. For parallelizing game tree search, it is worth it to have a separate SMP implementation but it gets complicated implementing the mixed SMP and MPI implemenations. Even some MPI calls are not thread safe, so you should make sure a designated thread say thread 0 does the message probing or use locks. When a thread runs out of work, now it can also get a job from another node which makes the implementation even more difficult. I am not sure if I solved all the nuisances but i recall there were some. In most other distributed computing, most people don't have a separate optimized SMP implemntation and rely on MPI alone.
Even deep blue did this. shallow-level splits were across nodes on the SP box using message-passing, then each SP node had multiple chess processors to use in a more traditional parallel search model.

I could not imagine someone starting 8 individual processes on an 8-core node and then using message-passing from some master process to keep them busy. That's sloppy enough I wouldn't even consider it for a quick-and-dirty implementation.

Hi Dr. Hyatt:

Thanks for the reply! I think probably you have a lot of experience of testing the engine on the cluster system. Have you got any curve of performance w.r.t. to the number of node machines in a cluster when testing Crafty? (Suppose all the node machine are the same) Like the reply above, I just curiou about how well we can do by only increasing the scale of hardware today.

hgm · Post by **hgm** » Sat Feb 23, 2013 10:05 am

Crafty is not a cluster engine, is it? Just SMP.

nkg114mc · Post by **nkg114mc** » Sat Feb 23, 2013 12:21 pm

hgm wrote:Crafty is not a cluster engine, is it? Just SMP.

Hi Mr. Muller,

Thanks for the reply! Yes, at least current Crafty is for SMP (with the Dynamic Tree Split algorithm, is it?). But as I remember, Dr. Hyatt once change its implementation into using processes. Since process implementation has more scalability, Dr. Hyatt might get some experience about how much his program improved from increasing the number of processes. I just curious about how important the scale of hardware is to the nowadays computer chess programs. Do you have any ideas or advice about this, Mr. Muller?

Thanks!

bob · Post by **bob** » Sat Feb 23, 2013 10:41 pm

nkg114mc wrote:
bob wrote:
Daniel Shawul wrote:
Edmund wrote: Given the processes share the same memory:
If you compare multiple threads model with SMP implementation and multiple processor model with SMP implementation then the two options have similar speeds - right?
Yes it shouldn't matter if you use processes or threads as long as you do the same SMP implementation. But we usually use processes for distributed computing , and that implementation will be slower than an SMP implementation with threads or processes. For parallelizing game tree search, it is worth it to have a separate SMP implementation but it gets complicated implementing the mixed SMP and MPI implemenations. Even some MPI calls are not thread safe, so you should make sure a designated thread say thread 0 does the message probing or use locks. When a thread runs out of work, now it can also get a job from another node which makes the implementation even more difficult. I am not sure if I solved all the nuisances but i recall there were some. In most other distributed computing, most people don't have a separate optimized SMP implemntation and rely on MPI alone.
Even deep blue did this. shallow-level splits were across nodes on the SP box using message-passing, then each SP node had multiple chess processors to use in a more traditional parallel search model.

I could not imagine someone starting 8 individual processes on an 8-core node and then using message-passing from some master process to keep them busy. That's sloppy enough I wouldn't even consider it for a quick-and-dirty implementation.
Hi Dr. Hyatt:

Thanks for the reply! I think probably you have a lot of experience of testing the engine on the cluster system. Have you got any curve of performance w.r.t. to the number of node machines in a cluster when testing Crafty? (Suppose all the node machine are the same) Like the reply above, I just curiou about how well we can do by only increasing the scale of hardware today.

Crafty is not a cluster engine. I've had a couple of versions that were "close". But there are lots of issues and I have not gone back to work on that any further. Losing the shared hash table is a big loss, which really limits efficiency. Then the fact that splits are so expensive (a message rather than just storing a pointer in shared memory) restricts the depths at which splits can be done, which also influences efficiency.

bob · Post by **bob** » Sat Feb 23, 2013 10:44 pm

nkg114mc wrote:
hgm wrote:Crafty is not a cluster engine, is it? Just SMP.
Hi Mr. Muller,

Thanks for the reply! Yes, at least current Crafty is for SMP (with the Dynamic Tree Split algorithm, is it?). But as I remember, Dr. Hyatt once change its implementation into using processes. Since process implementation has more scalability, Dr. Hyatt might get some experience about how much his program improved from increasing the number of processes. I just curious about how important the scale of hardware is to the nowadays computer chess programs. Do you have any ideas or advice about this, Mr. Muller?

Thanks!

On an SMP box, there is absolutely no difference between processes and threads, regarding the parallel search efficiency. I STILL would never use message-passing, even with processes. The older versions of Crafty (still plenty around on my ftp box) used the usual system-V shared memory library (shmget, shmat and such) to do exactly the same thing that I do in threads. The changes were easy to make and there was no gain or loss when I moved first form threads to processes (using fork()) or back to threads again when the pthread library became more stable...

Threads have a plus for endgame tables since Eugene's EGTB probe code was designed for threads and uses a shared cache. If you use processes, you lose the shared cache and see much duplicated I/O.

Suj · Post by **Suj** » Tue Feb 26, 2013 11:50 am

Its not easy to find where the big drop off is. One it needs a lot of games and second to test all the cluster parameters is cumbersome.

I intially thought the sjeng cluster had a drop of around 80 cores back in 2009 but now its closer to 300 cores and may be am still wrong here as I havent really ran that many games with over 300. Only time will tell.
Parameters for tuning on a clusters can mean +200 elo gain. I remember changing some parameters in pamplona in 2009 and after 40-50 games that night I got a strange elo boost of 170-180 elo over the local sjeng. This was with the first cluster version I had.

In terms of number of clients node perspective I think sjeng would be happy with atleast upto 15 client nodes of 16 cores each.

With relation to rybka too I thought before 2010 Japan cluster rybka had issues above 8 nodes but Japan was a big improvment over its previous cluster algorithm.

Its easy to run any number of client nodes and get great nps but how is that going to translate into elo and performance?

Only way is to run as many games and making an educated guess. I have run may be few thousand games upto 300 cores but very few above that. On 832 cores I have run 11 games so far and I cant judge any performance gain on that but perhaps when I have more time I will retest it.

hgm · Post by **hgm** » Tue Feb 26, 2013 12:17 pm

nkg114mc wrote:But as I remember, Dr. Hyatt once change its implementation into using processes.

As Bob says, using processes for the SMP implementation does not imply that you could run the processes on different machines. If they need shared memory for essential functions (like communicating their search results to each other), they must run on the same machine.

Daniel Shawul · Post by **Daniel Shawul** » Tue Feb 26, 2013 5:13 pm

It may be better to test only the efficiency of the cluster implementation by treating all cores as nodes. It is difficult to implement a full YBW on cluster, so even on 2 cores you can see how much you are loosing from that. For example, on a dual-core smp machine i have, 1 thread does 1M nps, 2 threads 2M nps, but 2 processes give me about 1.4M nps. This loss is a combined effect of a poor cluster algorithm and communication latency. An example of an algorithmic deficiency is that instead of copying board positions and stacks, I preferred making and unmaking moves from the whatever the node was searching previously. This choice reduces size of data to be transferred. I wonder if it may be better on smp machines as well. Another one is that a helper node can not be helped by its master's threads. This is similar to a limitation that older algorithms such as PVS had. As to communication latency, supposedly MPI switches to shared memory for message passing on smp machines but sometimes using the network card may be faster.

bob · Post by **bob** » Tue Feb 26, 2013 7:24 pm

Suj wrote:Its not easy to find where the big drop off is. One it needs a lot of games and second to test all the cluster parameters is cumbersome.

I intially thought the sjeng cluster had a drop of around 80 cores back in 2009 but now its closer to 300 cores and may be am still wrong here as I havent really ran that many games with over 300. Only time will tell.
Parameters for tuning on a clusters can mean +200 elo gain. I remember changing some parameters in pamplona in 2009 and after 40-50 games that night I got a strange elo boost of 170-180 elo over the local sjeng. This was with the first cluster version I had.

In terms of number of clients node perspective I think sjeng would be happy with atleast upto 15 client nodes of 16 cores each.

With relation to rybka too I thought before 2010 Japan cluster rybka had issues above 8 nodes but Japan was a big improvment over its previous cluster algorithm.

Its easy to run any number of client nodes and get great nps but how is that going to translate into elo and performance?

Only way is to run as many games and making an educated guess. I have run may be few thousand games upto 300 cores but very few above that. On 832 cores I have run 11 games so far and I cant judge any performance gain on that but perhaps when I have more time I will retest it.

You need to ask GCP about this. SJENG seems to be as good as any cluster program (in terms of parallel search efficiency) as anyone, and better than most. But it is a daunting problem, because the cost of message-passing has to continually be factored in when thinking about sharing things. And you end up compromising on what you share, which hurts search efficiency.

Suj · Post by **Suj** » Wed Feb 27, 2013 3:51 pm

I know what he might reply back.

Test games and do a bayeselo but we only have limited time to test and this is a huge compromise but I guess this is where limitations apply.

If we all have the time then yes we can look at testing much more but right now am afraid its limited for both GCP and myself.

SMP search in Viper and idea about search in cluster system

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys

Re: SMP search in Viper and idea about search in cluster sys