Page 1 of 2

Parallel search: System-level programmin details

Posted: Sat Feb 09, 2013 4:19 pm
by AlvaroBegue
I am about to try to write a parallel alpha-beta search for the first time. My first idea is to implement a fairly vanilla version of YBWC, and I think I understand the algorithm well enough, but I am sure getting the details of the implementation right is also very important.

In particular, I think I know how to do this using multiple processes in Linux, where an idle process uses signals to ask a random process for work. But I understand there are many other options: One could use threads and "publish" the available split nodes using some sort of lockless queue, so other threads can get work from it. Or perhaps something with MPI so the transition to a cluster is more seamless (although for now I'd be happy to get something working just for SMP machines).

I would also like this to work on Windows with minimal changes, but if I understand correctly, Windows doesn't have inter-process signals...

Is there a standard approach to this? How do you guys do it?

Thanks!

Re: Parallel search: System-level programmin details

Posted: Sat Feb 09, 2013 5:23 pm
by jdart
Most engines use threads, not processes (since they run on one physical machine and not a cluster of machines). If you don't intend to run on a cluster, threads are more efficient.

My implementation is open source (http://www.arasanchess.org) and includes Linux, Windows and MacOS support.

C++ 11 includes builit-in support for threads and synchronization primitives (I don't use this because my code pre-dates C++ 11).

--Jon

Re: Parallel search: System-level programmin details

Posted: Sat Feb 09, 2013 7:35 pm
by AlvaroBegue
Thank you, Jon. I'll read your code and try to learn. By the little I have perused it, it seems to be pretty well written.

Re: Parallel search: System-level programmin details

Posted: Sat Feb 09, 2013 7:38 pm
by jdart
Note there is actually a bug (Linux-specific) in one of my thread control primitives. The fixed version is below (will be in the next release):

Code: Select all

void ThreadControl::reset() {
#ifdef _WIN32
   if (!ResetEvent(hEvent1)) {
        cerr << "ResetEvent failed" << endl;
        ASSERT&#40;0&#41;;
   &#125;
#elif defined&#40;_MAC&#41;
   // use semctl, not semop, to mimic Windows behavior where                                              
   // multiple resets have no extra effect.                                                               
   union semun val;
   // set semaphore to 1                                                                                  
   val.val = 1;
   if &#40;semctl&#40;sem, 0, SETVAL, val&#41;) &#123;
      perror&#40;"semctl");
   &#125;
#else
   // lock semaphore                                                                                      
   while (!sem_trywait&#40;&sem&#41;) ;
   if &#40;errno == EAGAIN&#41; &#123;
      // expected, semaphore is locked                                                                    
   &#125; else &#123;
      perror&#40;"reset");
   &#125;
#endif
&#125;

Re: Parallel search: System-level programmin details

Posted: Sat Feb 09, 2013 10:16 pm
by zamar
AlvaroBegue wrote:I am about to try to write a parallel alpha-beta search for the first time. My first idea is to implement a fairly vanilla version of YBWC, and I think I understand the algorithm well enough, but I am sure getting the details of the implementation right is also very important.

In particular, I think I know how to do this using multiple processes in Linux, where an idle process uses signals to ask a random process for work. But I understand there are many other options: One could use threads and "publish" the available split nodes using some sort of lockless queue, so other threads can get work from it. Or perhaps something with MPI so the transition to a cluster is more seamless (although for now I'd be happy to get something working just for SMP machines).

I would also like this to work on Windows with minimal changes, but if I understand correctly, Windows doesn't have inter-process signals...

Is there a standard approach to this? How do you guys do it?

Thanks!
Each thread has a boolean variable "idle". Every time after move is completed in high enough depth, check if there are threads with "idle = true". If so, create a split point, and assign all idle threads to it.

Not maybe the most efficient, but very simple method.

Re: Parallel search: System-level programmin details

Posted: Sun Feb 10, 2013 12:00 am
by AlvaroBegue
zamar wrote:
AlvaroBegue wrote:I am about to try to write a parallel alpha-beta search for the first time. My first idea is to implement a fairly vanilla version of YBWC, and I think I understand the algorithm well enough, but I am sure getting the details of the implementation right is also very important.

In particular, I think I know how to do this using multiple processes in Linux, where an idle process uses signals to ask a random process for work. But I understand there are many other options: One could use threads and "publish" the available split nodes using some sort of lockless queue, so other threads can get work from it. Or perhaps something with MPI so the transition to a cluster is more seamless (although for now I'd be happy to get something working just for SMP machines).

I would also like this to work on Windows with minimal changes, but if I understand correctly, Windows doesn't have inter-process signals...

Is there a standard approach to this? How do you guys do it?

Thanks!
Each thread has a boolean variable "idle". Every time after move is completed in high enough depth, check if there are threads with "idle = true". If so, create a split point, and assign all idle threads to it.

Not maybe the most efficient, but very simple method.
I read a description of YBWC by Valavan Manohararajah in his thesis titled "Parallel Alpha-Beta Search on Shared Memory Multiprocessors". I quote:
A processor, P1, that is idle selects another processor, P2, at random and transmits a message requesting work. Processor P2 has work available if there is at least one node in the subtree it is examining that satisfies the weak YBWC criterion. That is, P2 has work available if it owns a node at which the eldest brother has been evaluated. The node that
satisfies the criterion becomes the split-point; if there are many nodes that satisfy the criterion, then the node that is the highest in the tree is selected as the split-point.
That's why I was thinking of an implementation using signals, since I would need to be able to interrupt what another worker is doing and ask for work. It seems like communicating the other way around is easier to implement, but perhaps you'll miss opportunities to split at a deeper node, if at the time when it became splittable there were no idle workers.

Re: Parallel search: System-level programmin details

Posted: Sun Feb 10, 2013 2:29 pm
by zamar
AlvaroBegue wrote:
zamar wrote:
AlvaroBegue wrote:I am about to try to write a parallel alpha-beta search for the first time. My first idea is to implement a fairly vanilla version of YBWC, and I think I understand the algorithm well enough, but I am sure getting the details of the implementation right is also very important.

In particular, I think I know how to do this using multiple processes in Linux, where an idle process uses signals to ask a random process for work. But I understand there are many other options: One could use threads and "publish" the available split nodes using some sort of lockless queue, so other threads can get work from it. Or perhaps something with MPI so the transition to a cluster is more seamless (although for now I'd be happy to get something working just for SMP machines).

I would also like this to work on Windows with minimal changes, but if I understand correctly, Windows doesn't have inter-process signals...

Is there a standard approach to this? How do you guys do it?

Thanks!
Each thread has a boolean variable "idle". Every time after move is completed in high enough depth, check if there are threads with "idle = true". If so, create a split point, and assign all idle threads to it.

Not maybe the most efficient, but very simple method.
I read a description of YBWC by Valavan Manohararajah in his thesis titled "Parallel Alpha-Beta Search on Shared Memory Multiprocessors". I quote:
A processor, P1, that is idle selects another processor, P2, at random and transmits a message requesting work. Processor P2 has work available if there is at least one node in the subtree it is examining that satisfies the weak YBWC criterion. That is, P2 has work available if it owns a node at which the eldest brother has been evaluated. The node that
satisfies the criterion becomes the split-point; if there are many nodes that satisfy the criterion, then the node that is the highest in the tree is selected as the split-point.
That's why I was thinking of an implementation using signals, since I would need to be able to interrupt what another worker is doing and ask for work. It seems like communicating the other way around is easier to implement, but perhaps you'll miss opportunities to split at a deeper node, if at the time when it became splittable there were no idle workers.
That's definetily a downside, but in practice the effect is minor (at least when number of threads is at most 8). The most critical split happens when we are in PV-node and first move is completed. Then all threads are automatically available and start in parallel searching moves 2,3,4,5 (I assume number of threads is 4). This is the only split point until move n-3 is completed. Then other threads are searching moves n-2, n-1 and n. Thread which completed move n-3 is available as a helpful master.

In this case the schema I described doesn't behave optimally, but practice has shown that moves at the end of the move list are usually completed very quickly anyway, so even a split is a bit suboptimal in this case, the performance loss is minimalistic.

When computers with 16 or 32 threads become available, then we need to reconsider, maybe we'll even see a switch from YBW to DTS?

Re: Parallel search: System-level programmin details

Posted: Sun Feb 10, 2013 4:34 pm
by AlvaroBegue
I have one more question. I am using PVS, so I perform null-window searches for all the young brothers, and those are the searches that I would do in parallel. So what happens when one of those searches actually discovers that the move is better than its eldest brother? Of course we'll change the boundary to compare new nodes to. But, should the threads that are currently searching with the wrong boundary be stopped? Or is it OK to let them continue with the old boundary, and we'll re-search them if we have to?

At this point it would be great if someone could point me to a paper that goes into all these details. Or I risk asking too many questions and bothering everybody here...

Re: Parallel search: System-level programmin details

Posted: Sun Feb 10, 2013 9:30 pm
by zamar
AlvaroBegue wrote:I have one more question. I am using PVS, so I perform null-window searches for all the young brothers, and those are the searches that I would do in parallel. So what happens when one of those searches actually discovers that the move is better than its eldest brother? Of course we'll change the boundary to compare new nodes to. But, should the threads that are currently searching with the wrong boundary be stopped?
I think this issue was discussed here some time ago and every body agreed that it didn't make any measurable difference.

Re: Parallel search: System-level programmin details

Posted: Sun Feb 10, 2013 10:46 pm
by AlvaroBegue
zamar wrote:
AlvaroBegue wrote:I have one more question. I am using PVS, so I perform null-window searches for all the young brothers, and those are the searches that I would do in parallel. So what happens when one of those searches actually discovers that the move is better than its eldest brother? Of course we'll change the boundary to compare new nodes to. But, should the threads that are currently searching with the wrong boundary be stopped?
I think this issue was discussed here some time ago and every body agreed that it didn't make any measurable difference.
Great. That makes things much easier, then. Thanks!