Message passing parallel search on SMP system

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 07, 2013 6:10 am

\begin-essay

While pondering on my cluster search for Nebiyu, I thought about optimizing it for SMP systems first just to have some base to compare to.
I got a somewhat surprizing result, that it can be as good as a shared memory search that uses threads or processes. Note that both tests
are done on an SMP systems so there is no latency from interconnect. The goal here is just to investigate how message passing
parallel search compares to shared memory parallel search. I sensed some doubts about this here some time back. Note that MPI may
use shared memory to pass messages around making this distinction difficult to understand. Anyhow after modifying a couple of things mentioned below,
I was able to get the same NPS scaling for both methods, tested on 2 cores:). The MPI cluster code was never able to double nps for 2 processors using
a YBW algotihm, not some search from root stuff that scales perfectly nps wise. The changes I made will probably not improve
the cluster search because the bottelneck there is the network latency, and also because I never used message passing on SMP nodes.
So this is more of a what if scenario. Here are the changes:

a) Polling is a bad idea for message passing parallel search on SMP systems, which i think is the main reason for the improvement I got.
With shared memory parallel search this works well because worker threads can quickly check for themselves whether a flag is set (e.g QUIT),
and can also merge move results at split points etc... The latter is not possible in message passing parallel search because the master thread
has to probe for message, and do the update on behalf of the worker. This delay affects performance badly even if polling is done at very fast
rate say after 200-1000 nodes.

Ok so what is the solution for this? Using additional threads dedicated for message processing similar to what is used by some engines.
The problem with this is that on an 8-core system, you need to start 8 processes + 8 threads. This is a bit too much.
What makes this really bad is that MPI calls always spin consuming 100% cpu. Blocking calls such as MPI_Send and MPI_Recv do that, thus
the 8 threads end up consuming 50% cpu. This is rather annoying for a chess programmer who is used to a blocking stdin not consuming any cpu time.
This is default behaviour for MPI,but i gathered its runtime behaviour can be modified through --mca parameter but unfortunately not all vendors
have it. This is still unsolved issue but i guess sleeping those threads by a millisecond won't hurt that much.

b) The size of the message exchanged for SPLIT/MERGE operation. This is exactly the way i have it for cluster search and not optimized
for SMP system. Instead of copying the board position, stack of moves and other data, I opted for sending the moves necessary to reach the
split position from the root. The helper will just do/undo moves to get to the split position indirectly. On average <=5 do/undos are enough specially if the parallel
search goes in a PV-split manner. I think this may be a good alternative to data copying for SMP search as well. The cost of a couple of do/undos
may be small compared to copying huge bytes of data specially for those who use piece lists? Apparently it works judging from the result i have here.
Even if the cost is significant, increasing the smp split depth by 1 should reduce its effect.

c) A master can not become a slave to its helper in a message passing parallel search. Maybe i should say it was very messy for me, and i did not
figure out how to do it cleanly yet. This will probably affect performance but apprently not by much.

I probably missed some stuff but those are the major ones. Comments are welcome.

\end-essay