scorpio can run on 8192 cores

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

scorpio can run on 8192 cores

Post by Daniel Shawul »

Is this a record for a chess engine to 'use' this many cores ? The performance is horrible though, for a couple of reasons a) cluster scaling is bad beyond a couple of nodes (1 node = 16 cores SMP non-NUMA) b) performance on one BG/Q core is abysmal mainly due to neglect of hardware design for integer performance. The hardware is meant for number crunching but chess engines hardly use floating point operations, if at all. Therefore, even though the cores speed is 1.6ghz, I get like 70,000 nps on one core. It scales well intra-node but YBW like algorithms do not scale well beyond 256 cores IIRC. I have yet to test ABDADA with distributed hash-tables which should keep all cores busy ...
Daniel
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: scorpio can run on 8192 cores

Post by Rebel »

Goodness :wink:

You must have a big cellar, or is the beast installed in the living room?
Zenmastur
Posts: 919
Joined: Sat May 31, 2014 8:28 am

Re: scorpio can run on 8192 cores

Post by Zenmastur »

So give us some figures for number of nodes and nps so we can see exactly how bad the scaling really is.

Regards,

Zen
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: scorpio can run on 8192 cores

Post by petero2 »

Daniel Shawul wrote:Is this a record for a chess engine to 'use' this many cores ? The performance is horrible though, for a couple of reasons a) cluster scaling is bad beyond a couple of nodes (1 node = 16 cores SMP non-NUMA) b) performance on one BG/Q core is abysmal mainly due to neglect of hardware design for integer performance. The hardware is meant for number crunching but chess engines hardly use floating point operations, if at all. Therefore, even though the cores speed is 1.6ghz, I get like 70,000 nps on one core. It scales well intra-node but YBW like algorithms do not scale well beyond 256 cores IIRC. I have yet to test ABDADA with distributed hash-tables which should keep all cores busy ...
Do you know of any papers describing a tree search algorithm that would scale reasonably well to a huge number of processors in a cluster, such as the IBM Sequoia with 1.5 million cores?

It seems clear to me that you can not have all processors trying to talk to all other processors. I have been thinking about arranging the processors in a tree structure (using MPI_Dist_graph_create). Each processor would only communicate with its parent/child nodes in this tree graph. A node can ask its child nodes to help searching sub-trees, and the child nodes report search results back to their parent node. I don't know how well this would work in practice though.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: scorpio can run on 8192 cores

Post by Dann Corbit »

petero2 wrote:
Daniel Shawul wrote:Is this a record for a chess engine to 'use' this many cores ? The performance is horrible though, for a couple of reasons a) cluster scaling is bad beyond a couple of nodes (1 node = 16 cores SMP non-NUMA) b) performance on one BG/Q core is abysmal mainly due to neglect of hardware design for integer performance. The hardware is meant for number crunching but chess engines hardly use floating point operations, if at all. Therefore, even though the cores speed is 1.6ghz, I get like 70,000 nps on one core. It scales well intra-node but YBW like algorithms do not scale well beyond 256 cores IIRC. I have yet to test ABDADA with distributed hash-tables which should keep all cores busy ...
Do you know of any papers describing a tree search algorithm that would scale reasonably well to a huge number of processors in a cluster, such as the IBM Sequoia with 1.5 million cores?

It seems clear to me that you can not have all processors trying to talk to all other processors. I have been thinking about arranging the processors in a tree structure (using MPI_Dist_graph_create). Each processor would only communicate with its parent/child nodes in this tree graph. A node can ask its child nodes to help searching sub-trees, and the child nodes report search results back to their parent node. I don't know how well this would work in practice though.
It feels right to have the shape of the search to follow the shape of the tree.

The pv nodes get the lion's share of the processors, with smaller clumps going to each of the sibling nodes.
As data is reported back to the hash table, some watcher/manager distributes the nodes to the most promising looking areas of the tree.

Considering the pv itself, imagine a logarithmic distribution along the pv itself (echoing Dan Homan's idea somewhat):

Let's suppose 1024 cpus for the root node of the pv.
The first and most promising forward node might get log2(1024) = 10 CPUs.
The most promising child of the most promising forward node would get log2(10) cpus = 3 CPUs.

Sibling nodes of the root node might get 10 CPUs each as well.
If some sibling node fails high, it informs the watcher/manager who gives it higher priority.

Other distribution schemes might work better.
It seems to me that a geometry that reduces communication traffic will work best.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: scorpio can run on 8192 cores

Post by Daniel Shawul »

petero2 wrote:
Daniel Shawul wrote:Is this a record for a chess engine to 'use' this many cores ? The performance is horrible though, for a couple of reasons a) cluster scaling is bad beyond a couple of nodes (1 node = 16 cores SMP non-NUMA) b) performance on one BG/Q core is abysmal mainly due to neglect of hardware design for integer performance. The hardware is meant for number crunching but chess engines hardly use floating point operations, if at all. Therefore, even though the cores speed is 1.6ghz, I get like 70,000 nps on one core. It scales well intra-node but YBW like algorithms do not scale well beyond 256 cores IIRC. I have yet to test ABDADA with distributed hash-tables which should keep all cores busy ...
Do you know of any papers describing a tree search algorithm that would scale reasonably well to a huge number of processors in a cluster, such as the IBM Sequoia with 1.5 million cores?
Not really but I don't expect the 'standard' YBW work stealing procedure will work well. I think a transposition table work driven work scheduling (http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf) where work is migrated to the processor that has the TT entry is an interesting approach. I haven't implemented this but I do have a distributed TT that I use to test simple approaches (shared TT and ABDADA). This give me awesome cumulative NPS but not so awesome interms of depth searched. My YBW implementation is such that an idle processor requests a random processor for work which is a very very bad approach.
It seems clear to me that you can not have all processors trying to talk to all other processors. I have been thinking about arranging the processors in a tree structure (using MPI_Dist_graph_create). Each processor would only communicate with its parent/child nodes in this tree graph. A node can ask its child nodes to help searching sub-trees, and the child nodes report search results back to their parent node. I don't know how well this would work in practice though.
That is true. I am trying to have multiple communicators that suits the underlying hardware better. E.g. For the BG/Q, an SMP node => 32 nodes drawer => a rack etc. Currently, I use threaded code on the SMP node (16 cores) and there is only MPI_COMM_WORLD. But I still think the work stealing procedure is not going to scale even with that, so I would have to implement a scheme that would give N processors as helpers along with a move to search. e.g. give 128 processors to 1st move, 64 processors to 2nd etc. (as Dann may have suggested already). Since moves are generated one by one in an incremental move generator, something that would allocate based on the number of idle processors available. I realized this to be a major bottneck when I did the test, because processor-0 have all (thousads of processors) but gives out 1 processor per move when it distributes search... In any case, I think we need new approaches to parallel search to use massive compute resources, compared to the SMP problem which I believe is solved.

Daniel
EvgeniyZh
Posts: 43
Joined: Fri Sep 19, 2014 4:54 pm
Location: Israel

Re: scorpio can run on 8192 cores

Post by EvgeniyZh »

petero2 wrote: Do you know of any papers describing a tree search algorithm that would scale reasonably well to a huge number of processors in a cluster, such as the IBM Sequoia with 1.5 million cores?
I believe I've seen some about middle size: PRBFM and its performance analysis. Also Dynamic Prediction of Minimal Trees in Large-Scale Parallel Game Tree Search. Though I haven't read them yet...
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: scorpio can run on 8192 cores

Post by petero2 »

Daniel Shawul wrote:
petero2 wrote:Do you know of any papers describing a tree search algorithm that would scale reasonably well to a huge number of processors in a cluster, such as the IBM Sequoia with 1.5 million cores?
Not really but I don't expect the 'standard' YBW work stealing procedure will work well. I think a transposition table work driven work scheduling (http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf) where work is migrated to the processor that has the TT entry is an interesting approach.
Thanks for the link, interesting approach. However, in the paper they used this idea to to implement a distributed IDA* search algorithm. It is not obvious to me how to generalize this to alpha-beta search where you also have to deal with alpha bound updates, cut-offs and propagation of search results to the parent node. The paper just says this is the subject of ongoing research.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: scorpio can run on 8192 cores

Post by Daniel Shawul »

petero2 wrote:
Daniel Shawul wrote:
petero2 wrote:Do you know of any papers describing a tree search algorithm that would scale reasonably well to a huge number of processors in a cluster, such as the IBM Sequoia with 1.5 million cores?
Not really but I don't expect the 'standard' YBW work stealing procedure will work well. I think a transposition table work driven work scheduling (http://citeseerx.ist.psu.edu/viewdoc/do ... 1&type=pdf) where work is migrated to the processor that has the TT entry is an interesting approach.
Thanks for the link, interesting approach. However, in the paper they used this idea to to implement a distributed IDA* search algorithm. It is not obvious to me how to generalize this to alpha-beta search where you also have to deal with alpha bound updates, cut-offs and propagation of search results to the parent node. The paper just says this is the subject of ongoing research.
Alpha-beta is too sequential no body wants to try it :). Some one did try it for two player zero-sum game, however, using MTD(f). http://citeseerx.ist.psu.edu/viewdoc/su ... 1.132.8604

I like the TTD approach only because it is radically different from the way we often think of distributing work in parallel search -- which is probably what is needed for a breakthough in massive parallelism of alpha-beta. There is also APHID which is promising (second link Evgeny's post). This is alpha-beta turned into a sort of monte-carlo tree searcher, where we do iterative-deepening on the leaves to get evaluation score and the upper part of the tree is stored in memory. APHID is dynamic in that the shape of the the tree that is stored in memory changes over time (some moves get pruned as more iterations are conducted and evaluation decides the branch is worthless). This is basically an enhancement over the static 'UDABS' approach where f.i you do separate iterative deepening on each of the root moves.

Daniel