NUMA in a YBWC implementation
Moderators: hgm, Rebel, chrisw
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
NUMA in a YBWC implementation
The trend nowadays and in the future are 2 or more CPUS in a NUMA configuration. I would like to implement NUMA awareness in a YBWC parallel search. Is there any tips from our SMP experts here besides pinning the threads to a fix cpu core?
-
- Posts: 4366
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: NUMA in a YBWC implementation
Pinning is not so easy because different OS's have different APIs for this and also there is hyperthreading: you want to pin to a physical core, not a virtual one. I am using the hwloc library to facilitate doing this: https://www.open-mpi.org/projects/hwloc/.
You also have to pay extra attention on a NUMA system to shared memory access. The hash table is necessarily shared memory, although you can take care to not map all the table into a single core's memory space (I think Crafty does some partial initialization of the table in each thread). Otherwise you should try to reduce global memory access. I copy some info like search options into each thread's stack so that all the threads aren't trying to hit a single copy. For things like performance counters I also accumulate them in the threads and only periodically update a global copy. I would recommend you run a profiler like oprofile or VTune to find hot spots.
There are a lot of resources on this on the Web. See for example:
http://ircc.fiu.edu/download/sc13/Pract ... Slides.pdf
https://blogs.fau.de/hager/files/2010/0 ... ticore.pdf
http://www.cs.utexas.edu/~skeckler/pubs/ISPASS_2011.pdf
https://software.intel.com/sites/defaul ... r_NUMA.pdf
--Jon
You also have to pay extra attention on a NUMA system to shared memory access. The hash table is necessarily shared memory, although you can take care to not map all the table into a single core's memory space (I think Crafty does some partial initialization of the table in each thread). Otherwise you should try to reduce global memory access. I copy some info like search options into each thread's stack so that all the threads aren't trying to hit a single copy. For things like performance counters I also accumulate them in the threads and only periodically update a global copy. I would recommend you run a profiler like oprofile or VTune to find hot spots.
There are a lot of resources on this on the Web. See for example:
http://ircc.fiu.edu/download/sc13/Pract ... Slides.pdf
https://blogs.fau.de/hager/files/2010/0 ... ticore.pdf
http://www.cs.utexas.edu/~skeckler/pubs/ISPASS_2011.pdf
https://software.intel.com/sites/defaul ... r_NUMA.pdf
--Jon
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: NUMA in a YBWC implementation
I guess that you already looked at Daniel Shawul's Scorpio implementation. which allows simultaneous NUMA and SMP searching?
I think his search is YBW (IIRC).
I think his search is YBW (IIRC).
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.