Last time I did measurements (with VTune and oprofile), locking wasn't an issue for me but every time there was a shared memory access it showed up as a hot spot. It was very clear in the profiler output.
--Jon
parallel speedup and assorted trivia
Moderators: hgm, Rebel, chrisw
-
- Posts: 4366
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: new data
I am assuming you mean shared memory that is written to, as opposed to just being shared? Modifying something that is shared is a cache burner for certain. Putting such things in separate cache blocks helps a lot.jdart wrote:Last time I did measurements (with VTune and oprofile), locking wasn't an issue for me but every time there was a shared memory access it showed up as a hot spot. It was very clear in the profiler output.
--Jon
-
- Posts: 4366
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: new data
Yes, I meant modifying cross-thread common data, such as caches, counters, etc. The hashtable is the biggest contributor to this but I have other memory access. For example there is a cache for some of the endgame scoring components. I tried removing that and it made multi-thread NPS go up, but it hurt single-threaded performance.
--Jon
--Jon
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: new data
I use a global history array, with the usual 9 bit index. At one point I had decided that had to hurt NPS. I made it thread-local and the NPS did not change at all. I don't do history updates in q-search of course, and it doesn't get done on ALL nodes either. Perhaps that is enough to skip. And of course, 90%+ of the time one just updates one counter.jdart wrote:Yes, I meant modifying cross-thread common data, such as caches, counters, etc. The hashtable is the biggest contributor to this but I have other memory access. For example there is a cache for some of the endgame scoring components. I tried removing that and it made multi-thread NPS go up, but it hurt single-threaded performance.
--Jon
-
- Posts: 4366
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: new data
This can be measured. Sometimes even a counter update is a pretty big hit. But I have mostly tested on older Xeons that have less cache than the latest chips.
--Jon
--Jon
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: new data
It is also an architectural issue. 2x10 is a lot easier than 4x2 (#chips - # cores). 4 caches get real busy with forwarding and invalidating stuff.jdart wrote:This can be measured. Sometimes even a counter update is a pretty big hit. But I have mostly tested on older Xeons that have less cache than the latest chips.
--Jon
I don't have any shared counters of any sort. That was a no-no back in the early days of parallel programming, even if there was no cache involved as on the Crays...