parallel speedup and assorted trivia

jdart · Post by **jdart** » Sat Jun 20, 2015 6:57 pm

Last time I did measurements (with VTune and oprofile), locking wasn't an issue for me but every time there was a shared memory access it showed up as a hot spot. It was very clear in the profiler output.

--Jon

bob · Post by **bob** » Sat Jun 20, 2015 11:12 pm

jdart wrote:Last time I did measurements (with VTune and oprofile), locking wasn't an issue for me but every time there was a shared memory access it showed up as a hot spot. It was very clear in the profiler output.

--Jon

I am assuming you mean shared memory that is written to, as opposed to just being shared? Modifying something that is shared is a cache burner for certain. Putting such things in separate cache blocks helps a lot.

jdart · Post by **jdart** » Sun Jun 21, 2015 1:54 pm

Yes, I meant modifying cross-thread common data, such as caches, counters, etc. The hashtable is the biggest contributor to this but I have other memory access. For example there is a cache for some of the endgame scoring components. I tried removing that and it made multi-thread NPS go up, but it hurt single-threaded performance.

--Jon

bob · Post by **bob** » Sun Jun 21, 2015 5:32 pm

jdart wrote:Yes, I meant modifying cross-thread common data, such as caches, counters, etc. The hashtable is the biggest contributor to this but I have other memory access. For example there is a cache for some of the endgame scoring components. I tried removing that and it made multi-thread NPS go up, but it hurt single-threaded performance.

--Jon

I use a global history array, with the usual 9 bit index. At one point I had decided that had to hurt NPS. I made it thread-local and the NPS did not change at all. I don't do history updates in q-search of course, and it doesn't get done on ALL nodes either. Perhaps that is enough to skip. And of course, 90%+ of the time one just updates one counter.

jdart · Post by **jdart** » Sun Jun 21, 2015 10:03 pm

This can be measured. Sometimes even a counter update is a pretty big hit. But I have mostly tested on older Xeons that have less cache than the latest chips.

--Jon

bob · Post by **bob** » Sun Jun 21, 2015 10:46 pm

jdart wrote:This can be measured. Sometimes even a counter update is a pretty big hit. But I have mostly tested on older Xeons that have less cache than the latest chips.

--Jon

It is also an architectural issue. 2x10 is a lot easier than 4x2 (#chips - # cores). 4 caches get real busy with forwarding and invalidating stuff.

I don't have any shared counters of any sort. That was a no-no back in the early days of parallel programming, even if there was no cache involved as on the Crays...

parallel speedup and assorted trivia

Re: new data

Re: new data

Re: new data

Re: new data

Re: new data

Re: new data