parallel speedup and assorted trivia

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
jdart
Posts: 3842
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: new data

Post by jdart » Sat Jun 20, 2015 4:57 pm

Last time I did measurements (with VTune and oprofile), locking wasn't an issue for me but every time there was a shared memory access it showed up as a hot spot. It was very clear in the profiler output.

--Jon

bob
Posts: 20642
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: new data

Post by bob » Sat Jun 20, 2015 9:12 pm

jdart wrote:Last time I did measurements (with VTune and oprofile), locking wasn't an issue for me but every time there was a shared memory access it showed up as a hot spot. It was very clear in the profiler output.

--Jon
I am assuming you mean shared memory that is written to, as opposed to just being shared? Modifying something that is shared is a cache burner for certain. Putting such things in separate cache blocks helps a lot.

jdart
Posts: 3842
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: new data

Post by jdart » Sun Jun 21, 2015 11:54 am

Yes, I meant modifying cross-thread common data, such as caches, counters, etc. The hashtable is the biggest contributor to this but I have other memory access. For example there is a cache for some of the endgame scoring components. I tried removing that and it made multi-thread NPS go up, but it hurt single-threaded performance.

--Jon

bob
Posts: 20642
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: new data

Post by bob » Sun Jun 21, 2015 3:32 pm

jdart wrote:Yes, I meant modifying cross-thread common data, such as caches, counters, etc. The hashtable is the biggest contributor to this but I have other memory access. For example there is a cache for some of the endgame scoring components. I tried removing that and it made multi-thread NPS go up, but it hurt single-threaded performance.

--Jon
I use a global history array, with the usual 9 bit index. At one point I had decided that had to hurt NPS. I made it thread-local and the NPS did not change at all. I don't do history updates in q-search of course, and it doesn't get done on ALL nodes either. Perhaps that is enough to skip. And of course, 90%+ of the time one just updates one counter.

jdart
Posts: 3842
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: new data

Post by jdart » Sun Jun 21, 2015 8:03 pm

This can be measured. Sometimes even a counter update is a pretty big hit. But I have mostly tested on older Xeons that have less cache than the latest chips.

--Jon

bob
Posts: 20642
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: new data

Post by bob » Sun Jun 21, 2015 8:46 pm

jdart wrote:This can be measured. Sometimes even a counter update is a pretty big hit. But I have mostly tested on older Xeons that have less cache than the latest chips.

--Jon
It is also an architectural issue. 2x10 is a lot easier than 4x2 (#chips - # cores). 4 caches get real busy with forwarding and invalidating stuff.

I don't have any shared counters of any sort. That was a no-no back in the early days of parallel programming, even if there was no cache involved as on the Crays...

Post Reply