parallel speedup and assorted trivia

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: new data

Post by jdart »

Last time I did measurements (with VTune and oprofile), locking wasn't an issue for me but every time there was a shared memory access it showed up as a hot spot. It was very clear in the profiler output.

--Jon
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: new data

Post by bob »

jdart wrote:Last time I did measurements (with VTune and oprofile), locking wasn't an issue for me but every time there was a shared memory access it showed up as a hot spot. It was very clear in the profiler output.

--Jon
I am assuming you mean shared memory that is written to, as opposed to just being shared? Modifying something that is shared is a cache burner for certain. Putting such things in separate cache blocks helps a lot.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: new data

Post by jdart »

Yes, I meant modifying cross-thread common data, such as caches, counters, etc. The hashtable is the biggest contributor to this but I have other memory access. For example there is a cache for some of the endgame scoring components. I tried removing that and it made multi-thread NPS go up, but it hurt single-threaded performance.

--Jon
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: new data

Post by bob »

jdart wrote:Yes, I meant modifying cross-thread common data, such as caches, counters, etc. The hashtable is the biggest contributor to this but I have other memory access. For example there is a cache for some of the endgame scoring components. I tried removing that and it made multi-thread NPS go up, but it hurt single-threaded performance.

--Jon
I use a global history array, with the usual 9 bit index. At one point I had decided that had to hurt NPS. I made it thread-local and the NPS did not change at all. I don't do history updates in q-search of course, and it doesn't get done on ALL nodes either. Perhaps that is enough to skip. And of course, 90%+ of the time one just updates one counter.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: new data

Post by jdart »

This can be measured. Sometimes even a counter update is a pretty big hit. But I have mostly tested on older Xeons that have less cache than the latest chips.

--Jon
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: new data

Post by bob »

jdart wrote:This can be measured. Sometimes even a counter update is a pretty big hit. But I have mostly tested on older Xeons that have less cache than the latest chips.

--Jon
It is also an architectural issue. 2x10 is a lot easier than 4x2 (#chips - # cores). 4 caches get real busy with forwarding and invalidating stuff.

I don't have any shared counters of any sort. That was a no-no back in the early days of parallel programming, even if there was no cache involved as on the Crays...