Explanation for non-expert?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Explanation for non-expert?

Post by mar »

bob wrote:On the 12-core box I use occasionally (dual 6 core chips) the scaling is almost perfect. 5.5M NPS on one core, 60M NPS on 12 cores.
11x speedup on 2x6 cores?...
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Explanation for non-expert?

Post by zullil »

bob wrote:
zullil wrote:Question from non-expert:

Would it be possible to include a UCI option "Write SMP log" (false by default) that would dump possibly useful information about activity/inactivity of threads (or whatever), so users with high-core machines might be able to test/contribute feedback?

I understand that testing multi-threaded searching is difficult, because the searches themselves are highly indeterminate. Testing with very short time control matches seems problematic to me.

Thanks.
This is not easy. First, such a log could be quite big. As a humorous note, Cray once asked me to do this when I was testing on the T90 with 32 cpus. I asked "are you sure you want to see that?" I ran the test, told them where to find the log, Dennis called me back and said "7+ terabytes is a bit large, eh?" :)

There are several issues to deal with. IF NPS scaling is poor, that is almost always a result of either (a) threads sitting idle waiting for work too long or (b) cache/memory bottlenecks. Discovering which is not so easy. If you watch Crafty run, you will see it display time as "time=3:14(97%)". What that means is that 97% of the time ALL threads were busy, or the inverse, 3% of the total performance power of the machine (N cores) was lost due to waiting for a place to split. If NPS is low but that percentage is high, then we are looking at cache/memory issues which are difficult for a user to deal with. Intel changes their cache mechanism regularly, particularly with automatic pre-fetching. This makes figuring out where the NPS bottleneck is quite difficult at times. Intel / AMD have some internal tools I have used in the past to help me discover unknown cache hot-spots and the like. But every time they make an improvement, it bites some program until the programmer fixes that.

Easy-to-tune is a goal. But a VERY difficult one. I'm interested in how the dual 18-core boxes might run and what quirks they are going to have...
Thanks. Even having access to such a percentage (time threads active per total time) for Stockfish might be helpful. Perhaps there is some way to measure this by monitoring the Stockfish process as it runs. Don't know enough.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Explanation for non-expert?

Post by bob »

mar wrote:
bob wrote:On the 12-core box I use occasionally (dual 6 core chips) the scaling is almost perfect. 5.5M NPS on one core, 60M NPS on 12 cores.
11x speedup on 2x6 cores?...
No. 11x NPS scaling. I didn't take the time to run the long speedup test, it takes a lot of time when you want the 12 core or 16 core test to take 3 minutes per position. Which means you want the 1 core test to take an hour per position or so. My test has over 300 positions, which means the 1 cpu test run takes almost 2 weeks by itself.

The NPS scaling is always a good starting point, because if it is bad, the parallel speedup will be even worse. It took quite a bit of program tweaking and tuning, as I started off on that particular box at maybe 8x NPS scaling, which translates to maybe a 6x speedup. Not so good. When I finished (this code is in 25.0 and will be released soon) it was "back to normal".

Until the next CPU change by Intel or AMD of course.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Explanation for non-expert?

Post by bob »

zullil wrote:
bob wrote:
zullil wrote:Question from non-expert:

Would it be possible to include a UCI option "Write SMP log" (false by default) that would dump possibly useful information about activity/inactivity of threads (or whatever), so users with high-core machines might be able to test/contribute feedback?

I understand that testing multi-threaded searching is difficult, because the searches themselves are highly indeterminate. Testing with very short time control matches seems problematic to me.

Thanks.
This is not easy. First, such a log could be quite big. As a humorous note, Cray once asked me to do this when I was testing on the T90 with 32 cpus. I asked "are you sure you want to see that?" I ran the test, told them where to find the log, Dennis called me back and said "7+ terabytes is a bit large, eh?" :)

There are several issues to deal with. IF NPS scaling is poor, that is almost always a result of either (a) threads sitting idle waiting for work too long or (b) cache/memory bottlenecks. Discovering which is not so easy. If you watch Crafty run, you will see it display time as "time=3:14(97%)". What that means is that 97% of the time ALL threads were busy, or the inverse, 3% of the total performance power of the machine (N cores) was lost due to waiting for a place to split. If NPS is low but that percentage is high, then we are looking at cache/memory issues which are difficult for a user to deal with. Intel changes their cache mechanism regularly, particularly with automatic pre-fetching. This makes figuring out where the NPS bottleneck is quite difficult at times. Intel / AMD have some internal tools I have used in the past to help me discover unknown cache hot-spots and the like. But every time they make an improvement, it bites some program until the programmer fixes that.

Easy-to-tune is a goal. But a VERY difficult one. I'm interested in how the dual 18-core boxes might run and what quirks they are going to have...
Thanks. Even having access to such a percentage (time threads active per total time) for Stockfish might be helpful. Perhaps there is some way to measure this by monitoring the Stockfish process as it runs. Don't know enough.
What I do is have each thread count the number of milliseconds it spends waiting on work. I know how long a search takes (elapsed time) So this usage number is

usage = 100 - total_idle_time / # cores * elapsed time

I believe. If idle time is zero, which happens on my 4 core iMac all the time, then usage = 100%. With 4 cores I have been seeing 97-100% which is what I want. On that 12 core box, I finally drove it up to the numbers previously given (92% or so). It's a challenge to work well everywhere.
petero2
Posts: 686
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Explanation for non-expert?

Post by petero2 »

zamar wrote:
#2 is also not quite so simple. Better to have 8 threads at a sure ALL node than having 4 at that node and the other 4 somewhere else. One condition I didn't see being checked was one of the conditions I used in the DTS algorithm for Cray Blitz, which was "work left to be done at a split point." Even in Crafty I have been measuring "futile splits" which can happen in multiple ways.
Well, there is no such a thing as a sure ALL node, but in in principle I agree. An ideal splitting algorithm would take the node type and the count of already searched moves into account as this affects the probability of cut-off.
Texel already does that. It collects cut off statistics during search and uses the information to estimate the probability that a move at a potential split point would be searched by a corresponding single threaded search. It also collects "overhead" statistics during search, which estimates how much time a thread needs to spend on starting/stopping the search, as a function of the depth of the subtree to search. For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.
zamar
Posts: 613
Joined: Sun Jan 18, 2009 7:03 am

Re: Explanation for non-expert?

Post by zamar »

petero2 wrote: For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.
I've thought along similar lines, so that makes perfect sense to me.
About your algorithm: Why are you not just using YBW?
Joona Kiiski
petero2
Posts: 686
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Explanation for non-expert?

Post by petero2 »

zamar wrote:
petero2 wrote: For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.
I've thought along similar lines, so that makes perfect sense to me.
About your algorithm: Why are you not just using YBW?
I wanted to design my own algorithm and I wanted to include some of the good properties from DTS in the algorithm.

I also wanted the algorithm to give a larger elo improvement than YBWC when using many cores, but I don't know if that succeeded. At least YBWC does not seem to be as good to me, because it basically replaces all the probability estimates with 0 (when first move has not been searched) and 1 (when first move has been searched).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Explanation for non-expert?

Post by bob »

petero2 wrote:
zamar wrote:
#2 is also not quite so simple. Better to have 8 threads at a sure ALL node than having 4 at that node and the other 4 somewhere else. One condition I didn't see being checked was one of the conditions I used in the DTS algorithm for Cray Blitz, which was "work left to be done at a split point." Even in Crafty I have been measuring "futile splits" which can happen in multiple ways.
Well, there is no such a thing as a sure ALL node, but in in principle I agree. An ideal splitting algorithm would take the node type and the count of already searched moves into account as this affects the probability of cut-off.
Texel already does that. It collects cut off statistics during search and uses the information to estimate the probability that a move at a potential split point would be searched by a corresponding single threaded search. It also collects "overhead" statistics during search, which estimates how much time a thread needs to spend on starting/stopping the search, as a function of the depth of the subtree to search. For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.
I looked at what you were doing. One thing you might not have noticed was that in DTS there is no "master" thread. Since this was an iterative rather than recursive search, ANY thread could complete the search at a node and back up to the previous ply. When any thread hits a fail high, it returns to the previous ply while stopping any siblings working there since that work is unnecessary.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Explanation for non-expert?

Post by bob »

zamar wrote:
petero2 wrote: For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.
I've thought along similar lines, so that makes perfect sense to me.
About your algorithm: Why are you not just using YBW?
YBW is simple, but it means that when a processor becomes idle, it is going to wait until some other thread can pick it up at a node after satisfying the YBW criterion. That was one of the driving forces for me to do the DTS approach, to eliminate any waiting at all. And DTS did that flawlessly. If you had N processors, it searched at N*nps speeds with zero lost time. I just didn't want to rewrite all of that when I started on Crafty, but I will likely do it one day because the gains are worthwhile.
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: Explanation for non-expert?

Post by Rein Halbersma »

zamar wrote:
petero2 wrote: For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.
I've thought along similar lines, so that makes perfect sense to me.
About your algorithm: Why are you not just using YBW?
I'm no where near an expert on parallel alpha-beta, but it seems to me that people in this thread are reinventing optimal scheduling of jobs related to potential split points. AFAICS, this is mostly solved in the literature by randomized work-stealing such as pioneered by Cilk-chess and currently available as Cilk-Plus from Intel (there is a g++ branch).

The way they implement work-stealing is by using a cactus-stack, essentially one deque of split points for each hardware thread. During execution, each thread keeps popping the bottom of a randomly chosen deque, and also pushes each potential split point to its own deque. There are tons of papers and PhD theses from the Cilk project in which they prove that this kind of scheduling is optimal for fork-join parallelism. Because the randomization automatically load-balances, there is no need for heuristics on the depth in the tree where a split point is being added.