Stockfish still scales poorly?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

bob wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
He certainly did not see a superlinear speedup.

I have no problem accepting that there might be good approaches to multi-threaded searching that to some extent leave the "try to mimick the serial search as best as we can"-approach. If a particular smp implementation happens to reach max Elo at 55 cores, because with 56 cores the extra synchronisation overhead outweighs the benefit of that extra core, then doing anything else with that core that makes remotely sense can only be an improvement. But you're just not going to get the performance of a single core search at 56x the speed.
What if it is possible to use the information gathered to reduce the branching factor by 0.15? By 0.25?
If pigs can fly in a parallel search, then they can fly equally well in a serial search. And probably better.

You can only use information gathered from nodes searched earlier in time. In a serial search it is much easier to control which nodes you search earlier. You are not dependent on other threads having done their work: if something is to be done at some other place of the tree, then a serial search can simply decide to do that first before continuing where it was searching before. This is actually precisely why there is always parallel overhead in multithreaded alpha-beta implementations.
Let's imagine, just for the sake of argument, that we have 50 threads, each of which is analyzing a sibling node.

The current state of analysis is sent to some central hub decision maker node.

The time allotted for each thread is one hour.

Now, some period into the search, our hub can see that 45 of the threads are faring very badly as to the score of the search from our {side to move} point of view.

The hub can now tell those 45 threads to work on something else.

It is true that we could write a serial search to do the same thing. But that serial search would be a lot more complicated. Would anyone really bother to write it?

The parallel version will be much simpler, since we have continuously updated information to some central location we can make a decision point anywhere that we like.

Equally interesting things can be done when something very good is discovered (new pv node, mate threats in current pv, whatever...)

Do you agree that the parallel version doing this will be much easier to write?

No. We already do this. IE alpha/beta pruning, where we decide to give up on this branch and move on. But if you can do it in parallel, one can do it serially and I don't see why it would not be simpler as well. Serial search doesn't require locks and synchronization and have to deal with race conditions as well.

Do you agree that it probably would never be written as a serial version?
No. If it worked in the parallel search, probably everyone would try it in the serial search as well. The current searches are not exactly simple as it stands, parallel or not.

All parallel searches of today handle fail highs (something very good) and fail lows (something very bad) equally well. Goal is to find the best move, not information about the rest, just prove them worse and move on.
Multi-pv search is an exception.

In my example, the fail highs and fail lows are live, real-time moving targets. it is not that way in standard serial search.
voyagerOne
Posts: 154
Joined: Tue May 17, 2011 8:12 pm

Re: Stockfish still scales poorly?

Post by voyagerOne »

@ Ronald

You are missing my point.
I am trying to state that the rate of NPS may be dependent upon the search depth.

So...
Using a fixed time and same engine.

At one core it spend 95% time at depth 21.
At four core it spends 95% time at depth 26.

However the NPS may decrease at higher depths.
Using the four core we found the NPS at depth 21 is 3000 kNPS
and at depth 26 it is 2500 kNPS

Since 4 core is faster than 1 core it reaches a higher depth...but at higher depth the NPS decrease.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
He certainly did not see a superlinear speedup.

I have no problem accepting that there might be good approaches to multi-threaded searching that to some extent leave the "try to mimick the serial search as best as we can"-approach. If a particular smp implementation happens to reach max Elo at 55 cores, because with 56 cores the extra synchronisation overhead outweighs the benefit of that extra core, then doing anything else with that core that makes remotely sense can only be an improvement. But you're just not going to get the performance of a single core search at 56x the speed.
What if it is possible to use the information gathered to reduce the branching factor by 0.15? By 0.25?
If pigs can fly in a parallel search, then they can fly equally well in a serial search. And probably better.

You can only use information gathered from nodes searched earlier in time. In a serial search it is much easier to control which nodes you search earlier. You are not dependent on other threads having done their work: if something is to be done at some other place of the tree, then a serial search can simply decide to do that first before continuing where it was searching before. This is actually precisely why there is always parallel overhead in multithreaded alpha-beta implementations.
Let's imagine, just for the sake of argument, that we have 50 threads, each of which is analyzing a sibling node.

The current state of analysis is sent to some central hub decision maker node.

The time allotted for each thread is one hour.

Now, some period into the search, our hub can see that 45 of the threads are faring very badly as to the score of the search from our {side to move} point of view.

The hub can now tell those 45 threads to work on something else.

It is true that we could write a serial search to do the same thing. But that serial search would be a lot more complicated. Would anyone really bother to write it?

The parallel version will be much simpler, since we have continuously updated information to some central location we can make a decision point anywhere that we like.

Equally interesting things can be done when something very good is discovered (new pv node, mate threats in current pv, whatever...)

Do you agree that the parallel version doing this will be much easier to write?

No. We already do this. IE alpha/beta pruning, where we decide to give up on this branch and move on. But if you can do it in parallel, one can do it serially and I don't see why it would not be simpler as well. Serial search doesn't require locks and synchronization and have to deal with race conditions as well.

Do you agree that it probably would never be written as a serial version?
No. If it worked in the parallel search, probably everyone would try it in the serial search as well. The current searches are not exactly simple as it stands, parallel or not.

All parallel searches of today handle fail highs (something very good) and fail lows (something very bad) equally well. Goal is to find the best move, not information about the rest, just prove them worse and move on.
Multi-pv search is an exception.

In my example, the fail highs and fail lows are live, real-time moving targets. it is not that way in standard serial search.
Alpha/beta DEPENDS on constant fail highs and fail lows. Otherwise the search reverts to minimax.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?
No, but a parallel search that leaves a lot of potential untapped leaves a lot of potential untapped.

It seems people are thinking that Joona's patch is being criticised. It is not.

Before Joona's patch, most people were aware that SF scales poorly when going from 8 to 16 cores and that this in particular shows in the poor scaling of NPS. This means there is a treasure waiting to be found: one that fixes NPS scaling.

Joona comes up with a patch that seems to improve SMP significantly. But it does not fix NPS scaling. It means a treasure was found but that it is not the treasure we knew existed. That one is still hidden and waiting to be found.
I am neither criticizing nor defending Joona's patch.

I attempted to say that examination of NPS is not enough to know that an SMP implementation scales well.
Nobody has said otherwise. But if NPS scales poorly, it is an absolute certainty that the parallel search will scale at least that poorly and almost certainly worse since we have not seen any perfect speedups for more than a couple or maybe 4 cpus max.

No matter how many cores you use, if your single-cpu NPS increases by 8x, your speed up is bound by that 8x and will actually be significantly less due to the usual SMP search overhead issues. It is simple math and there is no way to escape it, ever.
If the search is unable to exploit parallel information, then I agree that this conclusion is inescapable.
As I have said repeatedly, if there is something to exploit, the serial search can also exploit it by multiplexing the search.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

bob wrote:
Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
He certainly did not see a superlinear speedup.

I have no problem accepting that there might be good approaches to multi-threaded searching that to some extent leave the "try to mimick the serial search as best as we can"-approach. If a particular smp implementation happens to reach max Elo at 55 cores, because with 56 cores the extra synchronisation overhead outweighs the benefit of that extra core, then doing anything else with that core that makes remotely sense can only be an improvement. But you're just not going to get the performance of a single core search at 56x the speed.
What if it is possible to use the information gathered to reduce the branching factor by 0.15? By 0.25?
If pigs can fly in a parallel search, then they can fly equally well in a serial search. And probably better.

You can only use information gathered from nodes searched earlier in time. In a serial search it is much easier to control which nodes you search earlier. You are not dependent on other threads having done their work: if something is to be done at some other place of the tree, then a serial search can simply decide to do that first before continuing where it was searching before. This is actually precisely why there is always parallel overhead in multithreaded alpha-beta implementations.
Let's imagine, just for the sake of argument, that we have 50 threads, each of which is analyzing a sibling node.

The current state of analysis is sent to some central hub decision maker node.

The time allotted for each thread is one hour.

Now, some period into the search, our hub can see that 45 of the threads are faring very badly as to the score of the search from our {side to move} point of view.

The hub can now tell those 45 threads to work on something else.

It is true that we could write a serial search to do the same thing. But that serial search would be a lot more complicated. Would anyone really bother to write it?

The parallel version will be much simpler, since we have continuously updated information to some central location we can make a decision point anywhere that we like.

Equally interesting things can be done when something very good is discovered (new pv node, mate threats in current pv, whatever...)

Do you agree that the parallel version doing this will be much easier to write?

No. We already do this. IE alpha/beta pruning, where we decide to give up on this branch and move on. But if you can do it in parallel, one can do it serially and I don't see why it would not be simpler as well. Serial search doesn't require locks and synchronization and have to deal with race conditions as well.

Do you agree that it probably would never be written as a serial version?
No. If it worked in the parallel search, probably everyone would try it in the serial search as well. The current searches are not exactly simple as it stands, parallel or not.

All parallel searches of today handle fail highs (something very good) and fail lows (something very bad) equally well. Goal is to find the best move, not information about the rest, just prove them worse and move on.
Multi-pv search is an exception.

In my example, the fail highs and fail lows are live, real-time moving targets. it is not that way in standard serial search.
Alpha/beta DEPENDS on constant fail highs and fail lows. Otherwise the search reverts to minimax.
Yes, but this is completely different.

All the live searches can change their goals in real-time.
In a normal alpha-beta search, the goal does not change over an iteration.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
He certainly did not see a superlinear speedup.

I have no problem accepting that there might be good approaches to multi-threaded searching that to some extent leave the "try to mimick the serial search as best as we can"-approach. If a particular smp implementation happens to reach max Elo at 55 cores, because with 56 cores the extra synchronisation overhead outweighs the benefit of that extra core, then doing anything else with that core that makes remotely sense can only be an improvement. But you're just not going to get the performance of a single core search at 56x the speed.
What if it is possible to use the information gathered to reduce the branching factor by 0.15? By 0.25?
If pigs can fly in a parallel search, then they can fly equally well in a serial search. And probably better.

You can only use information gathered from nodes searched earlier in time. In a serial search it is much easier to control which nodes you search earlier. You are not dependent on other threads having done their work: if something is to be done at some other place of the tree, then a serial search can simply decide to do that first before continuing where it was searching before. This is actually precisely why there is always parallel overhead in multithreaded alpha-beta implementations.
Let's imagine, just for the sake of argument, that we have 50 threads, each of which is analyzing a sibling node.

The current state of analysis is sent to some central hub decision maker node.

The time allotted for each thread is one hour.

Now, some period into the search, our hub can see that 45 of the threads are faring very badly as to the score of the search from our {side to move} point of view.

The hub can now tell those 45 threads to work on something else.

It is true that we could write a serial search to do the same thing. But that serial search would be a lot more complicated. Would anyone really bother to write it?

The parallel version will be much simpler, since we have continuously updated information to some central location we can make a decision point anywhere that we like.

Equally interesting things can be done when something very good is discovered (new pv node, mate threats in current pv, whatever...)

Do you agree that the parallel version doing this will be much easier to write?

No. We already do this. IE alpha/beta pruning, where we decide to give up on this branch and move on. But if you can do it in parallel, one can do it serially and I don't see why it would not be simpler as well. Serial search doesn't require locks and synchronization and have to deal with race conditions as well.

Do you agree that it probably would never be written as a serial version?
No. If it worked in the parallel search, probably everyone would try it in the serial search as well. The current searches are not exactly simple as it stands, parallel or not.

All parallel searches of today handle fail highs (something very good) and fail lows (something very bad) equally well. Goal is to find the best move, not information about the rest, just prove them worse and move on.
Multi-pv search is an exception.

In my example, the fail highs and fail lows are live, real-time moving targets. it is not that way in standard serial search.
Alpha/beta DEPENDS on constant fail highs and fail lows. Otherwise the search reverts to minimax.
Yes, but this is completely different.

All the live searches can change their goals in real-time.
In a normal alpha-beta search, the goal does not change over an iteration.
When the previous iteration best move fails low, the goal certainly changes.

Learning things in parallel is a hit-or-miss race condition. Not exactly what a good chess program is going to be built around...
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

bob wrote:
Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?
No, but a parallel search that leaves a lot of potential untapped leaves a lot of potential untapped.

It seems people are thinking that Joona's patch is being criticised. It is not.

Before Joona's patch, most people were aware that SF scales poorly when going from 8 to 16 cores and that this in particular shows in the poor scaling of NPS. This means there is a treasure waiting to be found: one that fixes NPS scaling.

Joona comes up with a patch that seems to improve SMP significantly. But it does not fix NPS scaling. It means a treasure was found but that it is not the treasure we knew existed. That one is still hidden and waiting to be found.
I am neither criticizing nor defending Joona's patch.

I attempted to say that examination of NPS is not enough to know that an SMP implementation scales well.
Nobody has said otherwise. But if NPS scales poorly, it is an absolute certainty that the parallel search will scale at least that poorly and almost certainly worse since we have not seen any perfect speedups for more than a couple or maybe 4 cpus max.

No matter how many cores you use, if your single-cpu NPS increases by 8x, your speed up is bound by that 8x and will actually be significantly less due to the usual SMP search overhead issues. It is simple math and there is no way to escape it, ever.
If the search is unable to exploit parallel information, then I agree that this conclusion is inescapable.
As I have said repeatedly, if there is something to exploit, the serial search can also exploit it by multiplexing the search.
I disagree. In the example I gave earlier, the threads have simultaneous access to information and they can all act on it at once.

While you could simulate this serially, it would be very inefficient.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?
No, but a parallel search that leaves a lot of potential untapped leaves a lot of potential untapped.

It seems people are thinking that Joona's patch is being criticised. It is not.

Before Joona's patch, most people were aware that SF scales poorly when going from 8 to 16 cores and that this in particular shows in the poor scaling of NPS. This means there is a treasure waiting to be found: one that fixes NPS scaling.

Joona comes up with a patch that seems to improve SMP significantly. But it does not fix NPS scaling. It means a treasure was found but that it is not the treasure we knew existed. That one is still hidden and waiting to be found.
I am neither criticizing nor defending Joona's patch.

I attempted to say that examination of NPS is not enough to know that an SMP implementation scales well.
Nobody has said otherwise. But if NPS scales poorly, it is an absolute certainty that the parallel search will scale at least that poorly and almost certainly worse since we have not seen any perfect speedups for more than a couple or maybe 4 cpus max.

No matter how many cores you use, if your single-cpu NPS increases by 8x, your speed up is bound by that 8x and will actually be significantly less due to the usual SMP search overhead issues. It is simple math and there is no way to escape it, ever.
If the search is unable to exploit parallel information, then I agree that this conclusion is inescapable.
As I have said repeatedly, if there is something to exploit, the serial search can also exploit it by multiplexing the search.
I disagree. In the example I gave earlier, the threads have simultaneous access to information and they can all act on it at once.

While you could simulate this serially, it would be very inefficient.
If there is a gain from doing this, the gain will be present in parallel AND serial searches. Searching things out of order and hoping for something good to happen is NOT the way to develop a good parallel search however. If, by some miracle you can make this "something good happen" almost every time, one can do the same with a multiplex search and get that SAME something good.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Stockfish still scales poorly?

Post by syzygy »

Dann Corbit wrote:The current state of analysis is sent to some central hub decision maker node.

The time allotted for each thread is one hour.

Now, some period into the search, our hub can see that 45 of the threads are faring very badly as to the score of the search from our {side to move} point of view.

The hub can now tell those 45 threads to work on something else.
Serial alpha-beta does that, but much more efficiently.
Do you agree that the parallel version doing this will be much easier to write?
No. I'm sorry, but what you are proposing cannot even be called an idea. I think Bob picked a good word when he said mystical.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Stockfish still scales poorly?

Post by syzygy »

Dann Corbit wrote:If the search is unable to exploit parallel information, then I agree that this conclusion is inescapable.
"Parallel information" is a red herring.