Stockfish still scales poorly?

bob · Post by **bob** » Thu Feb 26, 2015 3:21 am

Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
He certainly did not see a superlinear speedup.

I have no problem accepting that there might be good approaches to multi-threaded searching that to some extent leave the "try to mimick the serial search as best as we can"-approach. If a particular smp implementation happens to reach max Elo at 55 cores, because with 56 cores the extra synchronisation overhead outweighs the benefit of that extra core, then doing anything else with that core that makes remotely sense can only be an improvement. But you're just not going to get the performance of a single core search at 56x the speed.
What if it is possible to use the information gathered to reduce the branching factor by 0.15? By 0.25?

I am not saying that this can be done. But I don't know why it can't be done either.
1.4^56 = 152464944
1.2^56 = 27174
5611 times faster from a change in the exponent of 0.2 for a 56 ply search.
Note that I just chose 56 arbitrarily. It has nothing to do with your 56.

This is also why I think that any really large Elo gains will come from search enhancements rather than all the tweaking of the evaluation function.

This question has the same answer as previously given. If you can do that with a parallel search, you should be able to do that with a multiplexed serial search.

A simple example might put this to rest. Suppose that for some reason, your move generation and ordering is completely backward, where you search the worst move first and the best move last. A parallel search will produce a super-linear speedup at times, because it will search all moves at a ply at once, and get to that best move sooner than the serial search. But if you multiplex the search, you will still do a simulated split at that node, and search one node for each sub-tree, then another node for each subtree, and you get that same effect. Even though the serial search was completely broken initially because move ordering was backward.

syzygy · Post by **syzygy** » Thu Feb 26, 2015 9:32 am

Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
He certainly did not see a superlinear speedup.

I have no problem accepting that there might be good approaches to multi-threaded searching that to some extent leave the "try to mimick the serial search as best as we can"-approach. If a particular smp implementation happens to reach max Elo at 55 cores, because with 56 cores the extra synchronisation overhead outweighs the benefit of that extra core, then doing anything else with that core that makes remotely sense can only be an improvement. But you're just not going to get the performance of a single core search at 56x the speed.
What if it is possible to use the information gathered to reduce the branching factor by 0.15? By 0.25?

If pigs can fly in a parallel search, then they can fly equally well in a serial search. And probably better.

You can only use information gathered from nodes searched earlier in time. In a serial search it is much easier to control which nodes you search earlier. You are not dependent on other threads having done their work: if something is to be done at some other place of the tree, then a serial search can simply decide to do that first before continuing where it was searching before. This is actually precisely why there is always parallel overhead in multithreaded alpha-beta implementations.

voyagerOne · Post by **voyagerOne** » Thu Feb 26, 2015 11:43 pm

My 2 cents:

I believe NPS is a misleading metric to determine scalability based on this logic:

An Engine's NPS/Depth may have a cubic relationship. So as depth increases the NPS decreases. (This might be from aggressive pruning or some other factors)

So as an Engine becomes faster it will get to deeper depth faster but as it do the NPS will decrease.

I think a better measurement will be how long an engine can solve a particular hard position.

zullil · Post by **zullil** » Thu Feb 26, 2015 11:51 pm

voyagerOne wrote:My 2 cents:

I believe NPS is a misleading metric to determine scalability based on this logic:

In think in this discussion NPS should be thought of as a (somewhat crude) proxy for time spent actually searching (as opposed to time the engine spends waiting to search due to locking, or waiting for data from memory). At least, that's my 2¢.

bob · Post by **bob** » Fri Feb 27, 2015 12:07 am

voyagerOne wrote:My 2 cents:

I believe NPS is a misleading metric to determine scalability based on this logic:

An Engine's NPS/Depth may have a cubic relationship. So as depth increases the NPS decreases. (This might be from aggressive pruning or some other factors)

So as an Engine becomes faster it will get to deeper depth faster but as it do the NPS will decrease.

I think a better measurement will be how long an engine can solve a particular hard position.

That's an effect I have not seen, nor could I explain why such would happen. The time to search a node doesn't particularly change with depth.

bob · Post by **bob** » Fri Feb 27, 2015 12:08 am

zullil wrote:
voyagerOne wrote:My 2 cents:

I believe NPS is a misleading metric to determine scalability based on this logic:

In think in this discussion NPS should be thought of as a (somewhat crude) proxy for time spent actually searching (as opposed to time the engine spends waiting to search due to locking, or waiting for data from memory). At least, that's my 2¢.

Notice also that NPS is just an upper bound. Actual SMP speedup will be less than the NPS speedup for well-known reasons. But if NPS speedup is bad, then certainly SMP speedup is going to be even worse. So it can point you toward whatever is the biggest limiting factor in the overall parallel speedup.

syzygy · Post by **syzygy** » Fri Feb 27, 2015 1:19 am

voyagerOne wrote:My 2 cents:

I believe NPS is a misleading metric to determine scalability based on this logic:

An Engine's NPS/Depth may have a cubic relationship.

That would be a very strange engine. For non-strange engines NPS can be measured reliably without much effort.

Note that we are not talking about comparing NPS between different engines or between different search implementations. We are talking about taking the same multithreaded engine and running it on 1 core, 2 cores, 3 cores, etc.

As has been explained several times, NPS speedup does not equate to Elo increase, but it is very literally an important factor of Elo increase.

The other factor is parallel search overhead in the sense of needing more nodes as the number of cores increases to achieve the same quality of play. Determine both factors, multiply them, and what you get is Elo increase.

I think a better measurement will be how long an engine can solve a particular hard position.

Time to solve a particular position is practically meaningless. Two runs on the same position with just two threads can already give very considerable differences in solution time. Taking averages won't help either, as a particular change might simply do well on a particular position without being good overall.

Dann Corbit · Post by **Dann Corbit** » Fri Feb 27, 2015 1:54 am

syzygy wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
He certainly did not see a superlinear speedup.

I have no problem accepting that there might be good approaches to multi-threaded searching that to some extent leave the "try to mimick the serial search as best as we can"-approach. If a particular smp implementation happens to reach max Elo at 55 cores, because with 56 cores the extra synchronisation overhead outweighs the benefit of that extra core, then doing anything else with that core that makes remotely sense can only be an improvement. But you're just not going to get the performance of a single core search at 56x the speed.
What if it is possible to use the information gathered to reduce the branching factor by 0.15? By 0.25?
If pigs can fly in a parallel search, then they can fly equally well in a serial search. And probably better.

You can only use information gathered from nodes searched earlier in time. In a serial search it is much easier to control which nodes you search earlier. You are not dependent on other threads having done their work: if something is to be done at some other place of the tree, then a serial search can simply decide to do that first before continuing where it was searching before. This is actually precisely why there is always parallel overhead in multithreaded alpha-beta implementations.

Let's imagine, just for the sake of argument, that we have 50 threads, each of which is analyzing a sibling node.

The current state of analysis is sent to some central hub decision maker node.

The time allotted for each thread is one hour.

Now, some period into the search, our hub can see that 45 of the threads are faring very badly as to the score of the search from our {side to move} point of view.

The hub can now tell those 45 threads to work on something else.

It is true that we could write a serial search to do the same thing. But that serial search would be a lot more complicated. Would anyone really bother to write it?

The parallel version will be much simpler, since we have continuously updated information to some central location we can make a decision point anywhere that we like.

Equally interesting things can be done when something very good is discovered (new pv node, mate threats in current pv, whatever...)

Do you agree that the parallel version doing this will be much easier to write?

Do you agree that it probably would never be written as a serial version?

Dann Corbit · Post by **Dann Corbit** » Fri Feb 27, 2015 1:59 am

bob wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?
No, but a parallel search that leaves a lot of potential untapped leaves a lot of potential untapped.

It seems people are thinking that Joona's patch is being criticised. It is not.

Before Joona's patch, most people were aware that SF scales poorly when going from 8 to 16 cores and that this in particular shows in the poor scaling of NPS. This means there is a treasure waiting to be found: one that fixes NPS scaling.

Joona comes up with a patch that seems to improve SMP significantly. But it does not fix NPS scaling. It means a treasure was found but that it is not the treasure we knew existed. That one is still hidden and waiting to be found.
I am neither criticizing nor defending Joona's patch.

I attempted to say that examination of NPS is not enough to know that an SMP implementation scales well.
Nobody has said otherwise. But if NPS scales poorly, it is an absolute certainty that the parallel search will scale at least that poorly and almost certainly worse since we have not seen any perfect speedups for more than a couple or maybe 4 cpus max.

No matter how many cores you use, if your single-cpu NPS increases by 8x, your speed up is bound by that 8x and will actually be significantly less due to the usual SMP search overhead issues. It is simple math and there is no way to escape it, ever.

If the search is unable to exploit parallel information, then I agree that this conclusion is inescapable.

bob · Post by **bob** » Fri Feb 27, 2015 2:00 am

Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
He certainly did not see a superlinear speedup.

I have no problem accepting that there might be good approaches to multi-threaded searching that to some extent leave the "try to mimick the serial search as best as we can"-approach. If a particular smp implementation happens to reach max Elo at 55 cores, because with 56 cores the extra synchronisation overhead outweighs the benefit of that extra core, then doing anything else with that core that makes remotely sense can only be an improvement. But you're just not going to get the performance of a single core search at 56x the speed.
What if it is possible to use the information gathered to reduce the branching factor by 0.15? By 0.25?
If pigs can fly in a parallel search, then they can fly equally well in a serial search. And probably better.

You can only use information gathered from nodes searched earlier in time. In a serial search it is much easier to control which nodes you search earlier. You are not dependent on other threads having done their work: if something is to be done at some other place of the tree, then a serial search can simply decide to do that first before continuing where it was searching before. This is actually precisely why there is always parallel overhead in multithreaded alpha-beta implementations.
Let's imagine, just for the sake of argument, that we have 50 threads, each of which is analyzing a sibling node.

The current state of analysis is sent to some central hub decision maker node.

The time allotted for each thread is one hour.

Now, some period into the search, our hub can see that 45 of the threads are faring very badly as to the score of the search from our {side to move} point of view.

The hub can now tell those 45 threads to work on something else.

It is true that we could write a serial search to do the same thing. But that serial search would be a lot more complicated. Would anyone really bother to write it?

The parallel version will be much simpler, since we have continuously updated information to some central location we can make a decision point anywhere that we like.

Equally interesting things can be done when something very good is discovered (new pv node, mate threats in current pv, whatever...)

Do you agree that the parallel version doing this will be much easier to write?

No. We already do this. IE alpha/beta pruning, where we decide to give up on this branch and move on. But if you can do it in parallel, one can do it serially and I don't see why it would not be simpler as well. Serial search doesn't require locks and synchronization and have to deal with race conditions as well.

Do you agree that it probably would never be written as a serial version?

No. If it worked in the parallel search, probably everyone would try it in the serial search as well. The current searches are not exactly simple as it stands, parallel or not.

All parallel searches of today handle fail highs (something very good) and fail lows (something very bad) equally well. Goal is to find the best move, not information about the rest, just prove them worse and move on.

Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?