Stockfish still scales poorly?

syzygy · Post by **syzygy** » Sat Feb 28, 2015 12:42 am

voyagerOne wrote:I am trying to state that the rate of NPS may be dependent upon the search depth.

I've answered that, see my post.

syzygy · Post by **syzygy** » Sat Feb 28, 2015 12:46 am

Dann Corbit wrote:Yes, but this is completely different.

All the live searches can change their goals in real-time.
In a normal alpha-beta search, the goal does not change over an iteration.

Something along the lines you're proposing (but made much more conrete) could possibly be worth the effort if at some point adding extra cores does not increase the strength of a conventional multithreaded engine anymore. If the extra core does not help the conventional engine, then doing anything else with it can't be bad.

If you come up with a very good idea, it might even help before adding extra cores to a conventional engine stops contributing Elo.

But it will very definitely not make an engine with N cores play stronger than a single-core engine running at Nx the speed.

See my earlier post about confusing A) with B). Maybe I did not explain that point clearly enough?

Dann Corbit · Post by **Dann Corbit** » Sat Feb 28, 2015 6:22 am

syzygy wrote:
Dann Corbit wrote:The current state of analysis is sent to some central hub decision maker node.

The time allotted for each thread is one hour.

Now, some period into the search, our hub can see that 45 of the threads are faring very badly as to the score of the search from our {side to move} point of view.

The hub can now tell those 45 threads to work on something else.
Serial alpha-beta does that, but much more efficiently.

If the pv thread is doing a normal aspiration search and all the other threads are doing zero window searches, in what way is the normal serial search more efficient?

Do you agree that the parallel version doing this will be much easier to write?
No. I'm sorry, but what you are proposing cannot even be called an idea. I think Bob picked a good word when he said mystical.

syzygy · Post by **syzygy** » Sat Feb 28, 2015 3:12 pm

Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:The current state of analysis is sent to some central hub decision maker node.

The time allotted for each thread is one hour.

Now, some period into the search, our hub can see that 45 of the threads are faring very badly as to the score of the search from our {side to move} point of view.

The hub can now tell those 45 threads to work on something else.
Serial alpha-beta does that, but much more efficiently.
If the pv thread is doing a normal aspiration search and all the other threads are doing zero window searches, in what way is the normal serial search more efficient?

Before I try to understand what you are proposing: is this an idea to get more than Nx speedup out of N threads? (I.e. N threads at 1 mnps outperforming 1 thread at N mpns?)

bob · Post by **bob** » Sat Feb 28, 2015 6:08 pm

Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:The current state of analysis is sent to some central hub decision maker node.

The time allotted for each thread is one hour.

Now, some period into the search, our hub can see that 45 of the threads are faring very badly as to the score of the search from our {side to move} point of view.

The hub can now tell those 45 threads to work on something else.
Serial alpha-beta does that, but much more efficiently.
If the pv thread is doing a normal aspiration search and all the other threads are doing zero window searches, in what way is the normal serial search more efficient?

Do you agree that the parallel version doing this will be much easier to write?
No. I'm sorry, but what you are proposing cannot even be called an idea. I think Bob picked a good word when he said mystical.

The zero-window searches know the CORRECT aspiration window. The way you are doing it, you do not. That's why YBW works, you get the correct lower bound from the first move before searching all the others...

Dann Corbit · Post by **Dann Corbit** » Sat Feb 28, 2015 9:11 pm

syzygy wrote: Before I try to understand what you are proposing: is this an idea to get more than Nx speedup out of N threads? (I.e. N threads at 1 mnps outperforming 1 thread at N mpns?)

No.
Better NPS is clearly impossible, so far as I can see.

What I am saying is that better search might be possible when running parallel threads compared to single threads.

Information collected during the search process could possibly be used to prune better and create a smaller branching factor.

While it is in theory possible to collect the same information from several serial searches, in practice it seems to me more difficult because we can act on the information immediately as it arrives from parallel sources and it would be difficult to achieve the same thing serially.

I am in no way offering a proof. It is only a mind experiment. It seems possible to me that we might guide the energy of the search according to incoming information from multiple threads to add workers where it is needed and reduce effort where it is clearly wasted. For example, suppose that at parallel zero window search gets a big, fat fail high. We might subtract a thread from the least promising node and throw him at the fail high issue in tandem with the current thread. It seems logical to me that some especially clever person (such as yourself) could use the plethora of incoming information from multiple threads to write a more intelligent method of searching.

At the same time, I am not claiming this is a certainty or even that it will work at all. But it seems a possibility to me. It has been claimed that it is impossible for multiple threads to search *better* than a single thread at multiplied NPS. I agree that it is impossible to search faster. I do not agree that it is impossible to search better.

Dann Corbit · Post by **Dann Corbit** » Sat Feb 28, 2015 9:34 pm

Another notion for possibly better searching:
It is related to an innovation by Daniel Shawul.
His program can search SMP.
It can also search NUMA.
It can also search Cluster.
And it can do these things all at the same time.

Now, when we look at the SMP loss with multiple threads, even the best programs seem to have some sort of logarithmic looking loss when enough threads are thrown at the problem.

So my idea is this:
We find some happy medium point where adding more threads is not gaining much and call this the crossover point.

Let's pick some arbitrary position which has 36 possible moves.
Let's consider a cluster of 32 machines each having 32 threads.

Perhaps 16 threads is where the search starts to have diminishing returns.

So, for our root search, we launch one process with 16 threads analyzing the most promising node, and a second process examining the first sibling.
Our second machine analyzes siblings 3,4 and so on until 18 machines are used up.

I do realize that the zero window searches will run much faster, so the real distribution of work would require some coordination.

Now, we have 14 machines left. So 10 of them get the most promising forward nodes. One machine is used as a mate proof search machine for areas with king safety danger and one machine is used as a pv polisher (it analyzes the best pv collected so far node by node on some heartbeat interval looking for future danger) and one machine is used as a coordinator or possibly a endgame tablebase oracle and opening database oracle or some combination of functions.

The reason this idea might have merit is that we give the nodes big chunks of work so that communication effort is not the dominating factor.

Dann Corbit · Post by **Dann Corbit** » Sat Feb 28, 2015 9:53 pm

bob wrote:
Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:The current state of analysis is sent to some central hub decision maker node.

The time allotted for each thread is one hour.

Now, some period into the search, our hub can see that 45 of the threads are faring very badly as to the score of the search from our {side to move} point of view.

The hub can now tell those 45 threads to work on something else.
Serial alpha-beta does that, but much more efficiently.
If the pv thread is doing a normal aspiration search and all the other threads are doing zero window searches, in what way is the normal serial search more efficient?

Do you agree that the parallel version doing this will be much easier to write?
No. I'm sorry, but what you are proposing cannot even be called an idea. I think Bob picked a good word when he said mystical.
The zero-window searches know the CORRECT aspiration window. The way you are doing it, you do not. That's why YBW works, you get the correct lower bound from the first move before searching all the others...

Aspiration windows are arbitrary estimates which can very possibly be improved by information from multiple parallel sources.

It is the base search that needs an aspiration window.

The zero window searches only need a single number (alpha) because by definition they are one unit wide.
value = PVS(-(alpha+1),-alpha)
If the search exceeds alpha, then we will need beta.
Alpha and beta themselves are only arbitrary estimates and are almost never the true values but are only the best possible estimates based on current and previous searches.
At any point in time, we can collect a best possible estimate for alpha and beta from all of the threads that are performing the search in some area of shared memory. It seems to me that we can form a much better estimate in this way. Of course, the estimate will vary over time and the search will be less deterministic and therefore much harder to debug.

bob · Post by **bob** » Sat Feb 28, 2015 10:09 pm

Dann Corbit wrote:
syzygy wrote: Before I try to understand what you are proposing: is this an idea to get more than Nx speedup out of N threads? (I.e. N threads at 1 mnps outperforming 1 thread at N mpns?)
No.
Better NPS is clearly impossible, so far as I can see.

What I am saying is that better search might be possible when running parallel threads compared to single threads.

Information collected during the search process could possibly be used to prune better and create a smaller branching factor.

While it is in theory possible to collect the same information from several serial searches, in practice it seems to me more difficult because we can act on the information immediately as it arrives from parallel sources and it would be difficult to achieve the same thing serially.

I am in no way offering a proof. It is only a mind experiment. It seems possible to me that we might guide the energy of the search according to incoming information from multiple threads to add workers where it is needed and reduce effort where it is clearly wasted. For example, suppose that at parallel zero window search gets a big, fat fail high. We might subtract a thread from the least promising node and throw him at the fail high issue in tandem with the current thread. It seems logical to me that some especially clever person (such as yourself) could use the plethora of incoming information from multiple threads to write a more intelligent method of searching.

At the same time, I am not claiming this is a certainty or even that it will work at all. But it seems a possibility to me. It has been claimed that it is impossible for multiple threads to search *better* than a single thread at multiplied NPS. I agree that it is impossible to search faster. I do not agree that it is impossible to search better.

Again, if you don't go faster, then a multiplexed search can get this same "better search results".

bob · Post by **bob** » Sat Feb 28, 2015 10:13 pm

Dann Corbit wrote:Another notion for possibly better searching:
It is related to an innovation by Daniel Shawul.
His program can search SMP.
It can also search NUMA.
It can also search Cluster.
And it can do these things all at the same time.

Now, when we look at the SMP loss with multiple threads, even the best programs seem to have some sort of logarithmic looking loss when enough threads are thrown at the problem.

So my idea is this:
We find some happy medium point where adding more threads is not gaining much and call this the crossover point.

Let's pick some arbitrary position which has 36 possible moves.
Let's consider a cluster of 32 machines each having 32 threads.

Perhaps 16 threads is where the search starts to have diminishing returns.

So, for our root search, we launch one process with 16 threads analyzing the most promising node, and a second process examining the first sibling.
Our second machine analyzes siblings 3,4 and so on until 18 machines are used up.

I do realize that the zero window searches will run much faster, so the real distribution of work would require some coordination.

Now, we have 14 machines left. So 10 of them get the most promising forward nodes. One machine is used as a mate proof search machine for areas with king safety danger and one machine is used as a pv polisher (it analyzes the best pv collected so far node by node on some heartbeat interval looking for future danger) and one machine is used as a coordinator or possibly a endgame tablebase oracle and opening database oracle or some combination of functions.

The reason this idea might have merit is that we give the nodes big chunks of work so that communication effort is not the dominating factor.

That's all well and good. And it has already been discussed many times. But you are ignoring the original point, NPS. NPS provides the upper bound for ANY possible parallel improvement. If you want to use 1/2 your NPS to search in parallel, and 1/2 of your NPS to search more moves (prune/reduce less, etc) that perfectly OK. But the primary gain comes from pure speed. Until you reach the point where you can't get any more speed out of it. If, at that point, you choose to start reducing the speculative stuff, that can work. IF you are sure that you can't squeeze any more parallel performance. And no matter what you do, your total Elo gain will STILL be bound by that NPS upper bound, and in real-world-cases, the actual performance bound (and Elo gain) will be LOWER than the NPS gain.

Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?