Stockfish still scales poorly?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
Michel wrote:
Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.
I think this is essentially Bob Hyatt's argument. He has another argument which I actually like more. A SMP search with N threads can be emulated using virtual threads with a single threaded search which runs on a clock which is N times faster.

So this shows that any (hypothetical) super speedup with an SMP search can be realized using a single threaded search as well.
The fastest parallel sorting algorithms are different than the fastest serial sorting algorithms because you can do things in parallel that are not efficiently done serially.

Yes, you could do things serially like look ahead with your one thread and this is even done with pondering. But it is a risky strategy for a single threaded app and I doubt if anyone is doing it.
IF it would work for a parallel algorithm, EVERYBODY would be doing it in the serial algorithm.
Crafty does not ponder the opponent's move after making a move. Instead, crafty ponders on the expected response. This is an example of speculation. You told me that you got a speedup from that. Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
Your sort statement is not exactly correct. The fastest sort algorithms are always going to be N log N. But those are hard to do in parallel due to the synchronization while sub-splitting the lists. But going to something that works better in parallel takes a big performance hit (for example, a simple bubble sort scales really well in terms of processors, but it starts off at N*N vs N log N. It therefore starts off significantly worse than the N log N sorts.
The fastest serial comparison sorts are O(N*log(N)).
Bitonic Sort is a so-so sort on serial machines, but one of the fastest on parallel machines because of the difference in architecture.
Sample Sort is generally considered the fastest sort on SMP machines, but it is not the fastest sort on serial machines.
Alpha/Beta is sqrt(W^D) basically. Even 1024 processors doesn't make minimax overtake alpha/beta, and minimax is the most efficient parallel search one can do since everything has to be searched.

For example, given W^D of 10 billion (pretty easy to search today even on one CPU)

alpha/beta = 100K

minimax = 10B. How many processors needed to reduce 10B to 100K (the break-even point)? 100,000. Not doable in any affordable machine today or in 5 years. So alpha/beta rules the roost, and traditional parallel search rules alpha/beta. And note that 100K processors just breaks even with alpha/beta, it would be no faster or better. To even 2x faster would require 200K processors which is even more "out there". A good parallel search today can certainly produce 8x faster with 16 cores. Minimax would be utterly hopeless.
I agree that minimax is the wrong algorithm for parallel search in chess.
There's a reason. It has been explained many times. Try this:

I just looked at a 120 minute game played on ICC (very long time control). In that game Crafty correctly predicted its opponent's move exactly 75% of the time.

Two choices:

(1) ponder just the expected move. 75% of the time you are correct, and 75% of the time you save ALL of the pondering time.

(2) Do a search over all of the opponent's moves. When he moves, you have been searching everything, so you can't move yet. You may have spent 1/4 of the time on the right move, 3/4 (or worse) of the time on the other moves. You get to keep right on searching, saving 1/4 of the time for each move you ponder.

If the game lasts 60 moves, option (1) will give you 45 moves for free (pondering correct move for correct amount of time). Option 2 will give you 1/4 of 60 moves in times saved. That is 1/3 the savings. So which is better? This is an old topic. Many have tried alternatives. Everyone ends up predicting a move and pondering that, because it is the most efficient/effective use of time...

For your sort issue, how do you measure speedup? Hopefully not by taking an inferior sort algorithm and then running on 1 and then 32 cores. Correct is an optimal serial sort for 1 cpu, then any sort you want for the parallel machine. For reasonable numbers of cores, it is very difficult to beat the n log n sorts.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?
No, but a parallel search that leaves a lot of potential untapped leaves a lot of potential untapped.

It seems people are thinking that Joona's patch is being criticised. It is not.

Before Joona's patch, most people were aware that SF scales poorly when going from 8 to 16 cores and that this in particular shows in the poor scaling of NPS. This means there is a treasure waiting to be found: one that fixes NPS scaling.

Joona comes up with a patch that seems to improve SMP significantly. But it does not fix NPS scaling. It means a treasure was found but that it is not the treasure we knew existed. That one is still hidden and waiting to be found.
I am neither criticizing nor defending Joona's patch.

I attempted to say that examination of NPS is not enough to know that an SMP implementation scales well.
Nobody has said otherwise. But if NPS scales poorly, it is an absolute certainty that the parallel search will scale at least that poorly and almost certainly worse since we have not seen any perfect speedups for more than a couple or maybe 4 cpus max.

No matter how many cores you use, if your single-cpu NPS increases by 8x, your speed up is bound by that 8x and will actually be significantly less due to the usual SMP search overhead issues. It is simple math and there is no way to escape it, ever.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

syzygy wrote:
Dann Corbit wrote:I attempted to say that examination of NPS is not enough to know that an SMP implementation scales well.
But knowing that NPS scales badly is certainly enough to conclude that the SMP implementation does not scale well Elo-wise.
Agreed, assuming that the SMP search is not using additional information gathered from the search to perform additional pruning.

I am not sure that I really disagree with anyone (you or Dr. Hyatt in particular) about this.

I do disagree that it is provably impossible for lower NPS parallel search to produce equal or better results. Put another way, I do not think it has been disproved that a parallel search can (on average) find a correct result faster than the simple multiplication of thread count and NPS for one thread would imply.

On the other hand, if there is no modification to the search to collect some sort of special information then I must agree that it cannot exceed that bound and will most certainly not even be able to meet it because no search is 100% efficient due to Amdahl's law.

I also agree that if threads are sitting idle, your efficiency is harmed.

I further agree that if threads are contending for the same resource, your efficiency is harmed.

I further agree that if work is duplicated, your efficiency is harmed.

According to my understanding of the posts in this thread, I think we can all agree on this:
Bad NPS taking the same algorithm from single threaded to multi-threaded is an indication of a problem.

I think that many disagree with me on this:
It may be possible for a parallel algorithm to exploit information gathered in a way that is more simple for a SMP/NUMA program to enhance pruning and therefore make the program solve faster on average.

I also think that excellent NPS is not a proof of excellent SMP search. I have shown several trivial examples but I also think that far more esoteric examples will exist in reality (for instance, there are definitely chess programs that scale well up to some thread count and then have a horrid looking Elo drop as processors increase with some of them even searching more and more poorly as additional processors are added.)

I do not intend to attack or defend Joona's SMP corrections except to say that if they increase the efficiency of Stockfish for the general case then they are a good idea. Quite frankly, I have not studied them.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Stockfish still scales poorly?

Post by syzygy »

petero2 wrote:
syzygy wrote:
Dann Corbit wrote:But we already know that a perfect search can search as few as the square root of nodes remaining times some constant.
We know that you can't do better than the minimal tree.
How is the minimal tree defined in the presence of LMR guided by the history heuristic?
Dann's question assumes regular alpha-beta. That's where "the square root of nodes" comes from.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Stockfish still scales poorly?

Post by syzygy »

Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:Now, I ask the kind audience to engage in a mind experiment with me. We are going to visualize something. Consider a CPU collection with as many real CPU cores as we see in today's high end video cards. We have literally tens of thousands of high energy threads at our disposal. (...)
You actually believe that there might just be an algorithm (we only have to find it!) that would turn that machine into a better chess-playing entity than a single-threaded engine on a single CPU core that is running at tens of thousands times the speed of each of your threads?

Not very realistic.
The assumption (which implies speedup is not possible) is that no special information can be gathered from parallel search. We already know that that assumption is not true for sorting. Perhaps it is true or not true for tree search. Your hand waving is not convincing to me, despite the chorus of agreement from others.
Of course I'm not going to convince someone who does not see the lack of realism of the above.

Regarding parallel sorting, using a fixed number of cores you aren't going to do better than O(n log n). (Using O(n) cores you can do better, but computers with O(n) cores do not exist.)

That a parallel sorting algorithm on N cores deviates from quicksort is not because it can go Nx faster than 1-core quicksort, but because quicksort has a serial step that runs into trouble with Amdahl's law.

If you have some weird "parallel" hardware architecture that benefits sorting applications, for example using some kind of analog circuits, then try running a chess engine on it. But I think we are considering a system with N regular cores.

I think what you are confusing is:
A) with N > 1 cores the optimal algorithm might be quite different from the optimal single-core algorithm; and
B) with N > 1 cores that optimal parallel algorithm might be more than Nx as fast as the optimal single-core algorithm running on a single core.

I am not contesting A). But B) is nonsense.
APassionForCriminalJustic
Posts: 417
Joined: Sat May 24, 2014 9:16 am

Re: Stockfish still scales poorly?

Post by APassionForCriminalJustic »

bob wrote:
Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
Michel wrote:
Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.
I think this is essentially Bob Hyatt's argument. He has another argument which I actually like more. A SMP search with N threads can be emulated using virtual threads with a single threaded search which runs on a clock which is N times faster.

So this shows that any (hypothetical) super speedup with an SMP search can be realized using a single threaded search as well.
The fastest parallel sorting algorithms are different than the fastest serial sorting algorithms because you can do things in parallel that are not efficiently done serially.

Yes, you could do things serially like look ahead with your one thread and this is even done with pondering. But it is a risky strategy for a single threaded app and I doubt if anyone is doing it.
IF it would work for a parallel algorithm, EVERYBODY would be doing it in the serial algorithm.
Crafty does not ponder the opponent's move after making a move. Instead, crafty ponders on the expected response. This is an example of speculation. You told me that you got a speedup from that. Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
Your sort statement is not exactly correct. The fastest sort algorithms are always going to be N log N. But those are hard to do in parallel due to the synchronization while sub-splitting the lists. But going to something that works better in parallel takes a big performance hit (for example, a simple bubble sort scales really well in terms of processors, but it starts off at N*N vs N log N. It therefore starts off significantly worse than the N log N sorts.
The fastest serial comparison sorts are O(N*log(N)).
Bitonic Sort is a so-so sort on serial machines, but one of the fastest on parallel machines because of the difference in architecture.
Sample Sort is generally considered the fastest sort on SMP machines, but it is not the fastest sort on serial machines.
Alpha/Beta is sqrt(W^D) basically. Even 1024 processors doesn't make minimax overtake alpha/beta, and minimax is the most efficient parallel search one can do since everything has to be searched.

For example, given W^D of 10 billion (pretty easy to search today even on one CPU)

alpha/beta = 100K

minimax = 10B. How many processors needed to reduce 10B to 100K (the break-even point)? 100,000. Not doable in any affordable machine today or in 5 years. So alpha/beta rules the roost, and traditional parallel search rules alpha/beta. And note that 100K processors just breaks even with alpha/beta, it would be no faster or better. To even 2x faster would require 200K processors which is even more "out there". A good parallel search today can certainly produce 8x faster with 16 cores. Minimax would be utterly hopeless.
I agree that minimax is the wrong algorithm for parallel search in chess.
There's a reason. It has been explained many times. Try this:

I just looked at a 120 minute game played on ICC (very long time control). In that game Crafty correctly predicted its opponent's move exactly 75% of the time.

Two choices:

(1) ponder just the expected move. 75% of the time you are correct, and 75% of the time you save ALL of the pondering time.

(2) Do a search over all of the opponent's moves. When he moves, you have been searching everything, so you can't move yet. You may have spent 1/4 of the time on the right move, 3/4 (or worse) of the time on the other moves. You get to keep right on searching, saving 1/4 of the time for each move you ponder.

If the game lasts 60 moves, option (1) will give you 45 moves for free (pondering correct move for correct amount of time). Option 2 will give you 1/4 of 60 moves in times saved. That is 1/3 the savings. So which is better? This is an old topic. Many have tried alternatives. Everyone ends up predicting a move and pondering that, because it is the most efficient/effective use of time...

For your sort issue, how do you measure speedup? Hopefully not by taking an inferior sort algorithm and then running on 1 and then 32 cores. Correct is an optimal serial sort for 1 cpu, then any sort you want for the parallel machine. For reasonable numbers of cores, it is very difficult to beat the n log n sorts.
This is off topic, but the player on ICC that you were playing is probably my PC account called Polyphemus since I have played numerous 120 minute games versus Crafty version 25. I've been running 20 cores with Komodo 8 and the latest Stockfish development. Do you just use ICC for just testing crafty (esp,. the latest crafty version)? It's good to see you return to the Internet Chess Club; I always watch interesting games versus your computer.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Stockfish still scales poorly?

Post by syzygy »

Btw, since you are explicitly not assuming that N cores could achieve more than Nx the NPS of a single core, the emulation/multiplexing argument becomes trivially true.

Suppose you have a parallel algorithm that uses 1 second to search M nodes on N cores. I claim that there is a serial algorithm that uses at most N seconds to search the same M nodes and therefore achieves the same results (up to the normal randomness of parallel search).

Proof: simply multiplex the single core to execute N threads in a sequential fashion. Since we are not assuming a superlinear NPS speedup, this single core will take at most N seconds to search the M nodes.

Without the assumption of non-superlinear speedup things are a bit different. With N cores you'll also have more cpu cache and it might well be possible to make use of that extra superfast memory to get more than Nx the NPS of a single core. (You'd have to very carefully construct the engine that does that, though.)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

APassionForCriminalJustic wrote:
bob wrote:
Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
Michel wrote:
Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.
I think this is essentially Bob Hyatt's argument. He has another argument which I actually like more. A SMP search with N threads can be emulated using virtual threads with a single threaded search which runs on a clock which is N times faster.

So this shows that any (hypothetical) super speedup with an SMP search can be realized using a single threaded search as well.
The fastest parallel sorting algorithms are different than the fastest serial sorting algorithms because you can do things in parallel that are not efficiently done serially.

Yes, you could do things serially like look ahead with your one thread and this is even done with pondering. But it is a risky strategy for a single threaded app and I doubt if anyone is doing it.
IF it would work for a parallel algorithm, EVERYBODY would be doing it in the serial algorithm.
Crafty does not ponder the opponent's move after making a move. Instead, crafty ponders on the expected response. This is an example of speculation. You told me that you got a speedup from that. Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
Your sort statement is not exactly correct. The fastest sort algorithms are always going to be N log N. But those are hard to do in parallel due to the synchronization while sub-splitting the lists. But going to something that works better in parallel takes a big performance hit (for example, a simple bubble sort scales really well in terms of processors, but it starts off at N*N vs N log N. It therefore starts off significantly worse than the N log N sorts.
The fastest serial comparison sorts are O(N*log(N)).
Bitonic Sort is a so-so sort on serial machines, but one of the fastest on parallel machines because of the difference in architecture.
Sample Sort is generally considered the fastest sort on SMP machines, but it is not the fastest sort on serial machines.
Alpha/Beta is sqrt(W^D) basically. Even 1024 processors doesn't make minimax overtake alpha/beta, and minimax is the most efficient parallel search one can do since everything has to be searched.

For example, given W^D of 10 billion (pretty easy to search today even on one CPU)

alpha/beta = 100K

minimax = 10B. How many processors needed to reduce 10B to 100K (the break-even point)? 100,000. Not doable in any affordable machine today or in 5 years. So alpha/beta rules the roost, and traditional parallel search rules alpha/beta. And note that 100K processors just breaks even with alpha/beta, it would be no faster or better. To even 2x faster would require 200K processors which is even more "out there". A good parallel search today can certainly produce 8x faster with 16 cores. Minimax would be utterly hopeless.
I agree that minimax is the wrong algorithm for parallel search in chess.
There's a reason. It has been explained many times. Try this:

I just looked at a 120 minute game played on ICC (very long time control). In that game Crafty correctly predicted its opponent's move exactly 75% of the time.

Two choices:

(1) ponder just the expected move. 75% of the time you are correct, and 75% of the time you save ALL of the pondering time.

(2) Do a search over all of the opponent's moves. When he moves, you have been searching everything, so you can't move yet. You may have spent 1/4 of the time on the right move, 3/4 (or worse) of the time on the other moves. You get to keep right on searching, saving 1/4 of the time for each move you ponder.

If the game lasts 60 moves, option (1) will give you 45 moves for free (pondering correct move for correct amount of time). Option 2 will give you 1/4 of 60 moves in times saved. That is 1/3 the savings. So which is better? This is an old topic. Many have tried alternatives. Everyone ends up predicting a move and pondering that, because it is the most efficient/effective use of time...

For your sort issue, how do you measure speedup? Hopefully not by taking an inferior sort algorithm and then running on 1 and then 32 cores. Correct is an optimal serial sort for 1 cpu, then any sort you want for the parallel machine. For reasonable numbers of cores, it is very difficult to beat the n log n sorts.
This is off topic, but the player on ICC that you were playing is probably my PC account called Polyphemus since I have played numerous 120 minute games versus Crafty version 25. I've been running 20 cores with Komodo 8 and the latest Stockfish development. Do you just use ICC for just testing crafty (esp,. the latest crafty version)? It's good to see you return to the Internet Chess Club; I always watch interesting games versus your computer.
I always run the latest version, which is just another sanity test of recent changes. Cluster testing is my primary methodology, but ICC is nice because of the variability of openings, time controls, etc. I went back and looked (this was the last 120 0 game I saw in the logs) and you were correct, it was definitely your account.

As far as testing goes, I don't put things up on ICC until they have gone through a cluster-testing session first. Easier to catch problems where things are more repeatable. No pondering, book randomness, etc. Once a change makes it through that, it goes on to ICC. My hardware is not particularly fast, as you might have noticed. A quad-core i7 iMac. But it is not bad.

And yes, Crafty has been playing there almost forever, dating back to 1995 in fact when it was still "ICS".
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

Dann Corbit wrote:
syzygy wrote:
Dann Corbit wrote:I attempted to say that examination of NPS is not enough to know that an SMP implementation scales well.
But knowing that NPS scales badly is certainly enough to conclude that the SMP implementation does not scale well Elo-wise.
Agreed, assuming that the SMP search is not using additional information gathered from the search to perform additional pruning.

I am not sure that I really disagree with anyone (you or Dr. Hyatt in particular) about this.

I do disagree that it is provably impossible for lower NPS parallel search to produce equal or better results. Put another way, I do not think it has been disproved that a parallel search can (on average) find a correct result faster than the simple multiplication of thread count and NPS for one thread would imply.

On the other hand, if there is no modification to the search to collect some sort of special information then I must agree that it cannot exceed that bound and will most certainly not even be able to meet it because no search is 100% efficient due to Amdahl's law.

I also agree that if threads are sitting idle, your efficiency is harmed.

I further agree that if threads are contending for the same resource, your efficiency is harmed.

I further agree that if work is duplicated, your efficiency is harmed.

According to my understanding of the posts in this thread, I think we can all agree on this:
Bad NPS taking the same algorithm from single threaded to multi-threaded is an indication of a problem.

I think that many disagree with me on this:
It may be possible for a parallel algorithm to exploit information gathered in a way that is more simple for a SMP/NUMA program to enhance pruning and therefore make the program solve faster on average.

I also think that excellent NPS is not a proof of excellent SMP search. I have shown several trivial examples but I also think that far more esoteric examples will exist in reality (for instance, there are definitely chess programs that scale well up to some thread count and then have a horrid looking Elo drop as processors increase with some of them even searching more and more poorly as additional processors are added.)

I do not intend to attack or defend Joona's SMP corrections except to say that if they increase the efficiency of Stockfish for the general case then they are a good idea. Quite frankly, I have not studied them.
If you collect this "special information" in parallel, multiplex it in a serial search and collect it. If it helps the parallel search it also would help the serial search equally.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

syzygy wrote:
Dann Corbit wrote:Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.
He certainly did not see a superlinear speedup.

I have no problem accepting that there might be good approaches to multi-threaded searching that to some extent leave the "try to mimick the serial search as best as we can"-approach. If a particular smp implementation happens to reach max Elo at 55 cores, because with 56 cores the extra synchronisation overhead outweighs the benefit of that extra core, then doing anything else with that core that makes remotely sense can only be an improvement. But you're just not going to get the performance of a single core search at 56x the speed.
What if it is possible to use the information gathered to reduce the branching factor by 0.15? By 0.25?

I am not saying that this can be done. But I don't know why it can't be done either.
1.4^56 = 152464944
1.2^56 = 27174
5611 times faster from a change in the exponent of 0.2 for a 56 ply search.
Note that I just chose 56 arbitrarily. It has nothing to do with your 56.

This is also why I think that any really large Elo gains will come from search enhancements rather than all the tweaking of the evaluation function.