SMP speed up

michiguel · Post by **michiguel** » Tue Sep 14, 2010 9:52 pm

This is a spin off of another thread.

I never gave a lot thought about it but the well know formula for Crafty

speedup = 1 + (NCPUS - 1) * 0.7

may indicate that the inefficiency is not related to the Amdahl's law, even if this applies to a low number of CPUs. What is the cause of the parallel inefficiency? The shape of the tree? still, it looks like it should either saturate quicker or the speed up with 2 cores should be higher than 1.7

Was this investigated?

Miguel

rbarreira · Post by **rbarreira** » Tue Sep 14, 2010 10:04 pm

I think it pays off to simplify the formula for understanding:

1+ (NCPUS - 1) * 0.7 =

0.7 * NCPUS + 0.3

What strikes me as surprising about it is that converges very quickly to 70% efficiency, which it will never go under of course. It implies that going from 1 CPU to 2 CPUs results in much more added overhead than going from 2 to 4 CPUs, and so on.

I think it's a strange formula, but if it has been correctly measured that way, what can I say... (though Robert did say he only measured up to 64 cores)

bob · Post by **bob** » Tue Sep 14, 2010 11:42 pm

michiguel wrote:This is a spin off of another thread.

I never gave a lot thought about it but the well know formula for Crafty

speedup = 1 + (NCPUS - 1) * 0.7

may indicate that the inefficiency is not related to the Amdahl's law, even if this applies to a low number of CPUs. What is the cause of the parallel inefficiency? The shape of the tree? still, it looks like it should either saturate quicker or the speed up with 2 cores should be higher than 1.7

Was this investigated?

Miguel

I beat it to death for 1-8 cores. About all that can be said is that after the first CPU, every processor added makes the tree grow. If you run the old CB positions, which are all the positions from a single real game that everyone has seen, it is pretty consistent. There is lots of variation until you average over all the moves, and then that 30% extra nodes per CPU starts to settle down. You could drive this down by improving move ordering to get a higher percentage of fail highs on the first move. But I have been stuck at 90-92% for years, which pretty well fixes the overhead since one of every 10 splits (when I split right after the first move) is going to be at a bad point that adds overhead...

There is likely a mathematical model that considers fh % as computed in Crafty, and predicts the speedup, but I have never tried to quantify that at all since it would not be of any benefit. We all know that YBW depends on the first move being the one to cause a cutoff, else we think it is an ALL node.

In my dissertation I tackled this by searching perfectly ordered trees and got perfect linearity in the speedup. But I faked the eval to produce a monotonically decreasing value to simulate perfect ordering. I also tackled worst-first, which is pure minimax, and also got perfect speedups. But it was slow with no cutoffs at all. It is the very good trees that cause the problem.

bob · Post by **bob** » Tue Sep 14, 2010 11:49 pm

rbarreira wrote:I think it pays off to simplify the formula for understanding:

1+ (NCPUS - 1) * 0.7 =

0.7 * NCPUS + 0.3

What strikes me as surprising about it is that converges very quickly to 70% efficiency, which it will never go under of course. It implies that going from 1 CPU to 2 CPUs results in much more added overhead than going from 2 to 4 CPUs, and so on.

how do you figure that?

[/quote]

There are lots of things at play. With 2 cpus, only 1 is doing unnecessary work at a split point that was poorly chosen. With 4, that goes up to 3. So although it might look like the overhead is going down, it really is not.

I think it's a strange formula, but if it has been correctly measured that way, what can I say... (though Robert did say he only measured up to 64 cores)

And remember, with 64 cores, you you 7 data points that don't lie on a perfectly straight line. I simply chose a good approximation. Originally that formula worked for 1-2-4. then we added 8 and it still fit well. And then 16 and 32. Eugene ran it on a 64 core Itanium which was possibly a tainted result since it was a different architecture, but the 1-2-4-8-16-32-64 numbers still stayed around that line. It is not a perfect fit. But it is a good 1st approximation, which is all I have ever called it...

Milos · Post by **Milos** » Wed Sep 15, 2010 12:20 am

Just a reply from the previous thread

"More on Bob's formula...

speedup_2=1+0.7=1.7
speedup_32=1+31*0.7=22.7
speedup_64=1+63*0.7=45.1

speedup_64/speedup32=1.99!!!
speedup_2=1.7

17% more gain when going from 32 to 64 than from 1 to 2 cores."

I let the ppl make their own conclusions

.

Milos · Post by **Milos** » Wed Sep 15, 2010 12:22 am

bob wrote:And remember, with 64 cores, you you 7 data points that don't lie on a perfectly straight line. I simply chose a good approximation. Originally that formula worked for 1-2-4. then we added 8 and it still fit well. And then 16 and 32. Eugene ran it on a 64 core Itanium which was possibly a tainted result since it was a different architecture, but the 1-2-4-8-16-32-64 numbers still stayed around that line. It is not a perfect fit. But it is a good 1st approximation, which is all I have ever called it...

I'll repeat. I hope you didn't published a paper with this kind of results, because this would be just a farce...

mhull · Post by **mhull** » Wed Sep 15, 2010 1:14 am

Milos wrote:Just a reply from the previous thread

"More on Bob's formula...

speedup_2=1+0.7=1.7
speedup_32=1+31*0.7=22.7
speedup_64=1+63*0.7=45.1

speedup_64/speedup32=1.99!!!
speedup_2=1.7

17% more gain when going from 32 to 64 than from 1 to 2 cores."

I let the ppl make their own conclusions .

And your tests confirm Bob is wrong?

bob · Post by **bob** » Wed Sep 15, 2010 4:48 am

Milos wrote:
bob wrote:And remember, with 64 cores, you you 7 data points that don't lie on a perfectly straight line. I simply chose a good approximation. Originally that formula worked for 1-2-4. then we added 8 and it still fit well. And then 16 and 32. Eugene ran it on a 64 core Itanium which was possibly a tainted result since it was a different architecture, but the 1-2-4-8-16-32-64 numbers still stayed around that line. It is not a perfect fit. But it is a good 1st approximation, which is all I have ever called it...
I'll repeat. I hope you didn't published a paper with this kind of results, because this would be just a farce...

I wish you would offer some sort of supporting evidence for your arguments, otherwise you just look foolish.

bob · Post by **bob** » Wed Sep 15, 2010 4:49 am

mhull wrote:
Milos wrote:Just a reply from the previous thread

"More on Bob's formula...

speedup_2=1+0.7=1.7
speedup_32=1+31*0.7=22.7
speedup_64=1+63*0.7=45.1

speedup_64/speedup32=1.99!!!
speedup_2=1.7

17% more gain when going from 32 to 64 than from 1 to 2 cores."

I let the ppl make their own conclusions .
And your tests confirm Bob is wrong?

No, his tests confirm he is either dense, an ass, or a troll. Nothing more or less. He is not offering _any_ data or observations of any kind, just boorish nonsense...

mhull · Post by **mhull** » Wed Sep 15, 2010 5:59 am

bob wrote:
mhull wrote:
Milos wrote:Just a reply from the previous thread

"More on Bob's formula...

speedup_2=1+0.7=1.7
speedup_32=1+31*0.7=22.7
speedup_64=1+63*0.7=45.1

speedup_64/speedup32=1.99!!!
speedup_2=1.7

17% more gain when going from 32 to 64 than from 1 to 2 cores."

I let the ppl make their own conclusions .
And your tests confirm Bob is wrong?
No, his tests confirm he is either dense, an ass, or a troll. Nothing more or less. He is not offering _any_ data or observations of any kind, just boorish nonsense...

It was a dig, since he ran no tests.

SMP speed up

SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up

Re: SMP speed up