Real Speedup due to core doubling etc

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Real Speedup due to core doubling etc

Post by bob »

Vinvin wrote:
bob wrote:
Vinvin wrote:
bob wrote:
Vinvin wrote:
bob wrote:
CRoberson wrote:IIRC, the Rybka team knew of the equation NPS speedup = 1 + (N-1)*0.7, but they saw many customers getting confused when the TTP (Time To Ply) speed uo didn't equal the same value as the NPS speedup due to the workload gain. So, they adjusted the equation to be a TTP equation.
That does not compute. :)

1 + (n-1)*.7 is time-to-depth speedup. Nothing to do with NPS. I have ALWAYS given speedup numbers as time-to-depth. NPS is irrelevant in that context.
Yes but Vasik wanted a formula where "more NPS" always mean "stronger" (taking account of the number of CPU), so the NPS are converted with the help of a formula close to "1 + (N-1)*0.7".
Doesn't make any sense at all. More NPS is generally stronger. His formula doesn't seem to apply to any numbers I produce in Crafty. This sounds like more of his node nonsense to me... "In rybka I count nodes differently..."

What he REALLY meant was "In rybka, I obfuscate the node count to make it harder to figure out what I am doing."
4 Mn/s on 4 CPU is probably weaker than 3.5 Mn/s on 1 CPU.
That's why Rybka display converted number. 4 CPU -> 1+(0.7*3) = 3.1, so a ratio "/4*3.1" is applied to speed displayed.
(obfuscated numbers on 1 CPU is another story ;) )
Rybka doesn't use 1 + .7*3. That is MY formula. And I agree, 4M with 1 cpu is stronger than 4m on 4 cpus, because of search overhead. Rybka used the n^.76 (or whatever the fraction was).

But in any case, when talking about SMP performance, NPS is not the right number to compare. time to depth is the reasonable measurement.
I don't find the formula used in Rybka ... only this post by Vasik in 2008 :
Vasik wrote:When Rybka displays a 2x higher kn/s, she is effectively 2x faster and correspondingly stronger. It's no different than if you give her 2x more time.
Other engines don't make this adjustment, so it may look like they scale better. I don't really care about this - we're just going to do it the way I think is right.
http://rybkaforum.net/cgi-bin/rybkaforu ... 0#pid86950
I saw his exponential formula somewhere. But in any case,

s = 1 + (N-1) * 0.7

has been my linear approximation formula for 30 years. I personally believe time to depth is the ONLY valid measure of SMP performance. I think this "widening" stuff is just a lot of "stuff". If it was really worthwhile to be less selective in a parallel search, it would be JUST as worthwhile to do it in the sequential/serial search as well.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Real Speedup due to core doubling etc

Post by Laskos »

bob wrote: I saw his exponential formula somewhere. But in any case,

s = 1 + (N-1) * 0.7

has been my linear approximation formula for 30 years. I personally believe time to depth is the ONLY valid measure of SMP performance. I think this "widening" stuff is just a lot of "stuff". If it was really worthwhile to be less selective in a parallel search, it would be JUST as worthwhile to do it in the sequential/serial search as well.
1) Your linear approximation is not "stuff", it's "crap".
2) The "widening" "stuff" is confirmed experimentally on several top engines. There is nothing which beats an experiment, neither are your perorations.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Real Speedup due to core doubling etc

Post by bob »

Laskos wrote:
bob wrote: I saw his exponential formula somewhere. But in any case,

s = 1 + (N-1) * 0.7

has been my linear approximation formula for 30 years. I personally believe time to depth is the ONLY valid measure of SMP performance. I think this "widening" stuff is just a lot of "stuff". If it was really worthwhile to be less selective in a parallel search, it would be JUST as worthwhile to do it in the sequential/serial search as well.
1) Your linear approximation is not "stuff", it's "crap".
2) The "widening" "stuff" is confirmed experimentally on several top engines. There is nothing which beats an experiment, neither are your perorations.
Feel free to produce a BETTER linear approx. I've posted numbers for Crafty. Others have done the same in the past, for 1-8 cpus. And that fit is quite good. I guess you have a problem detecting "crap".

As far as the widening goes, did you read what I wrote? If "widening" helps a parallel search, that SAME widening would benefit the sequential search just as much. Hence is is basically "noise", "sloppiness in the parallel implementation", or whatever. Whether you like it or not. All your test proves is that the basic search has some room for improvement, regardless of the parallel algorithm used.

However, if you'd like to see the actual data that supports my SMP approximation, you can find the discussions here a few years back, using an 8-cpu opteron box, testing with 1, 2, 4 and 8 positions, just as I tested Cray Blitz in the DTS paper. Feel free to actually look at something rather than just spouting noise.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Real Speedup due to core doubling etc

Post by Laskos »

bob wrote:
Laskos wrote:
bob wrote: I saw his exponential formula somewhere. But in any case,

s = 1 + (N-1) * 0.7

has been my linear approximation formula for 30 years. I personally believe time to depth is the ONLY valid measure of SMP performance. I think this "widening" stuff is just a lot of "stuff". If it was really worthwhile to be less selective in a parallel search, it would be JUST as worthwhile to do it in the sequential/serial search as well.
1) Your linear approximation is not "stuff", it's "crap".
2) The "widening" "stuff" is confirmed experimentally on several top engines. There is nothing which beats an experiment, neither are your perorations.
Feel free to produce a BETTER linear approx. I've posted numbers for Crafty. Others have done the same in the past, for 1-8 cpus. And that fit is quite good. I guess you have a problem detecting "crap".

As far as the widening goes, did you read what I wrote? If "widening" helps a parallel search, that SAME widening would benefit the sequential search just as much. Hence is is basically "noise", "sloppiness in the parallel implementation", or whatever. Whether you like it or not. All your test proves is that the basic search has some room for improvement, regardless of the parallel algorithm used.

However, if you'd like to see the actual data that supports my SMP approximation, you can find the discussions here a few years back, using an 8-cpu opteron box, testing with 1, 2, 4 and 8 positions, just as I tested Cray Blitz in the DTS paper. Feel free to actually look at something rather than just spouting noise.
1) There were posts here recently with up 16 and 32 cores, showing non-linear scaling. It's rather some power scaling. Then, by your crap formula for effective speedup (Time-To-Strength), Jonny with 2,000 cores should have been 500 Elo points above the opposition at recent WCCC, where Junior on 24 cores in fact won. Rybka on 64 cores should have beaten Houdini on 16 cores, if linear scaling holds. You simply lack elementary notion of scaling and reading simple tests.

2) Widening, you like it or not, happens in 3 top engines: Komodo, Rybka, and recently even Stockfish. So Time-To-Depth (TTD) is not an universal measure for Time-To-Strength (Effective Speedup). You keep bragging on TTD as being always the correct thing to measure for Effective Speedup. Do you learn ever anything?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Real Speedup due to core doubling etc

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote: I saw his exponential formula somewhere. But in any case,

s = 1 + (N-1) * 0.7

has been my linear approximation formula for 30 years. I personally believe time to depth is the ONLY valid measure of SMP performance. I think this "widening" stuff is just a lot of "stuff". If it was really worthwhile to be less selective in a parallel search, it would be JUST as worthwhile to do it in the sequential/serial search as well.
1) Your linear approximation is not "stuff", it's "crap".
2) The "widening" "stuff" is confirmed experimentally on several top engines. There is nothing which beats an experiment, neither are your perorations.
Feel free to produce a BETTER linear approx. I've posted numbers for Crafty. Others have done the same in the past, for 1-8 cpus. And that fit is quite good. I guess you have a problem detecting "crap".

As far as the widening goes, did you read what I wrote? If "widening" helps a parallel search, that SAME widening would benefit the sequential search just as much. Hence is is basically "noise", "sloppiness in the parallel implementation", or whatever. Whether you like it or not. All your test proves is that the basic search has some room for improvement, regardless of the parallel algorithm used.

However, if you'd like to see the actual data that supports my SMP approximation, you can find the discussions here a few years back, using an 8-cpu opteron box, testing with 1, 2, 4 and 8 positions, just as I tested Cray Blitz in the DTS paper. Feel free to actually look at something rather than just spouting noise.
1) There were posts here recently with up 16 and 32 cores, showing non-linear scaling. It's rather some power scaling. Then, by your crap formula for effective speedup (Time-To-Strength), Jonny with 2,000 cores should have been 500 Elo points above the opposition at recent WCCC, where Junior on 24 cores in fact won. Rybka on 64 cores should have beaten Houdini on 16 cores, if linear scaling holds. You simply lack elementary notion of scaling and reading simple tests.

2) Widening, you like it or not, happens in 3 top engines: Komodo, Rybka, and recently even Stockfish. So Time-To-Depth (TTD) is not an universal measure for Time-To-Strength (Effective Speedup). You keep bragging on TTD as being always the correct thing to measure for Effective Speedup. Do you learn ever anything?
Why don't you learn to read. EVER seen me say that linear approximation works for 2000 cores? Ever seen me say that it even works for 128? No, because I haven't. So stop with these straw-man arguments, making up things I said and then trying to refute that. I CLEARLY said I had data through 16 cores where that formula was a reasonably accurate estimator. I've said nothing about larger numbers. I mentioned I had run on 64 cores but did not have enough time to run the necessary tests.

I don't dispute "widening" happens. I DO claim it is not a feature, however, it is a BUG. If widening is better than going deeper, the SAME would be true for the normal search as well. But we know it isn't and that depth always helps, for most.

To claim otherwise shows a complete lack of understanding of parallel vs serial algorithms. As far as "ever learning anything" it would seem that for you the answer is a resounding NO.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Real Speedup due to core doubling etc

Post by Laskos »

bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote: I saw his exponential formula somewhere. But in any case,

s = 1 + (N-1) * 0.7

has been my linear approximation formula for 30 years. I personally believe time to depth is the ONLY valid measure of SMP performance. I think this "widening" stuff is just a lot of "stuff". If it was really worthwhile to be less selective in a parallel search, it would be JUST as worthwhile to do it in the sequential/serial search as well.
1) Your linear approximation is not "stuff", it's "crap".
2) The "widening" "stuff" is confirmed experimentally on several top engines. There is nothing which beats an experiment, neither are your perorations.
Feel free to produce a BETTER linear approx. I've posted numbers for Crafty. Others have done the same in the past, for 1-8 cpus. And that fit is quite good. I guess you have a problem detecting "crap".

As far as the widening goes, did you read what I wrote? If "widening" helps a parallel search, that SAME widening would benefit the sequential search just as much. Hence is is basically "noise", "sloppiness in the parallel implementation", or whatever. Whether you like it or not. All your test proves is that the basic search has some room for improvement, regardless of the parallel algorithm used.

However, if you'd like to see the actual data that supports my SMP approximation, you can find the discussions here a few years back, using an 8-cpu opteron box, testing with 1, 2, 4 and 8 positions, just as I tested Cray Blitz in the DTS paper. Feel free to actually look at something rather than just spouting noise.
1) There were posts here recently with up 16 and 32 cores, showing non-linear scaling. It's rather some power scaling. Then, by your crap formula for effective speedup (Time-To-Strength), Jonny with 2,000 cores should have been 500 Elo points above the opposition at recent WCCC, where Junior on 24 cores in fact won. Rybka on 64 cores should have beaten Houdini on 16 cores, if linear scaling holds. You simply lack elementary notion of scaling and reading simple tests.

2) Widening, you like it or not, happens in 3 top engines: Komodo, Rybka, and recently even Stockfish. So Time-To-Depth (TTD) is not an universal measure for Time-To-Strength (Effective Speedup). You keep bragging on TTD as being always the correct thing to measure for Effective Speedup. Do you learn ever anything?
Why don't you learn to read. EVER seen me say that linear approximation works for 2000 cores? Ever seen me say that it even works for 128? No, because I haven't. So stop with these straw-man arguments, making up things I said and then trying to refute that. I CLEARLY said I had data through 16 cores where that formula was a reasonably accurate estimator. I've said nothing about larger numbers. I mentioned I had run on 64 cores but did not have enough time to run the necessary tests.

I don't dispute "widening" happens. I DO claim it is not a feature, however, it is a BUG. If widening is better than going deeper, the SAME would be true for the normal search as well. But we know it isn't and that depth always helps, for most.

To claim otherwise shows a complete lack of understanding of parallel vs serial algorithms. As far as "ever learning anything" it would seem that for you the answer is a resounding NO.
What's your point? That you fitted to 8 cores 2-3 points (doublings) of an anyway monotonous function with a linear approximation? Don't you look at the asymptotic behavior? Your linear approximation is crap beyond 2-3 doublings, and you must have missed here, in CCC, tests on 16 and 32 cores.
As for TTD crap generalization you make, 3 of the top 6 engines do NOT obey it as Effective Speedup (TTS) goes. You seem to generalize your Crafty for "always". For many engines TTS is derived only with strength test with many games, using a reasonable time control. Effective Speedup depends on the time control used too, did you ever hear that? And never learnt? TTS is NOT always TTD, NPS and other misplaced generalizations.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Real Speedup due to core doubling etc

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote: I saw his exponential formula somewhere. But in any case,

s = 1 + (N-1) * 0.7

has been my linear approximation formula for 30 years. I personally believe time to depth is the ONLY valid measure of SMP performance. I think this "widening" stuff is just a lot of "stuff". If it was really worthwhile to be less selective in a parallel search, it would be JUST as worthwhile to do it in the sequential/serial search as well.
1) Your linear approximation is not "stuff", it's "crap".
2) The "widening" "stuff" is confirmed experimentally on several top engines. There is nothing which beats an experiment, neither are your perorations.
Feel free to produce a BETTER linear approx. I've posted numbers for Crafty. Others have done the same in the past, for 1-8 cpus. And that fit is quite good. I guess you have a problem detecting "crap".

As far as the widening goes, did you read what I wrote? If "widening" helps a parallel search, that SAME widening would benefit the sequential search just as much. Hence is is basically "noise", "sloppiness in the parallel implementation", or whatever. Whether you like it or not. All your test proves is that the basic search has some room for improvement, regardless of the parallel algorithm used.

However, if you'd like to see the actual data that supports my SMP approximation, you can find the discussions here a few years back, using an 8-cpu opteron box, testing with 1, 2, 4 and 8 positions, just as I tested Cray Blitz in the DTS paper. Feel free to actually look at something rather than just spouting noise.
1) There were posts here recently with up 16 and 32 cores, showing non-linear scaling. It's rather some power scaling. Then, by your crap formula for effective speedup (Time-To-Strength), Jonny with 2,000 cores should have been 500 Elo points above the opposition at recent WCCC, where Junior on 24 cores in fact won. Rybka on 64 cores should have beaten Houdini on 16 cores, if linear scaling holds. You simply lack elementary notion of scaling and reading simple tests.

2) Widening, you like it or not, happens in 3 top engines: Komodo, Rybka, and recently even Stockfish. So Time-To-Depth (TTD) is not an universal measure for Time-To-Strength (Effective Speedup). You keep bragging on TTD as being always the correct thing to measure for Effective Speedup. Do you learn ever anything?
Why don't you learn to read. EVER seen me say that linear approximation works for 2000 cores? Ever seen me say that it even works for 128? No, because I haven't. So stop with these straw-man arguments, making up things I said and then trying to refute that. I CLEARLY said I had data through 16 cores where that formula was a reasonably accurate estimator. I've said nothing about larger numbers. I mentioned I had run on 64 cores but did not have enough time to run the necessary tests.

I don't dispute "widening" happens. I DO claim it is not a feature, however, it is a BUG. If widening is better than going deeper, the SAME would be true for the normal search as well. But we know it isn't and that depth always helps, for most.

To claim otherwise shows a complete lack of understanding of parallel vs serial algorithms. As far as "ever learning anything" it would seem that for you the answer is a resounding NO.
What's your point? That you fitted to 8 cores 2-3 points (doublings) of an anyway monotonous function with a linear approximation? Don't you look at the asymptotic behavior? Your linear approximation is crap beyond 2-3 doublings, and you must have missed here, in CCC, tests on 16 and 32 cores.
As for TTD crap generalization you make, 3 of the top 6 engines do NOT obey it as Effective Speedup (TTS) goes. You seem to generalize your Crafty for "always". For many engines TTS is derived only with strength test with many games, using a reasonable time control. Effective Speedup depends on the time control used too, did you ever hear that? And never learnt? TTS is NOT always TTD, NPS and other misplaced generalizations.
Again, learn to read. I specifically pointed out I had tested on 2, 4, 8, 12 and 16 extensively, that I had some data on 32, but that beyond 32 all I had run on for any length of time was a 64 cpu Itanium box where most of the time was spent addressing the various NUMA issues. For my "approximation" it is quite accurate through 16. Not enough data for 32 and beyond, yet (if ever). But for usable / obtainable hardware, it is more than "good enough".

My DTS paper went through 16 processors, which was all we had at the time (Cray T90/16).. So I am fully aware of how the numbers look. We don't have any 2000 core boxes, nor will we likely ever, except for message-passing clusters which are a completely different animal.

As far as "speedup depends on length of search" I believe _I_ am the first to point that out, in my DTS paper. Try again. Nothing new here so far other than that you don't understand the basic theoretical model of parallel vs serial search and why this "widening" is a crock caused by an overly aggressive search that is too selective.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Real Speedup due to core doubling etc

Post by Laskos »

bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote: I saw his exponential formula somewhere. But in any case,

s = 1 + (N-1) * 0.7

has been my linear approximation formula for 30 years. I personally believe time to depth is the ONLY valid measure of SMP performance. I think this "widening" stuff is just a lot of "stuff". If it was really worthwhile to be less selective in a parallel search, it would be JUST as worthwhile to do it in the sequential/serial search as well.
1) Your linear approximation is not "stuff", it's "crap".
2) The "widening" "stuff" is confirmed experimentally on several top engines. There is nothing which beats an experiment, neither are your perorations.
Feel free to produce a BETTER linear approx. I've posted numbers for Crafty. Others have done the same in the past, for 1-8 cpus. And that fit is quite good. I guess you have a problem detecting "crap".

As far as the widening goes, did you read what I wrote? If "widening" helps a parallel search, that SAME widening would benefit the sequential search just as much. Hence is is basically "noise", "sloppiness in the parallel implementation", or whatever. Whether you like it or not. All your test proves is that the basic search has some room for improvement, regardless of the parallel algorithm used.

However, if you'd like to see the actual data that supports my SMP approximation, you can find the discussions here a few years back, using an 8-cpu opteron box, testing with 1, 2, 4 and 8 positions, just as I tested Cray Blitz in the DTS paper. Feel free to actually look at something rather than just spouting noise.
1) There were posts here recently with up 16 and 32 cores, showing non-linear scaling. It's rather some power scaling. Then, by your crap formula for effective speedup (Time-To-Strength), Jonny with 2,000 cores should have been 500 Elo points above the opposition at recent WCCC, where Junior on 24 cores in fact won. Rybka on 64 cores should have beaten Houdini on 16 cores, if linear scaling holds. You simply lack elementary notion of scaling and reading simple tests.

2) Widening, you like it or not, happens in 3 top engines: Komodo, Rybka, and recently even Stockfish. So Time-To-Depth (TTD) is not an universal measure for Time-To-Strength (Effective Speedup). You keep bragging on TTD as being always the correct thing to measure for Effective Speedup. Do you learn ever anything?
Why don't you learn to read. EVER seen me say that linear approximation works for 2000 cores? Ever seen me say that it even works for 128? No, because I haven't. So stop with these straw-man arguments, making up things I said and then trying to refute that. I CLEARLY said I had data through 16 cores where that formula was a reasonably accurate estimator. I've said nothing about larger numbers. I mentioned I had run on 64 cores but did not have enough time to run the necessary tests.

I don't dispute "widening" happens. I DO claim it is not a feature, however, it is a BUG. If widening is better than going deeper, the SAME would be true for the normal search as well. But we know it isn't and that depth always helps, for most.

To claim otherwise shows a complete lack of understanding of parallel vs serial algorithms. As far as "ever learning anything" it would seem that for you the answer is a resounding NO.
What's your point? That you fitted to 8 cores 2-3 points (doublings) of an anyway monotonous function with a linear approximation? Don't you look at the asymptotic behavior? Your linear approximation is crap beyond 2-3 doublings, and you must have missed here, in CCC, tests on 16 and 32 cores.
As for TTD crap generalization you make, 3 of the top 6 engines do NOT obey it as Effective Speedup (TTS) goes. You seem to generalize your Crafty for "always". For many engines TTS is derived only with strength test with many games, using a reasonable time control. Effective Speedup depends on the time control used too, did you ever hear that? And never learnt? TTS is NOT always TTD, NPS and other misplaced generalizations.
Again, learn to read. I specifically pointed out I had tested on 2, 4, 8, 12 and 16 extensively, that I had some data on 32, but that beyond 32 all I had run on for any length of time was a 64 cpu Itanium box where most of the time was spent addressing the various NUMA issues. For my "approximation" it is quite accurate through 16. Not enough data for 32 and beyond, yet (if ever). But for usable / obtainable hardware, it is more than "good enough".

My DTS paper went through 16 processors, which was all we had at the time (Cray T90/16).. So I am fully aware of how the numbers look. We don't have any 2000 core boxes, nor will we likely ever, except for message-passing clusters which are a completely different animal.

As far as "speedup depends on length of search" I believe _I_ am the first to point that out, in my DTS paper. Try again. Nothing new here so far other than that you don't understand the basic theoretical model of parallel vs serial search and why this "widening" is a crock caused by an overly aggressive search that is too selective.
And after all your hard work you come with:

TTS (Time-To-Strength or Effective Speedup)=1 + (N-1) * 0.7

Where here is your "As far as "speedup depends on length of search" I believe _I_ am the first to point that out, in my DTS paper."? Your crap "formula" doesn't have appropriate asymptotic behavior and doesn't have the depth dependency for specific engines.

Then, TTD clearly is NOT an universal measure of TTS, whatever buggy "crocks" you consider some 3 of the top 6 programs. Buggy, too selective whatever, they seem to scale reasonably with the number of cores. You simply seem to have a mental rupture: if it cannot be, then it cannot be, although I see a bunch of them, it cannot be because it cannot be, nobody understands the parallel search, that's it.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Real Speedup due to core doubling etc

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote: I saw his exponential formula somewhere. But in any case,

s = 1 + (N-1) * 0.7

has been my linear approximation formula for 30 years. I personally believe time to depth is the ONLY valid measure of SMP performance. I think this "widening" stuff is just a lot of "stuff". If it was really worthwhile to be less selective in a parallel search, it would be JUST as worthwhile to do it in the sequential/serial search as well.
1) Your linear approximation is not "stuff", it's "crap".
2) The "widening" "stuff" is confirmed experimentally on several top engines. There is nothing which beats an experiment, neither are your perorations.
Feel free to produce a BETTER linear approx. I've posted numbers for Crafty. Others have done the same in the past, for 1-8 cpus. And that fit is quite good. I guess you have a problem detecting "crap".

As far as the widening goes, did you read what I wrote? If "widening" helps a parallel search, that SAME widening would benefit the sequential search just as much. Hence is is basically "noise", "sloppiness in the parallel implementation", or whatever. Whether you like it or not. All your test proves is that the basic search has some room for improvement, regardless of the parallel algorithm used.

However, if you'd like to see the actual data that supports my SMP approximation, you can find the discussions here a few years back, using an 8-cpu opteron box, testing with 1, 2, 4 and 8 positions, just as I tested Cray Blitz in the DTS paper. Feel free to actually look at something rather than just spouting noise.
1) There were posts here recently with up 16 and 32 cores, showing non-linear scaling. It's rather some power scaling. Then, by your crap formula for effective speedup (Time-To-Strength), Jonny with 2,000 cores should have been 500 Elo points above the opposition at recent WCCC, where Junior on 24 cores in fact won. Rybka on 64 cores should have beaten Houdini on 16 cores, if linear scaling holds. You simply lack elementary notion of scaling and reading simple tests.

2) Widening, you like it or not, happens in 3 top engines: Komodo, Rybka, and recently even Stockfish. So Time-To-Depth (TTD) is not an universal measure for Time-To-Strength (Effective Speedup). You keep bragging on TTD as being always the correct thing to measure for Effective Speedup. Do you learn ever anything?
Why don't you learn to read. EVER seen me say that linear approximation works for 2000 cores? Ever seen me say that it even works for 128? No, because I haven't. So stop with these straw-man arguments, making up things I said and then trying to refute that. I CLEARLY said I had data through 16 cores where that formula was a reasonably accurate estimator. I've said nothing about larger numbers. I mentioned I had run on 64 cores but did not have enough time to run the necessary tests.

I don't dispute "widening" happens. I DO claim it is not a feature, however, it is a BUG. If widening is better than going deeper, the SAME would be true for the normal search as well. But we know it isn't and that depth always helps, for most.

To claim otherwise shows a complete lack of understanding of parallel vs serial algorithms. As far as "ever learning anything" it would seem that for you the answer is a resounding NO.
What's your point? That you fitted to 8 cores 2-3 points (doublings) of an anyway monotonous function with a linear approximation? Don't you look at the asymptotic behavior? Your linear approximation is crap beyond 2-3 doublings, and you must have missed here, in CCC, tests on 16 and 32 cores.
As for TTD crap generalization you make, 3 of the top 6 engines do NOT obey it as Effective Speedup (TTS) goes. You seem to generalize your Crafty for "always". For many engines TTS is derived only with strength test with many games, using a reasonable time control. Effective Speedup depends on the time control used too, did you ever hear that? And never learnt? TTS is NOT always TTD, NPS and other misplaced generalizations.
Again, learn to read. I specifically pointed out I had tested on 2, 4, 8, 12 and 16 extensively, that I had some data on 32, but that beyond 32 all I had run on for any length of time was a 64 cpu Itanium box where most of the time was spent addressing the various NUMA issues. For my "approximation" it is quite accurate through 16. Not enough data for 32 and beyond, yet (if ever). But for usable / obtainable hardware, it is more than "good enough".

My DTS paper went through 16 processors, which was all we had at the time (Cray T90/16).. So I am fully aware of how the numbers look. We don't have any 2000 core boxes, nor will we likely ever, except for message-passing clusters which are a completely different animal.

As far as "speedup depends on length of search" I believe _I_ am the first to point that out, in my DTS paper. Try again. Nothing new here so far other than that you don't understand the basic theoretical model of parallel vs serial search and why this "widening" is a crock caused by an overly aggressive search that is too selective.
And after all your hard work you come with:

TTS (Time-To-Strength or Effective Speedup)=1 + (N-1) * 0.7

Where here is your "As far as "speedup depends on length of search" I believe _I_ am the first to point that out, in my DTS paper."? Your crap "formula" doesn't have appropriate asymptotic behavior and doesn't have the depth dependency for specific engines.

Then, TTD clearly is NOT an universal measure of TTS, whatever buggy "crocks" you consider some 3 of the top 6 programs. Buggy, too selective whatever, they seem to scale reasonably with the number of cores. You simply seem to have a mental rupture: if it cannot be, then it cannot be, although I see a bunch of them, it cannot be because it cannot be, nobody understands the parallel search, that's it.
"where is the depth part?" I simply tested deeply enough to get good numbers. How simple was that? SMP performance doesn't continue to improve for each ply you go deeper, it just improves until you get beyond the "noise" point.

As far as the rest goes, I simply understand the alpha/beta model. And I don't accept handwaving nonsensical explanations. And this "widening" is EXACTLY that. Nothing more, nothing less. Just like Vincent's old super-linear speedup nonsense. It happens, but NOT with any regularity. Any good book explains why.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Real Speedup due to core doubling etc

Post by Laskos »

bob wrote:
Laskos wrote: And after all your hard work you come with:

TTS (Time-To-Strength or Effective Speedup)=1 + (N-1) * 0.7

Where here is your "As far as "speedup depends on length of search" I believe _I_ am the first to point that out, in my DTS paper."? Your crap "formula" doesn't have appropriate asymptotic behavior and doesn't have the depth dependency for specific engines.

Then, TTD clearly is NOT an universal measure of TTS, whatever buggy "crocks" you consider some 3 of the top 6 programs. Buggy, too selective whatever, they seem to scale reasonably with the number of cores. You simply seem to have a mental rupture: if it cannot be, then it cannot be, although I see a bunch of them, it cannot be because it cannot be, nobody understands the parallel search, that's it.
"where is the depth part?" I simply tested deeply enough to get good numbers. How simple was that? SMP performance doesn't continue to improve for each ply you go deeper, it just improves until you get beyond the "noise" point.

As far as the rest goes, I simply understand the alpha/beta model. And I don't accept handwaving nonsensical explanations. And this "widening" is EXACTLY that. Nothing more, nothing less. Just like Vincent's old super-linear speedup nonsense. It happens, but NOT with any regularity. Any good book explains why.
1) The dependence of depth is pretty significant even for high depths, so there is is none of your "deeply enough". There were posts here showing that. And it's not "noise", it's a monotonously increasing function with depth for speed-up.

2) The results presented here (lazy to dig them up) showed nothing of linear behavior for top engines to 32 cores. I was clearly some power function like 1+(N-1)^a or even logarithmic a+b*log(N). Maybe only DTS would work this linear way to 16 cores, although it surely won't work linearly to 2,000 cores, and no top engine uses DTS.

3) Isn't your linear crap formula 1 + (N-1) * 0.7 showing some weird behavior, that with each doubling the scaling improves? 2 -> 4 cores the doubling gives 1.82 in TTS, 16 -> 32 cores the doubling gives 1.97 in TTS. That's unreasonable, everybody knows (only you seem to miss it) that today's top engines don't gain much more than 1.4 (depends on depth too) in TTS going from 16 to 32 cores at reasonable depths.

4) So, at least you admit that widening happens, maybe not with regularity, but with both of the top 2 engines, and with 3 of the top 6 engines. And TTD is for them a wrong substitute to the real TTS.