back to the Komodo SMP issue

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
Laskos
Posts: 9539
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: back to the Komodo SMP issue

Post by Laskos » Tue Jul 02, 2013 4:44 pm

bob wrote:Here's the key point that nobody has even begun to address. Let's take a hypothetical position, where a program takes exactly 10 minutes to search to depth D. And when it finishes we measure an EBF of 2.0 exactly (to keep the numbers simple).

You run this with 4 cpus, and discover that to search to depth D, we get a time of 7 minutes. Speedup = 10 / 7 = 1.4x. SMP efficiency = 1.4 / 4.0 = .35 (or 35%)


Now, since the SMP efficiency is so low (as is the speedup) let's relax the rules a bit to make the search wider. This eliminates errors due to pruning/reductions, but it slows the search down, and now our serial search is obviously going to take much longer than 10 minutes. If you make the tree 2x larger, which is probably the minimum you would expect to produce a +130 Elo gain, you just doubled the search time for one cpu to 20 minutes. 20 / 1.4 = 14.2 minutes for the parallel search. So we made the search wider, only to give us a larger tree that we don't search very efficiently, and the time is longer than the original serial search, whcih means it is not going to gain a thing.
14.2 minutes as the single cored 20 minutes to the same _depth_. But it's twice wider at the same depth. I see no theoretical problem in going from 1->4 cores being always 2 times wider and time-to-depth being always 1.4 instead of 2.8. The gain IS there, although theoretically sub-optimal, but only theoretically, because practically EBF could vary a little without changing the strength.

bob
Posts: 20642
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob » Tue Jul 02, 2013 5:43 pm

Laskos wrote:
bob wrote:
Laskos wrote:
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.
Then you have lousy definitions there. From 1.4 to 1.96 the efficiency increases by a factor of 2=log(1.96)/log(1.4). For speedup 4 you still have 100%=log(4)/log(4). Anyway if that definition quoted by you is the standard, let's use it, although it's crappy.
"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...




I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
Lower EBF is not necessarily better than higher EBF. Maybe Komodo is performing close to optimally with EBF 1.8 on one core and EBF 1.9 on 4 cores. Even on one core, maybe both EBF's are close to optimal. Don mentioned that he could make Komodo 1 ply higher or lower (lower or higher EBF) without changing the strength.


SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).

If you use one cpu, the time taken for a serial search = time taken for the one-cpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
The problem with a larger EBF is it is a double-whammy. It FURTHER slows the smp search, since its efficiency is fixed by the algorithm used. Make it wider, you make it slower, period...

I don't see how that can be beneficial, particularly down at 4 cores...

User avatar
Laskos
Posts: 9539
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: back to the Komodo SMP issue

Post by Laskos » Tue Jul 02, 2013 6:20 pm

bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.
Then you have lousy definitions there. From 1.4 to 1.96 the efficiency increases by a factor of 2=log(1.96)/log(1.4). For speedup 4 you still have 100%=log(4)/log(4). Anyway if that definition quoted by you is the standard, let's use it, although it's crappy.
"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
I don't care about crappy CS manuals, without logarithms it's a LOUSY and unintuitive definition. By the way, you choose numbers where our definitions coincide: 2/4 = log(2)/log(4)=0.5 (50%), 4/4=log(4)/log(4)=1.0 (100%). But, to show the unintuitivity of crappy computer science manuals, if the speedup 1->4 cores is 1, you say 25% efficiency, if the speedup 1->8 cores is still 1, you say 12.5% efficiency, while there is no speedup at all in both cases. With my logarithmic definition they are both 0% efficiency of SMP. There could be negative efficiency in my case if the speedup is below 1, in your case going 1->4 with speedup 0.8 you have meaningless 20% efficiency (which is higher than 1->8 cores with 1.2 speedup).

Anyway, matters of definition, I like more intuitive ones. Besides that, strength in Elos and depth in plies are logarithms of time, not linear.




I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
Lower EBF is not necessarily better than higher EBF. Maybe Komodo is performing close to optimally with EBF 1.8 on one core and EBF 1.9 on 4 cores. Even on one core, maybe both EBF's are close to optimal. Don mentioned that he could make Komodo 1 ply higher or lower (lower or higher EBF) without changing the strength.


SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).

If you use one cpu, the time taken for a serial search = time taken for the one-cpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
The problem with a larger EBF is it is a double-whammy. It FURTHER slows the smp search, since its efficiency is fixed by the algorithm used. Make it wider, you make it slower, period...

I don't see how that can be beneficial, particularly down at 4 cores...

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 2:27 pm

Re: back to the Komodo SMP issue

Post by Don » Tue Jul 02, 2013 7:46 pm

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.
Then you have lousy definitions there. From 1.4 to 1.96 the efficiency increases by a factor of 2=log(1.96)/log(1.4). For speedup 4 you still have 100%=log(4)/log(4). Anyway if that definition quoted by you is the standard, let's use it, although it's crappy.
"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
I don't care about crappy CS manuals, without logarithms it's a LOUSY and unintuitive definition.
The textbook definition is idealized so that it's easy to reason about formally. Usually what one does is takes some serial problem such as sorting, n-body simulation or some other things and then speed it up on multiple processors. That is FAR easier to reason about because the only variable is speed which is trivially measured and it's also the easiest to deal with formally.

But often, even in the case of pure speed, you will choose a different algorithm if you intend to run it on multiple processors. The algorithm you choose may be just as good or perhaps not quite as good but have much more parallelism to exploit. I think even Bob should understand this but I'm not sure that he does.

The MIT Cilk team chose a different sort algorithm for example that was not quite as efficient as the unix sort routing but was more amenable to parallelism. They did this as a kind of demo project to show off Cilk a long time ago. Of course they didn't make any claims that they "sped" up shell sort with MP, but they did speed up the sort utility more than they could have if they had tried to parallelize the shell sort that was in the "sort" utility.

But chess has a metric that is far more appropriate here than speed and that is ELO. It's a lot more awkward to reason about formally and much harder to measure and it's also not easily treated in a formal way mathematically. I think Bob is taking the point of view that it is not relevant unless it has these characteristics or else he just wants to bicker about some formal definition that only he cares about.

By the way, you choose numbers where our definitions coincide: 2/4 = log(2)/log(4)=0.5 (50%), 4/4=log(4)/log(4)=1.0 (100%). But, to show the unintuitivity of crappy computer science manuals, if the speedup 1->4 cores is 1, you say 25% efficiency, if the speedup 1->8 cores is still 1, you say 12.5% efficiency, while there is no speedup at all in both cases. With my logarithmic definition they are both 0% efficiency of SMP. There could be negative efficiency in my case if the speedup is below 1, in your case going 1->4 with speedup 0.8 you have meaningless 20% efficiency (which is higher than 1->8 cores with 1.2 speedup).

Anyway, matters of definition, I like more intuitive ones. Besides that, strength in Elos and depth in plies are logarithms of time, not linear.




I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
Lower EBF is not necessarily better than higher EBF. Maybe Komodo is performing close to optimally with EBF 1.8 on one core and EBF 1.9 on 4 cores. Even on one core, maybe both EBF's are close to optimal. Don mentioned that he could make Komodo 1 ply higher or lower (lower or higher EBF) without changing the strength.


SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).

If you use one cpu, the time taken for a serial search = time taken for the one-cpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
The problem with a larger EBF is it is a double-whammy. It FURTHER slows the smp search, since its efficiency is fixed by the algorithm used. Make it wider, you make it slower, period...

I don't see how that can be beneficial, particularly down at 4 cores...
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.

User avatar
Laskos
Posts: 9539
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: back to the Komodo SMP issue

Post by Laskos » Tue Jul 02, 2013 8:04 pm

Don wrote:
The textbook definition is idealized so that it's easy to reason about formally. Usually what one does is takes some serial problem such as sorting, n-body simulation or some other things and then speed it up on multiple processors. That is FAR easier to reason about because the only variable is speed which is trivially measured and it's also the easiest to deal with formally.

But often, even in the case of pure speed, you will choose a different algorithm if you intend to run it on multiple processors. The algorithm you choose may be just as good or perhaps not quite as good but have much more parallelism to exploit. I think even Bob should understand this but I'm not sure that he does.

The MIT Cilk team chose a different sort algorithm for example that was not quite as efficient as the unix sort routing but was more amenable to parallelism. They did this as a kind of demo project to show off Cilk a long time ago. Of course they didn't make any claims that they "sped" up shell sort with MP, but they did speed up the sort utility more than they could have if they had tried to parallelize the shell sort that was in the "sort" utility.
I don't think we even got to this point of the discussion yet, here we will be stuck forever with Bob. It was just a matter of what metric to use to measure SMP efficiency as defined by Bob. I can pass here, as it's not so important, and a matter of definition, the important thing is spelled by you in the above paragraph.
But chess has a metric that is far more appropriate here than speed and that is ELO. It's a lot more awkward to reason about formally and much harder to measure and it's also not easily treated in a formal way mathematically. I think Bob is taking the point of view that it is not relevant unless it has these characteristics or else he just wants to bicker about some formal definition that only he cares about.
Yes, I will abandon the argument on this, I just wanted to show that the manual or his definition is flawed, for example 1->4 core speedup of 0.8 having larger "efficiency" than 1->8 core 1.2 speedup. Then, for chess, all these scalings need logarithms to transform into Elos or plies.

User avatar
michiguel
Posts: 6389
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

Re: back to the Komodo SMP issue

Post by michiguel » Tue Jul 02, 2013 8:24 pm

bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.
Then you have lousy definitions there. From 1.4 to 1.96 the efficiency increases by a factor of 2=log(1.96)/log(1.4). For speedup 4 you still have 100%=log(4)/log(4). Anyway if that definition quoted by you is the standard, let's use it, although it's crappy.
"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs
As I said before, if I am allowed to be pedantic, this definitions may not apply to chess engines. The formula assume the same task was done for the time "Ts" and "Tp". We know that is not that we have, or not necessarily the same. Time to depth could be an approximation, but it may break (certainly the trees are not the same). In fact, that was the point of the original thread, that it breaks in Komodo's case.

Miguel

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...




I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
Lower EBF is not necessarily better than higher EBF. Maybe Komodo is performing close to optimally with EBF 1.8 on one core and EBF 1.9 on 4 cores. Even on one core, maybe both EBF's are close to optimal. Don mentioned that he could make Komodo 1 ply higher or lower (lower or higher EBF) without changing the strength.


SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).

If you use one cpu, the time taken for a serial search = time taken for the one-cpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
The problem with a larger EBF is it is a double-whammy. It FURTHER slows the smp search, since its efficiency is fixed by the algorithm used. Make it wider, you make it slower, period...

I don't see how that can be beneficial, particularly down at 4 cores...

bob
Posts: 20642
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob » Tue Jul 02, 2013 8:54 pm

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.
Then you have lousy definitions there. From 1.4 to 1.96 the efficiency increases by a factor of 2=log(1.96)/log(1.4). For speedup 4 you still have 100%=log(4)/log(4). Anyway if that definition quoted by you is the standard, let's use it, although it's crappy.
"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
I don't care about crappy CS manuals, without logarithms it's a LOUSY and unintuitive definition. By the way, you choose numbers where our definitions coincide: 2/4 = log(2)/log(4)=0.5 (50%), 4/4=log(4)/log(4)=1.0 (100%). But, to show the unintuitivity of crappy computer science manuals, if the speedup 1->4 cores is 1, you say 25% efficiency, if the speedup 1->8 cores is still 1, you say 12.5% efficiency, while there is no speedup at all in both cases. With my logarithmic definition they are both 0% efficiency of SMP. There could be negative efficiency in my case if the speedup is below 1, in your case going 1->4 with speedup 0.8 you have meaningless 20% efficiency (which is higher than 1->8 cores with 1.2 speedup).

Anyway, matters of definition, I like more intuitive ones. Besides that, strength in Elos and depth in plies are logarithms of time, not linear.
First, if you use 8 cores, and your "speedup" is one (1), then you used one of those 8 cores effectively. efficiency of 12.5% is perfectly rational and understandable. The term "speedup" is somewhat unfortunate in that regard since there is no speedup. Unless you use it as an absolute which is what is done here. Then a single cpu does give "something" over absolutely nothing. And there are examples of < 1.0x speedup for poor parallel algorithms, BTW.

These "crappy CS books" have been around longer than your or myself. So we are pretty well stuck with existing terminology. The 25% and 12.5% numbers you don't like are still pretty intuitive. With 8 cores, you ARE using 1/8 of the hardware effectively. 12.5% in fact...






I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
Lower EBF is not necessarily better than higher EBF. Maybe Komodo is performing close to optimally with EBF 1.8 on one core and EBF 1.9 on 4 cores. Even on one core, maybe both EBF's are close to optimal. Don mentioned that he could make Komodo 1 ply higher or lower (lower or higher EBF) without changing the strength.


SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).

If you use one cpu, the time taken for a serial search = time taken for the one-cpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
The problem with a larger EBF is it is a double-whammy. It FURTHER slows the smp search, since its efficiency is fixed by the algorithm used. Make it wider, you make it slower, period...

I don't see how that can be beneficial, particularly down at 4 cores...

bob
Posts: 20642
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob » Tue Jul 02, 2013 9:01 pm

Don wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.
Then you have lousy definitions there. From 1.4 to 1.96 the efficiency increases by a factor of 2=log(1.96)/log(1.4). For speedup 4 you still have 100%=log(4)/log(4). Anyway if that definition quoted by you is the standard, let's use it, although it's crappy.
"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
I don't care about crappy CS manuals, without logarithms it's a LOUSY and unintuitive definition.
The textbook definition is idealized so that it's easy to reason about formally. Usually what one does is takes some serial problem such as sorting, n-body simulation or some other things and then speed it up on multiple processors. That is FAR easier to reason about because the only variable is speed which is trivially measured and it's also the easiest to deal with formally.

But often, even in the case of pure speed, you will choose a different algorithm if you intend to run it on multiple processors. The algorithm you choose may be just as good or perhaps not quite as good but have much more parallelism to exploit. I think even Bob should understand this but I'm not sure that he does.
Since you want to make such statements, I'll make one that is crystal clear. I doubt there is ANYTHING about parallel computing that you know and I don't. You are simply being juvenile here. Every parallel programming textbook starts off with the precept of "in many cases, the best parallel algorithm is NOT the well-known serial algorithm." So please stop with these nonsensical "I doubt's" and the like.

I have been teaching this stuff since the middle 70's when parallel hardware became readily available. None of this is new.


The MIT Cilk team chose a different sort algorithm for example that was not quite as efficient as the unix sort routing but was more amenable to parallelism. They did this as a kind of demo project to show off Cilk a long time ago. Of course they didn't make any claims that they "sped" up shell sort with MP, but they did speed up the sort utility more than they could have if they had tried to parallelize the shell sort that was in the "sort" utility.
Nothing new here either. In fact, there are two factions in performance analysis. I am in the group that says you compare the best parallel algorithm on N cores to the best serial algorithm on 1 core, and that is your speedup. Some prefer to compare the best parallel algorithm on N cores and compare that to the same (inferior to best serial algorithm) algorithm on 1 core to compute the speedup. I (and many) believe this is artificial.

But chess has a metric that is far more appropriate here than speed and that is ELO. It's a lot more awkward to reason about formally and much harder to measure and it's also not easily treated in a formal way mathematically. I think Bob is taking the point of view that it is not relevant unless it has these characteristics or else he just wants to bicker about some formal definition that only he cares about.
I simply addressed a clearly defined and well-understood aspect of parallel search. The gain is going deeper. I explained the side-effect of going wider, as you are guaranteed to go slower, since the same parallel algorithm (apparently inefficient if we are talking 1.4x speedup) has to search a still larger tree.

I posed a specific question, "How can this possibly work." As expected, I get hand-waving and "you are not thinking about this correctly."


By the way, you choose numbers where our definitions coincide: 2/4 = log(2)/log(4)=0.5 (50%), 4/4=log(4)/log(4)=1.0 (100%). But, to show the unintuitivity of crappy computer science manuals, if the speedup 1->4 cores is 1, you say 25% efficiency, if the speedup 1->8 cores is still 1, you say 12.5% efficiency, while there is no speedup at all in both cases. With my logarithmic definition they are both 0% efficiency of SMP. There could be negative efficiency in my case if the speedup is below 1, in your case going 1->4 with speedup 0.8 you have meaningless 20% efficiency (which is higher than 1->8 cores with 1.2 speedup).

Anyway, matters of definition, I like more intuitive ones. Besides that, strength in Elos and depth in plies are logarithms of time, not linear.




I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
Lower EBF is not necessarily better than higher EBF. Maybe Komodo is performing close to optimally with EBF 1.8 on one core and EBF 1.9 on 4 cores. Even on one core, maybe both EBF's are close to optimal. Don mentioned that he could make Komodo 1 ply higher or lower (lower or higher EBF) without changing the strength.


SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).

If you use one cpu, the time taken for a serial search = time taken for the one-cpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
The problem with a larger EBF is it is a double-whammy. It FURTHER slows the smp search, since its efficiency is fixed by the algorithm used. Make it wider, you make it slower, period...

I don't see how that can be beneficial, particularly down at 4 cores...

bob
Posts: 20642
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob » Tue Jul 02, 2013 9:06 pm

Laskos wrote:
Don wrote:
The textbook definition is idealized so that it's easy to reason about formally. Usually what one does is takes some serial problem such as sorting, n-body simulation or some other things and then speed it up on multiple processors. That is FAR easier to reason about because the only variable is speed which is trivially measured and it's also the easiest to deal with formally.

But often, even in the case of pure speed, you will choose a different algorithm if you intend to run it on multiple processors. The algorithm you choose may be just as good or perhaps not quite as good but have much more parallelism to exploit. I think even Bob should understand this but I'm not sure that he does.

The MIT Cilk team chose a different sort algorithm for example that was not quite as efficient as the unix sort routing but was more amenable to parallelism. They did this as a kind of demo project to show off Cilk a long time ago. Of course they didn't make any claims that they "sped" up shell sort with MP, but they did speed up the sort utility more than they could have if they had tried to parallelize the shell sort that was in the "sort" utility.
I don't think we even got to this point of the discussion yet, here we will be stuck forever with Bob. It was just a matter of what metric to use to measure SMP efficiency as defined by Bob. I can pass here, as it's not so important, and a matter of definition, the important thing is spelled by you in the above paragraph.
But chess has a metric that is far more appropriate here than speed and that is ELO. It's a lot more awkward to reason about formally and much harder to measure and it's also not easily treated in a formal way mathematically. I think Bob is taking the point of view that it is not relevant unless it has these characteristics or else he just wants to bicker about some formal definition that only he cares about.
Yes, I will abandon the argument on this, I just wanted to show that the manual or his definition is flawed, for example 1->4 core speedup of 0.8 having larger "efficiency" than 1->8 core 1.2 speedup. Then, for chess, all these scalings need logarithms to transform into Elos or plies.
Strawman argument, however. NOBODY is trying to scale to depth or Elo, directly. Step one is "how much faster can I go?" I am 100% certain that if you increase the speed of a processor 1%, my program gets stronger. How much stronger? That's a point where one might measure via test games. But faster = stronger, without exception, until we reach the end of the game. Can you gain Elo by going wider? Maybe or maybe not. If going wider slows you down, you have a gain and a loss that have to be measured. And combined.

My question at the start of this new thread was quite specific. Please explain to me how going wider can be better, when a parallel search is the thing doing the searching...

Given a small number of cores where a parallel speedup should be reasonable in the first place. If you have a poorly scaling algorithm, then going wider would seem to be worse since the thing will slow down. Significantly.

I asked nothing more, nothing less. Yet it turns into insults rather than technical answers???

Typical....

bob
Posts: 20642
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob » Tue Jul 02, 2013 9:10 pm

michiguel wrote:
bob wrote:
Laskos wrote:
bob wrote:
Laskos wrote:
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.
Then you have lousy definitions there. From 1.4 to 1.96 the efficiency increases by a factor of 2=log(1.96)/log(1.4). For speedup 4 you still have 100%=log(4)/log(4). Anyway if that definition quoted by you is the standard, let's use it, although it's crappy.
"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs
As I said before, if I am allowed to be pedantic, this definitions may not apply to chess engines. The formula assume the same task was done for the time "Ts" and "Tp". We know that is not that we have, or not necessarily the same. Time to depth could be an approximation, but it may break (certainly the trees are not the same). In fact, that was the point of the original thread, that it breaks in Komodo's case.

Miguel
Back to my original question. In a parallel model, HOW can you search wider, given the basic assumption that the parallel search only produces a 1.4x speedup? If you increase the EBF, you increase the size of the tree, which hurts the parallel search even more.

That is the question I posed, and which absolutely no one has addressed. If the parallel speedup on this "wider tree" is 1.4x, one can't possibly claim it would be worse on a narrower tree. So how does one actually use the extra processing power doing something OTHER than searching in parallel???


It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...




I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
Lower EBF is not necessarily better than higher EBF. Maybe Komodo is performing close to optimally with EBF 1.8 on one core and EBF 1.9 on 4 cores. Even on one core, maybe both EBF's are close to optimal. Don mentioned that he could make Komodo 1 ply higher or lower (lower or higher EBF) without changing the strength.


SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).

If you use one cpu, the time taken for a serial search = time taken for the one-cpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
The problem with a larger EBF is it is a double-whammy. It FURTHER slows the smp search, since its efficiency is fixed by the algorithm used. Make it wider, you make it slower, period...

I don't see how that can be beneficial, particularly down at 4 cores...

Post Reply