back to the Komodo SMP issue
Moderators: Harvey Williamson, Dann Corbit, hgm
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
back to the Komodo SMP issue
Here's the key point that nobody has even begun to address. Let's take a hypothetical position, where a program takes exactly 10 minutes to search to depth D. And when it finishes we measure an EBF of 2.0 exactly (to keep the numbers simple).
You run this with 4 cpus, and discover that to search to depth D, we get a time of 7 minutes. Speedup = 10 / 7 = 1.4x. SMP efficiency = 1.4 / 4.0 = .35 (or 35%)
Now, since the SMP efficiency is so low (as is the speedup) let's relax the rules a bit to make the search wider. This eliminates errors due to pruning/reductions, but it slows the search down, and now our serial search is obviously going to take much longer than 10 minutes. If you make the tree 2x larger, which is probably the minimum you would expect to produce a +130 Elo gain, you just doubled the search time for one cpu to 20 minutes. 20 / 1.4 = 14.2 minutes for the parallel search. So we made the search wider, only to give us a larger tree that we don't search very efficiently, and the time is longer than the original serial search, whcih means it is not going to gain a thing.
All this is predicated on the serial search being welltuned regarding reductions and pruning so that there is little expected gain from further tweaking, when using only one cpu. After thinking about this from every reasonable angle, I am STILL not convinced that this concept is valid. If you could claim "OK, if we use 2 cpus, we get 1.4x, or if we use 4 cpus we STILL get 1.4x, that clearly shows the extra 2 cpus are not helping at all. But exactly HOW would you get them to doing something helpful by making the tree wider, once you know that the FINAL tree result is still just 1.4x faster with 4 cores.
If one uses the usual doubling speed = +70 Elo, we need two doublings to reach that +130 Elo number that was discussed. It is pretty straightforward to figure out what kind of EBF increase one needs (given fixed time) to produce about 2/3 of that Elo gain (widening the tree, given an SMP speedup of 1.4x) I almost went down that road, but as I started the idea simply looked completely unsound for the reasons given above. As you go wider, you lose depth since the SMP search is doing so poorly. I don't see any way around this.
So perhaps someone is willing to participate in a technical discussion that is based on something beyond vague comments to come up with a way to make this actually work.
I think a reasonable starting point is depth=24, time=10 minutes, EBF=2.0, 1 cpu. Depth=24, time=7 minutes, EBF=2.0, 4 cpus, and figure out how we get to +130 Elo given that we are only expecting maybe 15  20 Elo from the SMP search gain...
Of course, in a real game, that last depth=24 will increase somewhat, but will not reach 25 since we need 2x speedup to reach the next ply with an EBF of 2.0...
You run this with 4 cpus, and discover that to search to depth D, we get a time of 7 minutes. Speedup = 10 / 7 = 1.4x. SMP efficiency = 1.4 / 4.0 = .35 (or 35%)
Now, since the SMP efficiency is so low (as is the speedup) let's relax the rules a bit to make the search wider. This eliminates errors due to pruning/reductions, but it slows the search down, and now our serial search is obviously going to take much longer than 10 minutes. If you make the tree 2x larger, which is probably the minimum you would expect to produce a +130 Elo gain, you just doubled the search time for one cpu to 20 minutes. 20 / 1.4 = 14.2 minutes for the parallel search. So we made the search wider, only to give us a larger tree that we don't search very efficiently, and the time is longer than the original serial search, whcih means it is not going to gain a thing.
All this is predicated on the serial search being welltuned regarding reductions and pruning so that there is little expected gain from further tweaking, when using only one cpu. After thinking about this from every reasonable angle, I am STILL not convinced that this concept is valid. If you could claim "OK, if we use 2 cpus, we get 1.4x, or if we use 4 cpus we STILL get 1.4x, that clearly shows the extra 2 cpus are not helping at all. But exactly HOW would you get them to doing something helpful by making the tree wider, once you know that the FINAL tree result is still just 1.4x faster with 4 cores.
If one uses the usual doubling speed = +70 Elo, we need two doublings to reach that +130 Elo number that was discussed. It is pretty straightforward to figure out what kind of EBF increase one needs (given fixed time) to produce about 2/3 of that Elo gain (widening the tree, given an SMP speedup of 1.4x) I almost went down that road, but as I started the idea simply looked completely unsound for the reasons given above. As you go wider, you lose depth since the SMP search is doing so poorly. I don't see any way around this.
So perhaps someone is willing to participate in a technical discussion that is based on something beyond vague comments to come up with a way to make this actually work.
I think a reasonable starting point is depth=24, time=10 minutes, EBF=2.0, 1 cpu. Depth=24, time=7 minutes, EBF=2.0, 4 cpus, and figure out how we get to +130 Elo given that we are only expecting maybe 15  20 Elo from the SMP search gain...
Of course, in a real game, that last depth=24 will increase somewhat, but will not reach 25 since we need 2x speedup to reach the next ply with an EBF of 2.0...
Re: back to the Komodo SMP issue
Two observations:bob wrote:Here's the key point that nobody has even begun to address. Let's take a hypothetical position, where a program takes exactly 10 minutes to search to depth D. And when it finishes we measure an EBF of 2.0 exactly (to keep the numbers simple).
You run this with 4 cpus, and discover that to search to depth D, we get a time of 7 minutes. Speedup = 10 / 7 = 1.4x. SMP efficiency = 1.4 / 4.0 = .35 (or 35%)
Now, since the SMP efficiency is so low (as is the speedup) let's relax the rules a bit to make the search wider. This eliminates errors due to pruning/reductions, but it slows the search down, and now our serial search is obviously going to take much longer than 10 minutes. If you make the tree 2x larger, which is probably the minimum you would expect to produce a +130 Elo gain, you just doubled the search time for one cpu to 20 minutes. 20 / 1.4 = 14.2 minutes for the parallel search. So we made the search wider, only to give us a larger tree that we don't search very efficiently, and the time is longer than the original serial search, whcih means it is not going to gain a thing.
All this is predicated on the serial search being welltuned regarding reductions and pruning so that there is little expected gain from further tweaking, when using only one cpu. After thinking about this from every reasonable angle, I am STILL not convinced that this concept is valid. If you could claim "OK, if we use 2 cpus, we get 1.4x, or if we use 4 cpus we STILL get 1.4x, that clearly shows the extra 2 cpus are not helping at all. But exactly HOW would you get them to doing something helpful by making the tree wider, once you know that the FINAL tree result is still just 1.4x faster with 4 cores.
If one uses the usual doubling speed = +70 Elo, we need two doublings to reach that +130 Elo number that was discussed. It is pretty straightforward to figure out what kind of EBF increase one needs (given fixed time) to produce about 2/3 of that Elo gain (widening the tree, given an SMP speedup of 1.4x) I almost went down that road, but as I started the idea simply looked completely unsound for the reasons given above. As you go wider, you lose depth since the SMP search is doing so poorly. I don't see any way around this.
So perhaps someone is willing to participate in a technical discussion that is based on something beyond vague comments to come up with a way to make this actually work.
I think a reasonable starting point is depth=24, time=10 minutes, EBF=2.0, 1 cpu. Depth=24, time=7 minutes, EBF=2.0, 4 cpus, and figure out how we get to +130 Elo given that we are only expecting maybe 15  20 Elo from the SMP search gain...
Of course, in a real game, that last depth=24 will increase somewhat, but will not reach 25 since we need 2x speedup to reach the next ply with an EBF of 2.0...
1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.
2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speedup of 1 is no speedup, not 1/4=25% efficiency.

 Posts: 38
 Joined: Tue Jul 01, 2008 7:36 pm
Re: back to the Komodo SMP issue
Perhaps another definition of SMP efficiency would be better in this case:
1 core: engine has a base Elo at time control y (without permanent brain to make testing easier  the quality of the permanent brain doesn't contribute)
n core: engine gets an Elo base+x at time control y
Now what time factor is needed for one core to get Elo base+x as well?
As this calculation is not easy to get the exact time factor it can be approximated with playing out matches with one core and time controls y, 2*y, 4*y, ... and time control y with 2 cores, 4 cores, ...
It would eliminate the reached depth as key number as it is more important to know how well the engines uses the computing ressources to make a move.
1 core: engine has a base Elo at time control y (without permanent brain to make testing easier  the quality of the permanent brain doesn't contribute)
n core: engine gets an Elo base+x at time control y
Now what time factor is needed for one core to get Elo base+x as well?
As this calculation is not easy to get the exact time factor it can be approximated with playing out matches with one core and time controls y, 2*y, 4*y, ... and time control y with 2 cores, 4 cores, ...
It would eliminate the reached depth as key number as it is more important to know how well the engines uses the computing ressources to make a move.
Re: back to the Komodo SMP issue
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.Laskos wrote:Two observations:bob wrote:Here's the key point that nobody has even begun to address. Let's take a hypothetical position, where a program takes exactly 10 minutes to search to depth D. And when it finishes we measure an EBF of 2.0 exactly (to keep the numbers simple).
You run this with 4 cpus, and discover that to search to depth D, we get a time of 7 minutes. Speedup = 10 / 7 = 1.4x. SMP efficiency = 1.4 / 4.0 = .35 (or 35%)
Now, since the SMP efficiency is so low (as is the speedup) let's relax the rules a bit to make the search wider. This eliminates errors due to pruning/reductions, but it slows the search down, and now our serial search is obviously going to take much longer than 10 minutes. If you make the tree 2x larger, which is probably the minimum you would expect to produce a +130 Elo gain, you just doubled the search time for one cpu to 20 minutes. 20 / 1.4 = 14.2 minutes for the parallel search. So we made the search wider, only to give us a larger tree that we don't search very efficiently, and the time is longer than the original serial search, whcih means it is not going to gain a thing.
All this is predicated on the serial search being welltuned regarding reductions and pruning so that there is little expected gain from further tweaking, when using only one cpu. After thinking about this from every reasonable angle, I am STILL not convinced that this concept is valid. If you could claim "OK, if we use 2 cpus, we get 1.4x, or if we use 4 cpus we STILL get 1.4x, that clearly shows the extra 2 cpus are not helping at all. But exactly HOW would you get them to doing something helpful by making the tree wider, once you know that the FINAL tree result is still just 1.4x faster with 4 cores.
If one uses the usual doubling speed = +70 Elo, we need two doublings to reach that +130 Elo number that was discussed. It is pretty straightforward to figure out what kind of EBF increase one needs (given fixed time) to produce about 2/3 of that Elo gain (widening the tree, given an SMP speedup of 1.4x) I almost went down that road, but as I started the idea simply looked completely unsound for the reasons given above. As you go wider, you lose depth since the SMP search is doing so poorly. I don't see any way around this.
So perhaps someone is willing to participate in a technical discussion that is based on something beyond vague comments to come up with a way to make this actually work.
I think a reasonable starting point is depth=24, time=10 minutes, EBF=2.0, 1 cpu. Depth=24, time=7 minutes, EBF=2.0, 4 cpus, and figure out how we get to +130 Elo given that we are only expecting maybe 15  20 Elo from the SMP search gain...
Of course, in a real game, that last depth=24 will increase somewhat, but will not reach 25 since we need 2x speedup to reach the next ply with an EBF of 2.0...
1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.
2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speedup of 1 is no speedup, not 1/4=25% efficiency.
I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).
If you use one cpu, the time taken for a serial search = time taken for the onecpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
Re: back to the Komodo SMP issue
That's an ok measurement, it is just not the classic SMP efficiency term, which has already been used for 40 years now. Perhaps "SMP effectiveness" might be the term to use, although it would need to be carefully defined to be useful in the future...Matthias Hartwich wrote:Perhaps another definition of SMP efficiency would be better in this case:
1 core: engine has a base Elo at time control y (without permanent brain to make testing easier  the quality of the permanent brain doesn't contribute)
n core: engine gets an Elo base+x at time control y
Now what time factor is needed for one core to get Elo base+x as well?
As this calculation is not easy to get the exact time factor it can be approximated with playing out matches with one core and time controls y, 2*y, 4*y, ... and time control y with 2 cores, 4 cores, ...
It would eliminate the reached depth as key number as it is more important to know how well the engines uses the computing ressources to make a move.
BTW for most every program tested to date, the SMP speedup does that perfectly. If you use 3 cpus and you run 2x faster (time to depth which factors in search overhead) then you get the same elo benefit as if you just run on a 2x faster single cpu box...
For almost any program, if you double the speed, you get something like 70 Elo. Whether you double the speed by making the CPU faster, or by using more than one CPU.
Re: back to the Komodo SMP issue
Then you have lousy definitions there. From 1.4 to 1.96 the efficiency increases by a factor of 2=log(1.96)/log(1.4). For speedup 4 you still have 100%=log(4)/log(4). Anyway if that definition quoted by you is the standard, let's use it, although it's crappy.bob wrote:No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.Laskos wrote:
Two observations:
1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.
2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speedup of 1 is no speedup, not 1/4=25% efficiency.
Lower EBF is not necessarily better than higher EBF. Maybe Komodo is performing close to optimally with EBF 1.8 on one core and EBF 1.9 on 4 cores. Even on one core, maybe both EBF's are close to optimal. Don mentioned that he could make Komodo 1 ply higher or lower (lower or higher EBF) without changing the strength.
I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).
If you use one cpu, the time taken for a serial search = time taken for the onecpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
Re: back to the Komodo SMP issue
In my case the only thing I care about is that Komodo plays as strong as possible on 4 cores. I don't care if the EBF goes up if I can get that ELO.Matthias Hartwich wrote:Perhaps another definition of SMP efficiency would be better in this case:
1 core: engine has a base Elo at time control y (without permanent brain to make testing easier  the quality of the permanent brain doesn't contribute)
n core: engine gets an Elo base+x at time control y
Now what time factor is needed for one core to get Elo base+x as well?
As this calculation is not easy to get the exact time factor it can be approximated with playing out matches with one core and time controls y, 2*y, 4*y, ... and time control y with 2 cores, 4 cores, ...
It would eliminate the reached depth as key number as it is more important to know how well the engines uses the computing ressources to make a move.
But if you want a practical definition of how well MP "works" you should find which level you need to run to get the same score against your single processor program. If I set the 4 core program to play at 5 minutes for example, what level would the 1 core problem need to compete?
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Re: back to the Komodo SMP issue
Woops, I said that incorrectly. What I should have said is that you should find which level you need to run the SP program at to get the same score against your 4 core program. That should come out to be more than 1 and less than 4 if you are testing the 4 core scalability.Don wrote:In my case the only thing I care about is that Komodo plays as strong as possible on 4 cores. I don't care if the EBF goes up if I can get that ELO.Matthias Hartwich wrote:Perhaps another definition of SMP efficiency would be better in this case:
1 core: engine has a base Elo at time control y (without permanent brain to make testing easier  the quality of the permanent brain doesn't contribute)
n core: engine gets an Elo base+x at time control y
Now what time factor is needed for one core to get Elo base+x as well?
As this calculation is not easy to get the exact time factor it can be approximated with playing out matches with one core and time controls y, 2*y, 4*y, ... and time control y with 2 cores, 4 cores, ...
It would eliminate the reached depth as key number as it is more important to know how well the engines uses the computing ressources to make a move.
But if you want a practical definition of how well MP "works" you should find which level you need to run to get the same score against your single processor program. If I set the 4 core program to play at 5 minutes for example, what level would the 1 core problem need to compete?
Even this is not 100% precise as there are some variables:
1. starting level
2. nonMP scalability
I'm not sure the starting level matters much because you are running a selftest. I think it would take just about the same time handicap for MP regardless of level.
nonMP scalability is a factor simply because you cannot completely isolate nonMP scalability from MP scalability, so with my proposed test you will be partially measuring nonMP scalability. If you run at long enough levels this is probably a pretty minor factor.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
Re: back to the Komodo SMP issue
These two are particularly painful at ultrashort controls. One simply cannot measure 1>4 core scalability (defined as (time_1_core) / (time_4_cores) to the same _strength_) at ultrashort controls.Don wrote:Woops, I said that incorrectly. What I should have said is that you should find which level you need to run the SP program at to get the same score against your 4 core program. That should come out to be more than 1 and less than 4 if you are testing the 4 core scalability.Don wrote:In my case the only thing I care about is that Komodo plays as strong as possible on 4 cores. I don't care if the EBF goes up if I can get that ELO.Matthias Hartwich wrote:Perhaps another definition of SMP efficiency would be better in this case:
1 core: engine has a base Elo at time control y (without permanent brain to make testing easier  the quality of the permanent brain doesn't contribute)
n core: engine gets an Elo base+x at time control y
Now what time factor is needed for one core to get Elo base+x as well?
As this calculation is not easy to get the exact time factor it can be approximated with playing out matches with one core and time controls y, 2*y, 4*y, ... and time control y with 2 cores, 4 cores, ...
It would eliminate the reached depth as key number as it is more important to know how well the engines uses the computing ressources to make a move.
But if you want a practical definition of how well MP "works" you should find which level you need to run to get the same score against your single processor program. If I set the 4 core program to play at 5 minutes for example, what level would the 1 core problem need to compete?
Even this is not 100% precise as there are some variables:
1. starting level
2. nonMP scalability
I'm not sure the starting level matters much because you are running a selftest. I think it would take just about the same time handicap for MP regardless of level.
nonMP scalability is a factor simply because you cannot completely isolate nonMP scalability from MP scalability, so with my proposed test you will be partially measuring nonMP scalability. If you run at long enough levels this is probably a pretty minor factor.
Re: back to the Komodo SMP issue
I agree completely. There is also startup costs at ultrablitz  some programs may take more time to start searching than others and this would make no noticeable difference at 5 minute chess but it would at game in 5 seconds.Laskos wrote:These two are particularly painful at ultrashort controls. One simply cannot measure 1>4 core scalability (defined as (time_1_core) / (time_4_cores) to the same _strength_) at ultrashort controls.Don wrote:
Woops, I said that incorrectly. What I should have said is that you should find which level you need to run the SP program at to get the same score against your 4 core program. That should come out to be more than 1 and less than 4 if you are testing the 4 core scalability.
Even this is not 100% precise as there are some variables:
1. starting level
2. nonMP scalability
So for a practical test I would say you need at least games where both programs take a couple of minutes to complete a game, perhaps 60s + 1 or 2 minutes sudden death.
I also believe that strength is the real issue, not the level or time. For example if you are measuring which program scales better you have to start them both at the same STRENGTH, not the same level. Then see which improves most if you multiply the level by some constant factor. If you don't do this a very weak program might appear to scale quite well when in fact it doesn't.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.