back to the Komodo SMP issue

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

back to the Komodo SMP issue

Post by bob »

Here's the key point that nobody has even begun to address. Let's take a hypothetical position, where a program takes exactly 10 minutes to search to depth D. And when it finishes we measure an EBF of 2.0 exactly (to keep the numbers simple).

You run this with 4 cpus, and discover that to search to depth D, we get a time of 7 minutes. Speedup = 10 / 7 = 1.4x. SMP efficiency = 1.4 / 4.0 = .35 (or 35%)


Now, since the SMP efficiency is so low (as is the speedup) let's relax the rules a bit to make the search wider. This eliminates errors due to pruning/reductions, but it slows the search down, and now our serial search is obviously going to take much longer than 10 minutes. If you make the tree 2x larger, which is probably the minimum you would expect to produce a +130 Elo gain, you just doubled the search time for one cpu to 20 minutes. 20 / 1.4 = 14.2 minutes for the parallel search. So we made the search wider, only to give us a larger tree that we don't search very efficiently, and the time is longer than the original serial search, whcih means it is not going to gain a thing.

All this is predicated on the serial search being well-tuned regarding reductions and pruning so that there is little expected gain from further tweaking, when using only one cpu. After thinking about this from every reasonable angle, I am STILL not convinced that this concept is valid. If you could claim "OK, if we use 2 cpus, we get 1.4x, or if we use 4 cpus we STILL get 1.4x, that clearly shows the extra 2 cpus are not helping at all. But exactly HOW would you get them to doing something helpful by making the tree wider, once you know that the FINAL tree result is still just 1.4x faster with 4 cores.

If one uses the usual doubling speed = +70 Elo, we need two doublings to reach that +130 Elo number that was discussed. It is pretty straightforward to figure out what kind of EBF increase one needs (given fixed time) to produce about 2/3 of that Elo gain (widening the tree, given an SMP speedup of 1.4x) I almost went down that road, but as I started the idea simply looked completely unsound for the reasons given above. As you go wider, you lose depth since the SMP search is doing so poorly. I don't see any way around this.

So perhaps someone is willing to participate in a technical discussion that is based on something beyond vague comments to come up with a way to make this actually work.

I think a reasonable starting point is depth=24, time=10 minutes, EBF=2.0, 1 cpu. Depth=24, time=7 minutes, EBF=2.0, 4 cpus, and figure out how we get to +130 Elo given that we are only expecting maybe 15 - 20 Elo from the SMP search gain...

Of course, in a real game, that last depth=24 will increase somewhat, but will not reach 25 since we need 2x speedup to reach the next ply with an EBF of 2.0...
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: back to the Komodo SMP issue

Post by Laskos »

bob wrote:Here's the key point that nobody has even begun to address. Let's take a hypothetical position, where a program takes exactly 10 minutes to search to depth D. And when it finishes we measure an EBF of 2.0 exactly (to keep the numbers simple).

You run this with 4 cpus, and discover that to search to depth D, we get a time of 7 minutes. Speedup = 10 / 7 = 1.4x. SMP efficiency = 1.4 / 4.0 = .35 (or 35%)


Now, since the SMP efficiency is so low (as is the speedup) let's relax the rules a bit to make the search wider. This eliminates errors due to pruning/reductions, but it slows the search down, and now our serial search is obviously going to take much longer than 10 minutes. If you make the tree 2x larger, which is probably the minimum you would expect to produce a +130 Elo gain, you just doubled the search time for one cpu to 20 minutes. 20 / 1.4 = 14.2 minutes for the parallel search. So we made the search wider, only to give us a larger tree that we don't search very efficiently, and the time is longer than the original serial search, whcih means it is not going to gain a thing.

All this is predicated on the serial search being well-tuned regarding reductions and pruning so that there is little expected gain from further tweaking, when using only one cpu. After thinking about this from every reasonable angle, I am STILL not convinced that this concept is valid. If you could claim "OK, if we use 2 cpus, we get 1.4x, or if we use 4 cpus we STILL get 1.4x, that clearly shows the extra 2 cpus are not helping at all. But exactly HOW would you get them to doing something helpful by making the tree wider, once you know that the FINAL tree result is still just 1.4x faster with 4 cores.

If one uses the usual doubling speed = +70 Elo, we need two doublings to reach that +130 Elo number that was discussed. It is pretty straightforward to figure out what kind of EBF increase one needs (given fixed time) to produce about 2/3 of that Elo gain (widening the tree, given an SMP speedup of 1.4x) I almost went down that road, but as I started the idea simply looked completely unsound for the reasons given above. As you go wider, you lose depth since the SMP search is doing so poorly. I don't see any way around this.

So perhaps someone is willing to participate in a technical discussion that is based on something beyond vague comments to come up with a way to make this actually work.

I think a reasonable starting point is depth=24, time=10 minutes, EBF=2.0, 1 cpu. Depth=24, time=7 minutes, EBF=2.0, 4 cpus, and figure out how we get to +130 Elo given that we are only expecting maybe 15 - 20 Elo from the SMP search gain...

Of course, in a real game, that last depth=24 will increase somewhat, but will not reach 25 since we need 2x speedup to reach the next ply with an EBF of 2.0...
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
Matthias Hartwich
Posts: 38
Joined: Tue Jul 01, 2008 9:36 pm

Re: back to the Komodo SMP issue

Post by Matthias Hartwich »

Perhaps another definition of SMP efficiency would be better in this case:

1 core: engine has a base Elo at time control y (without permanent brain to make testing easier - the quality of the permanent brain doesn't contribute)

n core: engine gets an Elo base+x at time control y

Now what time factor is needed for one core to get Elo base+x as well?

As this calculation is not easy to get the exact time factor it can be approximated with playing out matches with one core and time controls y, 2*y, 4*y, ... and time control y with 2 cores, 4 cores, ...

It would eliminate the reached depth as key number as it is more important to know how well the engines uses the computing ressources to make a move.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob »

Laskos wrote:
bob wrote:Here's the key point that nobody has even begun to address. Let's take a hypothetical position, where a program takes exactly 10 minutes to search to depth D. And when it finishes we measure an EBF of 2.0 exactly (to keep the numbers simple).

You run this with 4 cpus, and discover that to search to depth D, we get a time of 7 minutes. Speedup = 10 / 7 = 1.4x. SMP efficiency = 1.4 / 4.0 = .35 (or 35%)


Now, since the SMP efficiency is so low (as is the speedup) let's relax the rules a bit to make the search wider. This eliminates errors due to pruning/reductions, but it slows the search down, and now our serial search is obviously going to take much longer than 10 minutes. If you make the tree 2x larger, which is probably the minimum you would expect to produce a +130 Elo gain, you just doubled the search time for one cpu to 20 minutes. 20 / 1.4 = 14.2 minutes for the parallel search. So we made the search wider, only to give us a larger tree that we don't search very efficiently, and the time is longer than the original serial search, whcih means it is not going to gain a thing.

All this is predicated on the serial search being well-tuned regarding reductions and pruning so that there is little expected gain from further tweaking, when using only one cpu. After thinking about this from every reasonable angle, I am STILL not convinced that this concept is valid. If you could claim "OK, if we use 2 cpus, we get 1.4x, or if we use 4 cpus we STILL get 1.4x, that clearly shows the extra 2 cpus are not helping at all. But exactly HOW would you get them to doing something helpful by making the tree wider, once you know that the FINAL tree result is still just 1.4x faster with 4 cores.

If one uses the usual doubling speed = +70 Elo, we need two doublings to reach that +130 Elo number that was discussed. It is pretty straightforward to figure out what kind of EBF increase one needs (given fixed time) to produce about 2/3 of that Elo gain (widening the tree, given an SMP speedup of 1.4x) I almost went down that road, but as I started the idea simply looked completely unsound for the reasons given above. As you go wider, you lose depth since the SMP search is doing so poorly. I don't see any way around this.

So perhaps someone is willing to participate in a technical discussion that is based on something beyond vague comments to come up with a way to make this actually work.

I think a reasonable starting point is depth=24, time=10 minutes, EBF=2.0, 1 cpu. Depth=24, time=7 minutes, EBF=2.0, 4 cpus, and figure out how we get to +130 Elo given that we are only expecting maybe 15 - 20 Elo from the SMP search gain...

Of course, in a real game, that last depth=24 will increase somewhat, but will not reach 25 since we need 2x speedup to reach the next ply with an EBF of 2.0...
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.

I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"

SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).

If you use one cpu, the time taken for a serial search = time taken for the one-cpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob »

Matthias Hartwich wrote:Perhaps another definition of SMP efficiency would be better in this case:

1 core: engine has a base Elo at time control y (without permanent brain to make testing easier - the quality of the permanent brain doesn't contribute)

n core: engine gets an Elo base+x at time control y

Now what time factor is needed for one core to get Elo base+x as well?

As this calculation is not easy to get the exact time factor it can be approximated with playing out matches with one core and time controls y, 2*y, 4*y, ... and time control y with 2 cores, 4 cores, ...

It would eliminate the reached depth as key number as it is more important to know how well the engines uses the computing ressources to make a move.
That's an ok measurement, it is just not the classic SMP efficiency term, which has already been used for 40 years now. Perhaps "SMP effectiveness" might be the term to use, although it would need to be carefully defined to be useful in the future...

BTW for most every program tested to date, the SMP speedup does that perfectly. If you use 3 cpus and you run 2x faster (time to depth which factors in search overhead) then you get the same elo benefit as if you just run on a 2x faster single cpu box...

For almost any program, if you double the speed, you get something like 70 Elo. Whether you double the speed by making the CPU faster, or by using more than one CPU.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: back to the Komodo SMP issue

Post by Laskos »

bob wrote:
Laskos wrote:
Two observations:

1) With EBF = (Total nodes searched) ^ (1/depth), 4 cores has higher EBF than 1 core in Komodo.

2) You wrote "SMP efficiency = 1.4 / 4.0 = .35 (or 35%)". A bit strange formula, I guess it should be log(1.4)/log(4) ~ 24%. The speed-up of 1 is no speed-up, not 1/4=25% efficiency.
No, the formula I gave is the one referenced in SMP publications. 100% means that all 4 (in this case) of the processors are busy doing nothing but useful work. No idle time, no extra work.
Then you have lousy definitions there. From 1.4 to 1.96 the efficiency increases by a factor of 2=log(1.96)/log(1.4). For speedup 4 you still have 100%=log(4)/log(4). Anyway if that definition quoted by you is the standard, let's use it, although it's crappy.

I realize the higher EBF in Komodo. No doubt about it. The only doubt I have is "how does it help?"
Lower EBF is not necessarily better than higher EBF. Maybe Komodo is performing close to optimally with EBF 1.8 on one core and EBF 1.9 on 4 cores. Even on one core, maybe both EBF's are close to optimal. Don mentioned that he could make Komodo 1 ply higher or lower (lower or higher EBF) without changing the strength.


SMP efficiency is defined as speedup/#processors. Optimal is obviously 1.0 (100%).

If you use one cpu, the time taken for a serial search = time taken for the one-cpu parallel search, the speedup is 1.0 which is no speedup at all. #processors / speedup = 100% which is expected.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: back to the Komodo SMP issue

Post by Don »

Matthias Hartwich wrote:Perhaps another definition of SMP efficiency would be better in this case:

1 core: engine has a base Elo at time control y (without permanent brain to make testing easier - the quality of the permanent brain doesn't contribute)

n core: engine gets an Elo base+x at time control y

Now what time factor is needed for one core to get Elo base+x as well?

As this calculation is not easy to get the exact time factor it can be approximated with playing out matches with one core and time controls y, 2*y, 4*y, ... and time control y with 2 cores, 4 cores, ...

It would eliminate the reached depth as key number as it is more important to know how well the engines uses the computing ressources to make a move.
In my case the only thing I care about is that Komodo plays as strong as possible on 4 cores. I don't care if the EBF goes up if I can get that ELO.

But if you want a practical definition of how well MP "works" you should find which level you need to run to get the same score against your single processor program. If I set the 4 core program to play at 5 minutes for example, what level would the 1 core problem need to compete?
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: back to the Komodo SMP issue

Post by Don »

Don wrote:
Matthias Hartwich wrote:Perhaps another definition of SMP efficiency would be better in this case:

1 core: engine has a base Elo at time control y (without permanent brain to make testing easier - the quality of the permanent brain doesn't contribute)

n core: engine gets an Elo base+x at time control y

Now what time factor is needed for one core to get Elo base+x as well?

As this calculation is not easy to get the exact time factor it can be approximated with playing out matches with one core and time controls y, 2*y, 4*y, ... and time control y with 2 cores, 4 cores, ...

It would eliminate the reached depth as key number as it is more important to know how well the engines uses the computing ressources to make a move.
In my case the only thing I care about is that Komodo plays as strong as possible on 4 cores. I don't care if the EBF goes up if I can get that ELO.

But if you want a practical definition of how well MP "works" you should find which level you need to run to get the same score against your single processor program. If I set the 4 core program to play at 5 minutes for example, what level would the 1 core problem need to compete?
Woops, I said that incorrectly. What I should have said is that you should find which level you need to run the SP program at to get the same score against your 4 core program. That should come out to be more than 1 and less than 4 if you are testing the 4 core scalability.

Even this is not 100% precise as there are some variables:

1. starting level
2. non-MP scalability

I'm not sure the starting level matters much because you are running a self-test. I think it would take just about the same time handicap for MP regardless of level.

non-MP scalability is a factor simply because you cannot completely isolate non-MP scalability from MP scalability, so with my proposed test you will be partially measuring non-MP scalability. If you run at long enough levels this is probably a pretty minor factor.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: back to the Komodo SMP issue

Post by Laskos »

Don wrote:
Don wrote:
Matthias Hartwich wrote:Perhaps another definition of SMP efficiency would be better in this case:

1 core: engine has a base Elo at time control y (without permanent brain to make testing easier - the quality of the permanent brain doesn't contribute)

n core: engine gets an Elo base+x at time control y

Now what time factor is needed for one core to get Elo base+x as well?

As this calculation is not easy to get the exact time factor it can be approximated with playing out matches with one core and time controls y, 2*y, 4*y, ... and time control y with 2 cores, 4 cores, ...

It would eliminate the reached depth as key number as it is more important to know how well the engines uses the computing ressources to make a move.
In my case the only thing I care about is that Komodo plays as strong as possible on 4 cores. I don't care if the EBF goes up if I can get that ELO.

But if you want a practical definition of how well MP "works" you should find which level you need to run to get the same score against your single processor program. If I set the 4 core program to play at 5 minutes for example, what level would the 1 core problem need to compete?
Woops, I said that incorrectly. What I should have said is that you should find which level you need to run the SP program at to get the same score against your 4 core program. That should come out to be more than 1 and less than 4 if you are testing the 4 core scalability.

Even this is not 100% precise as there are some variables:

1. starting level
2. non-MP scalability
These two are particularly painful at ultra-short controls. One simply cannot measure 1->4 core scalability (defined as (time_1_core) / (time_4_cores) to the same _strength_) at ultra-short controls.

I'm not sure the starting level matters much because you are running a self-test. I think it would take just about the same time handicap for MP regardless of level.

non-MP scalability is a factor simply because you cannot completely isolate non-MP scalability from MP scalability, so with my proposed test you will be partially measuring non-MP scalability. If you run at long enough levels this is probably a pretty minor factor.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: back to the Komodo SMP issue

Post by Don »

Laskos wrote:
Don wrote:
Woops, I said that incorrectly. What I should have said is that you should find which level you need to run the SP program at to get the same score against your 4 core program. That should come out to be more than 1 and less than 4 if you are testing the 4 core scalability.

Even this is not 100% precise as there are some variables:

1. starting level
2. non-MP scalability
These two are particularly painful at ultra-short controls. One simply cannot measure 1->4 core scalability (defined as (time_1_core) / (time_4_cores) to the same _strength_) at ultra-short controls.
I agree completely. There is also startup costs at ultra-blitz - some programs may take more time to start searching than others and this would make no noticeable difference at 5 minute chess but it would at game in 5 seconds.

So for a practical test I would say you need at least games where both programs take a couple of minutes to complete a game, perhaps 60s + 1 or 2 minutes sudden death.

I also believe that strength is the real issue, not the level or time. For example if you are measuring which program scales better you have to start them both at the same STRENGTH, not the same level. Then see which improves most if you multiply the level by some constant factor. If you don't do this a very weak program might appear to scale quite well when in fact it doesn't.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.