back to the Komodo SMP issue

bob · Post by **bob** » Thu Jul 04, 2013 11:09 pm

Laskos wrote:
bob wrote:
FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...

That 1.4x seems to be a hurdle that is pretty high.
To illustrate practically the point at fixed _time_ I show the result in LittleBlitzer, where time, speed and depths are measured:
Code: Select all
 1.  Komodo 5.1 4threads      	594.0/874	&#40;tpm=107.5ms d=16.69 nps=7606453&#41;
 2.  Komodo 5.1 1thread       	280.0/874	&#40;tpm=108.4ms d=15.79 nps=2081020&#41;
+131 points

 1.  Houdini 3 4threads       	180.5/265	&#40;tpm=104.0ms d=18.37 nps=8163480&#41;
 2.  Houdini 3 1thread        	 84.5/265	&#40;tpm=103.8ms d=16.46 nps=3022500&#41;
+132 points
This test is not to compare the gain, but to see depths and NPS, while the Elo gain is comparable. To the similar Elo gain at fixed _time_, Komodo 1->4 cores gains 0.90 plies in depth, Houdini 1->4 cores gains 1.91 plies. NPS for Komodo 1->4 increased by a factor of 3.66. While all the effort in Houdini is devouted to depth, in Komodo it's pretty evenly divided for depth and width. In all your ad absurdum arguments, you are talking as though this 3.66 increase in NPS doesn't exist. Nobody is saying that Komodo parallel at the _same_ speed is stronger at fixed _time_ than Komodo single cored, or that Komodo parallel is stronger than 4x time single cored.

No, you are simply in over your head here. I explicitly asked "how does one widen the search?" Without losing something in the process. I gave a specific example using your 2.1 EBF number, which seems to pretty well match komodo. But that does not address the question. You just can't say "go wider". The question is how. And how do you avoid slowing things down, since a tree is a tree, and it still has to be searched. You can't just search the extra nodes and hand-wave them away. They have to be searched by the parallel search....

That's the issue here. How does one actually do this? Something that mathematically stands up to scrutiny.

the 3.66x NPS increase doesn't mean much, period, other than that the algorithm actually scales pretty well on that particular architecture, no major memory bottlenecks or synchronization issues. But SOMETHING has to be done in parallel, and that is what I am trying to understand. I understand parallel alpha/beta perfectly. And I understand what "just going wider" is going to do.

Laskos · Post by **Laskos** » Thu Jul 04, 2013 11:18 pm

bob wrote:
Laskos wrote:
bob wrote:
FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...

That 1.4x seems to be a hurdle that is pretty high.
To illustrate practically the point at fixed _time_ I show the result in LittleBlitzer, where time, speed and depths are measured:
Code: Select all
 1.  Komodo 5.1 4threads      	594.0/874	&#40;tpm=107.5ms d=16.69 nps=7606453&#41;
 2.  Komodo 5.1 1thread       	280.0/874	&#40;tpm=108.4ms d=15.79 nps=2081020&#41;
+131 points

 1.  Houdini 3 4threads       	180.5/265	&#40;tpm=104.0ms d=18.37 nps=8163480&#41;
 2.  Houdini 3 1thread        	 84.5/265	&#40;tpm=103.8ms d=16.46 nps=3022500&#41;
+132 points
This test is not to compare the gain, but to see depths and NPS, while the Elo gain is comparable. To the similar Elo gain at fixed _time_, Komodo 1->4 cores gains 0.90 plies in depth, Houdini 1->4 cores gains 1.91 plies. NPS for Komodo 1->4 increased by a factor of 3.66. While all the effort in Houdini is devouted to depth, in Komodo it's pretty evenly divided for depth and width. In all your ad absurdum arguments, you are talking as though this 3.66 increase in NPS doesn't exist. Nobody is saying that Komodo parallel at the _same_ speed is stronger at fixed _time_ than Komodo single cored, or that Komodo parallel is stronger than 4x time single cored.
No, you are simply in over your head here. I explicitly asked "how does one widen the search?" Without losing something in the process. I gave a specific example using your 2.1 EBF number, which seems to pretty well match komodo. But that does not address the question. You just can't say "go wider". The question is how. And how do you avoid slowing things down, since a tree is a tree, and it still has to be searched. You can't just search the extra nodes and hand-wave them away. They have to be searched by the parallel search....

That's the issue here. How does one actually do this? Something that mathematically stands up to scrutiny.

the 3.66x NPS increase doesn't mean much, period, other than that the algorithm actually scales pretty well on that particular architecture, no major memory bottlenecks or synchronization issues. But SOMETHING has to be done in parallel, and that is what I am trying to understand. I understand parallel alpha/beta perfectly. And I understand what "just going wider" is going to do.

Again, suppose one is not deepening at all going parallel. Won't he gain Elo points by just widening going parallel 1->4 cores with 3.66x nps? I am at loss with your "mathematical" argument. Are you saying that searching 3.66 more nodes to the _same_ depth at fixed time is not going to get you anywhere as Elo gain goes?

syzygy · Post by **syzygy** » Thu Jul 04, 2013 11:21 pm

bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
The definition clearly assumes that the same work is done.

When does a 4-core search perform the same work as a 1-core search?
The answer is obvious. It searches a position to a fixed depth.
That answer is very clearly false for Komodo. Komodo's 4-core search to a particular depth does more useful work than Komodo's 1-core search to that same depth. I though that would have been obvious by now......
The definition doesn't change for a specific program. It is a general expression.

I'm not changing any definition. The definition is about same work. Same depth is not always same work. If you change the definition into same depth, then you are just not being intellectually honest.

If a program violates the basic premise of "search the same tree but in parallel" then the program deviates from the formula, not the other way around.

Woohoo, the program is cheating?

Was Pacheco writing about computer chess? I don't think so. Did he write same depth? I don't think so.

If the definition does not apply to computer chess, it is not useful for the purpose of this forum. But it does apply, as long as you are being careful and objective. Same work, not same depth. It is obvious.

bob · Post by **bob** » Thu Jul 04, 2013 11:30 pm

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...

That 1.4x seems to be a hurdle that is pretty high.
To illustrate practically the point at fixed _time_ I show the result in LittleBlitzer, where time, speed and depths are measured:
Code: Select all
 1.  Komodo 5.1 4threads      	594.0/874	&#40;tpm=107.5ms d=16.69 nps=7606453&#41;
 2.  Komodo 5.1 1thread       	280.0/874	&#40;tpm=108.4ms d=15.79 nps=2081020&#41;
+131 points

 1.  Houdini 3 4threads       	180.5/265	&#40;tpm=104.0ms d=18.37 nps=8163480&#41;
 2.  Houdini 3 1thread        	 84.5/265	&#40;tpm=103.8ms d=16.46 nps=3022500&#41;
+132 points
This test is not to compare the gain, but to see depths and NPS, while the Elo gain is comparable. To the similar Elo gain at fixed _time_, Komodo 1->4 cores gains 0.90 plies in depth, Houdini 1->4 cores gains 1.91 plies. NPS for Komodo 1->4 increased by a factor of 3.66. While all the effort in Houdini is devouted to depth, in Komodo it's pretty evenly divided for depth and width. In all your ad absurdum arguments, you are talking as though this 3.66 increase in NPS doesn't exist. Nobody is saying that Komodo parallel at the _same_ speed is stronger at fixed _time_ than Komodo single cored, or that Komodo parallel is stronger than 4x time single cored.
No, you are simply in over your head here. I explicitly asked "how does one widen the search?" Without losing something in the process. I gave a specific example using your 2.1 EBF number, which seems to pretty well match komodo. But that does not address the question. You just can't say "go wider". The question is how. And how do you avoid slowing things down, since a tree is a tree, and it still has to be searched. You can't just search the extra nodes and hand-wave them away. They have to be searched by the parallel search....

That's the issue here. How does one actually do this? Something that mathematically stands up to scrutiny.

the 3.66x NPS increase doesn't mean much, period, other than that the algorithm actually scales pretty well on that particular architecture, no major memory bottlenecks or synchronization issues. But SOMETHING has to be done in parallel, and that is what I am trying to understand. I understand parallel alpha/beta perfectly. And I understand what "just going wider" is going to do.
Again, suppose one is not deepening at all going parallel. Won't he gain Elo points by just widening going parallel 1->4 cores with 3.66x nps? I am at loss with your "mathematical" argument. Are you saying that searching 3.66 more nodes to the _same_ depth at fixed time is not going to get you anywhere as Elo gain goes?

My question is, still, HOW does one do a parallel search that doesn't follow the normal tree shape? What I said about the 3.66x is that it says nothing at all. It MIGHT say you are 3.66 faster, or it MIGHT say you are searching a bunch of unnecessary nodes that add nothing, as in the usual parallel search overhead everyone sees. It is particularly interesting when one goes deeper AND wider. Trying to figure out how that might be done reasonably is an interesting exercise. Not saying it can't be done. Just asking "How?" Nothing simple comes to mind. You still have to search the tree, following alpha/beta rules, in parallel. Seems "interesting" at the very least...

Laskos · Post by **Laskos** » Fri Jul 05, 2013 12:27 am

bob wrote:
Laskos wrote:
Again, suppose one is not deepening at all going parallel. Won't he gain Elo points by just widening going parallel 1->4 cores with 3.66x nps? I am at loss with your "mathematical" argument. Are you saying that searching 3.66 more nodes to the _same_ depth at fixed time is not going to get you anywhere as Elo gain goes?
My question is, still, HOW does one do a parallel search that doesn't follow the normal tree shape? What I said about the 3.66x is that it says nothing at all. It MIGHT say you are 3.66 faster, or it MIGHT say you are searching a bunch of unnecessary nodes that add nothing, as in the usual parallel search overhead everyone sees. It is particularly interesting when one goes deeper AND wider. Trying to figure out how that might be done reasonably is an interesting exercise. Not saying it can't be done. Just asking "How?" Nothing simple comes to mind. You still have to search the tree, following alpha/beta rules, in parallel. Seems "interesting" at the very least...

The algorithm more adapted to parallelism may be not as good as the one used on single core, but have much more parallelism to exploit. That this algorithm happen to have a larger or smaller EBF is irrelevant, because the Elo gain is all that matters. Even on single core, Don implied that he could change EBF a little without changing the strength.

bob · Post by **bob** » Fri Jul 05, 2013 3:02 am

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
The definition clearly assumes that the same work is done.

When does a 4-core search perform the same work as a 1-core search?
The answer is obvious. It searches a position to a fixed depth.
That answer is very clearly false for Komodo. Komodo's 4-core search to a particular depth does more useful work than Komodo's 1-core search to that same depth. I though that would have been obvious by now......
The definition doesn't change for a specific program. It is a general expression.
I'm not changing any definition. The definition is about same work. Same depth is not always same work. If you change the definition into same depth, then you are just not being intellectually honest.

If we were comparing depth from program A to depth from program B, I would agree. But we are NOT doing that. We are simply searching using program A to a fixed depth, and measuring how much faster we get there with parallel search. So this is not intellectually dishonest at all, it is a logical/rational methodology to measure parallel performance.

If a program violates the basic premise of "search the same tree but in parallel" then the program deviates from the formula, not the other way around.
Woohoo, the program is cheating?

Any chance the hyperbole can be toned down? One does not normally try to do more work with a parallel application, one tries to get results for a fixed computation completed faster. Always been the case.

Was Pacheco writing about computer chess? I don't think so. Did he write same depth? I don't think so.

If the definition does not apply to computer chess, it is not useful for the purpose of this forum. But it does apply, as long as you are being careful and objective. Same work, not same depth. It is obvious.

I will repeat, since you seem to ignore it repeatedly. Absolutely NOTHING is wrong with measuring Elo. EXCEPT for the computational cost. Hence everyone (except for me, perhaps) uses time-to depth for 1, 2, 4 etc cores to measure how effective their SMP search is doing. I do that most of the time because the time required is acceptable. Measuring Elo requires a FAR greater investment of computer time since you play a bunch of complete games to get reasonable error bars, as opposed to just testing a few hundred positions.

From a performance metric point of view, Elo is still pretty pointless as a measurement tool, because it doesn't address things like raw speed of the hardware and what happens with a faster single processor as opposed to two slower processors. So as a measurement, it still lacks specificity in terms of performance.

bob · Post by **bob** » Fri Jul 05, 2013 3:05 am

Laskos wrote:
bob wrote:
Laskos wrote:
Again, suppose one is not deepening at all going parallel. Won't he gain Elo points by just widening going parallel 1->4 cores with 3.66x nps? I am at loss with your "mathematical" argument. Are you saying that searching 3.66 more nodes to the _same_ depth at fixed time is not going to get you anywhere as Elo gain goes?
My question is, still, HOW does one do a parallel search that doesn't follow the normal tree shape? What I said about the 3.66x is that it says nothing at all. It MIGHT say you are 3.66 faster, or it MIGHT say you are searching a bunch of unnecessary nodes that add nothing, as in the usual parallel search overhead everyone sees. It is particularly interesting when one goes deeper AND wider. Trying to figure out how that might be done reasonably is an interesting exercise. Not saying it can't be done. Just asking "How?" Nothing simple comes to mind. You still have to search the tree, following alpha/beta rules, in parallel. Seems "interesting" at the very least...
The algorithm more adapted to parallelism may be not as good as the one used on single core, but have much more parallelism to exploit. That this algorithm happen to have a larger or smaller EBF is irrelevant, because the Elo gain is all that matters. Even on single core, Don implied that he could change EBF a little without changing the strength.

The reductions and pruning guidelines are flexible, and are tuned for the "general case" normally. And it is quite often an equation that has more unknowns than equations, giving a set of feasible solutions ala' Linear Programming. Nothing new there.

The question remains, how to change the EBF, and STILL search in parallel, with a parallel search that doesn't scale very well? If it does scale well, additional depth almost certainly pays off more than increased width, else the serial search would go for the increased width as well...

syzygy · Post by **syzygy** » Fri Jul 05, 2013 3:42 am

bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
The definition clearly assumes that the same work is done.

When does a 4-core search perform the same work as a 1-core search?
The answer is obvious. It searches a position to a fixed depth.
That answer is very clearly false for Komodo. Komodo's 4-core search to a particular depth does more useful work than Komodo's 1-core search to that same depth. I though that would have been obvious by now......
The definition doesn't change for a specific program. It is a general expression.
I'm not changing any definition. The definition is about same work. Same depth is not always same work. If you change the definition into same depth, then you are just not being intellectually honest.
If we were comparing depth from program A to depth from program B, I would agree. But we are NOT doing that. We are simply searching using program A to a fixed depth, and measuring how much faster we get there with parallel search. So this is not intellectually dishonest at all, it is a logical/rational methodology to measure parallel performance.

The definition of SMP efficiency is about same work, not about "same depth". I am pretty sure your Peter Pacheco never talks about same depth. What is difficult about this?

bob · Post by **bob** » Fri Jul 05, 2013 6:18 am

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
The definition clearly assumes that the same work is done.

When does a 4-core search perform the same work as a 1-core search?
The answer is obvious. It searches a position to a fixed depth.
That answer is very clearly false for Komodo. Komodo's 4-core search to a particular depth does more useful work than Komodo's 1-core search to that same depth. I though that would have been obvious by now......
The definition doesn't change for a specific program. It is a general expression.
I'm not changing any definition. The definition is about same work. Same depth is not always same work. If you change the definition into same depth, then you are just not being intellectually honest.
If we were comparing depth from program A to depth from program B, I would agree. But we are NOT doing that. We are simply searching using program A to a fixed depth, and measuring how much faster we get there with parallel search. So this is not intellectually dishonest at all, it is a logical/rational methodology to measure parallel performance.
The definition of SMP efficiency is about same work, not about "same depth". I am pretty sure your Peter Pacheco never talks about same depth. What is difficult about this?

Absolutely nothing. But neither did Pacheco consider the idea of "computing to the same accuracy" where somehow a weather forecasting model relaxes accuracy when annealing in parallel so that you have to do more work than the sequential model.

Elo is a relative term that is not as precise as what we have been using since the first parallel program was written. And again, there's nothing wrong with using Elo as the "measure". If you have the available resources required to play the large number of games needed to produce reasonable accuracy. Time to depth is MUCH less computationally demanding. And it is much easier to tune the SMP search and measure the effect of the tuning in a reasonable amount of time.

I don't believe there is any doubt that depth is the key point. Otherwise why not have the serial search go wider rather than deeper? The answer is obvious. If one gives up on going deeper, which certainly becomes problematic as the number of processors grows, then other alternatives might well be viable. But for 4 cores, I remain unconvinced, given the lack of any sort of technical evidence on how/why this is a good idea...

Henk · Post by **Henk** » Fri Jul 05, 2013 9:07 am

It may easily happen that while searching N times faster the perhaps small error in the value of the best move in the root is still too big.

This causes the better moves in the root to be rejected.

The player that makes the smallest errors in important positions wins the game.

All parameters that have any effect on basic evaluation or pruning should be 're-tuned'. Or basic evaluation should be rewritten. Or more accurate pruning methods should be invented and implemented.

Of course if it is only one parameter that causes the error bottleneck only that parameter needs to be changed.

back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue