back to the Komodo SMP issue

Henk · Post by **Henk** » Thu Jul 04, 2013 12:02 pm

So probably at search levels above 18 the average branching factor should be greater than 2 otherwise searching faster doesn't give better games in terms of ELO rating.

bob · Post by **bob** » Thu Jul 04, 2013 5:52 pm

Henk wrote:So probably at search levels above 18 the average branching factor should be greater than 2 otherwise searching faster doesn't give better games in terms of ELO rating.

I've not seen any sort of research that shows lower EBF plays worse. In fact, quite the contrary based on LMR/forward-pruning results we see today. One can get too speculative, of course. But 2.0 doesn't seem to be some sort of "barrier".

FlavusSnow · Post by **FlavusSnow** » Thu Jul 04, 2013 6:42 pm

I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.

Joerg Oster · Post by **Joerg Oster** » Thu Jul 04, 2013 8:08 pm

FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.

I want to add, that today engines already reach in single mode high depths very fast. Due to extensive pruning, reductions etc. But they also miss many things.
Isn't it reasonable to think that they will gain by searching wider (less reductions and pruning), and not only by going deeper?

syzygy · Post by **syzygy** » Thu Jul 04, 2013 8:48 pm

bob wrote:
syzygy wrote:
bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
The definition clearly assumes that the same work is done.

When does a 4-core search perform the same work as a 1-core search?
The answer is obvious. It searches a position to a fixed depth.

That answer is very clearly false for Komodo. Komodo's 4-core search to a particular depth does more useful work than Komodo's 1-core search to that same depth. I though that would have been obvious by now......

bob · Post by **bob** » Thu Jul 04, 2013 9:12 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
The definition clearly assumes that the same work is done.

When does a 4-core search perform the same work as a 1-core search?
The answer is obvious. It searches a position to a fixed depth.
That answer is very clearly false for Komodo. Komodo's 4-core search to a particular depth does more useful work than Komodo's 1-core search to that same depth. I though that would have been obvious by now......

The definition doesn't change for a specific program. It is a general expression. If a program violates the basic premise of "search the same tree but in parallel" then the program deviates from the formula, not the other way around.

More useful work seems probable. But we don't have the timing information to show what that costs. Clearly disabling LMR would make a program stronger, if time were not an issue, but that's not the case.

I'm STILL waiting on some rational explanation for how a program can get a speedup of 1.4x, but search more useful information. The speedup is STILL just 1.4x. Just increasing the EBF by 0.1 more than uses that .4x factor.

So, rather than continuing the argument about "if", I have tried, unsuccessfully so far, to try to discuss "how". So far, nothing technical, just the usual.

I tend to not accept things "just because". I want to understand. And I don't believe in things that can't be explained... so, tell me how...

bob · Post by **bob** » Thu Jul 04, 2013 9:17 pm

FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.

However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...

That 1.4x seems to be a hurdle that is pretty high.

bob · Post by **bob** » Thu Jul 04, 2013 9:18 pm

Joerg Oster wrote:
FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
I want to add, that today engines already reach in single mode high depths very fast. Due to extensive pruning, reductions etc. But they also miss many things.
Isn't it reasonable to think that they will gain by searching wider (less reductions and pruning), and not only by going deeper?

Doesn't that same concept apply to a serial search on a CPU that is 2x faster? You could fix the depth and throttle back a bit on the reductions and pruning to use that extra power to make the tree wider, not deeper, obviously...

bob · Post by **bob** » Thu Jul 04, 2013 9:20 pm

FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.

However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...

That 1.4x seems to be a hurdle that is pretty high.

Laskos · Post by **Laskos** » Thu Jul 04, 2013 9:40 pm

bob wrote:
FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...

That 1.4x seems to be a hurdle that is pretty high.

To illustrate practically the point at fixed _time_ I show the result in LittleBlitzer, where time, speed and depths are measured:

Code: Select all

 1.  Komodo 5.1 4threads      	594.0/874	&#40;tpm=107.5ms d=16.69 nps=7606453&#41;
 2.  Komodo 5.1 1thread       	280.0/874	&#40;tpm=108.4ms d=15.79 nps=2081020&#41;
+131 points

 1.  Houdini 3 4threads       	180.5/265	&#40;tpm=104.0ms d=18.37 nps=8163480&#41;
 2.  Houdini 3 1thread        	 84.5/265	&#40;tpm=103.8ms d=16.46 nps=3022500&#41;
+132 points

This test is not to compare the gain, but to see depths and NPS, while the Elo gain is comparable. To the similar Elo gain at fixed _time_, Komodo 1->4 cores gains 0.90 plies in depth, Houdini 1->4 cores gains 1.91 plies. NPS for Komodo 1->4 increased by a factor of 3.66. While all the effort in Houdini is devouted to depth, in Komodo it's pretty evenly divided for depth and width. In all your ad absurdum arguments, you are talking as though this 3.66 increase in NPS doesn't exist. Nobody is saying that Komodo parallel at the _same_ speed is stronger at fixed _time_ than Komodo single cored, or that Komodo parallel is stronger than 4x time single cored.

back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue

Re: back to the Komodo SMP issue