back to the Komodo SMP issue
Moderators: hgm, Rebel, chrisw
-
- Posts: 7220
- Joined: Mon May 27, 2013 10:31 am
Re: back to the Komodo SMP issue
So probably at search levels above 18 the average branching factor should be greater than 2 otherwise searching faster doesn't give better games in terms of ELO rating.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: back to the Komodo SMP issue
I've not seen any sort of research that shows lower EBF plays worse. In fact, quite the contrary based on LMR/forward-pruning results we see today. One can get too speculative, of course. But 2.0 doesn't seem to be some sort of "barrier".Henk wrote:So probably at search levels above 18 the average branching factor should be greater than 2 otherwise searching faster doesn't give better games in terms of ELO rating.
-
- Posts: 89
- Joined: Thu Apr 01, 2010 5:28 am
- Location: Omaha, NE
Re: back to the Komodo SMP issue
I think about this very simply, and let me exaggerate just a little here for illustration purposes.
Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.
I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.
I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
-
- Posts: 937
- Joined: Fri Mar 10, 2006 4:29 pm
- Location: Germany
Re: back to the Komodo SMP issue
I want to add, that today engines already reach in single mode high depths very fast. Due to extensive pruning, reductions etc. But they also miss many things.FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.
Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.
I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
Isn't it reasonable to think that they will gain by searching wider (less reductions and pruning), and not only by going deeper?
Jörg Oster
-
- Posts: 5566
- Joined: Tue Feb 28, 2012 11:56 pm
Re: back to the Komodo SMP issue
That answer is very clearly false for Komodo. Komodo's 4-core search to a particular depth does more useful work than Komodo's 1-core search to that same depth. I though that would have been obvious by now......bob wrote:The answer is obvious. It searches a position to a fixed depth.syzygy wrote:The definition clearly assumes that the same work is done.bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.
speedup = T(serial) / T(parallel)
efficiency = speedup / nprocs
It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
When does a 4-core search perform the same work as a 1-core search?
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: back to the Komodo SMP issue
The definition doesn't change for a specific program. It is a general expression. If a program violates the basic premise of "search the same tree but in parallel" then the program deviates from the formula, not the other way around.syzygy wrote:That answer is very clearly false for Komodo. Komodo's 4-core search to a particular depth does more useful work than Komodo's 1-core search to that same depth. I though that would have been obvious by now......bob wrote:The answer is obvious. It searches a position to a fixed depth.syzygy wrote:The definition clearly assumes that the same work is done.bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.
speedup = T(serial) / T(parallel)
efficiency = speedup / nprocs
It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
When does a 4-core search perform the same work as a 1-core search?
More useful work seems probable. But we don't have the timing information to show what that costs. Clearly disabling LMR would make a program stronger, if time were not an issue, but that's not the case.
I'm STILL waiting on some rational explanation for how a program can get a speedup of 1.4x, but search more useful information. The speedup is STILL just 1.4x. Just increasing the EBF by 0.1 more than uses that .4x factor.
So, rather than continuing the argument about "if", I have tried, unsuccessfully so far, to try to discuss "how". So far, nothing technical, just the usual.
I tend to not accept things "just because". I want to understand. And I don't believe in things that can't be explained... so, tell me how...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: back to the Komodo SMP issue
However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.
Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.
I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
That 1.4x seems to be a hurdle that is pretty high.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: back to the Komodo SMP issue
Doesn't that same concept apply to a serial search on a CPU that is 2x faster? You could fix the depth and throttle back a bit on the reductions and pruning to use that extra power to make the tree wider, not deeper, obviously...Joerg Oster wrote:I want to add, that today engines already reach in single mode high depths very fast. Due to extensive pruning, reductions etc. But they also miss many things.FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.
Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.
I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
Isn't it reasonable to think that they will gain by searching wider (less reductions and pruning), and not only by going deeper?
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: back to the Komodo SMP issue
However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.
Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.
I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
That 1.4x seems to be a hurdle that is pretty high.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: back to the Komodo SMP issue
To illustrate practically the point at fixed _time_ I show the result in LittleBlitzer, where time, speed and depths are measured:bob wrote:However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.
Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.
I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
That 1.4x seems to be a hurdle that is pretty high.
Code: Select all
1. Komodo 5.1 4threads 594.0/874 (tpm=107.5ms d=16.69 nps=7606453)
2. Komodo 5.1 1thread 280.0/874 (tpm=108.4ms d=15.79 nps=2081020)
+131 points
1. Houdini 3 4threads 180.5/265 (tpm=104.0ms d=18.37 nps=8163480)
2. Houdini 3 1thread 84.5/265 (tpm=103.8ms d=16.46 nps=3022500)
+132 points