back to the Komodo SMP issue

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Henk
Posts: 7216
Joined: Mon May 27, 2013 10:31 am

Re: back to the Komodo SMP issue

Post by Henk »

So probably at search levels above 18 the average branching factor should be greater than 2 otherwise searching faster doesn't give better games in terms of ELO rating.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob »

Henk wrote:So probably at search levels above 18 the average branching factor should be greater than 2 otherwise searching faster doesn't give better games in terms of ELO rating.
I've not seen any sort of research that shows lower EBF plays worse. In fact, quite the contrary based on LMR/forward-pruning results we see today. One can get too speculative, of course. But 2.0 doesn't seem to be some sort of "barrier".
FlavusSnow
Posts: 89
Joined: Thu Apr 01, 2010 5:28 am
Location: Omaha, NE

Re: back to the Komodo SMP issue

Post by FlavusSnow »

I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: back to the Komodo SMP issue

Post by Joerg Oster »

FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
I want to add, that today engines already reach in single mode high depths very fast. Due to extensive pruning, reductions etc. But they also miss many things.
Isn't it reasonable to think that they will gain by searching wider (less reductions and pruning), and not only by going deeper?
Jörg Oster
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: back to the Komodo SMP issue

Post by syzygy »

bob wrote:
syzygy wrote:
bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
The definition clearly assumes that the same work is done.

When does a 4-core search perform the same work as a 1-core search?
The answer is obvious. It searches a position to a fixed depth.
That answer is very clearly false for Komodo. Komodo's 4-core search to a particular depth does more useful work than Komodo's 1-core search to that same depth. I though that would have been obvious by now......
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob »

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:"An introduction to parallel programming" by Peter Pacheco, page 58, near the bottom.

speedup = T(serial) / T(parallel)

efficiency = speedup / nprocs

It's not a matter of liking or not liking a term, it is one that works, is clear, and understandable. If you use 4 cpus to search twice as fast, your efficiency is 50%, where if you use 4 cpus to search 4x as fast, your efficiency is 100%. Zero lost effort...
The definition clearly assumes that the same work is done.

When does a 4-core search perform the same work as a 1-core search?
The answer is obvious. It searches a position to a fixed depth.
That answer is very clearly false for Komodo. Komodo's 4-core search to a particular depth does more useful work than Komodo's 1-core search to that same depth. I though that would have been obvious by now......
The definition doesn't change for a specific program. It is a general expression. If a program violates the basic premise of "search the same tree but in parallel" then the program deviates from the formula, not the other way around.

More useful work seems probable. But we don't have the timing information to show what that costs. Clearly disabling LMR would make a program stronger, if time were not an issue, but that's not the case.

I'm STILL waiting on some rational explanation for how a program can get a speedup of 1.4x, but search more useful information. The speedup is STILL just 1.4x. Just increasing the EBF by 0.1 more than uses that .4x factor.

So, rather than continuing the argument about "if", I have tried, unsuccessfully so far, to try to discuss "how". So far, nothing technical, just the usual.

I tend to not accept things "just because". I want to understand. And I don't believe in things that can't be explained... so, tell me how...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob »

FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...

That 1.4x seems to be a hurdle that is pretty high.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob »

Joerg Oster wrote:
FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
I want to add, that today engines already reach in single mode high depths very fast. Due to extensive pruning, reductions etc. But they also miss many things.
Isn't it reasonable to think that they will gain by searching wider (less reductions and pruning), and not only by going deeper?
Doesn't that same concept apply to a serial search on a CPU that is 2x faster? You could fix the depth and throttle back a bit on the reductions and pruning to use that extra power to make the tree wider, not deeper, obviously...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: back to the Komodo SMP issue

Post by bob »

FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...

That 1.4x seems to be a hurdle that is pretty high.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: back to the Komodo SMP issue

Post by Laskos »

bob wrote:
FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.

Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.

I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...

That 1.4x seems to be a hurdle that is pretty high.
To illustrate practically the point at fixed _time_ I show the result in LittleBlitzer, where time, speed and depths are measured:

Code: Select all

 1.  Komodo 5.1 4threads      	594.0/874	(tpm=107.5ms d=16.69 nps=7606453)
 2.  Komodo 5.1 1thread       	280.0/874	(tpm=108.4ms d=15.79 nps=2081020)
+131 points

 1.  Houdini 3 4threads       	180.5/265	(tpm=104.0ms d=18.37 nps=8163480)
 2.  Houdini 3 1thread        	 84.5/265	(tpm=103.8ms d=16.46 nps=3022500)
+132 points
This test is not to compare the gain, but to see depths and NPS, while the Elo gain is comparable. To the similar Elo gain at fixed _time_, Komodo 1->4 cores gains 0.90 plies in depth, Houdini 1->4 cores gains 1.91 plies. NPS for Komodo 1->4 increased by a factor of 3.66. While all the effort in Houdini is devouted to depth, in Komodo it's pretty evenly divided for depth and width. In all your ad absurdum arguments, you are talking as though this 3.66 increase in NPS doesn't exist. Nobody is saying that Komodo parallel at the _same_ speed is stronger at fixed _time_ than Komodo single cored, or that Komodo parallel is stronger than 4x time single cored.