No, you are simply in over your head here. I explicitly asked "how does one widen the search?" Without losing something in the process. I gave a specific example using your 2.1 EBF number, which seems to pretty well match komodo. But that does not address the question. You just can't say "go wider". The question is how. And how do you avoid slowing things down, since a tree is a tree, and it still has to be searched. You can't just search the extra nodes and hand-wave them away. They have to be searched by the parallel search....Laskos wrote:To illustrate practically the point at fixed _time_ I show the result in LittleBlitzer, where time, speed and depths are measured:bob wrote:However, back to my original point. If the SMP speedup is ONLY 1.4x for 4 cores, you don't have a lot of room to add extra width, before you completely consume that 40% speed (depth) gain, and start to actually slow the program down. If it is still stronger, even after being slowed down, that suggests that it could be made stronger in the serial search just as easily, using this same idea...FlavusSnow wrote:I think about this very simply, and let me exaggerate just a little here for illustration purposes.
Say we have an algorithm that searches very narrowly by searching only the first move in the list for 4 ply before adding a second branch, and every 4 ply it adds another branch somewhere in the tree. This would search very deep on one thread, but not very wide. In fact this would probably result in very weak play on one thread. But now let's say I have 128 threads. Now I can unroll the search at the root a little and start to get some more information about the tree. Next thing you know, more branches are forming around the PV by a hive of searching threads and the tree is getting some meat. The engine sucked with one thread, but it might be plausible with 128 threads. These threads could each be on their own core, or maybe it would even function reasonably with the same 128 threads on one core. Would it perform like an engine designed for one core? Certainly not, but would the engine designed for one core get any gain from running on a 128 core system? Some gain, yes, but not nearly the gain obtained by the engine designed for 128 cores from the beginning.
I hope can see the concept. To most efficiently design an engine for large numbers of cores, you must throw out the idea of optimizing it to run on one thread.
That 1.4x seems to be a hurdle that is pretty high.This test is not to compare the gain, but to see depths and NPS, while the Elo gain is comparable. To the similar Elo gain at fixed _time_, Komodo 1->4 cores gains 0.90 plies in depth, Houdini 1->4 cores gains 1.91 plies. NPS for Komodo 1->4 increased by a factor of 3.66. While all the effort in Houdini is devouted to depth, in Komodo it's pretty evenly divided for depth and width. In all your ad absurdum arguments, you are talking as though this 3.66 increase in NPS doesn't exist. Nobody is saying that Komodo parallel at the _same_ speed is stronger at fixed _time_ than Komodo single cored, or that Komodo parallel is stronger than 4x time single cored.Code: Select all
1. Komodo 5.1 4threads 594.0/874 (tpm=107.5ms d=16.69 nps=7606453) 2. Komodo 5.1 1thread 280.0/874 (tpm=108.4ms d=15.79 nps=2081020) +131 points 1. Houdini 3 4threads 180.5/265 (tpm=104.0ms d=18.37 nps=8163480) 2. Houdini 3 1thread 84.5/265 (tpm=103.8ms d=16.46 nps=3022500) +132 points
That's the issue here. How does one actually do this? Something that mathematically stands up to scrutiny.
the 3.66x NPS increase doesn't mean much, period, other than that the algorithm actually scales pretty well on that particular architecture, no major memory bottlenecks or synchronization issues. But SOMETHING has to be done in parallel, and that is what I am trying to understand. I understand parallel alpha/beta perfectly. And I understand what "just going wider" is going to do.