bob wrote:diep wrote:bob wrote:Jouni wrote:I had just one basic question in mind: why can't operating system (OS) do MP thing automatically?! Simply run software with all available CPU units automatically as fast as possible. Can't be too impossible to OS - or is it?
Completely impossible until the O/S actually UNDERSTANDS the algorithm and can make the necessary changes to make a single-cpu program use more than one. Take a single-thread chess engine and rewrite it to use more than one CPU effectively. That is what the operating system / compiler would have to do to accomplish the same thing. Could a compiler really either (a) understand all possible computationally intensive algorithms or (b) be able to look at a source code and actually understand the underlying ideas and then rewrite to use more than one CPU?
Not impossible, but at least 20-30-40 years away from the present state of the art...
The major projects on doing this have basically faded away for pretty obvious reasons...
Not only impossible, also it would be a crap algorithm as some loser would implement it
If you'd use Cilk, which means you have to call Cilk functions to get things done, which already is a step further than 'have the OS figure it out', you still would get maybe 2000 chesspositions a second at each machine.
That's factor 10k slower now or so?
So 500 cpu's would be slower with Cilk than a single CPU program without, with 100% sureness
Simply because Leierson & co might be good book authors, but they cannot program themselves.
Generic workstealing/loadbalancing/auto-parallellisation algorithms always will be factor 1000 too slow for computerchess. The latency of those algorithms is always in the dozens of milliseconds, as the runqueue still ticks at that speed of 100Hz, and that won't change any soon.
So we can prove that a generic implementation will always fail to deliver as splitting, aborting and other things, at least in Diep is just a fraction of the time that the management of Cilk and similar systems eats.
When running at machines with hundreds of cores you simply need those millions of splits and Cilk just simply can't deliver that quickly. So it's always exponential slower.
And against that mathematical proof not even the help of some clever PHD students let alone intel, will ever help that.
Yet please realize that in case of Cilk the YBW algorithm you still have to implement yourself doing Cilk function calls. So it really is far off from the one liner dropped here - at which we always gladly respond
Vincent
One of the best automatic parallelizing projects was done by Kuch and associates for Cray years ago. Cray asked me to try the compiler and I did. It had no clue about how to parallelize what was actually an iterated (rather than recursive) search already. It parallelized move input and output loops and such, and produced zero speedup for the actual search, because it could not understand the dependencies caused by a variable-depth search.
I don't think software will ever be able to deal with automatically transforming a normal alpha/beta search into a parallel search. At least not until the point where the software actually begins to understand source code. That is a long, long way out in the future, if ever...
Amazingly it seems some government dudes has revived Cilk as intel is quoting it in their speeches. Speaking of worthless crap, cilk sure belongs there.
Also note it isn't actually parallellizing yoru software. You still must give cilk the notion of where to split and how. So it's not autoparallellisation at all.
From my viewpoint seen most scientific and government people total underestimate how tough it is to parallellize a game tree search. they just have no clue.
If we realize that start 21th century around year 2000, when i asked a bunch, 0 programmers that i met who had themselves by then parallellized their own engine, 0 of them believed it was possible to parallellize a MODERN chess engine with a low branching factor for a big supercomputer, without losing some big gigantic factor rendering the supercomputer near worthless.
They just didn't believe it, because of the total inefficiency of guys before them that had tried to do this, most of them losing factor 40 to 50, and even then they just got af ew plies more because of big forward pruning.
So when comparing the searches 1 to 1, they were crap everywhere.
And i have to admit - when i started parallellizing, had i known in advance the total crap latencies from cpu to cpu, i would NOT have started the project of parallellizing diep for the supercomputer. When i finally figured out the bad latencies, after having written a test myself and i had to wait a month or 6 to run it on the 512 processor partition, using a couple of hundreds of cpu's, then i didn't really know myself whether i would manage.
That's basically showing the big difference between experts and guys on the street.
So we can prove very simple that it's impossible that an OS or some abstraction layer can do this for you, as the hardware simply changes shape each time.
So without first getting it to run perfectly by an expert, some abstraction layer cannot do it.
As they never will hire an expert who can do it, to put it in an abstraction layer, simply because that expert is not gonna have an US passport, we can therefore prove that this is impossible.
If you say: "oh i have a car that drives me from A to B, so i can drink something tonight as the car will bring me at home".
Most will simply offer you a beer and go on - not realizing how difficult such achievement is.