M ANSARI wrote:bob wrote:Zach Wegner wrote:bob wrote:A future version of Crafty (hopefully this year) will also do so.
How's this going, by the way? Got anything running yet?
No. But got lots of the design and some of the coding done... Every time I get started on details, I discover a better approach to do something and take about 10 steps back for every step forward.
If Crafty does come out with a cluster version, will it be engine specific to Crafty only or will other engines be able to piggy back on the platform?
Forget about crafty, rybka or sjeng getting any significant positive speedup during gameplay from a 100 mbit or 1 gigabit standard non-dma ethernet card using old machines.
Getting to 2.0 out of n nodes is already really tough from such a cluster and you'll realize that some old junk hardware is more than factor 2 slower than the latest intel or AMD or even Sun cpu.
If you buy a bunch of cheap Q6600's now and expect such a cluster with standard components to be faster than a 8 core sun cpu in 2010, that's wishful thinking.
Some people have been bragging too much about clusters.
What i've got at home is a QM400 switch (16 way), 17 network cards and 17 cables. It requires special PCI-X capable mainboards.
If you'd buy a new network existing out of newer pci-e network cards, you'll get cards that are probably 8 times faster in bandwidth than this QM400.
Yet that has a node price of around $1000 a node, which still is cheap. For large clusters node price usuallygoes up to $3300 a node.
that's why i bought this second hand. getting it to work with modern cpu's is not so easy.
I could buy recently second hand, but didn't that of course, old junk dual xeon 3.06Ghz machines. All equipped with pci-x.
Had i bought 16 of 'em, each for $150, that would have been in total 8 x 300 = 2400 dollar.
Now i don't have a penny income at all, or i might have been tempted to do it.
Of course it is a big waste of power, but it is a relative cheap manner to already be able to experiment with a lot of nodes and a lot of cores at a cluster. Additional to that, the hardware for sure works with this network.
A single such box achieves a speed for diep of about 220k nps.
Forget about pwoer usage now. The real problem of clusters is power usage.
16 x 220k nps = 8 x 440 = 3200 + 320 = 3520k nps.
A single i7-965 gets 1.0 mln nps exactly, which is also what the phenom2-955 gets when overclocked a tad.
You'll figure out soon that i can easily build 4 nodes AMD 955 for $2400,
which will practically get more nodes per second.
Note that if i would use Q6600's that i can't overclock them, as i can't overclock pci-bus.
Newer hardware always slams to death old hardware so bigtime, that it is nearly not possible.
Now let's compare the powerusage.
That oldie p4 Xeon 3.06Ghz was eating 105 watt a cpu. 2 cpu's is 210 watt.
Add up other usages you end up with a box eating 270 watt each.
So 16 x 270 watt = 2700 + 1200 + 420 = 4320 watt or so.
Now the i7-965 is too expensive, let's skip to phenom2-955 which is 200 euro or so. It eats nearly as much as the i7 which eats too much power also. Both are around 220 watt i'm guessing under full diep load.
4 x 220 = 880 watt.
Add to those power costs 800 watt for the switch and another 25 watt or so for each QM400 card. Those highend network cards are never idle of course.
At 16 nodes you feel that: 16 x 25 = 8 * 50 = 400.
So that makes the old junk that gets 3.5 million nps using:
4320 watt + 1200 = 5.5 kilowatt in total for the cluster solution of just 16 old outdated and cheapo nodes.
Now the 4M nps phenom2 cluster of 4 nodes it is:
880 + 700 + 100 = 1680 watt.
Now the actual performance. This might surprise you. In the way how i'm doing parallel search, the 'oldie' nodes at 3.5 mln nps, might do in fact not much worse than the 4M nps phenom2 cluster. It might be better.
The reason is simple. 1 node is 5 times faster of that phenom2, so that means i'm missing a ply or 2 more that can't use a global hashtable, whereas the Xeon 3.06Ghz will not lose that.
Missing hashtable last few plies is really pricey.
The communication latency relative to the nps of 1 node is pretty ok.
The communication latency does hurt however at 220k nps bigtime.
It is about 3 us for a one way pingpong.
So that's Qm400 cards from quadrics. google for them at ebay.
Note they do not work for ethernet.
The latency of your 100 mbit or even 1 gbit is more like 0.3 ms, so factor 100 slower at least.
If Diep was the first program to deal reasonably with the latencies compared to the nps that diep got, something the 'old' guys didn't manage.
Note they were not very helpful either in providing information, despite that i had emailed 'em. So it was uncertain whether diep would manage to work with the latencies from SGI, which has the same latency like the cards i own now. Note those latencies at SGI are for HUGE partition of 128 nodes and 512 processors 500Mhz.
At that diep reached above 5 million nps, with a peek of exactly 9.99 mln nps. Just not 10 mln nps. Believe me i browsed those logfiles to see whether it got over 10 mln, but it never managed.
Now you're speaking of todays networks which still have those ugly latencies.
The fastest and most expensive network that's there right now, which IBM uses sometimes with its power6, it is about 1.0 us.
So less than factor 3 faster than "the old junk" i've got over here, and probably factor 20 bigger bandwidth thanks to 2 rails (note bandwidth is not that relevant in this case).
That factor 3 sounds like a lot, but it isn't. As that's of course the usual HPC lies. As soon as you get a BIG partition of course you ain't gonna see one-way pingpong times of 1 us. You only get that with 1 switch.
With a big network with level2 and level3 routers, say existing out of a 512 nodes or so, that latency again will be similar to what i've got at home here.
Add to that, that diep's nps is quite low compared to the fast beancounters like rybka (2.4 mln nps or so single core so i heard the latest rumours).
So forget about a cluster version that works for you with those commercial guys.
Buying a new 2 socket machine is always faster than that.
As for Diep, it might be the only program that is going to work well one day on a cluster, if i can find the money to buy some quad core nodes. Right now i own an old 2.4Ghz quad core AMD which replaced the 6 stolen machines from here. I already need it for source code editting. Testing at it is already tough as it means i can't work in the soruce code then.
But i do not give you the illusion that you can buy a cluster version of diep that easily works for you at 100 mbit or 1 gigabit networks.
I won't even say it "maybe works" end of 2009. It won't unless you buy a low latency network and manage to configure it (which you never will manage yourself, you need a HPC sysadmin for you to do so, someone without HPC experience won't manage it), in which case you CAN buy a cluster copy and i'll especially customize it for you to work with the network you've got (provided it has MPI or something similar).
Low latency networks are not easy to configure. Note it IS possible to buy low latency tcp/ip cards. They're quite expensive though. Nearly $1000 a piece. They're DMA. Additional to that you need a low latency switch which newprice also is a $5000. Maybe ebay has all that a lot cheaper.
So cluster versions MIGHT work onto that quite well. Some of you might be able to obtain that cheap.
Other than Diep, forget about cluster versions at your desk that get a speedup of over 2.0 out of the network. For all that money and effort,
it's easier to buy a 4 socket machine of course.
In june amd is supposed to release 24 core box. Now THAT kicks butt.
More than a 16 nodes P4 Xeon 3.06Ghz cluster, with worlds best parallel algorithm, that's for sure.
Vincent