96 core box unveiled by intel

diep · Post by **diep** » Tue Nov 24, 2009 6:19 pm

hi,

nice box to run on with a chessprogram. many will work.

http://news.cnet.com/8301-13924_3-10041308-64.html

"Intel unveils 96 core box".
16 sockets * 6 cores.

Made by unisys.

I see different prices getting quoted.
the full configuration tested was 2 million dollar of it with 96 cores.

Bob?

something to order for crafty?

Please note this is not a cluster. This is a good hardware where pc programs with a reasonable parallellism will work great at also.

Phone the sheikh

Anyway there was some postings on a new 96 core box now shared memory, maybe it is this one that was previous year announced.

Vincent

bob · Post by **bob** » Tue Nov 24, 2009 7:14 pm

diep wrote:hi,

nice box to run on with a chessprogram. many will work.

http://news.cnet.com/8301-13924_3-10041308-64.html

"Intel unveils 96 core box".
16 sockets * 6 cores.

Made by unisys.

I see different prices getting quoted.
the full configuration tested was 2 million dollar of it with 96 cores.

Bob?

something to order for crafty?

Please note this is not a cluster. This is a good hardware where pc programs with a reasonable parallellism will work great at also.

Phone the sheikh

Anyway there was some postings on a new 96 core box now shared memory, maybe it is this one that was previous year announced.

Vincent

I think it will end up far cheaper. We had an 8-socket box a few years ago. I'm trying to remember the manufacturer but am coming up blank. Had 4 sockets on the board, and a 4-socket "daughter-board". Thing cost something in the $5-6K range not including cabinet, power supply, disks or CPUs.

I suspect the primary expense is a way to connect 16 cores to memory, probably using some sort of crossbar rather than the traditional bus, but making the crossbar look like a bus to the processor chips.

The Unisys box appears to be priced at the $135K range for a full-blown configuration, but the ad you linked to was last year's stuff prior to the i7...

diep · Post by **diep** » Tue Nov 24, 2009 7:25 pm

bob wrote:
diep wrote:hi,

nice box to run on with a chessprogram. many will work.

http://news.cnet.com/8301-13924_3-10041308-64.html

"Intel unveils 96 core box".
16 sockets * 6 cores.

Made by unisys.

I see different prices getting quoted.
the full configuration tested was 2 million dollar of it with 96 cores.

Bob?

something to order for crafty?

Please note this is not a cluster. This is a good hardware where pc programs with a reasonable parallellism will work great at also.

Phone the sheikh

Anyway there was some postings on a new 96 core box now shared memory, maybe it is this one that was previous year announced.

Vincent
I think it will end up far cheaper. We had an 8-socket box a few years ago. I'm trying to remember the manufacturer but am coming up blank. Had 4 sockets on the board, and a 4-socket "daughter-board". Thing cost something in the $5-6K range not including cabinet, power supply, disks or CPUs.

I suspect the primary expense is a way to connect 16 cores to memory, probably using some sort of crossbar rather than the traditional bus, but making the crossbar look like a bus to the processor chips.

The Unisys box appears to be priced at the $135K range for a full-blown configuration, but the ad you linked to was last year's stuff prior to the i7...

Well the configuration used and tested with the microsoft windows that runs now 96 cores also, it was quoted to be 2 million dollar.

The price quoted there is when it wasn't getting announced yet. Announcement now is 2 million which makes sense.

around year 2000 the 16 processor alpha's were 10 million dollar and 16 processor Sun's were also several millions, up to 10 million each.

That's shared memory Bob. It's not a cluster.

Cluster latency is about 4-5 microseconds.

SGI at 16 processors is a tad better around 700 ns. That's however nowadays 32 cores. Not 96.

When using 64 processor altix3800 partitions it's several
microseconds again. That's 128 cores.

Such configuration original had a cost of 1 million dollar. Will be a tad cheaper now.

This unisys is total shared memory. I don't knowmemory latency. It will be very good i guess. That memory latency will determine simply its price. Wen that's something like 200 ns, it sure will be 2 million dollar i bet.

You get what you pay for

Vincent

diep · Post by **diep** » Tue Nov 24, 2009 7:35 pm

The price of RAM will be expensive already.
It can have 512GB ram.

8GB dimms DDR3 ecc are quite expensive now.

That's like 500 dollar * 512 / 8 = 32k euro just in RAM.
They don't deliver that for 32k euro hower. More like 100k euro

That RAM they don't use for testing it however, as the benchmarks done will be done with the fastest RAM they can find also which is quite expensive in those ranges (and explains why those testmachines always look so great).

Having 512GB fast RAM is very valuable for some sort of software programs.

Then there will be a newer version of the box by now that's getting used, as this was the initial announcement. Nehalem is there nowadays and an experimental 6 core version is there already (in beta) and getting demonstrated but not publicly available yet of course.

Intel and AMD and Nvidia really do their best now to look better and faster past few months. Recession must hit hard on them.

Unisys i can't remember they ever had lately something else than intel processors in such machines?

All this raises price quickly.

bob · Post by **bob** » Tue Nov 24, 2009 10:15 pm

diep wrote:
bob wrote:
diep wrote:hi,

nice box to run on with a chessprogram. many will work.

http://news.cnet.com/8301-13924_3-10041308-64.html

"Intel unveils 96 core box".
16 sockets * 6 cores.

Made by unisys.

I see different prices getting quoted.
the full configuration tested was 2 million dollar of it with 96 cores.

Bob?

something to order for crafty?

Please note this is not a cluster. This is a good hardware where pc programs with a reasonable parallellism will work great at also.

Phone the sheikh

Anyway there was some postings on a new 96 core box now shared memory, maybe it is this one that was previous year announced.

Vincent
I think it will end up far cheaper. We had an 8-socket box a few years ago. I'm trying to remember the manufacturer but am coming up blank. Had 4 sockets on the board, and a 4-socket "daughter-board". Thing cost something in the $5-6K range not including cabinet, power supply, disks or CPUs.

I suspect the primary expense is a way to connect 16 cores to memory, probably using some sort of crossbar rather than the traditional bus, but making the crossbar look like a bus to the processor chips.

The Unisys box appears to be priced at the $135K range for a full-blown configuration, but the ad you linked to was last year's stuff prior to the i7...
Well the configuration used and tested with the microsoft windows that runs now 96 cores also, it was quoted to be 2 million dollar.

The price quoted there is when it wasn't getting announced yet. Announcement now is 2 million which makes sense.

around year 2000 the 16 processor alpha's were 10 million dollar and 16 processor Sun's were also several millions, up to 10 million each.

That's shared memory Bob. It's not a cluster.

Cluster latency is about 4-5 microseconds.

SGI at 16 processors is a tad better around 700 ns. That's however nowadays 32 cores. Not 96.

When using 64 processor altix3800 partitions it's several
microseconds again. That's 128 cores.

Such configuration original had a cost of 1 million dollar. Will be a tad cheaper now.

This unisys is total shared memory. I don't knowmemory latency. It will be very good i guess. That memory latency will determine simply its price. Wen that's something like 200 ns, it sure will be 2 million dollar i bet.

You get what you pay for

Vincent

I only looked at the price Unisys was quoting in the article you linked to. I can't imagine a machine with 16 Intel processors, none of which sell for more than $5,000 apiece, selling for over a million dollars. Something wrong with that picture.

jhaglund · Post by **jhaglund** » Thu Nov 26, 2009 1:07 pm

Another interesting link:

http://techresearch.intel.com/articles/ ... e/1449.htm

or

Video...

http://techresearch.intel.com/UserFiles ... 20300K.wmv

diep · Post by **diep** » Mon Nov 30, 2009 3:19 pm

jhaglund wrote:Another interesting link:

http://techresearch.intel.com/articles/ ... e/1449.htm

or

Video...

http://techresearch.intel.com/UserFiles ... 20300K.wmv

Not carrying a date. This is probably the famous chip demonstrated a few years ago. Intel didn't continue it AFAIK. Note that a simple gpu delivers already nearly 3 Tflop nowadays (single precision). The nvidia fermi chip seems interesting here as it will be the first one to carry a level cache.

A go program that has been implemented at latest nvidia chip (the ones sold now; i thought a 295 chip or something) had the same speed like a dual core2 in nodes per second (let's not even discuss yet parallel speedup bla bla).

I felt that was a good achievement.

Yet it's telling you something of the problems you have at manycore cpu's. This is called by the way a multicore cpu, but the terminology is dangerous.

A multicore chip doing integers at 80 cores, each core having some sort of cache itself local, and especially a branch prediction unit that's not too ugly slow, that's worth something you know.

A vector chip that's basically total vector oriented is a much tougher nut to program integer codes at. Chessprograms have many branches which only runs fast at x64 cpu's right now. Additional to that you need a shared hashtable. At the GPU's having a big shared hashtable is complicated.

For example at x2 cpu's the RAM is simply not shared at all between the gpu's. Yes there is some sort of a link which you can program for,
but that's not making it easier to get software to run at it, as every
set of cores have to execute the same code at the same time.

For example the 240 cores of current nvidia generations split up in 8 multicores of each 30 manycores.

So 30 cores execute the same instruction at the same time. That isn't making it easier as in a chessprogram the ideal thing is to have each core busy at a different part of the code and each core having a different position being busy at.

Having 30 cores busy at the same position is no fun of course.

Larrabee is even worse here in that the indirection needed to adress the cores at independant positions is really slow. Allt hose instructions are far over 7+ cycles. That's the information leaked so far on larrabee.

So you lose basically factor 8 or so directly which annihilates a lot of the advantages of having so many cores at the vector or manycore processors.

That's why those expensive machines with many x86 cores are doing so well for computerchess, as they can each run a different part of your code and have level caches and real good ones and branch prediction and real good branch prediction and between the processors is the hashtable shared.

You don't want to search the same position twice of course. that's where the transpositiontable kicks in.

If you do 100 million chesspositions a second, ideally you want to do 100 MILLION lookups per second to the hashtable to see whether you already visited this position.

Now a cluster with just gigabit or something can do like a 1000 lookups per second there or so, so you're missing a factor 100k of lookup speed basically, making your search real inefficient.

The 72 and 96 core machines that rybka and deepsjeng nowadays run on with very fast shared memory are therefore this expensive as they have fast lookup speeds.

I guess you can call each shared memory machine a cluster somehow, but really it's much more than that. The memory subsystem is where all the money goes.

It is Frans Morsch who knew how to say it right there in a chat with me, some eyars ago. He said: "above 400Mhz on mainboards the cupper tracks are basically transistor radio's, so not usable to connect other parts".

This is the big problem when increasing the memory system from 1 mainboard to 16 mainboards.

Solving the fast communication between all the processors (not to mention cache snooping and so on) is real complicated. Intel already has years of delay with their upcoming Xeon MP platform.

AMD really dominates there for a few years now since 2004 at the 4 socket region.

So the next step is to go to manufacturers like HP and Unisys which have supercomputers that have this memory subsystem solved. Those machines are *really* expensive.

Vincent

jshriver · Post by **jshriver** » Fri Dec 04, 2009 7:49 pm

Beautiful box, excuse me now while I stop drooling heh

96 core box unveiled by intel

96 core box unveiled by intel

Re: 96 core box unveiled by intel

Re: 96 core box unveiled by intel

Re: 96 core box unveiled by intel

Re: 96 core box unveiled by intel

Re: 96 core box unveiled by intel

Re: 96 core box unveiled by intel

Re: 96 core box unveiled by intel