Not carrying a date. This is probably the famous chip demonstrated a few years ago. Intel didn't continue it AFAIK. Note that a simple gpu delivers already nearly 3 Tflop nowadays (single precision). The nvidia fermi chip seems interesting here as it will be the first one to carry a level cache.
A go program that has been implemented at latest nvidia chip (the ones sold now; i thought a 295 chip or something) had the same speed like a dual core2 in nodes per second (let's not even discuss yet parallel speedup bla bla).
I felt that was a good achievement.
Yet it's telling you something of the problems you have at manycore cpu's. This is called by the way a multicore cpu, but the terminology is dangerous.
A multicore chip doing integers at 80 cores, each core having some sort of cache itself local, and especially a branch prediction unit that's not too ugly slow, that's worth something you know.
A vector chip that's basically total vector oriented is a much tougher nut to program integer codes at. Chessprograms have many branches which only runs fast at x64 cpu's right now. Additional to that you need a shared hashtable. At the GPU's having a big shared hashtable is complicated.
For example at x2 cpu's the RAM is simply not shared at all between the gpu's. Yes there is some sort of a link which you can program for,
but that's not making it easier to get software to run at it, as every
set of cores have to execute the same code at the same time.
For example the 240 cores of current nvidia generations split up in 8 multicores of each 30 manycores.
So 30 cores execute the same instruction at the same time. That isn't making it easier as in a chessprogram the ideal thing is to have each core busy at a different part of the code and each core having a different position being busy at.
Having 30 cores busy at the same position is no fun of course.
Larrabee is even worse here in that the indirection needed to adress the cores at independant positions is really slow. Allt hose instructions are far over 7+ cycles. That's the information leaked so far on larrabee.
So you lose basically factor 8 or so directly which annihilates a lot of the advantages of having so many cores at the vector or manycore processors.
That's why those expensive machines with many x86 cores are doing so well for computerchess, as they can each run a different part of your code and have level caches and real good ones and branch prediction and real good branch prediction and between the processors is the hashtable shared.
You don't want to search the same position twice of course. that's where the transpositiontable kicks in.
If you do 100 million chesspositions a second, ideally you want to do 100 MILLION lookups per second to the hashtable to see whether you already visited this position.
Now a cluster with just gigabit or something can do like a 1000 lookups per second there or so, so you're missing a factor 100k of lookup speed basically, making your search real inefficient.
The 72 and 96 core machines that rybka and deepsjeng nowadays run on with very fast shared memory are therefore this expensive as they have fast lookup speeds.
I guess you can call each shared memory machine a cluster somehow, but really it's much more than that. The memory subsystem is where all the money goes.
It is Frans Morsch who knew how to say it right there in a chat with me, some eyars ago. He said: "above 400Mhz on mainboards the cupper tracks are basically transistor radio's, so not usable to connect other parts".
This is the big problem when increasing the memory system from 1 mainboard to 16 mainboards.
Solving the fast communication between all the processors (not to mention cache snooping and so on) is real complicated. Intel already has years of delay with their upcoming Xeon MP platform.
AMD really dominates there for a few years now since 2004 at the 4 socket region.
So the next step is to go to manufacturers like HP and Unisys which have supercomputers that have this memory subsystem solved. Those machines are *really* expensive.
Vincent