On the 100 core Tilera cpu releasing maybe 2010

diep · Post by **diep** » Tue Dec 08, 2009 4:55 pm

Crossposted in other forum where GCP calculated the chips speed.

hi GCP, i had missed this chip.

The first of this chip, tilera with 64 cores i have tried to get information what it would cost to get a 1000 of them.
That was some years ago when they managed to get publicity for it.

The response was to say polite, not how a salesman would react. They really were after deals selling millions of them.

If a chip is real good you don't react like that. Remember that the first generation was their first chip of this sort.

I'm a bit amazed to hear from a new generation which has dramatic improvements. We must be very honest and fair
that this chip really has improved a lot.

However your speedup calculation is overly optimistic IMHO as cpu's like this don't work in multiplce sockets,
yet that's how you have to compare it.

http://www.tgdaily.com/hardware-feature ... -processor

Realize the statement: "with up to 26MB total L3 coherent cache across the device."

Cache coherency is really a big problem here. For our hashtable we want in fact no cache coherency at all,
as with so many cores it will just slowdown immensely the search speed.

So practical it will not be possible to store the hashtable last few plies. Now for diep that would be a major loss,
not to mention the faster engines (deepsjeng, rybka etc).

Just losing the qsearch is already 20% loss in case of Diep. Losing the last few plies for hashtable for faster engines
than Diep will be an even bigger problem. So let me calculate it for diep first as the problem for rybka and others
which are 2.4 mln chesspositions a second a core or something, there the problem is even bigger.

I'm guessing this new core is of similar quality to a MIPS processor, like the R14000,
as these guys are not intel nor AMD, let alone that they have a compiler that is any good
at such short notice.

The R14000 was slighly worse than K7 for Diep in IPC, some years ago the PGO and other improvements
in visual c++ 2005 have really speeded up K7 a lot, quite some more than 20%,
whereas we can surely expect that their GCC port, notoriously bad with pgo, won't really function well there.

I have tested the speed of Diep now at a K7 and it is 109531 nodes per second at a K7 2.13Ghz,
let's say 110k nps at 2.1ghz.

At 1.5Ghz that K7 is roughly 1.4 times slower: 110 / 1.4 = 78.57k nps
Now the 20% reduction for having ugly compiler: 62.8k nps

However i gave it advantages everywhere in the roundoffs, so that's 60k nps.

At 1500Mhz i estimate it gets about 60k nps. So it is 6 Million nps at 100 cores in case of ideal scaling.

Now we lose instantly 20% to the quiescencesearch not being able to get stored in hashtable.
Another 10% loss is there for Diep as i have to make its evaluation and pawntable extra tiny to fit in L2.

So we start losing 30% to simplistically the cache coherency.

0.7 * 6 = 4.2M nps

Now the parallel speedup is quite ok in case of Diep. However please realize each search of the YoungBrotherWait principle
has to do a much slower search prior to be able to put other cpu's to work. That really cripples speedup.

I would guess Diep might get 50% indeed, however in case of DeepSjeng and Rybka i really doubt it. More likely you'd be
real happy getting 30% speedup.

4.2M nps * 0.5 = 2.1M nps

In short an existing 2 socket nehalem 2.53Ghz @ 16 cores total hammers it already for Diep.

In case of Rybka and DeepSjeng i would guess your speedup is roughly 30%. Maybe 35% for DeepSjeng,
and as you get more nps than Diep of course the impact of missing hash at bigger depths will be bigger than
the 20% loss of Diep; so it's not imaginary that an i7-965 with some manual turboboost set, is quite better.

I'd say the 2 points where this chip loses most is the compiler and especially its clockspeed.
1.5Ghz is simply too low, but you already noticed that. It's losing over a factor 2 to clockspeed as such a chip
should clock 3Ghz of course at least.

You can't afford such huge losses simply.

Please note that for the telecommunication and defense a chip like this is really interesting, especially because of its built in encryption
and relative low power. Being low power is a requirement there, not a feature.

jwes · Post by **jwes** » Tue Dec 08, 2009 8:02 pm

I think this would be interesting as a test machine, like Bob's cluster but much cheaper.

diep · Post by **diep** » Tue Dec 08, 2009 9:22 pm

jwes wrote:I think this would be interesting as a test machine, like Bob's cluster but much cheaper.

to test how well your parallel speedup is, sure.
it's sampling in 2010 however so maybe in 2011 it's available.

By mid 2011 it's probably a factor 10 slower or so than a single processor you can get in a shop for $200.

Possibly end of april 2011 even, you can get quite cheap
some sort of dual socket AMD box with 24 cores and 48 'smt' type cores,
and from intel something like that as well.

So the question is whether for that 'factor 2 more cores' you would want to buy it, or as a collectors item for the museum.

The thing is: it's low power also relative fast. If you look around and look to other low power devices it really isn't in the same speed league like this.

Vincent

bob · Post by **bob** » Tue Dec 08, 2009 9:29 pm

diep wrote:Crossposted in other forum where GCP calculated the chips speed.

hi GCP, i had missed this chip.

The first of this chip, tilera with 64 cores i have tried to get information what it would cost to get a 1000 of them.
That was some years ago when they managed to get publicity for it.

The response was to say polite, not how a salesman would react. They really were after deals selling millions of them.

If a chip is real good you don't react like that. Remember that the first generation was their first chip of this sort.

I'm a bit amazed to hear from a new generation which has dramatic improvements. We must be very honest and fair
that this chip really has improved a lot.

However your speedup calculation is overly optimistic IMHO as cpu's like this don't work in multiplce sockets,
yet that's how you have to compare it.

http://www.tgdaily.com/hardware-feature ... -processor

Realize the statement: "with up to 26MB total L3 coherent cache across the device."

Cache coherency is really a big problem here. For our hashtable we want in fact no cache coherency at all,
as with so many cores it will just slowdown immensely the search speed.

So practical it will not be possible to store the hashtable last few plies. Now for diep that would be a major loss,
not to mention the faster engines (deepsjeng, rybka etc).

Perhaps they will copy Intel's MTRR stuff so that you can set parts of RAM to be "non-cachable" which would probably be a good idea for hash tables.

Just losing the qsearch is already 20% loss in case of Diep. Losing the last few plies for hashtable for faster engines
than Diep will be an even bigger problem. So let me calculate it for diep first as the problem for rybka and others
which are 2.4 mln chesspositions a second a core or something, there the problem is even bigger.

I'm guessing this new core is of similar quality to a MIPS processor, like the R14000,
as these guys are not intel nor AMD, let alone that they have a compiler that is any good
at such short notice.

The R14000 was slighly worse than K7 for Diep in IPC, some years ago the PGO and other improvements
in visual c++ 2005 have really speeded up K7 a lot, quite some more than 20%,
whereas we can surely expect that their GCC port, notoriously bad with pgo, won't really function well there.

I have tested the speed of Diep now at a K7 and it is 109531 nodes per second at a K7 2.13Ghz,
let's say 110k nps at 2.1ghz.

At 1.5Ghz that K7 is roughly 1.4 times slower: 110 / 1.4 = 78.57k nps
Now the 20% reduction for having ugly compiler: 62.8k nps

However i gave it advantages everywhere in the roundoffs, so that's 60k nps.

At 1500Mhz i estimate it gets about 60k nps. So it is 6 Million nps at 100 cores in case of ideal scaling.

Now we lose instantly 20% to the quiescencesearch not being able to get stored in hashtable.
Another 10% loss is there for Diep as i have to make its evaluation and pawntable extra tiny to fit in L2.

So we start losing 30% to simplistically the cache coherency.

0.7 * 6 = 4.2M nps

Now the parallel speedup is quite ok in case of Diep. However please realize each search of the YoungBrotherWait principle
has to do a much slower search prior to be able to put other cpu's to work. That really cripples speedup.

I would guess Diep might get 50% indeed, however in case of DeepSjeng and Rybka i really doubt it. More likely you'd be
real happy getting 30% speedup.

4.2M nps * 0.5 = 2.1M nps

In short an existing 2 socket nehalem 2.53Ghz @ 16 cores total hammers it already for Diep.

In case of Rybka and DeepSjeng i would guess your speedup is roughly 30%. Maybe 35% for DeepSjeng,
and as you get more nps than Diep of course the impact of missing hash at bigger depths will be bigger than
the 20% loss of Diep; so it's not imaginary that an i7-965 with some manual turboboost set, is quite better.

I'd say the 2 points where this chip loses most is the compiler and especially its clockspeed.
1.5Ghz is simply too low, but you already noticed that. It's losing over a factor 2 to clockspeed as such a chip
should clock 3Ghz of course at least.

You can't afford such huge losses simply.

Please note that for the telecommunication and defense a chip like this is really interesting, especially because of its built in encryption
and relative low power. Being low power is a requirement there, not a feature.

diep · Post by **diep** » Tue Dec 08, 2009 9:56 pm

bob wrote:
diep wrote:Crossposted in other forum where GCP calculated the chips speed.

hi GCP, i had missed this chip.

The first of this chip, tilera with 64 cores i have tried to get information what it would cost to get a 1000 of them.
That was some years ago when they managed to get publicity for it.

The response was to say polite, not how a salesman would react. They really were after deals selling millions of them.

If a chip is real good you don't react like that. Remember that the first generation was their first chip of this sort.

I'm a bit amazed to hear from a new generation which has dramatic improvements. We must be very honest and fair
that this chip really has improved a lot.

However your speedup calculation is overly optimistic IMHO as cpu's like this don't work in multiplce sockets,
yet that's how you have to compare it.

http://www.tgdaily.com/hardware-feature ... -processor

Realize the statement: "with up to 26MB total L3 coherent cache across the device."

Cache coherency is really a big problem here. For our hashtable we want in fact no cache coherency at all,
as with so many cores it will just slowdown immensely the search speed.

So practical it will not be possible to store the hashtable last few plies. Now for diep that would be a major loss,
not to mention the faster engines (deepsjeng, rybka etc).
Perhaps they will copy Intel's MTRR stuff so that you can set parts of RAM to be "non-cachable" which would probably be a good idea for hash tables.

That's a good remark and would need to get investigated. When by 2011 they managed to produce it, let's see what their final product has become and whether they managed to get it to 1.5Ghz

As for my calculations that doesn't change too much. Basically it wins back 30% to cache coherency, caching in qsearch will still be rather difficult for Diep. In crafty case you'd still lose last 3 plies i guess, so you need a factor 2 more nodes because of that. I hadn't factored that in yet as i was just speaking about Diep and its relative low nps that is of course letting it run easier at cpu's like this than other software.

6 million nps seem attractive for a chip but remember it's end 2011 then. We really have to compare it with what we can get in a 2 socket machine by then, which is possibly 48 single integer execution unit cores at AMD and something similar of intel.

At near or just over 3Ghz that'll total blow away of course this chippie.

At embedded level this chip looks suddenly a lot better. I'd say genius it looks like. However i remember the email communication i had with them about their first chip; you don't respond like that unless you have to hide your potential cpu's performance.

Intels 48 core chip should scale quite better there. It doesn't have cache coherency.

You know that if there is performance to win somewhere, we chessprogrammers squeeze it out of the chip.

That won't be easy with this 100 core tilera chip, because 1 core is ugly slow. My nps figure won't be far off. Realize the MIPS R14000 is a 4 issue processor and 3 units retire processor, so it might be just as good as this core, if not better. It probably will be very similar in performance. This isn't AMD nor Intel.

See also the ugly IPC that all other manufacturers get, other than AMD and intel.

Some cheapo 16 core quad socket AMD solution now is hammering this totally away right now and it's 2009 now. Sure, that eats a lot of power, so it is not a fair compare, also it is 4 sockets and this is 1 chip.

But that's the reality.

Just losing the qsearch is already 20% loss in case of Diep. Losing the last few plies for hashtable for faster engines
than Diep will be an even bigger problem. So let me calculate it for diep first as the problem for rybka and others
which are 2.4 mln chesspositions a second a core or something, there the problem is even bigger.

I'm guessing this new core is of similar quality to a MIPS processor, like the R14000,
as these guys are not intel nor AMD, let alone that they have a compiler that is any good
at such short notice.

The R14000 was slighly worse than K7 for Diep in IPC, some years ago the PGO and other improvements
in visual c++ 2005 have really speeded up K7 a lot, quite some more than 20%,
whereas we can surely expect that their GCC port, notoriously bad with pgo, won't really function well there.

I have tested the speed of Diep now at a K7 and it is 109531 nodes per second at a K7 2.13Ghz,
let's say 110k nps at 2.1ghz.

At 1.5Ghz that K7 is roughly 1.4 times slower: 110 / 1.4 = 78.57k nps
Now the 20% reduction for having ugly compiler: 62.8k nps

However i gave it advantages everywhere in the roundoffs, so that's 60k nps.

At 1500Mhz i estimate it gets about 60k nps. So it is 6 Million nps at 100 cores in case of ideal scaling.

Now we lose instantly 20% to the quiescencesearch not being able to get stored in hashtable.
Another 10% loss is there for Diep as i have to make its evaluation and pawntable extra tiny to fit in L2.

So we start losing 30% to simplistically the cache coherency.

0.7 * 6 = 4.2M nps

Now the parallel speedup is quite ok in case of Diep. However please realize each search of the YoungBrotherWait principle
has to do a much slower search prior to be able to put other cpu's to work. That really cripples speedup.

I would guess Diep might get 50% indeed, however in case of DeepSjeng and Rybka i really doubt it. More likely you'd be
real happy getting 30% speedup.

4.2M nps * 0.5 = 2.1M nps

In short an existing 2 socket nehalem 2.53Ghz @ 16 cores total hammers it already for Diep.

In case of Rybka and DeepSjeng i would guess your speedup is roughly 30%. Maybe 35% for DeepSjeng,
and as you get more nps than Diep of course the impact of missing hash at bigger depths will be bigger than
the 20% loss of Diep; so it's not imaginary that an i7-965 with some manual turboboost set, is quite better.

I'd say the 2 points where this chip loses most is the compiler and especially its clockspeed.
1.5Ghz is simply too low, but you already noticed that. It's losing over a factor 2 to clockspeed as such a chip
should clock 3Ghz of course at least.

You can't afford such huge losses simply.

Please note that for the telecommunication and defense a chip like this is really interesting, especially because of its built in encryption
and relative low power. Being low power is a requirement there, not a feature.

Gian-Carlo Pascutto · Tue Dec 08, 2009 10:01 pm

diep wrote: 6 million nps seem attractive for a chip but remember it's end 2011 then. We really have to compare it with what we can get in a 2 socket machine by then, which is possibly 48 single integer execution unit cores at AMD and something similar of intel.

Need to factor price somewhere in that equation too. Might improve the look of this chip or make it worse. Who knows?

At embedded level this chip looks suddenly a lot better. I'd say genius it looks like. However i remember the email communication i had with them about their first chip; you don't respond like that unless you have to hide your potential cpu's performance.

Ah, they responded? I had no such luck

And I agree, that's just a very bad sign.

On the 100 core Tilera cpu releasing maybe 2010

On the 100 core Tilera cpu releasing maybe 2010

Re: On the 100 core Tilera cpu releasing maybe 2010

Re: On the 100 core Tilera cpu releasing maybe 2010

Re: On the 100 core Tilera cpu releasing maybe 2010

Re: On the 100 core Tilera cpu releasing maybe 2010

Re: On the 100 core Tilera cpu releasing maybe 2010