Werewolf wrote:Bob, thank you.
Can I ask one more question: how relevant is the size of cache in a processor? I notice Intel have gone from 2 MB /per core to 2.5 MB / per core in their xeons but some people say cache isn't relevant for chess.
Can you say whether it is or isn't with a brief (and simple
) explanation of why this is so, please?
Not so much the size of a cache, the speed of a cache can make or break a processor. As for the L1 cache the size is really relevant. 8KB is really too tiny. 16KB is real tiny (i7 and bulldozer have 16KB effectively per minicore), 32KB tends to be seen as 'ok' again and 64KB is luxury for the datacache.
Usually the cache size gets quoted with instruction and datacache together and also it's for 2 cores.
So a 64KB L1 for i7 means in reality effectively 16KB a core, as first you have to split it in 32 + 32 for data and instruction cache and then somehow it has to be shared by 2 cores. bulldozer has a hard split, so in a hard manner splits it into 16KB datacache for each core. instruction cache they are a bit more gibberish about.
The chip with the best cache is the fastest chip simply. The real problem which determines how fast a chip is for computerchess usually is the speed at which it can decode integer instructions from the L1 instruction cache to the execution cores/units. i7 can decode 4 instructions a cycle a core, which 2 logical cores share, compare the P4 could decode just 1 instruction a cycle. That some tracecache could on paper retrieve 3 a cycle somehow didn't really speed it up for computerchess. A bulldozer module can decode 4 instructions a cycle. Not surprisingly for Diep a 4 module bulldozer is the same speed like an i7-quadcore at the same Ghz, with the 6 core i7-gulftown (i7-970,i7-980,i7-990) totally destroying it.
It's relative cheap and easy to put a big cache onto a chip. A fast cache is however really complicated.
Right now the L1 caches are comparable in speed somewhat. Say 4 cycles or so, but they get prefetched pretty ok, so you don't suffer that 4 cycle penalty to do an access in the L1 cache.
The real difference is the L2 cache right now. i7 has the fastest L2 cache. It's just 256KB a core, yet for computerchess this is enough. Say above 512KB you can't even measure a difference between 512KB and 1MB, let alone between 1MB and 2MB. AMD's bulldroop (bulldozer) has 2MB cache a module (same thing like a core in i7), and it's ugly slow. Like over 2 times slower than from i7.
AMD has a fundamental problem from performance viewpoint. Intel on other hand is a tad expensive with its quadcore i7's, knowing AMD is a lot cheaper there with bulldozer. Might intel soon release their new sandy bridge sixcores they can cleansweep the bulldozer of course and annihilate it. All based upon having 50% more cores, from 4 to 6, and a fast L2 cache.
Bulldozer needs a whopping 2 billion transistors, compare a gulftown i7-sixcore is 1.2 billion transistors. That's everything counted huh - recently AMD released a statement which is vague that it's 1.2 billion transistors as well, but seems without counting its L3 caches which alone are making up for a 850 million transistors.
All those cpu's are very well optimized internal, so some simple change won't win them speed. So the entire design of i7 requires a fast tiny cache, and the entire design of bulldozer requires a huge slow cache. There is no way to change that except if you design an entirely new cpu.
It's unclear to me why AMD designed this total failed bulldozer chip which basically achieves a similar performance like the i7-quadcores, which intel already released november 2008 with the i7-965.
Sure it's a lot cheaper. So intel will wipe those bulldozers away when they release a cheap sixcore sandybridge. note that the sandy bridge design is a 8-core design, but seems they usually turn off 2 cores of it. The Xeons have the full 8 cores.
Vincent