OK, but a logical "core" is not a real core. I have tested on i7's of course. We have a cluster of 6-core i7's that I use occasionally, including using one node in the last ACCA tournament... We don't overclock, however. Not interested in the debugging. I have seen machines that pass every test when overclocked, except the test that counts, the actual application someone wants to run.diep wrote:i7 shares L1 for both logical cores is the idea. Just google. Each core can decode 4 instructions per cycle, but as you can see from tests of the prime number guys maintaining the table here http://gmplib.org/~tege/x86-timing.pdf, you can see it has a throughput of 3 instructions per cycle max. Which explains why hyperthreading at least on paper makes sense. With all the bit type instructions that crafty is doing for bitboards, many of them have a peak of 2 per cycle on the chip, as opposed to Diep executing more of a cheapskate instructions possibly; a possible attempt to form an explanation why the scaling of HT for crafty might be less than Diep, if that is the case, despite some claims of Nalimov there some years ago with respect to crafty, prior to him moving from Wintel to Google (i bet that doubled his salary).bob wrote:I've not seen any Intel chips that have a shared L1. And most new ones don't have a shared L2 either. They all have a shared L3 (if they have an L3), else a shared L2. But shared L1? Have not seen that anywhere on anything I have tested on, and I have tested on most everything they have done, or are going to do within 6 months or so...diep wrote:Not so much the size of a cache, the speed of a cache can make or break a processor. As for the L1 cache the size is really relevant. 8KB is really too tiny. 16KB is real tiny (i7 and bulldozer have 16KB effectively per minicore), 32KB tends to be seen as 'ok' again and 64KB is luxury for the datacache.Werewolf wrote:Bob, thank you.
Can I ask one more question: how relevant is the size of cache in a processor? I notice Intel have gone from 2 MB /per core to 2.5 MB / per core in their xeons but some people say cache isn't relevant for chess.
Can you say whether it is or isn't with a brief (and simple ) explanation of why this is so, please?
Usually the cache size gets quoted with instruction and datacache together and also it's for 2 cores.
So a 64KB L1 for i7 means in reality effectively 16KB a core, as first you have to split it in 32 + 32 for data and instruction cache and then somehow it has to be shared by 2 cores. bulldozer has a hard split, so in a hard manner splits it into 16KB datacache for each core. instruction cache they are a bit more gibberish about.
The chip with the best cache is the fastest chip simply. The real problem which determines how fast a chip is for computerchess usually is the speed at which it can decode integer instructions from the L1 instruction cache to the execution cores/units. i7 can decode 4 instructions a cycle a core, which 2 logical cores share, compare the P4 could decode just 1 instruction a cycle. That some tracecache could on paper retrieve 3 a cycle somehow didn't really speed it up for computerchess. A bulldozer module can decode 4 instructions a cycle. Not surprisingly for Diep a 4 module bulldozer is the same speed like an i7-quadcore at the same Ghz, with the 6 core i7-gulftown (i7-970,i7-980,i7-990) totally destroying it.
It's relative cheap and easy to put a big cache onto a chip. A fast cache is however really complicated.
Right now the L1 caches are comparable in speed somewhat. Say 4 cycles or so, but they get prefetched pretty ok, so you don't suffer that 4 cycle penalty to do an access in the L1 cache.
The real difference is the L2 cache right now. i7 has the fastest L2 cache. It's just 256KB a core, yet for computerchess this is enough. Say above 512KB you can't even measure a difference between 512KB and 1MB, let alone between 1MB and 2MB. AMD's bulldroop (bulldozer) has 2MB cache a module (same thing like a core in i7), and it's ugly slow. Like over 2 times slower than from i7.
AMD has a fundamental problem from performance viewpoint. Intel on other hand is a tad expensive with its quadcore i7's, knowing AMD is a lot cheaper there with bulldozer. Might intel soon release their new sandy bridge sixcores they can cleansweep the bulldozer of course and annihilate it. All based upon having 50% more cores, from 4 to 6, and a fast L2 cache.
Bulldozer needs a whopping 2 billion transistors, compare a gulftown i7-sixcore is 1.2 billion transistors. That's everything counted huh - recently AMD released a statement which is vague that it's 1.2 billion transistors as well, but seems without counting its L3 caches which alone are making up for a 850 million transistors.
All those cpu's are very well optimized internal, so some simple change won't win them speed. So the entire design of i7 requires a fast tiny cache, and the entire design of bulldozer requires a huge slow cache. There is no way to change that except if you design an entirely new cpu.
It's unclear to me why AMD designed this total failed bulldozer chip which basically achieves a similar performance like the i7-quadcores, which intel already released november 2008 with the i7-965.
Sure it's a lot cheaper. So intel will wipe those bulldozers away when they release a cheap sixcore sandybridge. note that the sandy bridge design is a 8-core design, but seems they usually turn off 2 cores of it. The Xeons have the full 8 cores.
Bob, you should really retest all this for high clocked i7's, and as power gets paid by uncle Bill anyway, maybe you can find a student somewhere who can study the effects of overclocking for you; after all if it fails you can still tell 'em that a student caused that
Overclocking a slower version of a specific version of a processor to a faster speed is not as risky, since Intel tends to produce all of a specific type of processor on the same fab line, and just "under-mark" them in terms of speed. But if you want to buy the fastest thing available and overclock that, feel free. Not for me...
BTW the i7 can issue 4 instructions per clock, but can only retire 3. So you can never go beyond 3 IPC except in very short bursts where you also have to match that up with bursts of < 3 IPC so that you don't exceed the max of 3...
The primary problem for Crafty is that I have spent years optimizing memory accesses. In the last edition of Hennessy/Patterson, which referenced spec 2000, there was no other program that was efficient with L1 hits as Crafty. HT really helps if you do have two threads with lots of dependencies that prevent instruction issues for every cycle, or with lots of memory accesses that result in L1 misses to increase the latency and provide free clock cycles where the other logical core can step in and keep things busy...