wgarvin wrote:I looked up the latency of integer multiply on Core2 and its faster than I thought, only 3-4 cycles of latency. So even with the array arranged in the original order, I can't imagine what could be slowing this down so much.
Do you use lots of arrays or data tables for other things (like eval)? Perhaps the zobrist key entries are not able to stay in the L1 cache. But even if it has to get them from L2 all the time, that is also pretty damn fast on Core2 chips, taking 14 or 15 cycles (instead of 3 cycles for L1 hit).
If you can, post the assembly code? I'm really curious now.
On L2 or L3 caches. You really don't want to get anything out of L2 let alone L3 cache, that would slow down your program bigtime.
Lucky the L1 has a big hit.
So for zobrist the statistical odds the lookup gets automatic prefetched and read from L1, is over 99%. Far over 99%.
A limitation from intel is that it has just 1 read port, so in cachegrind i see how it's fulltime busy each cycle trying to read something from L1.
the latencies intel and amd quote for their L2 and L3 caches are ballony, they're really a lot slower than what they quote. Never confuse throughput with latency. They're assuming for example already some cycles you was lucky and it could prefetch it a few cycles before it gets asked for.
so you can safely add a number of cycles to the latency of the L2/L3 (and that's not the L1 latency).
every latency in those handbooks are 'good weather' latencies.
not so long ago for example we had P4's from intel. On paper it could execute 4 integers a cycle so called. And shifting was at most 2 cycles.
Of course that 2 cycle shift penalty was bad news brought later.
All ballony of course. Already before it got released it was made clear by hardware experts that the tracecache could deliver at most 3 instructions a cycle bestcase so there was NO WAY it could execute 4 a cycle.
And later on it appeared that prescott had a latency of 7 cycles for right shifting.
On paper core2 is similar in multiplication speed of integers, yet in reality core2 is nearly double faster there. Still far from optimal though.
None of all that was in the handbooks.
AMD on other hand has no good compiler. How about that?
Even today the GCC compiler is doing effort to not generate CMOV type instructions or rewrite code to spaghetticode that's objectively SLOWER for core2 AND amd k8 and newer.
Yet AMD can execute, on paper 3 of them within a cycle. Of course you really want to avoid trying that as it will probably also get you into trouble, but now and then one is really no problem. It's a directpath instruction.
Yet the compilers aren't generating that much faster code, which is easily 50% faster for a lot of loops that have small simple branches.
jumping (for a loop or whatever) seems to cost 0 cycles at intel and 1 extra cycle at AMD that shouldn't be there (though documented) and so on.
Trusting paper is not very clever.
Just test yourself but don't forget to look at what code the compiler generates.