Code: Select all
#include <time.h>
#define SIZE 128*1024*1024
char mem[SIZE];
int n;
main()
{
int i;
int mask;
for(i=0; i<SIZE; i++) mem[i] = 0;
for(mask=63; mask<SIZE; mask = 2*mask + 1) {
time_t t = clock();
unsigned int i = 2.4e9, j=0;
do {
j = j + 64 & mask;
n += mem[j];
// mem[j+1] = n; // make line dirty
} while( --i );
t = clock() - t;
printf("%10d %10d %6.3f sec\n", n, mask+1, t*(1./CLOCKS_PER_SEC));
}
}
Code: Select all
0 64 2.109 sec
0 128 2.063 sec
0 256 2.047 sec
0 512 2.047 sec
0 1024 2.062 sec
0 2048 2.063 sec
0 4096 2.047 sec
0 8192 2.046 sec
0 16384 2.047 sec
0 32768 2.203 sec
0 65536 2.188 sec
0 131072 2.109 sec
0 262144 2.703 sec
0 524288 4.985 sec
0 1048576 5.000 sec
0 2097152 5.484 sec
0 4194304 9.610 sec
0 8388608 13.062 sec
0 16777216 13.781 sec
0 33554432 13.750 sec
0 67108864 13.813 sec
0 134217728 13.765 sec
But then I get a surprise: when I cycle through an array of 64KB or 128KB, it remains 2 clocks! L1 is supposed to be 32KB, so with these sizes I am definitely overflowing it, and every access must be an L2 access. I am measuring throughput here, rather than latency, but I am very surprised to see that L2 can keep up this bandwidth of 1 cache line per 2 clocks. I have never seen that in any other CPU before.
At 256KB it gets a bit slower, which is expected, because this is the full L2 size, and L2 is shared between data and instructions, so I am already a bit overflowing it. The 1MB and 2MB arrays measure L3 throughput, which seems to be 1 cache line per 5 clocks. My i3 is supposed to have 3MB L3 cache, so at 4MB I am now overflowing it (but probably still some hits, because 3 is not a power of 2), and after that I am measuring the DRAM on-page access time. (Not very bad, with only 13 clocks!)
This is all for clean cache lines. When I make the lines dirty, I get
Code: Select all
0 64 2.078 sec
0 128 2.063 sec
0 256 2.062 sec
0 512 2.063 sec
0 1024 2.109 sec
0 2048 2.078 sec
0 4096 2.063 sec
0 8192 2.093 sec
0 16384 2.079 sec
0 32768 2.093 sec
0 65536 6.391 sec
0 131072 6.469 sec
0 262144 6.609 sec
0 524288 9.047 sec
0 1048576 9.078 sec
0 2097152 9.484 sec
0 4194304 15.782 sec
0 8388608 26.890 sec
0 16777216 28.766 sec
0 33554432 29.078 sec
0 67108864 29.000 sec
0 134217728 29.313 sec
The big surprise is that L2 is so fast. For clean data it really doesn't matter much if you are working from L1 and L2. Unless you need more than 1 memory read every 2 clocks, I suppose; in principle L1 should be able to do 2 reads per clock.
[Edit] Indeed, when I modify the loop to read mem[j] as well as mem[j+64] in the same iteration (stepping j by 128), it can still do the loop in 2 clocks when these are all L1 hits. (I had to hand-optimize the loop a bit to not make the number of instructions the bottleneck.) But with all L1 misses/L2 hits this increases to 4 clocks. I still think that is a pretty amazing speed for an L2.