Intel i3 L2 cache

hgm · Post by **hgm** » Tue Jan 28, 2014 11:41 am

I am trying to measure performance I can expect of my 2.4GHz i3 laptop. In particular, how fast I can stream contiguous data sets into the CPU. To this end I just time how long it would take to load 2.4e9 cache lines, so that the number of seconds it takes directly translates to clock cycles. I do this with arrays of various sizes, upto 128MB:

Code: Select all

#include <time.h>

#define SIZE 128*1024*1024
char mem&#91;SIZE&#93;;
int n;

main&#40;)
&#123;
  int i;
  int mask;
  for&#40;i=0; i<SIZE; i++) mem&#91;i&#93; = 0;
  for&#40;mask=63; mask<SIZE; mask = 2*mask + 1&#41; &#123;
    time_t t = clock&#40;);
    unsigned int i = 2.4e9, j=0;
    do &#123;
      j = j + 64 & mask;
      n += mem&#91;j&#93;;
//    mem&#91;j+1&#93; = n; // make line dirty
    &#125; while&#40; --i );
    t = clock&#40;) - t;
    printf&#40;"%10d %10d %6.3f sec\n", n, mask+1, t*&#40;1./CLOCKS_PER_SEC&#41;);
  &#125;
&#125;

The output surprises me very much, however:

Code: Select all

         0         64  2.109 sec
         0        128  2.063 sec
         0        256  2.047 sec
         0        512  2.047 sec
         0       1024  2.062 sec
         0       2048  2.063 sec
         0       4096  2.047 sec
         0       8192  2.046 sec
         0      16384  2.047 sec
         0      32768  2.203 sec
         0      65536  2.188 sec
         0     131072  2.109 sec
         0     262144  2.703 sec
         0     524288  4.985 sec
         0    1048576  5.000 sec
         0    2097152  5.484 sec
         0    4194304  9.610 sec
         0    8388608 13.062 sec
         0   16777216 13.781 sec
         0   33554432 13.750 sec
         0   67108864 13.813 sec
         0  134217728 13.765 sec

Initially the loop just takes 2 clocks, which is pretty good considering the number of instructions it has to do, and probably not dominated by the L1 cache bandwidth.

But then I get a surprise: when I cycle through an array of 64KB or 128KB, it remains 2 clocks! L1 is supposed to be 32KB, so with these sizes I am definitely overflowing it, and every access must be an L2 access. I am measuring throughput here, rather than latency, but I am very surprised to see that L2 can keep up this bandwidth of 1 cache line per 2 clocks. I have never seen that in any other CPU before.

At 256KB it gets a bit slower, which is expected, because this is the full L2 size, and L2 is shared between data and instructions, so I am already a bit overflowing it. The 1MB and 2MB arrays measure L3 throughput, which seems to be 1 cache line per 5 clocks. My i3 is supposed to have 3MB L3 cache, so at 4MB I am now overflowing it (but probably still some hits, because 3 is not a power of 2), and after that I am measuring the DRAM on-page access time. (Not very bad, with only 13 clocks!)

This is all for clean cache lines. When I make the lines dirty, I get

Code: Select all

         0         64  2.078 sec
         0        128  2.063 sec
         0        256  2.062 sec
         0        512  2.063 sec
         0       1024  2.109 sec
         0       2048  2.078 sec
         0       4096  2.063 sec
         0       8192  2.093 sec
         0      16384  2.079 sec
         0      32768  2.093 sec
         0      65536  6.391 sec
         0     131072  6.469 sec
         0     262144  6.609 sec
         0     524288  9.047 sec
         0    1048576  9.078 sec
         0    2097152  9.484 sec
         0    4194304 15.782 sec
         0    8388608 26.890 sec
         0   16777216 28.766 sec
         0   33554432 29.078 sec
         0   67108864 29.000 sec
         0  134217728 29.313 sec

So now the L1 size stands out, as an L1 miss forces the old cache line to be flushed to L2 before it can be replaced by the sought line.

The big surprise is that L2 is so fast. For clean data it really doesn't matter much if you are working from L1 and L2. Unless you need more than 1 memory read every 2 clocks, I suppose; in principle L1 should be able to do 2 reads per clock.

[Edit] Indeed, when I modify the loop to read mem[j] as well as mem[j+64] in the same iteration (stepping j by 128), it can still do the loop in 2 clocks when these are all L1 hits. (I had to hand-optimize the loop a bit to not make the number of instructions the bottleneck.) But with all L1 misses/L2 hits this increases to 4 clocks. I still think that is a pretty amazing speed for an L2.

Evert · Post by **Evert** » Tue Jan 28, 2014 11:56 am

Here's the same data for my i7 laptop:

Code: Select all

         0         64  1.587 sec
         0        128  1.570 sec
         0        256  1.545 sec
         0        512  1.639 sec
         0       1024  1.573 sec
         0       2048  1.601 sec
         0       4096  1.594 sec
         0       8192  1.643 sec
         0      16384  1.625 sec
         0      32768  1.733 sec
         0      65536  1.747 sec
         0     131072  1.654 sec
         0     262144  2.021 sec
         0     524288  3.706 sec
         0    1048576  3.749 sec
         0    2097152  3.863 sec
         0    4194304  8.302 sec
         0    8388608 11.642 sec
         0   16777216 12.149 sec
         0   33554432 12.245 sec
         0   67108864 12.214 sec
         0  134217728 12.149 sec

and with dirty lines:

Code: Select all

         0         64  2.136 sec
         0        128  2.181 sec
         0        256  2.131 sec
         0        512  2.123 sec
         0       1024  2.180 sec
         0       2048  2.126 sec
         0       4096  2.151 sec
         0       8192  2.123 sec
         0      16384  2.158 sec
         0      32768  2.123 sec
         0      65536  4.367 sec
         0     131072  4.420 sec
         0     262144  4.600 sec
         0     524288  6.389 sec
         0    1048576  6.378 sec
         0    2097152  6.487 sec
         0    4194304 10.173 sec
         0    8388608 16.419 sec
         0   16777216 19.290 sec
         0   33554432 20.121 sec
         0   67108864 20.848 sec
         0  134217728 20.777 sec

hgm · Post by **hgm** » Tue Jan 28, 2014 11:59 am

What is the nominal clock speed of this i7?

Evert · Post by **Evert** » Tue Jan 28, 2014 1:00 pm

It says 2.9 GHz in the info panel (it's a Mac, I have no idea how to get the current speed it's running at), but it was running off battery at the time I did the measurement. Actually, it's on a power supply now and I get the same numbers.

From the information panel:

Code: Select all

  Processor Name&#58;	Intel Core i7
  Processor Speed&#58;	2.9 GHz
  Number of Processors&#58;	1
  Total Number of Cores&#58;	2
  L2 Cache &#40;per Core&#41;&#58;	256 KB
  L3 Cache&#58;	4 MB

I seem to recall there's something funny with the relation between L1 and L2 for some of the newer Core processors, but I don't remember the specifics.

If you want, I can run the test on my desktop tonight (an i7- 2600).

hgm · Post by **hgm** » Tue Jan 28, 2014 1:43 pm

I was just wondering about the significance that you get 1.6 sec. You did change the number of loop iterations in the program to 2.9e9 to keep the equivalence of seconcs and cycles?

I think that i7 has this 'turbo-boost' technology, where it would overclock itself when only a single core is active. This could be a reason why you seem to be able to run the loop with all L1 hits in under 2 cycles.

Evert · Post by **Evert** » Tue Jan 28, 2014 2:27 pm

hgm wrote:I was just wondering about the significance that you get 1.6 sec. You did change the number of loop iterations in the program to 2.9e9 to keep the equivalence of seconcs and cycles?

Ah. No, I did not.

I think that i7 has this 'turbo-boost' technology, where it would overclock itself when only a single core is active. This could be a reason why you seem to be able to run the loop with all L1 hits in under 2 cycles.

My desktop has it, but I'm not sure about the laptop. It may be disabled for those.

bob · Post by **bob** » Tue Jan 28, 2014 6:32 pm

hgm wrote:I am trying to measure performance I can expect of my 2.4GHz i3 laptop. In particular, how fast I can stream contiguous data sets into the CPU. To this end I just time how long it would take to load 2.4e9 cache lines, so that the number of seconds it takes directly translates to clock cycles. I do this with arrays of various sizes, upto 128MB:
Code: Select all
#include <time.h>

#define SIZE 128*1024*1024
char mem&#91;SIZE&#93;;
int n;

main&#40;)
&#123;
  int i;
  int mask;
  for&#40;i=0; i<SIZE; i++) mem&#91;i&#93; = 0;
  for&#40;mask=63; mask<SIZE; mask = 2*mask + 1&#41; &#123;
    time_t t = clock&#40;);
    unsigned int i = 2.4e9, j=0;
    do &#123;
      j = j + 64 & mask;
      n += mem&#91;j&#93;;
//    mem&#91;j+1&#93; = n; // make line dirty
    &#125; while&#40; --i );
    t = clock&#40;) - t;
    printf&#40;"%10d %10d %6.3f sec\n", n, mask+1, t*&#40;1./CLOCKS_PER_SEC&#41;);
  &#125;
&#125;
The output surprises me very much, however:
Code: Select all
         0         64  2.109 sec
         0        128  2.063 sec
         0        256  2.047 sec
         0        512  2.047 sec
         0       1024  2.062 sec
         0       2048  2.063 sec
         0       4096  2.047 sec
         0       8192  2.046 sec
         0      16384  2.047 sec
         0      32768  2.203 sec
         0      65536  2.188 sec
         0     131072  2.109 sec
         0     262144  2.703 sec
         0     524288  4.985 sec
         0    1048576  5.000 sec
         0    2097152  5.484 sec
         0    4194304  9.610 sec
         0    8388608 13.062 sec
         0   16777216 13.781 sec
         0   33554432 13.750 sec
         0   67108864 13.813 sec
         0  134217728 13.765 sec
Initially the loop just takes 2 clocks, which is pretty good considering the number of instructions it has to do, and probably not dominated by the L1 cache bandwidth.

But then I get a surprise: when I cycle through an array of 64KB or 128KB, it remains 2 clocks! L1 is supposed to be 32KB, so with these sizes I am definitely overflowing it, and every access must be an L2 access. I am measuring throughput here, rather than latency, but I am very surprised to see that L2 can keep up this bandwidth of 1 cache line per 2 clocks. I have never seen that in any other CPU before.

At 256KB it gets a bit slower, which is expected, because this is the full L2 size, and L2 is shared between data and instructions, so I am already a bit overflowing it. The 1MB and 2MB arrays measure L3 throughput, which seems to be 1 cache line per 5 clocks. My i3 is supposed to have 3MB L3 cache, so at 4MB I am now overflowing it (but probably still some hits, because 3 is not a power of 2), and after that I am measuring the DRAM on-page access time. (Not very bad, with only 13 clocks!)

This is all for clean cache lines. When I make the lines dirty, I get
Code: Select all
         0         64  2.078 sec
         0        128  2.063 sec
         0        256  2.062 sec
         0        512  2.063 sec
         0       1024  2.109 sec
         0       2048  2.078 sec
         0       4096  2.063 sec
         0       8192  2.093 sec
         0      16384  2.079 sec
         0      32768  2.093 sec
         0      65536  6.391 sec
         0     131072  6.469 sec
         0     262144  6.609 sec
         0     524288  9.047 sec
         0    1048576  9.078 sec
         0    2097152  9.484 sec
         0    4194304 15.782 sec
         0    8388608 26.890 sec
         0   16777216 28.766 sec
         0   33554432 29.078 sec
         0   67108864 29.000 sec
         0  134217728 29.313 sec
So now the L1 size stands out, as an L1 miss forces the old cache line to be flushed to L2 before it can be replaced by the sought line.

The big surprise is that L2 is so fast. For clean data it really doesn't matter much if you are working from L1 and L2. Unless you need more than 1 memory read every 2 clocks, I suppose; in principle L1 should be able to do 2 reads per clock.

[Edit] Indeed, when I modify the loop to read mem[j] as well as mem[j+64] in the same iteration (stepping j by 128), it can still do the loop in 2 clocks when these are all L1 hits. (I had to hand-optimize the loop a bit to not make the number of instructions the bottleneck.) But with all L1 misses/L2 hits this increases to 4 clocks. I still think that is a pretty amazing speed for an L2.

These things have some very good predictive pre-fetching in hardware. One thing to do is rather than just reading, modify the data. That turns off the aggressive pre-fetching since pre-fetching data will cause existing data to have to be rewritten to memory. I give an assignment in my architecture class where students have to wrote a program to determine number of cache levels, size of each level, overall block size, and set associativity for each level. Almost impossible just reading, much more straightforward if writing...

ZirconiumX · Post by **ZirconiumX** » Tue Jan 28, 2014 7:04 pm

For old time's sake, I ran this on my Core 2-based Xeon L5420 box, to see how things have improved.

Clean:

Code: Select all

         0         64  2.607 sec
         0        128  2.601 sec
         0        256  2.605 sec
         0        512  2.620 sec
         0       1024  2.607 sec
         0       2048  2.606 sec
         0       4096  2.610 sec
         0       8192  2.608 sec
         0      16384  2.607 sec
         0      32768  2.607 sec
         0      65536  4.694 sec
         0     131072  4.705 sec
         0     262144  4.763 sec
         0     524288  4.803 sec
         0    1048576  4.682 sec
         0    2097152  4.848 sec
         0    4194304  4.760 sec
         0    8388608 41.445 sec
         0   16777216 43.923 sec
         0   33554432 42.579 sec
         0   67108864 42.687 sec
         0  134217728 43.773 sec

Dirty:

Code: Select all

         0         64  3.169 sec
         0        128  3.187 sec
         0        256  3.189 sec
         0        512  3.191 sec
         0       1024  3.193 sec
         0       2048  3.194 sec
         0       4096  3.193 sec
         0       8192  3.194 sec
         0      16384  3.193 sec
         0      32768  3.193 sec
         0      65536  7.377 sec
         0     131072  7.605 sec
         0     262144  7.795 sec
         0     524288  7.558 sec
         0    1048576  7.475 sec
         0    2097152  7.571 sec
         0    4194304  7.640 sec
         0    8388608 73.325 sec
         0   16777216 79.674 sec
         0   33554432 83.481 sec
         0   67108864 84.018 sec
         0  134217728 82.622 sec

Over double the clock cycles on a dirty line.

Matthew:out

hgm · Post by **hgm** » Tue Jan 28, 2014 7:27 pm

bob wrote:These things have some very good predictive pre-fetching in hardware. One thing to do is rather than just reading, modify the data. That turns off the aggressive pre-fetching since pre-fetching data will cause existing data to have to be rewritten to memory. I give an assignment in my architecture class where students have to wrote a program to determine number of cache levels, size of each level, overall block size, and set associativity for each level. Almost impossible just reading, much more straightforward if writing...

Prefetching helps to reduce latency, but it cannot do anything for throughput. I agree that the numbers I posted would be meaningless for someone who would want to know, for example, how expensive his hash probes were. But what I am interested in is sequential access for the purpose of tablebase generation.

Btw, I did modify the data: the second table was for dirty cache lines.

ZirconiumX wrote:For old time's sake, I ran this on my Core 2-based Xeon L5420 box, to see how things have improved.
...
Over double the clock cycles on a dirty line.

Matthew:out

Well, the times for 32KB and smaller (all L1 hits) are not dominated by cache access, but by instruction execution. The C2D has a disadvantage there, because it can handle only three instructions per clock, rather than 4 for i3 and other Nehalem architectures. You clearly see that.

L2 is a significant slowdown compared to L1, though. In i3 this has significantly improved. Of course the C2D L2 is more like an L3; it is 4MB, while my i3 L3 is only 3MB. So perhaps it is unfair to compare it with the i3 L2.

But also the DRAM is spectacularly faster on i3: a factor 3-4.

syzygy · Post by **syzygy** » Tue Jan 28, 2014 8:15 pm

i7-3930K, clocked at 4.2Ghz (so i = 4.2e9):

Clean:

Code: Select all

         0         64  2.771 sec
         0        128  2.768 sec
         0        256  2.768 sec
         0        512  2.768 sec
         0       1024  2.768 sec
         0       2048  2.768 sec
         0       4096  2.768 sec
         0       8192  2.768 sec
         0      16384  2.768 sec
         0      32768  2.768 sec
         0      65536  2.757 sec
         0     131072  2.754 sec
         0     262144  3.202 sec
         0     524288  5.934 sec
         0    1048576  5.934 sec
         0    2097152  5.937 sec
         0    4194304  5.936 sec
         0    8388608  7.138 sec
         0   16777216 13.546 sec
         0   33554432 14.961 sec
         0   67108864 14.957 sec
         0  134217728 14.961 sec

Dirty:

Code: Select all

         0         64  3.428 sec
         0        128  3.267 sec
         0        256  3.044 sec
         0        512  3.035 sec
         0       1024  3.035 sec
         0       2048  3.036 sec
         0       4096  3.032 sec
         0       8192  3.033 sec
         0      16384  3.032 sec
         0      32768  3.033 sec
         0      65536  6.392 sec
         0     131072  6.448 sec
         0     262144  6.802 sec
         0     524288  9.068 sec
         0    1048576  9.164 sec
         0    2097152  9.167 sec
         0    4194304  9.167 sec
         0    8388608  9.185 sec
         0   16777216 19.436 sec
         0   33554432 21.214 sec
         0   67108864 21.276 sec
         0  134217728 21.310 sec

Intel i3 L2 cache

Intel i3 L2 cache

Re: Intel i3 L1 cache

Re: Intel i3 L1 cache

Re: Intel i3 L1 cache

Re: Intel i3 L1 cache

Re: Intel i3 L1 cache

Re: Intel i3 L2 cache

Re: Intel i3 L2 cache

Re: Intel i3 L2 cache

Re: Intel i3 L2 cache