Nehalem uArch - 33% more micro-ops in flight

Nid Hogge · Post by **Nid Hogge** » Tue Mar 18, 2008 5:30 pm

Nehalem allows for 33% more micro-ops in flight compared to Penryn (128 micro-ops vs. 96 in Penryn), this increase was achieved by simply increasing the size of the re-order window and other such buffers throughout the pipeline.

With more micro-ops in flight, Nehalem can extract greater instruction level parallelism (ILP) as well as support an increase in micro-ops thanks to each core now handling micro-ops from two threads at once.

Despite the increase in ability to support more micro-ops in flight, there have been no significant changes to the decoder or front end of Nehalem. Nehalem is still fundamentally the same 4-issue design we saw introduced with the first Core 2 microprocessors. The next time we'll see a re-evaluation of this front end will most likely be 2 years from now with the 32nm "tock" processor, codenamed Sandy Bridge.

Nehalem also improved unaligned cache access performance. In SSE there are two types of instructions: one if your data is aligned to a 16-byte cache boundary, and one if your data is unaligned. In current Core 2 based processors, the aligned instructions could execute faster than the unaligned instructions. Every now and then a compiler would produce code that used an unaligned instruction on data that was aligned with a cache boundary, resulting in a performance penalty. Nehalem fixes this case (through some circuit tricks) where unaligned instructions running on aligned data are now fast.

In many applications (e.g. video encoding) you're walking through bytes of data through a stream. If you happen to cross a cache line boundary (64-byte lines) and an instruction needs data from both sides of that boundary you encounter a latency penalty for the unaligned cache access. Nehalem significantly reduces this latency penalty, so algorithms for things like motion estimation will be sped up significantly (hence the improvement in video encode performance).

Nehalem also introduces a second level branch predictor per core. This new branch predictor augments the normal one that sits in the processor pipeline and aids it much like a L2 cache works with a L1 cache. The second level predictor has a much larger set of history data it can use to predict branches, but since its branch history table is much larger, this predictor is much slower. The first level predictor works as it always has, predicting branches as best as it can, but simultaneously the new second level predictor will also be evaluating branches. There may be cases where the first level predictor makes a prediction based on the type of branch but doesn't really have the historical data to make a highly accurate prediction, but the second level predictor can. Since it (the 2nd level predictor) has a larger history window to predict from, it has higher accuracy and can, on the fly, help catch mispredicts and correct them before a significant penalty is incurred.

The renamed return stack buffer is also a very important enhancement to Nehalem. Mispredicts in the pipeline can result in incorrect data being populated into Penryn's return stack (a data structure that keeps track of where in memory the CPU should begin executing after working on a function). A return stack with renaming support prevents corruption in the stack, so as long as the calls/returns are properly paired you'll always get the right data out of Nehalem's stack even in the event of a mispredict.

http://www.anandtech.com/cpuchipsets/sh ... i=3264&p=2

This feautre seems to be liked>> • Faster Synchronization Primitives: As multi-threaded software becomes more prevalent, the
need to synchronize threads is also becoming more common. Next generation Intel
microarchitecture (Nehalem) speeds up the common legacy synchronization primitives (such
as instructions with a LOCK prefix or the XCHG instruction) so that existing threaded
software will see a performance boost.

Linus Torvalds wrote:>If using the lock prefix is a legacy operation what are
>the modern ones?

I don't think there are any - I think they just meant that
they made the old legacy instructions run faster, instead
of trying to introduce anything new.

Which I really look forward to testing. The serialization
overhead of Core 2 is better than many other processors,
but everything else is so good that it still stands out
like a sore thumb. We have lots of kernel loads where one
of the biggest costs is just locking (even without any
nasty contention and cacheline ping-ping), because of how
it serializes the pipeline.

Now that people are trying to push more and more multi-
threaded programming paradigms, the locking is finally
getting some real exposure. It's always been a big issue
in kernels, but now all the fast user-level locking is
making it show up in "normal" loads too.

Linus

http://realworldtech.com/forums/index.c ... 0&roomid=2

Let’s now explain other microarchitecture enhancements that Nehalem will incorporate.

First Nehalem will have four dispatch units instead of three. So what does that mean? This means that internally the CPU can have four microinstructions processing at the same time instead of three like on other Core-based CPUs (Core 2 Duo, for example). This represents a 33% improvement in the CPU processing capability. Translation: this CPU will be faster than Core 2 Duo CPUs under the same clock rate because it can process four microinstructions at the same time instead of three.

bob · Post by **bob** » Wed Mar 19, 2008 10:18 pm

Nid Hogge wrote:

Nehalem allows for 33% more micro-ops in flight compared to Penryn (128 micro-ops vs. 96 in Penryn), this increase was achieved by simply increasing the size of the re-order window and other such buffers throughout the pipeline.

With more micro-ops in flight, Nehalem can extract greater instruction level parallelism (ILP) as well as support an increase in micro-ops thanks to each core now handling micro-ops from two threads at once.

Despite the increase in ability to support more micro-ops in flight, there have been no significant changes to the decoder or front end of Nehalem. Nehalem is still fundamentally the same 4-issue design we saw introduced with the first Core 2 microprocessors. The next time we'll see a re-evaluation of this front end will most likely be 2 years from now with the 32nm "tock" processor, codenamed Sandy Bridge.

Nehalem also improved unaligned cache access performance. In SSE there are two types of instructions: one if your data is aligned to a 16-byte cache boundary, and one if your data is unaligned. In current Core 2 based processors, the aligned instructions could execute faster than the unaligned instructions. Every now and then a compiler would produce code that used an unaligned instruction on data that was aligned with a cache boundary, resulting in a performance penalty. Nehalem fixes this case (through some circuit tricks) where unaligned instructions running on aligned data are now fast.

In many applications (e.g. video encoding) you're walking through bytes of data through a stream. If you happen to cross a cache line boundary (64-byte lines) and an instruction needs data from both sides of that boundary you encounter a latency penalty for the unaligned cache access. Nehalem significantly reduces this latency penalty, so algorithms for things like motion estimation will be sped up significantly (hence the improvement in video encode performance).

Nehalem also introduces a second level branch predictor per core. This new branch predictor augments the normal one that sits in the processor pipeline and aids it much like a L2 cache works with a L1 cache. The second level predictor has a much larger set of history data it can use to predict branches, but since its branch history table is much larger, this predictor is much slower. The first level predictor works as it always has, predicting branches as best as it can, but simultaneously the new second level predictor will also be evaluating branches. There may be cases where the first level predictor makes a prediction based on the type of branch but doesn't really have the historical data to make a highly accurate prediction, but the second level predictor can. Since it (the 2nd level predictor) has a larger history window to predict from, it has higher accuracy and can, on the fly, help catch mispredicts and correct them before a significant penalty is incurred.

The renamed return stack buffer is also a very important enhancement to Nehalem. Mispredicts in the pipeline can result in incorrect data being populated into Penryn's return stack (a data structure that keeps track of where in memory the CPU should begin executing after working on a function). A return stack with renaming support prevents corruption in the stack, so as long as the calls/returns are properly paired you'll always get the right data out of Nehalem's stack even in the event of a mispredict.

http://www.anandtech.com/cpuchipsets/sh ... i=3264&p=2

This feautre seems to be liked>> • Faster Synchronization Primitives: As multi-threaded software becomes more prevalent, the
need to synchronize threads is also becoming more common. Next generation Intel
microarchitecture (Nehalem) speeds up the common legacy synchronization primitives (such
as instructions with a LOCK prefix or the XCHG instruction) so that existing threaded
software will see a performance boost.

Linus Torvalds wrote:>If using the lock prefix is a legacy operation what are
>the modern ones?

I don't think there are any - I think they just meant that
they made the old legacy instructions run faster, instead
of trying to introduce anything new.

Which I really look forward to testing. The serialization
overhead of Core 2 is better than many other processors,
but everything else is so good that it still stands out
like a sore thumb. We have lots of kernel loads where one
of the biggest costs is just locking (even without any
nasty contention and cacheline ping-ping), because of how
it serializes the pipeline.

Now that people are trying to push more and more multi-
threaded programming paradigms, the locking is finally
getting some real exposure. It's always been a big issue
in kernels, but now all the fast user-level locking is
making it show up in "normal" loads too.

Linus
http://realworldtech.com/forums/index.c ... 0&roomid=2

Let’s now explain other microarchitecture enhancements that Nehalem will incorporate.

First Nehalem will have four dispatch units instead of three. So what does that mean? This means that internally the CPU can have four microinstructions processing at the same time instead of three like on other Core-based CPUs (Core 2 Duo, for example). This represents a 33% improvement in the CPU processing capability. Translation: this CPU will be faster than Core 2 Duo CPUs under the same clock rate because it can process four microinstructions at the same time instead of three.

Here's the only issue. It is going to be hard to issue those extra micro-ops if they don't actually exist due to various data dependencies, or control dependencies. So to put it in perspective, a 33% improvement is speed is the theoretical limit that you are guaranteed to never exceed, and which you might also never reach...

But it does look good overall, particularly on the cache side and the better TLB support for those using large memory blocks for things like hashing.

hgm · Post by **hgm** » Wed Mar 19, 2008 10:40 pm

bob wrote:Here's the only issue. It is going to be hard to issue those extra micro-ops if they don't actually exist due to various data dependencies, or control dependencies.

This is why they also increased the size of the re-order buffer. In the 33% addition to the re-order buffer it should be able to find 33% extra independent instructions, enough to keep the extra dispatch unit busy.

And even if it can't, hyperthreading will provide more independent instructions in a re-order buffer of the same size (which presumably will be split in two parts, one for each thread), as the threads ar by definition independent of each other.

This really seems a good machine, where hyper-threading might produce a serious performance enhancement. I can hardly wait to have one. At 16 threads SMP starts to be interesting enough to try it. With two cores anything works, so why bother?

sje · Post by **sje** » Thu Mar 20, 2008 12:09 am

I have a Pentium 4 machine and HT has never given more than a 15% speed up, and that was in rare cases. The real number is more like 5% or so.

HT on a single core machine is a pain when trying to run two processes at different priorities; the scheduler thinks that there are two cores available, so each process winds up running at the same priority level as they share the same core.

I can disable HT in the BIOS, but then Ubuntu/Debian gets confused and won't run.

Really, is HT anything much more than a cheap marketing claim?

hgm · Post by **hgm** » Thu Mar 20, 2008 9:51 am

But Nehalem is no Pentium IV. Since Pentium-M, Intel designs really look like they have been made by extremely clever people that know what they are doing. (Before that it looked like they were made by idiots...)

So I have every confidence that this extra dispatch unit and the larger re-order buffer are really what is needed to make hyperthreading work, if the current designers think that this is needed to make it work.

Let's face it: none of our codes can currently run at 4 instructions per clock, or even near to it. Sustained performance of 2 instructions per clock is already very good. So there must be lots of stalls and bubbles in the pipe-line, and hyper-threading is, at least in theory, an ideal way to squeeze out those bubbles. If the bubbles are caused by data dependencies in the instruction stream, and not by dispatch units, register-file ports, execute units, etc. Netburst was a very minimal architecture, where there was shortage of almost any resource compared to P6-based designs. And it seems in Nehalem they attacked the bottlenecks that still existed.

I would be surprised if hyper-threading in Nehalem would not give you an extra performance of 50% for highly optimized integer code, and close to 100% for very poor code (for code that is not limited by cache / memory bandwidth, of course).

Of course none of this excludes that incompetence of the OS can make it backfire. But the foreground/background-task problems are not very relevant for SMP Chess applications, where you don't want to run tasks at different priorities, but just squeeze as many MIPS from the hardware as possible.

Bo Persson · Post by **Bo Persson** » Thu Mar 20, 2008 2:38 pm

hgm wrote:But Nehalem is no Pentium IV. Since Pentium-M, Intel designs really look like they have been made by extremely clever people that know what they are doing. (Before that it looked like they were made by idiots...)

You are a bit unfair here - the P4 design spec was "Highest clock rate, no matter what!".

So it finally ended up with a 30 stage pipeline and no heavy functional units. Almost reached 4 GHz, the designers earned their bonuses, but the chip didn't work very well.

So now the new spec is "Work well, no matter what!". Surprise!

hgm · Post by **hgm** » Thu Mar 20, 2008 2:58 pm

OK, I grant them that. And actually, when you want to do decicionfree FPU number crunching, the P4 is not a bad machine at all. You can squeeze enormous MFLOPS from it.

In fact the hyperthreading does work quite well there too, for algorithms that cannor do without a loop with a long critical path. Take for instance the following:

Code: Select all

int j = 2e9;
double a=1.; b= 1+1./j;
do &#123;
    a *= b;
&#125; while&#40;--j&#41;;

This loop can never execute in less than 7 cycles, as this is the latency of the floating multiply, and there are not enough instructions to keep the processor busy for 7 cycles. And indeed, running this program in parallel with itself on a P-IV with HT enabled, does not give a measurable slowdown for each of them. In other wordt, the HT gave you a 100% performance boost, you now run two jobs in the same time as one.

sje · Post by **sje** » Thu Mar 20, 2008 3:19 pm

hgm wrote:Of course none of this excludes that incompetence of the OS can make it backfire. But the foreground/background-task problems are not very relevant for SMP Chess applications, where you don't want to run tasks at different priorities, but just squeeze as many MIPS from the hardware as possible.

Well, one problem here is when your chess program is playing in an international event and it's three in the morning local time and your Unix/BSD/Linux decides to run a bunch of system maintenance scripts. The scripts are run at priority 19 (lowest Unix level) and that would be fine on a non-HT machine. But the HT-deluded scheduler effectively gives the maintenance work the same CPU share as the chess program.

bob · Post by **bob** » Thu Mar 20, 2008 4:43 pm

hgm wrote:
bob wrote:Here's the only issue. It is going to be hard to issue those extra micro-ops if they don't actually exist due to various data dependencies, or control dependencies.
This is why they also increased the size of the re-order buffer. In the 33% addition to the re-order buffer it should be able to find 33% extra independent instructions, enough to keep the extra dispatch unit busy.

You need to read Hennessy/Patterson "Computer Architecture, a quantitative approach". They answer this exact question. In fact, they take some well-known applications and stretch the size of the reorder buffer to infinity to see what the max theoretical performance would be. The gain is nowhere near what you would expect.

This is must-reading for anyone interested in this kind of topic. It is very difficult, with a horrible diminishing returns level, to find more and more parallelism, no matter how far you look ahead. There are always substantial numbers of data dependencies that simply defeat any such strategy...

And even if it can't, hyperthreading will provide more independent instructions in a re-order buffer of the same size (which presumably will be split in two parts, one for each thread), as the threads ar by definition independent of each other.

Again, visit "the bible (H-P above). The effect is far weaker than you might imagine. Chapter 3 of that book covers this effect for a wide variety of different aspects. What would happen if branch prediction were perfect? What if you have an infinite number of renaming registers? What if you have perfect memory aliasing detection so that you can do loads and stores in any order without wrecking the program?

This really seems a good machine, where hyper-threading might produce a serious performance enhancement. I can hardly wait to have one. At 16 threads SMP starts to be interesting enough to try it. With two cores anything works, so why bother?

You might find that even with two cores, not everything works very well... "works?" probably. "well" hardly... It's not an easy problem...

bob · Post by **bob** » Thu Mar 20, 2008 4:46 pm

hgm wrote:But Nehalem is no Pentium IV. Since Pentium-M, Intel designs really look like they have been made by extremely clever people that know what they are doing. (Before that it looked like they were made by idiots...)

So I have every confidence that this extra dispatch unit and the larger re-order buffer are really what is needed to make hyperthreading work, if the current designers think that this is needed to make it work.

Let's face it: none of our codes can currently run at 4 instructions per clock, or even near to it. Sustained performance of 2 instructions per clock is already very good. So there must be lots of stalls and bubbles in the pipe-line, and hyper-threading is, at least in theory, an ideal way to squeeze out those bubbles. If the bubbles are caused by data dependencies in the instruction stream, and not by dispatch units, register-file ports, execute units, etc. Netburst was a very minimal architecture, where there was shortage of almost any resource compared to P6-based designs. And it seems in Nehalem they attacked the bottlenecks that still existed.

I would be surprised if hyper-threading in Nehalem would not give you an extra performance of 50% for highly optimized integer code, and close to 100% for very poor code (for code that is not limited by cache / memory bandwidth, of course).

Of course none of this excludes that incompetence of the OS can make it backfire. But the foreground/background-task problems are not very relevant for SMP Chess applications, where you don't want to run tasks at different priorities, but just squeeze as many MIPS from the hardware as possible.

Hyper-threading works for programs that are memory-latency bound, only. The idea is that while one "thread" waits on a memory transfer, another thread can run using cache. It is not a winner for chess. I have a PIV in my office. I have had hyper-threading turned off for years. You can actually get better performance by working on your programs memory usage, than you could ever get with hyper-threading. And when you factor in the SMP overhead, hyperthreading is a loser in parallel chess, as the overhead more than offsets the tiny gain SMT offers.

Nehalem uArch - 33% more micro-ops in flight

Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight