Nehalem uArch - 33% more micro-ops in flight

bob · Post by **bob** » Thu Mar 20, 2008 4:48 pm

hgm wrote:OK, I grant them that. And actually, when you want to do decicionfree FPU number crunching, the P4 is not a bad machine at all. You can squeeze enormous MFLOPS from it.

In fact the hyperthreading does work quite well there too, for algorithms that cannor do without a loop with a long critical path. Take for instance the following:
Code: Select all
int j = 2e9;
double a=1.; b= 1+1./j;
do &#123;
    a *= b;
&#125; while&#40;--j&#41;;
This loop can never execute in less than 7 cycles, as this is the latency of the floating multiply, and there are not enough instructions to keep the processor busy for 7 cycles. And indeed, running this program in parallel with itself on a P-IV with HT enabled, does not give a measurable slowdown for each of them. In other wordt, the HT gave you a 100% performance boost, you now run two jobs in the same time as one.

Just a minute... hyper-threading does not behave as you are describing. There is exactly _one_ FP unit in a PIV. If one thread is using it, the other thread can't use it at the same time. It's a serial resource. hyperthreading primarily tries to hide memory latency so that one thread can run out of cache/registers while the other is waiting on a cache-line fill from memory. That's the only case where the PIV hyper-threading has anything to offer...

hgm · Post by **hgm** » Thu Mar 20, 2008 5:29 pm

You might not have expect this from me, but this actually was real data!

I really tried it on the computer of a friend that has a P-IV system that supports hyperthreading.

The explanation is that the code above is not limited on _throughput_ of the FPU, but on _latency_. An fmul takes 7 clcock on P-IV. Apart from it, the optimized loop only contains a decrement and a branch. In principle the three instructions only represent a single cycle of workload. (But a loop on P-IV takes at least 2 clocks due to a pre-fetcher bubble.)

There is a bad dependency chain, though: each fmul needs the outcome of the previous one. So the loop will take 7 clocks, and on 6 of them nothing is done at all (except the multiply data rippling through the multiplier pipeline of the FPU). The multiplier can accept double-precision operands every two clocks, though (and single precision every clock). So even the two hyper-threads combined only utilize 4/7th of the available FPU throughput.

Try it, it really works! HT can be very useful in programs that have long dependency chains.

hgm · Post by **hgm** » Thu Mar 20, 2008 6:02 pm

bob wrote:Hyper-threading works for programs that are memory-latency bound, only. The idea is that while one "thread" waits on a memory transfer, another thread can run using cache. It is not a winner for chess. I have a PIV in my office. I have had hyper-threading turned off for years. You can actually get better performance by working on your programs memory usage, than you could ever get with hyper-threading. And when you factor in the SMP overhead, hyperthreading is a loser in parallel chess, as the overhead more than offsets the tiny gain SMT offers.

Well, I tried the effect of HT with Joker as well, and indeed there was zero gain. OTOH, I know that Joker is not nearly executing at 4 uOps per clock. So there must be very many execute bubbles, that in principle could be utilized by another thread. Unless, of course, these bubbles occur not because of data dependencies (waiting for operands to become available), but for bottlnecks in the machine (register-file ports, specialized execute units, dispatchers).

I am sure the P-IV had many bottlenecks, (most of them not advertized...), and I cannot predict which will be removed by the time we have Nehalem. Nevertheless, it cannot be excluded that I should be able to do much better on HT with Joker even on P-IV. When I started working on Joker, I noticed that the assembler generated by my compiler looked awful. The code was speckled with an enormous number of unnecessary loads and stores, of stuff that could all have been held in the registers. So I handoptimized th compiler-generated assembly code. In the end I was able to reduce the number of instructions in the time-critical part of the code by a factor 2 (!). But when I then timed the code, there was exactly 0% gain in speed!

My conclusion was that the abundant usage of loads and stores apparently was something that had no impact on performance, providing justification for why it was not optimized away. Since then I have not been trying to outsmart the compiler, as it seemed pointless.

But I can very well imagine that all these superfluous loads and stores provide a full workload for the load-store unit. As long as the load is less than 100% it would not impact performance to reduce it, as the unit is working in parallel with the rest of the execution. Apparently decoders and retirement unit are not the bottleneck, as optimizing away the loads and stores did not produce any speadup. And the loads and stores don't compete with other instructions for any other resources.

But a throughput of 90% on the load-store unit would strongly hinder a second hyper-thread, that would also want to use it 90% of the time. So an optimization that did not benefit the single-threaded code, could be crucial to get good HT performance. And in the current compiles of Joker it might very well be the loads and stores that would do it. On Crafty, which is very L2-hungry, the bottleneck might be caused by the L2 -> L1 replacement path. On P-IV L1 replacement is likely to be a problem even in Joker, as L1 is tiny (8KB). This bottleneck will have gone away in Nehalem for sure, even without optimizing any code...

In summary: I don't think that trying a program that is not optimized with HT in mind, and concluding that it doesn't benefit, proves much. The example I give, with a program that is designed to be ideal for HT, shows that it _can_ work for the full 100%. You just have to figure out how to exploit it. That might not be impossible for a Chess program even on P-IV. But I hope it will be much easier on Nehalem. (And I don't have a P-IV, but I will have a Nehalem!

)

bob · Post by **bob** » Thu Mar 20, 2008 6:27 pm

hgm wrote:You might not have expect this from me, but this actually was real data! I really tried it on the computer of a friend that has a P-IV system that supports hyperthreading.

The explanation is that the code above is not limited on _throughput_ of the FPU, but on _latency_. An fmul takes 7 clcock on P-IV. Apart from it, the optimized loop only contains a decrement and a branch. In principle the three instructions only represent a single cycle of workload. (But a loop on P-IV takes at least 2 clocks due to a pre-fetcher bubble.)

There is a bad dependency chain, though: each fmul needs the outcome of the previous one. So the loop will take 7 clocks, and on 6 of them nothing is done at all (except the multiply data rippling through the multiplier pipeline of the FPU). The multiplier can accept double-precision operands every two clocks, though (and single precision every clock). So even the two hyper-threads combined only utilize 4/7th of the available FPU throughput.

Try it, it really works! HT can be very useful in programs that have long dependency chains.

I'd believe that since you used the magic word "latency". But it takes such tiny loops executed large numbers of times, or else latency elsewhere such as for memory (most common for a chess engine) for this to work. Of course in a chess engine, I doubt anyone is going to be worrying about FP latency.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Mar 20, 2008 7:19 pm

bob wrote:Of course in a chess engine, I doubt anyone is going to be worrying about FP latency.

No, but SSE2,(E)3,4 SIMD-performance for quad-bitboards and fill-stuff.
Processing vectors of four floats inside one register for dot-products of eval- features and weights, and for instance sigmoid game-state scaling may be interesting applications. I wonder how to define a null window with floats?

Nid Hogge · Post by **Nid Hogge** » Fri Mar 21, 2008 3:52 pm

Here's the full White Paper on Nehalem from Intel's presestation:

Next Generation Intel® Microarchitecture (Nehalem)
http://www.intel.com/pressroom/archive/ ... ehalem.pdf (PDF).

Performance Improvement Features:

With the next generation microarchitecture, Intel made significant core enhancements to further improve
the performance of the individual processor cores. Below we describe some of these enhancements.

Instructions per cycle improvements. The more instructions that can be run per each clock cycle, the greater the performance. In addition, in many cases, by running more instructions in any given clock cycle, the work task can complete sooner enabling the processor to more quickly get back into a lower power state. To run more instructions per cycle, Intel made several key innovations.

• Greater parallelism. One way to extract more parallelism out of software code is to increase the
amount of instructions that can be run “out of order.” This enables more simultaneous processing and
overlap latency. To be able to identify more independent operations that can be run in parallel, Intel
increased the size of the out-of-order window and scheduler, giving them a wider window from
which to look for these operations. Intel also increased the size of the other buffers in the core to
ensure they wouldn’t become a limiting factor.

• More efficient algorithms. With each new microarchitecture, Intel has included improved algorithms in places where previous processor generations saw lost performance due to stalls (dead cycles). Next generation Intel microarchitecture (Nehalem) brings many such improved algorithms to increase performance. These include:

• Faster Synchronization Primitives: As multi-threaded software becomes more prevalent, the
need to synchronize threads is also becoming more common. Next generation Intel
microarchitecture (Nehalem) speeds up the common legacy synchronization primitives (such
as instructions with a LOCK prefix or the XCHG instruction) so that existing threaded
software will see a performance boost.

• Faster Handling of Branch Mispredictions: A common way to increase performance is
through the prediction of branches. Next generation Intel microarchitecture (Nehalem)
optimizes the cases where the predictions are wrong, so that the effective penalty of
branch mispredictions overall is lower than on prior processors.

• Improved hardware prefetch and better load-store scheduling: Next generation Intel
microarchitecture (Nehalem) continues the many advances Intel made with the 45nm next
generation Intel Core microarchitecture (Penryn) family of processors in reducing memory
access latencies through prefetch and load-store scheduling improvements.

Enhanced branch prediction. Branch prediction attempts to guess whether a conditional branch will be taken or not. Branch predictors are crucial in today's processors for achieving high performance. They allow processors to fetch and execute instructions without waiting for a branch to be resolved. Processors also use branch target prediction to attempt to guess the target of the branch or unconditional jump before it is computed by parsing the instruction itself. In addition to greater performance, an additional benefit of increased branch prediction accuracy is that it can enable the processor to consume less energy by spending less time executing mis-predicted branch paths.

Next generation Intel microarchitecture (Nehalem) uses several innovations to reduce branch mispredicts
that can hinder performance and to improve the handling of branch mispredicts.

• New second-level branch target buffer (BTB). To improve branch predictions in applications that have large code footprints, such as database applications, Intel added a second-level branch target buffer (BTB). BTBs reduce the performance penalty of branches in pipelined processors by predicting the
path of the branch and caching information used by the branch.

• New renamed return stack buffer (RSB). RSBs store forward and return pointers associated with call and return instructions. Next generation microarchitecture’s renamed RSB helps avoid many common
return instruction mispredictions

Intel Smart Cache Enhancements:

The new three-level cache hierarchy for next generation Intel microarchitecture (Nehalem) consists of:

• Same L1 cache as Intel Core microarchitecture (32 KB Instruction Cache, 32 KB Data Cache)
• New L2 cache per core for very low latency (256 KB per core for handling data and instruction)
• New fully inclusive, fully shared 8MB L3 cache (all applications can use entire cache)

A new two-level Translation Lookaside Buffer (TLB) hierarchy is also included in next generation Intel
microarchitecture (Nehalem). A TLB is a processor cache that is used by memory management hardware to improve the speed of virtual address translation. The TLB references physical memory addresses in its table.

All current desktop and server processors use a TLB, but next generation Intel microarchitecture (Nehalem)
adds a new second level 512 entry TLB to further improve performance.

Improved virtualization performance. Next generation Intel microarchitecture (Nehalem) adds new features that enable software to further improve their performance in virtualized environments. For example, the next generation microarchitecture includes an Extended Page Table (EPT) for reconciling memory type specification in a guest operating system with memory type specification in the host operating system in virtualization systems that support memory type specification.

Nid Hogge · Post by **Nid Hogge** » Fri Mar 21, 2008 3:58 pm

BTW here's the new instruction set coming with Sandy Bridge (32nm, 2010): AVX.

256-bit, Three operands.

Nid Hogge · Post by **Nid Hogge** » Fri Mar 21, 2008 4:18 pm

Bo Persson wrote:
hgm wrote:But Nehalem is no Pentium IV. Since Pentium-M, Intel designs really look like they have been made by extremely clever people that know what they are doing. (Before that it looked like they were made by idiots...)
You are a bit unfair here - the P4 design spec was "Highest clock rate, no matter what!".

So it finally ended up with a 30 stage pipeline and no heavy functional units. Almost reached 4 GHz, the designers earned their bonuses, but the chip didn't work very well.

So now the new spec is "Work well, no matter what!". Surprise!

This is the advantage of having 2 design teams, where each one thinks about different ideas and researching 2 paths. Without the IDG(Israeli Design Team) Intel would probably bust once again. It saved theyr'e a**. They has working samples indoors of Tejas desigend CPU's running 5.6 GHZ STOCK at 170W~ TDP. in the same time Core( Yonah) was being debuted in Centerino platforms for notebooks and proved to be superior, thus finally made the CEO pick them as the primary deasign team. it's an innner-inner competition inside Intel that keeps R&D going and also provides a safenet in case of failure(P4\Netburst).Ofocurse that working in collaboration and sharing design discoveries is very helpful too.

But I'm _so_ glad they dumped the MHz race.. so useless.. they'd figure faster overall matters much more than faster Mhz.. performance per clock is what matters, and they seem to be in the right direction.. well now he just gotta hope AMD follows and won't abanadon the high-end and performance segments.. that would really hurt innovation and let Intel seat comfortably once again.. especially since Nehalem marks the one advantage AMD did have up until now ..IMC.

Nid Hogge · Post by **Nid Hogge** » Fri Mar 21, 2008 4:24 pm

BTW, since this far too technical for people like me, I suggest you could join the discussion over RWT where knowledgeable people and CPU architects are regulars.. Highly technical and interesting discussions.. Would be nice for you guys I guess and to also witness high-level debates.

Nehalem discussion,
http://realworldtech.com/forums/index.c ... 0&roomid=2

All
http://realworldtech.com/forums/index.c ... t&roomid=2

Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight

Re: Nehalem uArch - 33% more micro-ops in flight