Exceptions and branches are processed in the retirement stage of the pipe-line, not? This is why mispredictions are so costly, at the time you discover them you really have to discard a completely filled pipeline of uOps.
Every instruction that can cause an exception is logically equivalent to a branch. As it branches only on rare conditions, it is 'predicted' to be not taken. The branch only affects control flow when it reaches the retirement unit, and at that time it is no longer speculative, as all earlier instructions have already been reitred. If the retirement of an earlier instruction caused the pipeline contents to be discarded, the instruction after it causing the exception will never reach the retirement unit, and thus not affect the flow of control at instruction (pre-)fetcher.
Zobrist key random numbers
Moderator: Ras
-
- Posts: 28326
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
-
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Zobrist key random numbers
Well a lot of todays OoO processors logics in transistor counts is for branches. The manufacturers do all kind of things there that speedup IPC.hgm wrote:Exceptions and branches are processed in the retirement stage of the pipe-line, not? This is why mispredictions are so costly, at the time you discover them you really have to discard a completely filled pipeline of uOps.
Every instruction that can cause an exception is logically equivalent to a branch. As it branches only on rare conditions, it is 'predicted' to be not taken. The branch only affects control flow when it reaches the retirement unit, and at that time it is no longer speculative, as all earlier instructions have already been reitred. If the retirement of an earlier instruction caused the pipeline contents to be discarded, the instruction after it causing the exception will never reach the retirement unit, and thus not affect the flow of control at instruction (pre-)fetcher.
there is a huge difference with respect to short branches and long branches in terms of misprediction, i notice that simply by looking at how much faster code gets on the modern processors, even if you just a number of instructions ago already went over that routine, so it definitely is in L1i.
It really has to do with lookahead.
Yet of course, as also shown by Gerd, compilers must take safe assumptions and where compilers CAN optimize they usually do not optimize very well.
In 64 bits of course processors cannot mean much right now, what is it they issue at once, a byte or 16 (and huge diff intel vs amd in lookahead). So that's an instruction or 3 in 64 bits?
In 32 bits you can do more there. It's more like 8 instructions or so in 32 bits? (not sure about exact sizes). So that is very helpful in diep.
Glad i'm still 32 bits if we look how clumsy the PC processors still are in 64 bits. I really must admit some years ago i wouldn't have believed that in 2009 the pc-processors would be still that clumsy in 64 bits.
Basically they improved in SSE2 a lot. From an IPC that practically achieved 1 to 2 gflop a cycle to several gflop a cycle now. So that is a big improvement where we cannot profit from simply. Also the latencies of SSE2 means it is difficult to use for chess.
Some puzzlers like Gerd manage to make some out of it, but if you ask me you'd lose that much time to puzzling with those instructions, that keeping a good oversight at engine is tough.
Additionally it is not portable code.
Yet if we look to the huge differences in penalty of short jumping branches versus longer (JZ/JZE etc) jumps, it is obvious that the additional transistor logics works real well in short term.
Way more tricks get applied than we are thinking and in future they'll improve upon that once again. I have faith there in engineers.
What really lags behind most is compiler technology. Where they keep improving for specint the compilers, progress of compilers at my Diep code is real little. It just creates ugly code that runs slower objectively. Real real ugly it is.
However when i spoke a year or 9 with GCC team member, he has been predicting all this very exactly. That wasn't a selffulfilling prophecy. Simply a realistic viewpoint on what volunteers can achieve and how nowadays 1 bad individual at todays more complex software, can mess up more than what 100 wise men can repair.
(gcc team is about 15 members or so, last time i checked).
Vincent
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Zobrist key random numbers
Correct, so far as I know. The issue was the exceptions. By the time the instruction that controls whether or not an instruction is executed (the control dependency I mentioned) is completed (retired) the instructions that branch "protects" (the divide by zero in my example) has also been executed, and the 'exception" condition is attached to it so that if, and only if, that instruction gets retired, the exception condition is raised. So semantically an if (c) sees the sub-components of the (c) evaluated left-to-right, byt actually they are executed in whatever order is most efficient, and it is up to the CPU to do this and make sure that exceptions don't arise for instructions that don't actually get executed (to the point where they would cause the program to crash because of an exception condition) even though they were speculatively executed out of order...hgm wrote:Exceptions and branches are processed in the retirement stage of the pipe-line, not? This is why mispredictions are so costly, at the time you discover them you really have to discard a completely filled pipeline of uOps.
Every instruction that can cause an exception is logically equivalent to a branch. As it branches only on rare conditions, it is 'predicted' to be not taken. The branch only affects control flow when it reaches the retirement unit, and at that time it is no longer speculative, as all earlier instructions have already been reitred. If the retirement of an earlier instruction caused the pipeline contents to be discarded, the instruction after it causing the exception will never reach the retirement unit, and thus not affect the flow of control at instruction (pre-)fetcher.
The old IBM /360 model 91 had an imprecise exception mechanism as a result, where it would report "exception near instruction xxx" because it was ahead in the instruction stream and by the time it was ready to "retire" an instruction that raised an exception, it had lost the program counter for the instruction causing the problem...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Zobrist key random numbers
I have no idea what you mean by "clumsiness of 64 bit instructions." They don't seem clumsy to me considering the speeds I see on 64 bit processors. Yes, the compilers have to produce machine language that is semantically equivalent to the original source code you wrote. And yes, the hardware has to execute things so that the final result is identical with what would be produced if the compiler output was executed exactly in the order produced by the compiler, without any effects of speculative execution or out of order execution changing a single thing other than making it run faster...diep wrote:Well a lot of todays OoO processors logics in transistor counts is for branches. The manufacturers do all kind of things there that speedup IPC.hgm wrote:Exceptions and branches are processed in the retirement stage of the pipe-line, not? This is why mispredictions are so costly, at the time you discover them you really have to discard a completely filled pipeline of uOps.
Every instruction that can cause an exception is logically equivalent to a branch. As it branches only on rare conditions, it is 'predicted' to be not taken. The branch only affects control flow when it reaches the retirement unit, and at that time it is no longer speculative, as all earlier instructions have already been reitred. If the retirement of an earlier instruction caused the pipeline contents to be discarded, the instruction after it causing the exception will never reach the retirement unit, and thus not affect the flow of control at instruction (pre-)fetcher.
there is a huge difference with respect to short branches and long branches in terms of misprediction, i notice that simply by looking at how much faster code gets on the modern processors, even if you just a number of instructions ago already went over that routine, so it definitely is in L1i.
It really has to do with lookahead.
Yet of course, as also shown by Gerd, compilers must take safe assumptions and where compilers CAN optimize they usually do not optimize very well.
In 64 bits of course processors cannot mean much right now, what is it they issue at once, a byte or 16 (and huge diff intel vs amd in lookahead). So that's an instruction or 3 in 64 bits?
In 32 bits you can do more there. It's more like 8 instructions or so in 32 bits? (not sure about exact sizes). So that is very helpful in diep.
Glad i'm still 32 bits if we look how clumsy the PC processors still are in 64 bits. I really must admit some years ago i wouldn't have believed that in 2009 the pc-processors would be still that clumsy in 64 bits.
Basically they improved in SSE2 a lot. From an IPC that practically achieved 1 to 2 gflop a cycle to several gflop a cycle now. So that is a big improvement where we cannot profit from simply. Also the latencies of SSE2 means it is difficult to use for chess.
Some puzzlers like Gerd manage to make some out of it, but if you ask me you'd lose that much time to puzzling with those instructions, that keeping a good oversight at engine is tough.
Additionally it is not portable code.
Yet if we look to the huge differences in penalty of short jumping branches versus longer (JZ/JZE etc) jumps, it is obvious that the additional transistor logics works real well in short term.
Way more tricks get applied than we are thinking and in future they'll improve upon that once again. I have faith there in engineers.
What really lags behind most is compiler technology. Where they keep improving for specint the compilers, progress of compilers at my Diep code is real little. It just creates ugly code that runs slower objectively. Real real ugly it is.
However when i spoke a year or 9 with GCC team member, he has been predicting all this very exactly. That wasn't a selffulfilling prophecy. Simply a realistic viewpoint on what volunteers can achieve and how nowadays 1 bad individual at todays more complex software, can mess up more than what 100 wise men can repair.
(gcc team is about 15 members or so, last time i checked).
Vincent
-
- Posts: 2251
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: Zobrist key random numbers
The point is that mailbox programs, which would even compile and run efficiently on a 16-bit os, did not take advantage from 64-bit mode, despite eight additional registers and implicit fastcall.bob wrote: I have no idea what you mean by "clumsiness of 64 bit instructions." They don't seem clumsy to me considering the speeds I see on 64 bit processors. Yes, the compilers have to produce machine language that is semantically equivalent to the original source code you wrote. And yes, the hardware has to execute things so that the final result is identical with what would be produced if the compiler output was executed exactly in the order produced by the compiler, without any effects of speculative execution or out of order execution changing a single thing other than making it run faster...
Using r08-r15 takes an additional opcode byte. x64 calling conventions "allocates" stackspace to save caller safe registers via mov [rbp + offset], reg instead of push/pop (much longer ocpode). Accessing global data via register indirection. Probably compiler are not (yet) good enough to optimize (despite pgo) than their 32-bit counterparts. In general, if you don't use 64-bit that much, most mentioned programs likely become larger if compiled for 64-bit mode than for 32-bit mode.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Zobrist key random numbers
Unfortunately X86-64 was an "add-on" to X86 rather than what was needed, which was a complete "re-do". But so far, and this covers a _lot_ of applications besides Crafty, I have yet to find one that slows down using 64 bit stuff. I'm not particularly interested in exactly why, but I assume that for most, it is the extra 8 registers which overcomes a major bottleneck in the normal X86 instruction set.Gerd Isenberg wrote:The point is that mailbox programs, which would even compile and run efficiently on a 16-bit os, did not take advantage from 64-bit mode, despite eight additional registers and implicit fastcall.bob wrote: I have no idea what you mean by "clumsiness of 64 bit instructions." They don't seem clumsy to me considering the speeds I see on 64 bit processors. Yes, the compilers have to produce machine language that is semantically equivalent to the original source code you wrote. And yes, the hardware has to execute things so that the final result is identical with what would be produced if the compiler output was executed exactly in the order produced by the compiler, without any effects of speculative execution or out of order execution changing a single thing other than making it run faster...
Using r08-r15 takes an additional opcode byte. x64 calling conventions "allocates" stackspace to save caller safe registers via mov [rbp + offset], reg instead of push/pop (much longer ocpode). Accessing global data via register indirection. Probably compiler are not (yet) good enough to optimize (despite pgo) than their 32-bit counterparts. In general, if you don't use 64-bit that much, most mentioned programs likely become larger if compiled for 64-bit mode than for 32-bit mode.
Vincent's comments seem directed to something else, however. Particularly dealing with OOE and optimizing the instruction stream, which the compilers I use seem to be quite good at doing from the many .S files I have looked at over time...
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Zobrist key random numbers
I'm not sure why L1i is worse. The reorder buffer typically is full. An L1i stall will tend to see the reorder buffer drain down until the new cache line is partially read and the instruction stream reaches the decoder and starts catching up again. Since the thing can decode faster than it can retire, catching back up doesn't take that long, And since we don't have to wait for an _entire_ L1i cache line to be read from memory before we resume decoding (critical word first approach) we can resume fetching/decoding as soon as the data starts to flow into the L1i block...diep wrote:Hi HGM, you mention L1d misses are bad. that's true.hgm wrote:OK, now we are getting somewhere. And I did not use the word 'netburst' at all: I try to avoid that in general, as each time I use it, I feel like I have to rinse my mouth...![]()
Sorry about pushing the point so aggressively, but it totally pointless to discuss this issue if we are not even in sync on the very basics. OK, so PII and PIII have 32-byte L1 cache line. But everything later has 64-byte L1 line, as Gerd pointed out. This was what I meant with 'nearly a decade', which was not so far off the mark, as P-IV was introduced 2000Q4, according to Wikipedia.
I have no PII or PIII around anymore, (funny enough I do have a still operational 100MHz 'P-I'!), and do not optimize my engines for those architectures. But that mis-aligned accesses are not really expensive at all, I actually first discovered on a P-I, because the DJGCC compiler was not smart enough to align the stack on 8-byte boundaries, which lead to mis-alignement of any locally declared 'double' with 50% probability. Globals were always aligned, so at first I thought the slower execution when I used locals was due to the indexed addressing mode needed for stack-frame addressing. On P-I the difference was 2 clocks (as a pipeline stall, as the P-I did not have any out-of-order execution).
You say the misalignments are bad, but that seems more a matter of principle than anything else. In practice they work. To my amazement, on Pentium-M they even work perfectly, as I reported above. I had not tested it on that machine before. But both an aligned and a misaligned L1 hit have a 3-cycle latency. That is what I call a zero-cycle penalty for the misalignment. Only when I straddle a cache-line boundary there is a 9-cycle penalty, so the latency goes up to 12 cycles. Note that this is always for 4-byte loads, I do not use any MMX or SSE, as my compiler doesn't generate those. So long-long arithmetic is done in two halves always, and only one of those halves can straddle the cache-line boundary.
Note that it is not true that there are extra uOps involved. This would be true for unaligned SSE loads, but can't be the case on normal x86 instructions. At decode time the CPU has no idea yet that a mis-aligned load will be requested, as this becomes only apparent when the instruction is already being executed, in the address calculation, when the output of the AGU becomes available. It will be a single LOAD uOp, and only the cache unit will be presented with a problem. Apparently it can handle that problem very well in Pentium M, by alowing the fetch of two neighboring banks in the same cycle. And apparently the multiplexer that has to be there to select the addressed byte out of a word anyway has been given 32-bit width in P-M.
Unfortunately I don't recall for which CPU the numbers where that I quoted initially (2 clocks penalty). I could not find a program to measure it on my AMD Athlon XP (K7) nor my Core 2 Duo (E6600). So I fear it might have indeed been the dreaded P-IV in my office at work. And it was probably not for straddling a cache line.
My philosophy is: "if the penalty is negligible on almost any machine I know, I would be a fool not to use it." And it seems that the penalty is _much_ lower than the penalty for an L1 cache miss. If others call that "bad practice", so be it. I rather be guilty of "bad practice" and have a fast program, that doing things "by the book", and have worse performance. My programs also contain goto statements...
Now I don't worry much about the other objections you bring up. For one, the straddling of cache lines can be avoided. A 0x88 board aligned on a 64-byte boundary only uses bytes 0-8, 16-23, 32-39 and 48-55 for board rows. Even if I would do a 64-byte load from address 55 it would only extend to byte 62, i.e. not straddle the cache-line boundary. (This would even be true for 32-bit cache lines.) This would reduce the compression from a factor 8 to a factor 4, but that can still be a very worthwile factor. And even if there are L1 misses, there never would be double L1 misses, as there are no straddlers. On a Pentium M this size reduction is apparently absolutely free. (I still have to measure it on the E6600, but will report the result of that here later.) Only if you want the factor 8 you pay something for it. But as this is apparently very little, (0.5 clock per access), it might very well be worth it.
However a much bigger problem in diep is not L1d but l1i.
Misses in L1d are about 0.6-0.9% from which 0.5% to 0.8% goes to memory controller anyway (and we know why that is), so basically the L2 is only 'useful' for 0.1%. This is at core2.
In contradiction the L1i misses at core2 of 1.34%.
So that forms a way bigger problem. I was a bit lazy to figure out how at latest architectures (Phenom2 and Nehalem) the misses hurt in number of cycles. Any thoughts on that?
Because it is really ugly to have 1.34% misses in L1i. The biggest problem of Diep is whether i should replace branches by more code in some cases, or add branches and remove code. What's more expensive considering all the tricks they use today?
Measured with cachegrind under linux at my laptop.
Thanks,
Vincent
OTOH, L1d stalls can potentially cause one or more pipes to stall when there are no micro-ops that can be issued because they are all waiting on a memory read to occur first...
-
- Posts: 2251
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: Zobrist key random numbers
IIRR there were mailbox programmers reporting a slow down in 64-bit mode.bob wrote:Unfortunately X86-64 was an "add-on" to X86 rather than what was needed, which was a complete "re-do". But so far, and this covers a _lot_ of applications besides Crafty, I have yet to find one that slows down using 64 bit stuff. I'm not particularly interested in exactly why, but I assume that for most, it is the extra 8 registers which overcomes a major bottleneck in the normal X86 instruction set.
Vincent's comments seem directed to something else, however. Particularly dealing with OOE and optimizing the instruction stream, which the compilers I use seem to be quite good at doing from the many .S files I have looked at over time...
To leave the instructions in 32-bit default mode (eax vs. rax) with almost identical opcode was fine, and an additional prefix opcode byte for 64-bit width as well. My main criticism on x64 is, it has no compact 32-bit address space memory mode. For me 4GByte is enough

Code: Select all
mov rax, [array + 8*ebx]
xor rax, [array + 8*r10d]
-
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Zobrist key random numbers
Eugene has regrettably proven to be very correct and he said it polite still compared to a GCC team member Marc Lehman.Gerd Isenberg wrote:The point is that mailbox programs, which would even compile and run efficiently on a 16-bit os, did not take advantage from 64-bit mode, despite eight additional registers and implicit fastcall.bob wrote: I have no idea what you mean by "clumsiness of 64 bit instructions." They don't seem clumsy to me considering the speeds I see on 64 bit processors. Yes, the compilers have to produce machine language that is semantically equivalent to the original source code you wrote. And yes, the hardware has to execute things so that the final result is identical with what would be produced if the compiler output was executed exactly in the order produced by the compiler, without any effects of speculative execution or out of order execution changing a single thing other than making it run faster...
Using r08-r15 takes an additional opcode byte. x64 calling conventions "allocates" stackspace to save caller safe registers via mov [rbp + offset], reg instead of push/pop (much longer ocpode). Accessing global data via register indirection. Probably compiler are not (yet) good enough to optimize (despite pgo) than their 32-bit counterparts. In general, if you don't use 64-bit that much, most mentioned programs likely become larger if compiled for 64-bit mode than for 32-bit mode.
Eugene: "32 bits will be faster for most programs than 64 bits".
Correct, and a LOT faster.
Only bitboard engines profitted for a simple reason which has nothing to do with number of registers.
K7 had already 44 rename registers or so and P4 like 128, so there never was a real shortage of registers in x86.
Marc Lehmann: "x86-64 will be the ultimate disaster for GCC"
He has proven to be very right.
Add to this that GCC was already ugly bad in efficient code optimization.
Compilers have become one big intel show.
The more complex processors get now with all their transistors, the bigger the advantage of intel compilerwise. AMD has lost the battle there.
Just theoretic spoken, Phenom2 is a fantastic processor, definitely not worse than Nehalem, other than the 200Mhz gap between them.
In reality Nehalem is 13.5% faster for diep IPC wise.
That is just thanks to compiler and nothing else. Thanks to the wintel compiler.
You can prove the problem mathematical. Suppose m$ claims that the features in the wintel compiler only get in when they don't hurt 1 of the 2 manufacturers majorleague, then intel already has a major advantage.
Realize what an intel cpu is. It is the utmost cheap design they could do to execute a single small path very quickly. Anything else that tries to use the full resources is ugly slow. The core2 (have to figure out Nehalem yet) executes already at a luxury 1 instruction a cycle for CMOV type instructions. AMD can do on paper 3 within 1 cycle. Core2 already has the luxury design, from intel viewpoint seen, of being able to execute 2 shift instructions a cycle. AMD could do 3 of these instructions already 10 years ago within 1 cycle.
We're never gonna see of course a compiler that will generate 3 shift instructions in such a manner that CPU can reorder it to what it can actually do.
That is already luxury from intels viewpoint, as during P4 prescott times, even giving AMD flags were not going to generate CMOV instructions at all. Hah with good reasons for P4 prescott, as shift right was 7 cycles.
Those who write assembler, they complain loud about every design of intel having a problem. Latency of this instruction sucks, or latency of that instruction sucks.
Compilers avoid all those intel problems.
That's something total different from TAKING ADVANTAGE of what AMD can do fast. Compilers just do NOT generate that. GCC in fact with AMD flags, generates code that is so silly, that even my nephew of 1.5 years old, so he could write already assembler, would be doing lightyears faster.
I didn't even need to check in intel documents to realize that this code, though objectively slower than doing a few CMOV's, would be ugly slow on AMD triggering all kind of penalties, meanwhile at intel it wasn't a problem, as it took advantage of that 'single fast path' again.
It is not realistic to discuss about complicated phenomena's if compilers are basically carrying an intel logo and no compiler will ever be ABLE to use all registers, as that's ugly slow at intel.
Forget the 16 registers of x64, it's too slow for intel to use all registers at the same time.
Just try in assembler.
Vincent
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Zobrist key random numbers
This is _completely_ false. Renaming registers does _not_ reduce the effects of a limited number of registers for most programming examples. Renaming allows an easy way to track the data flow for an OOE processor. But when you just have eax-edx visible to the compiler, the extra 8 registers on X86-64 make a significant difference...diep wrote:Eugene has regrettably proven to be very correct and he said it polite still compared to a GCC team member Marc Lehman.Gerd Isenberg wrote:The point is that mailbox programs, which would even compile and run efficiently on a 16-bit os, did not take advantage from 64-bit mode, despite eight additional registers and implicit fastcall.bob wrote: I have no idea what you mean by "clumsiness of 64 bit instructions." They don't seem clumsy to me considering the speeds I see on 64 bit processors. Yes, the compilers have to produce machine language that is semantically equivalent to the original source code you wrote. And yes, the hardware has to execute things so that the final result is identical with what would be produced if the compiler output was executed exactly in the order produced by the compiler, without any effects of speculative execution or out of order execution changing a single thing other than making it run faster...
Using r08-r15 takes an additional opcode byte. x64 calling conventions "allocates" stackspace to save caller safe registers via mov [rbp + offset], reg instead of push/pop (much longer ocpode). Accessing global data via register indirection. Probably compiler are not (yet) good enough to optimize (despite pgo) than their 32-bit counterparts. In general, if you don't use 64-bit that much, most mentioned programs likely become larger if compiled for 64-bit mode than for 32-bit mode.
Eugene: "32 bits will be faster for most programs than 64 bits".
Correct, and a LOT faster.
Only bitboard engines profitted for a simple reason which has nothing to do with number of registers.
K7 had already 44 rename registers or so and P4 like 128, so there never was a real shortage of registers in x86.
I use gcc on AMD processors exclusively, so I am not sure what you are talking about. It does quite well...
Marc Lehmann: "x86-64 will be the ultimate disaster for GCC"
He has proven to be very right.
Add to this that GCC was already ugly bad in efficient code optimization.
Compilers have become one big intel show.
I do and I use all the registers in Crafty, Feel free to dump the .S files to see. It shortens some pieces of code significantly by avoiding a register jam...The more complex processors get now with all their transistors, the bigger the advantage of intel compilerwise. AMD has lost the battle there.
Just theoretic spoken, Phenom2 is a fantastic processor, definitely not worse than Nehalem, other than the 200Mhz gap between them.
In reality Nehalem is 13.5% faster for diep IPC wise.
That is just thanks to compiler and nothing else. Thanks to the wintel compiler.
You can prove the problem mathematical. Suppose m$ claims that the features in the wintel compiler only get in when they don't hurt 1 of the 2 manufacturers majorleague, then intel already has a major advantage.
Realize what an intel cpu is. It is the utmost cheap design they could do to execute a single small path very quickly. Anything else that tries to use the full resources is ugly slow. The core2 (have to figure out Nehalem yet) executes already at a luxury 1 instruction a cycle for CMOV type instructions. AMD can do on paper 3 within 1 cycle. Core2 already has the luxury design, from intel viewpoint seen, of being able to execute 2 shift instructions a cycle. AMD could do 3 of these instructions already 10 years ago within 1 cycle.
We're never gonna see of course a compiler that will generate 3 shift instructions in such a manner that CPU can reorder it to what it can actually do.
That is already luxury from intels viewpoint, as during P4 prescott times, even giving AMD flags were not going to generate CMOV instructions at all. Hah with good reasons for P4 prescott, as shift right was 7 cycles.
Those who write assembler, they complain loud about every design of intel having a problem. Latency of this instruction sucks, or latency of that instruction sucks.
Compilers avoid all those intel problems.
That's something total different from TAKING ADVANTAGE of what AMD can do fast. Compilers just do NOT generate that. GCC in fact with AMD flags, generates code that is so silly, that even my nephew of 1.5 years old, so he could write already assembler, would be doing lightyears faster.
I didn't even need to check in intel documents to realize that this code, though objectively slower than doing a few CMOV's, would be ugly slow on AMD triggering all kind of penalties, meanwhile at intel it wasn't a problem, as it took advantage of that 'single fast path' again.
It is not realistic to discuss about complicated phenomena's if compilers are basically carrying an intel logo and no compiler will ever be ABLE to use all registers, as that's ugly slow at intel.
Forget the 16 registers of x64, it's too slow for intel to use all registers at the same time.
Just try in assembler.
Vincent