Zobrist key random numbers

hgm · Post by **hgm** » Sun Jan 25, 2009 11:00 pm

OK, now we are getting somewhere. And I did not use the word 'netburst' at all: I try to avoid that in general, as each time I use it, I feel like I have to rinse my mouth...

Sorry about pushing the point so aggressively, but it totally pointless to discuss this issue if we are not even in sync on the very basics. OK, so PII and PIII have 32-byte L1 cache line. But everything later has 64-byte L1 line, as Gerd pointed out. This was what I meant with 'nearly a decade', which was not so far off the mark, as P-IV was introduced 2000Q4, according to Wikipedia.

I have no PII or PIII around anymore, (funny enough I do have a still operational 100MHz 'P-I'!), and do not optimize my engines for those architectures. But that mis-aligned accesses are not really expensive at all, I actually first discovered on a P-I, because the DJGCC compiler was not smart enough to align the stack on 8-byte boundaries, which lead to mis-alignement of any locally declared 'double' with 50% probability. Globals were always aligned, so at first I thought the slower execution when I used locals was due to the indexed addressing mode needed for stack-frame addressing. On P-I the difference was 2 clocks (as a pipeline stall, as the P-I did not have any out-of-order execution).

You say the misalignments are bad, but that seems more a matter of principle than anything else. In practice they work. To my amazement, on Pentium-M they even work perfectly, as I reported above. I had not tested it on that machine before. But both an aligned and a misaligned L1 hit have a 3-cycle latency. That is what I call a zero-cycle penalty for the misalignment. Only when I straddle a cache-line boundary there is a 9-cycle penalty, so the latency goes up to 12 cycles. Note that this is always for 4-byte loads, I do not use any MMX or SSE, as my compiler doesn't generate those. So long-long arithmetic is done in two halves always, and only one of those halves can straddle the cache-line boundary.

Note that it is not true that there are extra uOps involved. This would be true for unaligned SSE loads, but can't be the case on normal x86 instructions. At decode time the CPU has no idea yet that a mis-aligned load will be requested, as this becomes only apparent when the instruction is already being executed, in the address calculation, when the output of the AGU becomes available. It will be a single LOAD uOp, and only the cache unit will be presented with a problem. Apparently it can handle that problem very well in Pentium M, by alowing the fetch of two neighboring banks in the same cycle. And apparently the multiplexer that has to be there to select the addressed byte out of a word anyway has been given 32-bit width in P-M.

Unfortunately I don't recall for which CPU the numbers where that I quoted initially (2 clocks penalty). I could not find a program to measure it on my AMD Athlon XP (K7) nor my Core 2 Duo (E6600). So I fear it might have indeed been the dreaded P-IV in my office at work. And it was probably not for straddling a cache line.

My philosophy is: "if the penalty is negligible on almost any machine I know, I would be a fool not to use it." And it seems that the penalty is _much_ lower than the penalty for an L1 cache miss. If others call that "bad practice", so be it. I rather be guilty of "bad practice" and have a fast program, that doing things "by the book", and have worse performance. My programs also contain goto statements...

Now I don't worry much about the other objections you bring up. For one, the straddling of cache lines can be avoided. A 0x88 board aligned on a 64-byte boundary only uses bytes 0-8, 16-23, 32-39 and 48-55 for board rows. Even if I would do a 64-byte load from address 55 it would only extend to byte 62, i.e. not straddle the cache-line boundary. (This would even be true for 32-bit cache lines.) This would reduce the compression from a factor 8 to a factor 4, but that can still be a very worthwile factor. And even if there are L1 misses, there never would be double L1 misses, as there are no straddlers. On a Pentium M this size reduction is apparently absolutely free. (I still have to measure it on the E6600, but will report the result of that here later.) Only if you want the factor 8 you pay something for it. But as this is apparently very little, (0.5 clock per access), it might very well be worth it.

diep · Post by **diep** » Mon Jan 26, 2009 10:15 pm

hgm wrote:OK, now we are getting somewhere. And I did not use the word 'netburst' at all: I try to avoid that in general, as each time I use it, I feel like I have to rinse my mouth...

Sorry about pushing the point so aggressively, but it totally pointless to discuss this issue if we are not even in sync on the very basics. OK, so PII and PIII have 32-byte L1 cache line. But everything later has 64-byte L1 line, as Gerd pointed out. This was what I meant with 'nearly a decade', which was not so far off the mark, as P-IV was introduced 2000Q4, according to Wikipedia.

I have no PII or PIII around anymore, (funny enough I do have a still operational 100MHz 'P-I'!), and do not optimize my engines for those architectures. But that mis-aligned accesses are not really expensive at all, I actually first discovered on a P-I, because the DJGCC compiler was not smart enough to align the stack on 8-byte boundaries, which lead to mis-alignement of any locally declared 'double' with 50% probability. Globals were always aligned, so at first I thought the slower execution when I used locals was due to the indexed addressing mode needed for stack-frame addressing. On P-I the difference was 2 clocks (as a pipeline stall, as the P-I did not have any out-of-order execution).

You say the misalignments are bad, but that seems more a matter of principle than anything else. In practice they work. To my amazement, on Pentium-M they even work perfectly, as I reported above. I had not tested it on that machine before. But both an aligned and a misaligned L1 hit have a 3-cycle latency. That is what I call a zero-cycle penalty for the misalignment. Only when I straddle a cache-line boundary there is a 9-cycle penalty, so the latency goes up to 12 cycles. Note that this is always for 4-byte loads, I do not use any MMX or SSE, as my compiler doesn't generate those. So long-long arithmetic is done in two halves always, and only one of those halves can straddle the cache-line boundary.

Note that it is not true that there are extra uOps involved. This would be true for unaligned SSE loads, but can't be the case on normal x86 instructions. At decode time the CPU has no idea yet that a mis-aligned load will be requested, as this becomes only apparent when the instruction is already being executed, in the address calculation, when the output of the AGU becomes available. It will be a single LOAD uOp, and only the cache unit will be presented with a problem. Apparently it can handle that problem very well in Pentium M, by alowing the fetch of two neighboring banks in the same cycle. And apparently the multiplexer that has to be there to select the addressed byte out of a word anyway has been given 32-bit width in P-M.

Unfortunately I don't recall for which CPU the numbers where that I quoted initially (2 clocks penalty). I could not find a program to measure it on my AMD Athlon XP (K7) nor my Core 2 Duo (E6600). So I fear it might have indeed been the dreaded P-IV in my office at work. And it was probably not for straddling a cache line.

My philosophy is: "if the penalty is negligible on almost any machine I know, I would be a fool not to use it." And it seems that the penalty is _much_ lower than the penalty for an L1 cache miss. If others call that "bad practice", so be it. I rather be guilty of "bad practice" and have a fast program, that doing things "by the book", and have worse performance. My programs also contain goto statements...

Now I don't worry much about the other objections you bring up. For one, the straddling of cache lines can be avoided. A 0x88 board aligned on a 64-byte boundary only uses bytes 0-8, 16-23, 32-39 and 48-55 for board rows. Even if I would do a 64-byte load from address 55 it would only extend to byte 62, i.e. not straddle the cache-line boundary. (This would even be true for 32-bit cache lines.) This would reduce the compression from a factor 8 to a factor 4, but that can still be a very worthwile factor. And even if there are L1 misses, there never would be double L1 misses, as there are no straddlers. On a Pentium M this size reduction is apparently absolutely free. (I still have to measure it on the E6600, but will report the result of that here later.) Only if you want the factor 8 you pay something for it. But as this is apparently very little, (0.5 clock per access), it might very well be worth it.

Hi HGM, you mention L1d misses are bad. that's true.

However a much bigger problem in diep is not L1d but l1i.

Misses in L1d are about 0.6-0.9% from which 0.5% to 0.8% goes to memory controller anyway (and we know why that is), so basically the L2 is only 'useful' for 0.1%. This is at core2.

In contradiction the L1i misses at core2 of 1.34%.
So that forms a way bigger problem. I was a bit lazy to figure out how at latest architectures (Phenom2 and Nehalem) the misses hurt in number of cycles. Any thoughts on that?

Because it is really ugly to have 1.34% misses in L1i. The biggest problem of Diep is whether i should replace branches by more code in some cases, or add branches and remove code. What's more expensive considering all the tricks they use today?

Measured with cachegrind under linux at my laptop.

Thanks,
Vincent

hgm · Post by **hgm** » Tue Jan 27, 2009 3:54 pm

Do you have any idea which code produces thes i-cache misses? E.g. is it rarely used evaluation features, or promotions in Make / UnMake?

I never payed much attention to L1i misses, as my code is too small to produce them anyway. Furthermore, L1i is usually equiped with prefetch logic that would prevent any actual misses when execution proceeds linearly. So it is likely that taken branchess are responsible for most of the misses.

If the code is larger than the cache, there is of course nothing you can do (other than shrink the code). Although it might help a little if you would collect code of small if-clauses that are rarely used, but when they are used, are likely used together, in the same cache line. E.g. the extra code for castling in Make and UnMake could be located in the same cache line (assumingthey fit there together). The compiler typically generate codes that branches to some far off place to do the conditional part, and then jumps back at the end of it. (This makes better use of the L1i pre-fetcher bandwidth, I suppose.) So if you Make a castling move, and it would normally be a miss ecause castling code does not belong to the code executed so frequently that it can be kept in cache always, then at least the UnMake, which usually will follow not too much thereafter, will produce an automatic hit, while it would have been another miss if it was located in an independnet cache line.

Note that on AMD the small associativity (2-way) might cause a problem, and lead to collisions of very frequently used code that could have been avoided by moving it around. A not-so-frequently-used piece of code that happens to map to the same set as two very frequently-used code parts will be flushed out, and almost always cause two misses when executed (one to bring it in, one to bring back the frequently-used code it replaced). And there might be sets were only one code part maps to, so that the other way in that set remains completely unused. I am not sure at all if compilers take this into account. On the Intel 8-way caches this should almost never be a problem.

bob · Post by **bob** » Tue Jan 27, 2009 6:04 pm

diep wrote:
hgm wrote:OK, now we are getting somewhere. And I did not use the word 'netburst' at all: I try to avoid that in general, as each time I use it, I feel like I have to rinse my mouth...

Sorry about pushing the point so aggressively, but it totally pointless to discuss this issue if we are not even in sync on the very basics. OK, so PII and PIII have 32-byte L1 cache line. But everything later has 64-byte L1 line, as Gerd pointed out. This was what I meant with 'nearly a decade', which was not so far off the mark, as P-IV was introduced 2000Q4, according to Wikipedia.

I have no PII or PIII around anymore, (funny enough I do have a still operational 100MHz 'P-I'!), and do not optimize my engines for those architectures. But that mis-aligned accesses are not really expensive at all, I actually first discovered on a P-I, because the DJGCC compiler was not smart enough to align the stack on 8-byte boundaries, which lead to mis-alignement of any locally declared 'double' with 50% probability. Globals were always aligned, so at first I thought the slower execution when I used locals was due to the indexed addressing mode needed for stack-frame addressing. On P-I the difference was 2 clocks (as a pipeline stall, as the P-I did not have any out-of-order execution).

You say the misalignments are bad, but that seems more a matter of principle than anything else. In practice they work. To my amazement, on Pentium-M they even work perfectly, as I reported above. I had not tested it on that machine before. But both an aligned and a misaligned L1 hit have a 3-cycle latency. That is what I call a zero-cycle penalty for the misalignment. Only when I straddle a cache-line boundary there is a 9-cycle penalty, so the latency goes up to 12 cycles. Note that this is always for 4-byte loads, I do not use any MMX or SSE, as my compiler doesn't generate those. So long-long arithmetic is done in two halves always, and only one of those halves can straddle the cache-line boundary.

Note that it is not true that there are extra uOps involved. This would be true for unaligned SSE loads, but can't be the case on normal x86 instructions. At decode time the CPU has no idea yet that a mis-aligned load will be requested, as this becomes only apparent when the instruction is already being executed, in the address calculation, when the output of the AGU becomes available. It will be a single LOAD uOp, and only the cache unit will be presented with a problem. Apparently it can handle that problem very well in Pentium M, by alowing the fetch of two neighboring banks in the same cycle. And apparently the multiplexer that has to be there to select the addressed byte out of a word anyway has been given 32-bit width in P-M.

Unfortunately I don't recall for which CPU the numbers where that I quoted initially (2 clocks penalty). I could not find a program to measure it on my AMD Athlon XP (K7) nor my Core 2 Duo (E6600). So I fear it might have indeed been the dreaded P-IV in my office at work. And it was probably not for straddling a cache line.

My philosophy is: "if the penalty is negligible on almost any machine I know, I would be a fool not to use it." And it seems that the penalty is _much_ lower than the penalty for an L1 cache miss. If others call that "bad practice", so be it. I rather be guilty of "bad practice" and have a fast program, that doing things "by the book", and have worse performance. My programs also contain goto statements...

Now I don't worry much about the other objections you bring up. For one, the straddling of cache lines can be avoided. A 0x88 board aligned on a 64-byte boundary only uses bytes 0-8, 16-23, 32-39 and 48-55 for board rows. Even if I would do a 64-byte load from address 55 it would only extend to byte 62, i.e. not straddle the cache-line boundary. (This would even be true for 32-bit cache lines.) This would reduce the compression from a factor 8 to a factor 4, but that can still be a very worthwile factor. And even if there are L1 misses, there never would be double L1 misses, as there are no straddlers. On a Pentium M this size reduction is apparently absolutely free. (I still have to measure it on the E6600, but will report the result of that here later.) Only if you want the factor 8 you pay something for it. But as this is apparently very little, (0.5 clock per access), it might very well be worth it.
Hi HGM, you mention L1d misses are bad. that's true.

However a much bigger problem in diep is not L1d but l1i.

Misses in L1d are about 0.6-0.9% from which 0.5% to 0.8% goes to memory controller anyway (and we know why that is), so basically the L2 is only 'useful' for 0.1%. This is at core2.

In contradiction the L1i misses at core2 of 1.34%.
So that forms a way bigger problem. I was a bit lazy to figure out how at latest architectures (Phenom2 and Nehalem) the misses hurt in number of cycles. Any thoughts on that?

Because it is really ugly to have 1.34% misses in L1i. The biggest problem of Diep is whether i should replace branches by more code in some cases, or add branches and remove code. What's more expensive considering all the tricks they use today?

Measured with cachegrind under linux at my laptop.

Thanks,
Vincent

Unless this has changed, last time I looked, L1 i-cache was a _trace_ cache, which is addressed differently from the L1-D cache. It stores decoded micro-ops, and in the order they were executed, not in the order they would sit in memory were actual instructions stored.

This causes some branch-prediction issues because a program runs faster when it follows the same path repeatedly...

hgm · Post by **hgm** » Tue Jan 27, 2009 6:17 pm

Oops, you force me to say the dreaded word.

The trace cache existed only in netburst arcitecture. All decent CPUs nowadays store x86 instructions (perhaps wih some hidden extra bits for instruction-boundary marking) in a 32KB 8-way or 64KB 2-way cache.

I guess the original reason for having a trace cache was that the netburst pipeline was so long (~20 stages). By skipping the decoding stages at the beginning of the pipe-line, mis-prediction penalties could be kept bearable, as long as the branch target was aready in L1. Problem was that this trace cache was so small; 12K uOps transates to about 10KB worth of x86 instructions.

bob · Post by **bob** » Tue Jan 27, 2009 7:49 pm

I just wanted to point out that not all L1 I caches are created equal, and there are still a ton of the PIV boxes around, including the dual 2.8 in my office in fact...

hgm · Post by **hgm** » Tue Jan 27, 2009 8:05 pm

But I am pretty sure Vincent does not use them...

wgarvin · Post by **wgarvin** » Tue Jan 27, 2009 10:28 pm

hgm wrote:I guess the original reason for having a trace cache was that the netburst pipeline was so long (~20 stages). By skipping the decoding stages at the beginning of the pipe-line, mis-prediction penalties could be kept bearable, as long as the branch target was aready in L1. Problem was that this trace cache was so small; 12K uOps transates to about 10KB worth of x86 instructions.

IIRC, it could add only decode one new instruction into the trace cache per cycle, so exceeding the size of the trace cache could be bad.

I also vaguely remember something about it taking 3 trace cache entries to store two back-to-back instructions which both had 32-bit constants in them... bleah.

I'm glad Intel dumped netburst and went back to what is now the Core2 architecture.

diep · Post by **diep** » Tue Jan 27, 2009 11:58 pm

bob wrote:I just wanted to point out that not all L1 I caches are created equal, and there are still a ton of the PIV boxes around, including the dual 2.8 in my office in fact...

My laptop is a core2 though where i measured at. If you scroll back you see i mentionned that. It has 32KB L1i.

Thanks,
Vincent

diep · Post by **diep** » Wed Jan 28, 2009 12:01 am

wgarvin wrote:
hgm wrote:I guess the original reason for having a trace cache was that the netburst pipeline was so long (~20 stages). By skipping the decoding stages at the beginning of the pipe-line, mis-prediction penalties could be kept bearable, as long as the branch target was aready in L1. Problem was that this trace cache was so small; 12K uOps transates to about 10KB worth of x86 instructions.
IIRC, it could add only decode one new instruction into the trace cache per cycle, so exceeding the size of the trace cache could be bad.

I also vaguely remember something about it taking 3 trace cache entries to store two back-to-back instructions which both had 32-bit constants in them... bleah.

I'm glad Intel dumped netburst and went back to what is now the Core2 architecture.

P4's were always too slow for computerchess to even bother taking a look at them.

Also core2 is a cheapo design not ideal for chess, yet it's lightyears better than it was. Nehalem is an improvement upon it, yet the power it uses is quite huge. Too huge i'd argue.

Note i see you work at IBM, did i google that correct?

You cannot compare the PC processors with HPC processors from IBM. Those architectures are *total* different. I'd love to have a try with diep at a power6 to see its IPC there. I'm betting it is a LOT better than anything in specint is.

Vincent

Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers