
Sorry about pushing the point so aggressively, but it totally pointless to discuss this issue if we are not even in sync on the very basics. OK, so PII and PIII have 32-byte L1 cache line. But everything later has 64-byte L1 line, as Gerd pointed out. This was what I meant with 'nearly a decade', which was not so far off the mark, as P-IV was introduced 2000Q4, according to Wikipedia.
I have no PII or PIII around anymore, (funny enough I do have a still operational 100MHz 'P-I'!), and do not optimize my engines for those architectures. But that mis-aligned accesses are not really expensive at all, I actually first discovered on a P-I, because the DJGCC compiler was not smart enough to align the stack on 8-byte boundaries, which lead to mis-alignement of any locally declared 'double' with 50% probability. Globals were always aligned, so at first I thought the slower execution when I used locals was due to the indexed addressing mode needed for stack-frame addressing. On P-I the difference was 2 clocks (as a pipeline stall, as the P-I did not have any out-of-order execution).
You say the misalignments are bad, but that seems more a matter of principle than anything else. In practice they work. To my amazement, on Pentium-M they even work perfectly, as I reported above. I had not tested it on that machine before. But both an aligned and a misaligned L1 hit have a 3-cycle latency. That is what I call a zero-cycle penalty for the misalignment. Only when I straddle a cache-line boundary there is a 9-cycle penalty, so the latency goes up to 12 cycles. Note that this is always for 4-byte loads, I do not use any MMX or SSE, as my compiler doesn't generate those. So long-long arithmetic is done in two halves always, and only one of those halves can straddle the cache-line boundary.
Note that it is not true that there are extra uOps involved. This would be true for unaligned SSE loads, but can't be the case on normal x86 instructions. At decode time the CPU has no idea yet that a mis-aligned load will be requested, as this becomes only apparent when the instruction is already being executed, in the address calculation, when the output of the AGU becomes available. It will be a single LOAD uOp, and only the cache unit will be presented with a problem. Apparently it can handle that problem very well in Pentium M, by alowing the fetch of two neighboring banks in the same cycle. And apparently the multiplexer that has to be there to select the addressed byte out of a word anyway has been given 32-bit width in P-M.
Unfortunately I don't recall for which CPU the numbers where that I quoted initially (2 clocks penalty). I could not find a program to measure it on my AMD Athlon XP (K7) nor my Core 2 Duo (E6600). So I fear it might have indeed been the dreaded P-IV in my office at work. And it was probably not for straddling a cache line.
My philosophy is: "if the penalty is negligible on almost any machine I know, I would be a fool not to use it." And it seems that the penalty is _much_ lower than the penalty for an L1 cache miss. If others call that "bad practice", so be it. I rather be guilty of "bad practice" and have a fast program, that doing things "by the book", and have worse performance. My programs also contain goto statements...
Now I don't worry much about the other objections you bring up. For one, the straddling of cache lines can be avoided. A 0x88 board aligned on a 64-byte boundary only uses bytes 0-8, 16-23, 32-39 and 48-55 for board rows. Even if I would do a 64-byte load from address 55 it would only extend to byte 62, i.e. not straddle the cache-line boundary. (This would even be true for 32-bit cache lines.) This would reduce the compression from a factor 8 to a factor 4, but that can still be a very worthwile factor. And even if there are L1 misses, there never would be double L1 misses, as there are no straddlers. On a Pentium M this size reduction is apparently absolutely free. (I still have to measure it on the E6600, but will report the result of that here later.) Only if you want the factor 8 you pay something for it. But as this is apparently very little, (0.5 clock per access), it might very well be worth it.