Care to make a small wager? L1 is _not_ 64 bytes per line. And L2 on the PIV was _not_ 64 bytes. Occasionally you might want to check your facts before writing something that _anybody_ can prove wrong as in this trivial case. My core-2 is 32/64 line size for L1/L2. The machine in my office (PIV xeon) is 32/128. The cluster with dual quads are all 32/64 linesizes.hgm wrote:Everyone knows that that cache-line length has been 64 bytes for nearly a decade! (Ever since DDR memory was introduced, and the memory burst cycle was increased from 4 to 8 words.)
Do you have any more "wisdom" to pass along???

Just do a google search, then report back with what "everybody" knows, yourself excluded...
And, for the record, my point still stands. Unaligned memory accesses are _bad_. I don't care what you "think" the penalty is. I _know_ what it is. From extra micro-ops generated at instruction decode, to almost 25% of L1/L2 cache hits taking 2x the normal hit penalty, and then there is the miss penalty of filling two lines from the next lower level down, depending on whether your L1 cache miss also is a dual-penalty L2 cache miss since, contrary to your believe, L1 and L2 do _not_ have the same linesize in X86 or X86-64. And there are other parts of this that are also bad, as you can overflow to the next virtual page as well, which is _really_ bad.
And there are many processors that don't even allow unaligned accesses so you exclude yourself from using those, such as the SPARC, etc... And the unaligned penalty is getting _worse_ not better as architecture marches on. I7 being a good example.
But "most" of us understand and know that... I was just trying to help you join "most". If it didn't work, that's ok. Follow all the bad programming practices you want. It is your program...