Zobrist key random numbers

bob · Post by **bob** » Sun Jan 25, 2009 1:16 am

hgm wrote:Everyone knows that that cache-line length has been 64 bytes for nearly a decade! (Ever since DDR memory was introduced, and the memory burst cycle was increased from 4 to 8 words.)

Care to make a small wager? L1 is _not_ 64 bytes per line. And L2 on the PIV was _not_ 64 bytes. Occasionally you might want to check your facts before writing something that _anybody_ can prove wrong as in this trivial case. My core-2 is 32/64 line size for L1/L2. The machine in my office (PIV xeon) is 32/128. The cluster with dual quads are all 32/64 linesizes.

Do you have any more "wisdom" to pass along???

Just do a google search, then report back with what "everybody" knows, yourself excluded...

And, for the record, my point still stands. Unaligned memory accesses are _bad_. I don't care what you "think" the penalty is. I _know_ what it is. From extra micro-ops generated at instruction decode, to almost 25% of L1/L2 cache hits taking 2x the normal hit penalty, and then there is the miss penalty of filling two lines from the next lower level down, depending on whether your L1 cache miss also is a dual-penalty L2 cache miss since, contrary to your believe, L1 and L2 do _not_ have the same linesize in X86 or X86-64. And there are other parts of this that are also bad, as you can overflow to the next virtual page as well, which is _really_ bad.

And there are many processors that don't even allow unaligned accesses so you exclude yourself from using those, such as the SPARC, etc... And the unaligned penalty is getting _worse_ not better as architecture marches on. I7 being a good example.

But "most" of us understand and know that... I was just trying to help you join "most". If it didn't work, that's ok. Follow all the bad programming practices you want. It is your program...

hgm · Post by **hgm** » Sun Jan 25, 2009 10:55 am

Now that you propose it, a friendly wager might indeed have a very beneficial effect here. So how about this:

You (Bob Hyatt) claim the cache line of L1 on an Intel Celeron M is 32 bytes long
I (H.G.Muller) claim that the cache line on an Intel Celeron M is 64 bytes long

For clarity: a Celeron M is a Pentium M with smaller L2 cache.

He who is proved wrong will start a new thread in the programming section of CCC titled "L1 cache-line length", in which he makes a post that repeats the line "I am an idiot and know zilch about CPU architecture!" one hundred times, and signs that with its own name.

In case a web search will not conclusively indicate who is right, a test program will be run on the Pentium M to decide the question.

You agree to those conditions?

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Jan 25, 2009 11:10 am

bob wrote:
hgm wrote:Everyone knows that that cache-line length has been 64 bytes for nearly a decade! (Ever since DDR memory was introduced, and the memory burst cycle was increased from 4 to 8 words.)
Care to make a small wager? L1 is _not_ 64 bytes per line. And L2 on the PIV was _not_ 64 bytes. Occasionally you might want to check your facts before writing something that _anybody_ can prove wrong as in this trivial case. My core-2 is 32/64 line size for L1/L2. The machine in my office (PIV xeon) is 32/128. The cluster with dual quads are all 32/64 linesizes.

Do you have any more "wisdom" to pass along???

Just do a google search, then report back with what "everybody" knows, yourself excluded...

And, for the record, my point still stands. Unaligned memory accesses are _bad_. I don't care what you "think" the penalty is. I _know_ what it is. From extra micro-ops generated at instruction decode, to almost 25% of L1/L2 cache hits taking 2x the normal hit penalty, and then there is the miss penalty of filling two lines from the next lower level down, depending on whether your L1 cache miss also is a dual-penalty L2 cache miss since, contrary to your believe, L1 and L2 do _not_ have the same linesize in X86 or X86-64. And there are other parts of this that are also bad, as you can overflow to the next virtual page as well, which is _really_ bad.

And there are many processors that don't even allow unaligned accesses so you exclude yourself from using those, such as the SPARC, etc... And the unaligned penalty is getting _worse_ not better as architecture marches on. I7 being a good example.

But "most" of us understand and know that... I was just trying to help you join "most". If it didn't work, that's ok. Follow all the bad programming practices you want. It is your program...

Can you provide a link to a site for 32 byte cache lines of recent x86 processors? As HG, I am only aware of 64 byte wide L1 cache lines for intel as well as AMD.

Some quotes from Agner Fog's microarchitecture.pdf, size: 1093888, last modified: 2009-Jan-23
http://www.agner.org/optimize/microarchitecture.pdf

6 Pentium M pipeline
Memory access
If the program is accessing large amounts of data, or if the data are scattered around everywhere in the memory, then you will have many data cache misses. Accessing uncached data is so time consuming that all other optimization considerations are unimportant. The caches are organized as aligned lines of 64 bytes each. If one byte within an aligned 64-byte block has been accessed, then you can be certain that all 64 bytes will be loaded into the level-1 data cache and can be accessed at no extra cost.

7 Core 2 pipeline
7.13 Cache and memory access
The level-1 data cache is 32 kB, dual port, 8 way, 64 byte line size. The level-1 code cache has the same size as the data cache.

Cache bank conflicts
Each 64-bytes line in the data cache is divided into 4 banks of 16 bytes each. It is not possible to do a memory read and a memory write in the same clock cycle if the two memory addresses have the same bank number, i.e. if bit 4 and 5 in the two addresses are the same. Example:
; Example 7.14. Core2 cache bank conflict
mov eax, [esi] ; Use bank 0, assuming esi is divisible by 40H
mov [esi+100H], ebx ; Use bank 0. Cache bank conflict
mov [esi+110H], ebx ; Use bank 1. No cache bank conflict

Misaligned memory accesses
Misaligned memory reads have a penalty of approximately 12 clock cycles when a cache line boundary (64 bytes) is crossed.
Misaligned memory writes have a penalty of approximately 10 clock cycles when a cache line boundary (64 bytes) is crossed.

8 Pentium 4 (NetBurst) pipeline
Memory access
If the program is accessing large amounts of data, or if the data are scattered around everywhere in the memory, then we will have many data cache misses. Accessing uncached data is so time consuming that all other optimization considerations are unimportant. The caches are organized as aligned lines of 64 bytes each. If one byte within an aligned 64-bytes block has been accessed, then we can be certain that all 64 bytes will be loaded into the level-1 data cache and can be accessed at no extra cost. To improve caching, it is recommended that data that are used in the same part of the program be stored together. You may align large arrays and structures by 64. Store local variables on the stack if you don’t have enough registers.
The level-1 data cache is only 8 kb on the P4 and 16 kb on P4E. This may not be enough to hold all the data, but the level-2 cache is more efficient on the P4/P4E than on previous processors. Fetching data from the level-2 cache will cost only a few clock cycles extra.

9 AMD pipeline
9.15 Cache
The level-1 code cache and the level-1 data cache are both 64 Kbytes, 2 way set associative and a line size of 64 bytes. The data cache has two ports which can be used for either read or write. This means that it can do two reads or two writes or one read and one write in the same clock cycle. Each read port is 128 bits wide on K10, 64 bits on K8. The write ports are 64 bits on both K8 and K10. This means that a 128-bit write operation requires two macro-operations.

Each 64 byte line in the code cache line is split into 4 blocks of 16 bytes each. Each 64 byte line in the data cache is split into 8 banks of 8 bytes each. The data cache cannot do two memory operations in the same clock cycle if they use the same bank, except for two reads from the same cache line.

See also the intel manual: http://www.intel.com/design/processor/m ... 248966.pdf

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Jan 25, 2009 11:40 am

bob wrote:
Gerd Isenberg wrote: Probably the unaligned load instruction via LDDQU:
http://software.intel.com/en-us/article ... hitecture/
http://forums.amd.com/devblog/blogpost. ... adid=94377
The problem is, this just makes it _work_. It doesn't make it _efficient_ to do unaligned accesses...

Nehalem as well as Sandy Bridge will further improve miss aligned access due to faster video stream processing, where miss aligned read is an issue. They already did for later core 2 duo with the LDDQU instruction to prevent the cache line split. So HG's "dobious" technique to reduce 7/8 of the memory space (12 cache lines versus 96 cache lines) for reading the overlapped zobrist keys has some merits with respect to L1 misses. It is easy to test, anyway.

hgm · Post by **hgm** » Sun Jan 25, 2009 11:42 am

One thing is certain: my Core 2 Duo (E6600) also has 64-byte L1.

But that was not really relevant for the earlier test results I reported, as those where stated to be for Pentium M.

bob · Post by **bob** » Sun Jan 25, 2009 5:49 pm

Gerd Isenberg wrote:
bob wrote:
Gerd Isenberg wrote: Probably the unaligned load instruction via LDDQU:
http://software.intel.com/en-us/article ... hitecture/
http://forums.amd.com/devblog/blogpost. ... adid=94377
The problem is, this just makes it _work_. It doesn't make it _efficient_ to do unaligned accesses...
Nehalem as well as Sandy Bridge will further improve miss aligned access due to faster video stream processing, where miss aligned read is an issue. They already did for later core 2 duo with the LDDQU instruction to prevent the cache line split. So HG's "dobious" technique to reduce 7/8 of the memory space (12 cache lines versus 96 cache lines) for reading the overlapped zobrist keys has some merits with respect to L1 misses. It is easy to test, anyway.

I understand, but, for example, Nehalem added one cycle to the L1 hit penalty, so that it takes 4 clocks to fetch a value from L1. If it is unaligned, and wraps from one line to the next, that turns into 8 clocks, rather than the 3 of previous L1's...

And there are so many other issues. Wrapping to the next page is _much_ worse, and will cause pipeline stalls during speculative execution. It requires a second TLB lookup. Possibly a second virtual-to-real translation if TLB fails, etc.

It's just a bad idea and is impossible on some processors anyway...

HG is putting _far_ too much emphasis on L1 footprint at the expense of efficiency...

bob · Post by **bob** » Sun Jan 25, 2009 6:25 pm

Gerd Isenberg wrote:
bob wrote:
hgm wrote:Everyone knows that that cache-line length has been 64 bytes for nearly a decade! (Ever since DDR memory was introduced, and the memory burst cycle was increased from 4 to 8 words.)
Care to make a small wager? L1 is _not_ 64 bytes per line. And L2 on the PIV was _not_ 64 bytes. Occasionally you might want to check your facts before writing something that _anybody_ can prove wrong as in this trivial case. My core-2 is 32/64 line size for L1/L2. The machine in my office (PIV xeon) is 32/128. The cluster with dual quads are all 32/64 linesizes.

Do you have any more "wisdom" to pass along???

Just do a google search, then report back with what "everybody" knows, yourself excluded...

And, for the record, my point still stands. Unaligned memory accesses are _bad_. I don't care what you "think" the penalty is. I _know_ what it is. From extra micro-ops generated at instruction decode, to almost 25% of L1/L2 cache hits taking 2x the normal hit penalty, and then there is the miss penalty of filling two lines from the next lower level down, depending on whether your L1 cache miss also is a dual-penalty L2 cache miss since, contrary to your believe, L1 and L2 do _not_ have the same linesize in X86 or X86-64. And there are other parts of this that are also bad, as you can overflow to the next virtual page as well, which is _really_ bad.

And there are many processors that don't even allow unaligned accesses so you exclude yourself from using those, such as the SPARC, etc... And the unaligned penalty is getting _worse_ not better as architecture marches on. I7 being a good example.

But "most" of us understand and know that... I was just trying to help you join "most". If it didn't work, that's ok. Follow all the bad programming practices you want. It is your program...
Can you provide a link to a site for 32 byte cache lines of recent x86 processors? As HG, I am only aware of 64 byte wide L1 cache lines for intel as well as AMD.

Some quotes from Agner Fog's microarchitecture.pdf, size: 1093888, last modified: 2009-Jan-23
http://www.agner.org/optimize/microarchitecture.pdf

6 Pentium M pipeline
Memory access
If the program is accessing large amounts of data, or if the data are scattered around everywhere in the memory, then you will have many data cache misses. Accessing uncached data is so time consuming that all other optimization considerations are unimportant. The caches are organized as aligned lines of 64 bytes each. If one byte within an aligned 64-byte block has been accessed, then you can be certain that all 64 bytes will be loaded into the level-1 data cache and can be accessed at no extra cost.

7 Core 2 pipeline
7.13 Cache and memory access
The level-1 data cache is 32 kB, dual port, 8 way, 64 byte line size. The level-1 code cache has the same size as the data cache.

Cache bank conflicts
Each 64-bytes line in the data cache is divided into 4 banks of 16 bytes each. It is not possible to do a memory read and a memory write in the same clock cycle if the two memory addresses have the same bank number, i.e. if bit 4 and 5 in the two addresses are the same. Example:
; Example 7.14. Core2 cache bank conflict
mov eax, [esi] ; Use bank 0, assuming esi is divisible by 40H
mov [esi+100H], ebx ; Use bank 0. Cache bank conflict
mov [esi+110H], ebx ; Use bank 1. No cache bank conflict

Misaligned memory accesses
Misaligned memory reads have a penalty of approximately 12 clock cycles when a cache line boundary (64 bytes) is crossed.
Misaligned memory writes have a penalty of approximately 10 clock cycles when a cache line boundary (64 bytes) is crossed.

8 Pentium 4 (NetBurst) pipeline
Memory access
If the program is accessing large amounts of data, or if the data are scattered around everywhere in the memory, then we will have many data cache misses. Accessing uncached data is so time consuming that all other optimization considerations are unimportant. The caches are organized as aligned lines of 64 bytes each. If one byte within an aligned 64-bytes block has been accessed, then we can be certain that all 64 bytes will be loaded into the level-1 data cache and can be accessed at no extra cost. To improve caching, it is recommended that data that are used in the same part of the program be stored together. You may align large arrays and structures by 64. Store local variables on the stack if you don’t have enough registers.
The level-1 data cache is only 8 kb on the P4 and 16 kb on P4E. This may not be enough to hold all the data, but the level-2 cache is more efficient on the P4/P4E than on previous processors. Fetching data from the level-2 cache will cost only a few clock cycles extra.

9 AMD pipeline
9.15 Cache
The level-1 code cache and the level-1 data cache are both 64 Kbytes, 2 way set associative and a line size of 64 bytes. The data cache has two ports which can be used for either read or write. This means that it can do two reads or two writes or one read and one write in the same clock cycle. Each read port is 128 bits wide on K10, 64 bits on K8. The write ports are 64 bits on both K8 and K10. This means that a 128-bit write operation requires two macro-operations.

Each 64 byte line in the code cache line is split into 4 blocks of 16 bytes each. Each 64 byte line in the data cache is split into 8 banks of 8 bytes each. The data cache cannot do two memory operations in the same clock cycle if they use the same bank, except for two reads from the same cache line.
See also the intel manual: http://www.intel.com/design/processor/m ... 248966.pdf

Note that (again) I did not mention a specific processor, except for the reference HGM made to the PIV 64 byte L2 line size. I said that cache line sizes has varied from 32 to 128 in X86. I have a bunch of "cpuid output" from the machines I have run on over the past 5 years. About 1/2 of them had 32 byte L1, the other half had 64. The dividing point seemed to be more related to the size of L2. Larger L2's seem to go with 64. Small L2's seem to use 32. Recently all we have had is large L2 machines, so that everything _today_ might certainly be 64/64, at least on the upper-end of Intel's microprocessor product line. Used to be that the "xeon" end of the product line was more aggressive, but also far more expensive. When "netburst" came along, intel doubled line size in an effort to drive RAMBUS and such faster. It was a miserable failure. And they finally did things right when the "core" architecture(s) came along. I couldn't tell you with any accuracy what the low-end processors do, not having any around means I don't try to figure that out for them. But I do run on PIII/PIV/Core/Core2 all the time, but also worry about performance on _all_ processors since Crafty is run on everything going all the way back to pentium pro 200 boxes, even today.

Today, across all vendors (not just X86, we see L1 from 32 to 256. We see L2 from 64 to who-knows-what but at least 256. In the high-end of Intel, all I have seen, thinking carefully, is 64 on the most recent editions.

However, back to the main topic, which is unaligned addresses. They are bad. They have always been bad. And they are worse than the claimed "0-2 cycle penalty." _far_ worse.

hgm · Post by **hgm** » Sun Jan 25, 2009 7:25 pm

bob wrote:Note that (again) I did not mention a specific processor, except for the reference HGM made to the PIV 64 byte L2 line size.

Excuse me??? Are you now blind as well? I Where did I mention PIV? I did write this:

bob wrote:
hgm wrote:Well, this is what I measured. I can't take any responsibility for what you read or what Intel says.

This number seems to be highly CPU dependent anyway, so it would always be good to check if what you read in fact applies to your own model. I just repeated the test for my Celeron-M and guess what:

There isn't a mis-alignment penalty at all there!
All this shows is that your measurement is flawed. Simple case: A L1 cache line is 32 bytes long.

Can you read that?

Celeron-M

Since when is Celeron M a Pentium IV?

And you don't mention anyting about cache-line length having varied, you plainly state that my test results done on the Celeron M cannot be right, because the L1 cache-line length is 32 (repeatedly!).

Is your memory span that short, or have you resorted to plain lying to weasel your way out of this?

Spock · Post by **Spock** » Sun Jan 25, 2009 9:10 pm

bob wrote:Crafty is run on everything going all the way back to pentium pro 200 boxes, even today.

Beautiful machines, I had a slower Pentium Pro 150, running IBM OS/2 back then. I wasn't doing computer chess back then though

bob · Post by **bob** » Sun Jan 25, 2009 9:36 pm

hgm wrote:
bob wrote:Note that (again) I did not mention a specific processor, except for the reference HGM made to the PIV 64 byte L2 line size.
Excuse me??? Are you now blind as well? I Where did I mention PIV? I did write this:

"netburst". Did you not mention that? The PIV is where netburst started, although the processor was a colossal flop comared to AMD and the later core architecture.

bob wrote:
hgm wrote:Well, this is what I measured. I can't take any responsibility for what you read or what Intel says.

This number seems to be highly CPU dependent anyway, so it would always be good to check if what you read in fact applies to your own model. I just repeated the test for my Celeron-M and guess what:

There isn't a mis-alignment penalty at all there!
All this shows is that your measurement is flawed. Simple case: A L1 cache line is 32 bytes long.

Can you read that?

Celeron-M

Since when is Celeron M a Pentium IV?

And you don't mention anyting about cache-line length having varied, you plainly state that my test results done on the Celeron M cannot be right, because the L1 cache-line length is 32 (repeatedly!).

Is your memory span that short, or have you resorted to plain lying to weasel your way out of this?[/quote]

HG, I did not notice the Celeron-M reference, sorry. I don't own a celeron-based machine of any kind. I have told you what I test on every day. I have a quad PII/450 box. I have a quad PIII/700 box. I have my dual PIV/2.8 box. I have some K6/K7/opteron boxes plus an opteron cluster. I have a core-1 cluster and a core-2 cluster.

I was absolutely inaccurate with respect to recent processors. I knew the core architectures used 64 but not specifically for memory bandwidth issues, but because the data requirements for the CPUs doubled with 64 bit data widths. And I had read where low-enders still used 32 bytes.

But, back to the original point, now that I have admitted what I did not know about your celeron-M. Which is that your unaligned references are _bad_. It is a bad programming practice for all the reasons I have given. It incurs a decode penalty because it requires multiple micro-ops to piece the misaligned data together; It incurs a 2x L1 or L2 or memory penalty when the misaligned data spans two cache lines/blocks. On some architectures, it will earn you an instant crash since they don't allow unaligned addresses. On the Cray, for example, you can address memory _only_ by words. If you want two pieces as would happen in an unaligned reference, you get to program that yourself.

I would _never_ advise anyone to use a scheme that is slower, not recommended by any chip maker, and which will not work on all processors. You act like cache footprint is the only issue of any significance. It is not. And in a SMP-type box cache footprint takes on an even less significant role than making sure you use cache in a way that doesn't produce excessive cache-to-cache traffic dealing with shared data.

As I said, and will repeat here, there are right ways to do things, and wrong ways to do them. And unaligned data is simply a bad idea, and one Intel does not guarantee will work on future generations of processors. It causes them issues as well to deal with it.

That said, do it how you want. I reserve the right to tell others how they _ought_ to do it. Although most already know...

Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers