Zobrist key random numbers

diep · Post by **diep** » Sat Jan 31, 2009 8:20 pm

bob wrote:
diep wrote:
Gerd Isenberg wrote:
bob wrote: I have no idea what you mean by "clumsiness of 64 bit instructions." They don't seem clumsy to me considering the speeds I see on 64 bit processors. Yes, the compilers have to produce machine language that is semantically equivalent to the original source code you wrote. And yes, the hardware has to execute things so that the final result is identical with what would be produced if the compiler output was executed exactly in the order produced by the compiler, without any effects of speculative execution or out of order execution changing a single thing other than making it run faster...
The point is that mailbox programs, which would even compile and run efficiently on a 16-bit os, did not take advantage from 64-bit mode, despite eight additional registers and implicit fastcall.

Using r08-r15 takes an additional opcode byte. x64 calling conventions "allocates" stackspace to save caller safe registers via mov [rbp + offset], reg instead of push/pop (much longer ocpode). Accessing global data via register indirection. Probably compiler are not (yet) good enough to optimize (despite pgo) than their 32-bit counterparts. In general, if you don't use 64-bit that much, most mentioned programs likely become larger if compiled for 64-bit mode than for 32-bit mode.
Eugene has regrettably proven to be very correct and he said it polite still compared to a GCC team member Marc Lehman.

Eugene: "32 bits will be faster for most programs than 64 bits".

Correct, and a LOT faster.
Only bitboard engines profitted for a simple reason which has nothing to do with number of registers.

K7 had already 44 rename registers or so and P4 like 128, so there never was a real shortage of registers in x86.

This is _completely_ false. Renaming registers does _not_ reduce the effects of a limited number of registers for most programming examples. Renaming allows an easy way to track the data flow for an OOE processor. But when you just have eax-edx visible to the compiler, the extra 8 registers on X86-64 make a significant difference...

Marc Lehmann: "x86-64 will be the ultimate disaster for GCC"

He has proven to be very right.

Add to this that GCC was already ugly bad in efficient code optimization.
Compilers have become one big intel show.
I use gcc on AMD processors exclusively, so I am not sure what you are talking about. It does quite well...

I am using GCC on AMD processors as well and i agree with Marc Lehmann.

Vincent

diep · Post by **diep** » Sat Jan 31, 2009 9:18 pm

I moved the compiler behaviour of GCC to other thread:

http://talkchess.com/forum/viewtopic.php?t=26338

bob · Post by **bob** » Sat Jan 31, 2009 11:22 pm

diep wrote:
bob wrote:
diep wrote:
Gerd Isenberg wrote:
bob wrote: I have no idea what you mean by "clumsiness of 64 bit instructions." They don't seem clumsy to me considering the speeds I see on 64 bit processors. Yes, the compilers have to produce machine language that is semantically equivalent to the original source code you wrote. And yes, the hardware has to execute things so that the final result is identical with what would be produced if the compiler output was executed exactly in the order produced by the compiler, without any effects of speculative execution or out of order execution changing a single thing other than making it run faster...
The point is that mailbox programs, which would even compile and run efficiently on a 16-bit os, did not take advantage from 64-bit mode, despite eight additional registers and implicit fastcall.

Using r08-r15 takes an additional opcode byte. x64 calling conventions "allocates" stackspace to save caller safe registers via mov [rbp + offset], reg instead of push/pop (much longer ocpode). Accessing global data via register indirection. Probably compiler are not (yet) good enough to optimize (despite pgo) than their 32-bit counterparts. In general, if you don't use 64-bit that much, most mentioned programs likely become larger if compiled for 64-bit mode than for 32-bit mode.
Eugene has regrettably proven to be very correct and he said it polite still compared to a GCC team member Marc Lehman.

Eugene: "32 bits will be faster for most programs than 64 bits".

Correct, and a LOT faster.
Only bitboard engines profitted for a simple reason which has nothing to do with number of registers.

K7 had already 44 rename registers or so and P4 like 128, so there never was a real shortage of registers in x86.

This is _completely_ false. Renaming registers does _not_ reduce the effects of a limited number of registers for most programming examples. Renaming allows an easy way to track the data flow for an OOE processor. But when you just have eax-edx visible to the compiler, the extra 8 registers on X86-64 make a significant difference...

Marc Lehmann: "x86-64 will be the ultimate disaster for GCC"

He has proven to be very right.

Add to this that GCC was already ugly bad in efficient code optimization.
Compilers have become one big intel show.
I use gcc on AMD processors exclusively, so I am not sure what you are talking about. It does quite well...

I am using GCC on AMD processors as well and i agree with Marc Lehmann.

Vincent

That's all I run on AMD as well and if he is saying "gcc sucks" then I don't agree. It's always produced good speed numbers for me...

hgm · Post by **hgm** » Sun Feb 01, 2009 11:15 am

In my experience, the Out-Of-Order x86 CPUs benefit very little from extra registers. There are enough registers for efficient evaluation of expressions, even when they contain a fair number of repeating sub-expressions, and due to the OOO there never is any need to work on several expressions at the same time.

The only thing that you in general cannot afford is to keep variables in a register all the time; after every expression you have to store the result to free the register set, and let the following expression load all its operands from memory again. At first glance, this might lead to unnecessary load on the cache unit.

In practice, however, this does not seem to cause any slowdown. On the contrary, when I tried to hand-optimize assembly code, I was surprised to discover that code running only from registers often runs slower than the same code using memory-based temporary variables! There seems to be some internal bottleneck in the number of registers that you can read simultaneously. I.e. the number of read ports to the register file seems not enough to fetch all operands of all instructions that could execute in parallel. Having some of the operands not come from the register file, but from the load unit in stead, bypasses this bottleneck, and leads to higher IPC.

To this it should be added that the accumulator-like x86 instructions (i.e. with one read-modify-write operand) often do require copying of a register if you don't want its contents destroyed. This leads to a lot of data shuttling for in-register code as well. This register-to-register data shuttling occupies execution units, while shuttling the same data to and from memory would be done by the load and store units, which could operate in parallel with the execute units. So the Intel design philosophy of allowing one uOp per cycle more than the number of execution units the CPU has pretty much makes the overhead associated with memory-based temporaries completely invisible.

bob · Post by **bob** » Sun Feb 01, 2009 5:15 pm

hgm wrote:In my experience, the Out-Of-Order x86 CPUs benefit very little from extra registers. There are enough registers for efficient evaluation of expressions, even when they contain a fair number of repeating sub-expressions, and due to the OOO there never is any need to work on several expressions at the same time.

The only thing that you in general cannot afford is to keep variables in a register all the time; after every expression you have to store the result to free the register set, and let the following expression load all its operands from memory again. At first glance, this might lead to unnecessary load on the cache unit.

In practice, however, this does not seem to cause any slowdown. On the contrary, when I tried to hand-optimize assembly code, I was surprised to discover that code running only from registers often runs slower than the same code using memory-based temporary variables! There seems to be some internal bottleneck in the number of registers that you can read simultaneously. I.e. the number of read ports to the register file seems not enough to fetch all operands of all instructions that could execute in parallel. Having some of the operands not come from the register file, but from the load unit in stead, bypasses this bottleneck, and leads to higher IPC.

To this it should be added that the accumulator-like x86 instructions (i.e. with one read-modify-write operand) often do require copying of a register if you don't want its contents destroyed. This leads to a lot of data shuttling for in-register code as well. This register-to-register data shuttling occupies execution units, while shuttling the same data to and from memory would be done by the load and store units, which could operate in parallel with the execute units. So the Intel design philosophy of allowing one uOp per cycle more than the number of execution units the CPU has pretty much makes the overhead associated with memory-based temporaries completely invisible.

I simply don't see how memory can be faster. Because cache has the same "port" limitation as well. Even if it takes two cycles to read 4 total registers, any cache hit is going to take 4 cycles for a single value on today's best.

Several chess programs produced significant speedup when moving to a 64 bit platform, even though they were not 64 bit programs. I used to use gnu chess for testing and when I moved to 64 bit platforms I saw a significant speed increase when I moved to a 64 bit OS/Compiler (the first such machine I tried came with 32bit stuff). Crafty was significantly faster still, but that was because it needs 64 bit boolean operations and does 1/2 as many total operations on 64 bit as opposed to 32 bits...

hgm · Post by **hgm** » Sun Feb 01, 2009 8:59 pm

My observation was on machines that take 3 cycles for a hit. (Pentium II/III and M.) But that is latency, so it does not necessarily affect throughput. And it does not seem to apply to data that is just written: It seems there is a bypass from store to load unit. So any load request not only probes the cache, but also the store queue, and it seems response from the latter is faster than the cache latency. (Well, I guess it must be, as it must pre-empt the cache when there is a hit.) This effectively makes the store queue an independent register file, with its own port. Code that does not shuttle variables to memoy and back do not make use of this extra bandwidth.

bob · Post by **bob** » Sun Feb 01, 2009 10:12 pm

hgm wrote:My observation was on machines that take 3 cycles for a hit. (Pentium II/III and M.) But that is latency, so it does not necessarily affect throughput. And it does not seem to apply to data that is just written: It seems there is a bypass from store to load unit. So any load request not only probes the cache, but also the store queue, and it seems response from the latter is faster than the cache latency. (Well, I guess it must be, as it must pre-empt the cache when there is a hit.) This effectively makes the store queue an independent register file, with its own port. Code that does not shuttle variables to memoy and back do not make use of this extra bandwidth.

But here's the question. How can more registers make it _worse_??? That is what makes no sense. If ports are an issue (and I don't think they are otherwise why all the functional units to do integer math since registers are the only place to get data on the micro-ops that actually do computation). And with register renaming, there are already a pot-load of registers in use at any instant of time, even on old X86 instruction set.

The only thing that hits the load/store units are load and store micro-ops. Everything else is the usual RISC-like register-only operations. So how extra registers would hurt is beyond me unless you are running into the longer instruction format issues. Some code won't be helped with extra registers because of how the renaming eliminates traditional register name dependencies, and comparing such code to x86-64 instructions with longer opcodese might run into a slight penalty for that...

hgm · Post by **hgm** » Sun Feb 01, 2009 10:53 pm

More registers per se cannot make it worse, as the compiler could always refrain from usig them. (But will it?)

Look, I have no experience with 64-bit mode, as I have only one machine that has it, and the OS with which it was delivered prevents me from running in 64-bit mode. So I have no first-hand data on it. I just wanted to point out that most of these out-of-order CPUs really seem to benefit from frequent loading and storing. Which was just as unexpected to me as it will be to you. Naively 'optimizing' the code by eliminating memory variables and better use of registers often produced very counter-productive results.

So I can imagine that an optimizing compiler that uses the same algorithm as before, with just the parameter for the number of registers set to a larger value, would in fact make all the wrong decisions.

diep · Post by **diep** » Sun Feb 01, 2009 11:41 pm

hgm wrote:More registers per se cannot make it worse, as the compiler could always refrain from usig them. (But will it?)

Look, I have no experience with 64-bit mode, as I have only one machine that has it, and the OS with which it was delivered prevents me from running in 64-bit mode. So I have no first-hand data on it. I just wanted to point out that most of these out-of-order CPUs really seem to benefit from frequent loading and storing. Which was just as unexpected to me as it will be to you. Naively 'optimizing' the code by eliminating memory variables and better use of registers often produced very counter-productive results.

So I can imagine that an optimizing compiler that uses the same algorithm as before, with just the parameter for the number of registers set to a larger value, would in fact make all the wrong decisions.

Processors can fake having 16 GPR's and just get away with just a handful, matter of renaming temporarily some registers.

Vincent

wgarvin · Post by **wgarvin** » Mon Feb 02, 2009 2:32 am

hgm wrote:More registers per se cannot make it worse, as the compiler could always refrain from usig them. (But will it?)

Look, I have no experience with 64-bit mode, as I have only one machine that has it, and the OS with which it was delivered prevents me from running in 64-bit mode. So I have no first-hand data on it. I just wanted to point out that most of these out-of-order CPUs really seem to benefit from frequent loading and storing. Which was just as unexpected to me as it will be to you. Naively 'optimizing' the code by eliminating memory variables and better use of registers often produced very counter-productive results.

So I can imagine that an optimizing compiler that uses the same algorithm as before, with just the parameter for the number of registers set to a larger value, would in fact make all the wrong decisions.

The algorithms used in x86 32-bit compilers would probably do all right on x86-64. Since the x86 register file is so small that traditional register allocators (e.g. graph-coloring ones) didn't work very well, the x86 compiler vendors put a lot of effort into instruction selection and scheduling techniques. They want to minimize dependencies and the length of dependency chains, and make sure they can issue lots of instructions to get as much ILP as possible. That's still a sound strategy for x86-64: even though they have more architecturally-visible registers in 64-bit mode, these chips have the same number of internal registers in both modes.

Since the compiler writers never had enough registers to play with, its not surprising that the harware evolved to make spilling to and from L1 as inexpensive as possible. I'm a little fuzzy on the details these days... but I think on most x86 chips since about the Pentium-II, spilling values to L1 and then reading them back is usually very cheap. The store forwarding hardware helps if you try to read it back a few cycles later, and even if you don't read it for a few hundred cycles, its still pretty much guaranteed to be an L1 hit which is extremely cheap if the latency can be hidden by out-of-order execution which it usually can. Also, I think if you move a value to an architecturally-visible register and let it go cold (more than a few dozen cycles?), there is a small cost when it gets written back to the architecturally-visible register file to free up its temporary register. Then, it has to be read back from the register file before it can be used. So if you had written it to L1 instead and then read it back from L1 a little later (while it was still a cache hit), the cost would be nearly the same. You can pretty much treat L1 as an "extended register set".

[Edit: okay, I think when you read a register that you haven't touched in a while, you can get an "ROB read port stall". Only 2 regular plus one index registers can be read from the ROB per cycle. Usually on x86-32 its not an issue because there aren't very many registers so values seldom sit in them until they get cold. But it does suggest that for an x86-64 compiler, using a classical RISC register allocator (e.g. a graph coloring allocator) might not be a good idea.]

Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers

Re: Zobrist key random numbers