Two small in-register-lookups

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Apr 24, 2007 12:28 am

Color of square - assumes 32-bit shift by mod 32.
Is that portable for 32-bit ints?
(otherwise sq&31 or sq&15 is necessary)

Code: Select all

unsigned int colorOfSquare (unsigned int sq) {
   return  (0xAA55AA55 >> sq) & 1; 
}

Center-Distance in the 0..3 range as a folded [32]-element lookup of two bits:

Code: Select all

 3 3 3 3 3 3 3 3
 3 2 2 2 2 2 2 3
 3 2 1 1 1 1 2 3
 3 2 1 0 0 1 2 3
 3 2 1 0 0 1 2 3
 3 2 1 1 1 1 2 3
 3 2 2 2 2 2 2 3
 3 3 3 3 3 3 3 3

Code: Select all

unsigned int centerDistance(unsigned int a) {
    const u64 _13 = 0xffffeaabe55be41b;
    a ^= (a - 32) >> 27;
    return  (_13  >> 2*a) & 3;
}

or 32-bit friendly:

Code: Select all

unsigned int centerDistance(unsigned int a) {
    a ^=  (a  -  32)  >> 27;
    return ((-8274523 >> a) & 1)
         | ((  -30842 >> a) & 2);
}

Cheers,
Gerd

jwes · Post by **jwes** » Tue Apr 24, 2007 11:08 am

Gerd Isenberg wrote:Color of square - assumes 32-bit shift by mod 32.
Is that portable for 32-bit ints?
(otherwise sq&31 or sq&15 is necessary)
Code: Select all
unsigned int colorOfSquare (unsigned int sq) {
   return  (0xAA55AA55 >> sq) & 1; 
}

or

Code: Select all

unsigned int colorOfSquare (unsigned int sq) {
   return  ((sq>>3)^ sq) & 1; 
}

hgm · Post by **hgm** » Tue Apr 24, 2007 11:51 am

Why would you want to use in-register lookups? Fetching such simple things from a memory table should be very competative, if not faster. I noticed that these out-of-order CPUs often perform very poorly if you only use registers, it seems that there is some internal bottleneck on how many registers you can access per clock. Having some of the operands come from memory is usually more successful in keeping the ALUs busy.

On AMD machines load and stores are piggybacked on ALU makro ops, so there is no decoder penalty in using instructions with memory operands. uOp fusion in Intel machines now does the same (Pentium M and Core CPUs), while Pentium IV circumvented decoder limits by using a trace cache.

In addition, shifts are usually a bottleneck, as most CPUs have only one ALU capable of doing shifts, wich can become a bottleneck if your code uses them heavily. So better reserve them for what you really can't do without. (Plus, the i86 machine language requires a variable shift count to be in one specific register, leading to extra register-to-register moves to implement them.)

If you are worried about L1 footprint, you could pack several such tables in 64 byte. E.g. bits 0 and 1 the center distance, bit 2 the square color, etc.

Code: Select all

colorOfSquare  = Data[sqr] & 4; // assuming it is used as a boolean, so 4=true
centerDistance = Data[sqr] & 3;

(Btw, Joker uses Wesley's method for square color, but I am considering to change that.)

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Apr 24, 2007 12:58 pm

jwes wrote:
Gerd Isenberg wrote:Color of square - assumes 32-bit shift by mod 32.
Is that portable for 32-bit ints?
(otherwise sq&31 or sq&15 is necessary)
Code: Select all
unsigned int colorOfSquare (unsigned int sq) {
   return  (0xAA55AA55 >> sq) & 1; 
}
or
Code: Select all
unsigned int colorOfSquare (unsigned int sq) {
   return  ((sq>>3)^ sq) & 1; 
}

Yes, I am aware of that. It has one dependent instruction more and with my mapping (a1 == 0, -> dark square), I even need one additional xor 1 or not

Thus loading the immediate constant to eax, plus shift, mask looks not that bad, despite variable shift needs cl.

Code: Select all

; ecx == sq
mov eax, AA55AA55H
shr eax, cl
and eax, 1

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Apr 24, 2007 1:00 pm

hgm wrote:Why would you want to use in-register lookups? Fetching such simple things from a memory table should be very competative, if not faster. I noticed that these out-of-order CPUs often perform very poorly if you only use registers, it seems that there is some internal bottleneck on how many registers you can access per clock. Having some of the operands come from memory is usually more successful in keeping the ALUs busy.

Bit-twiddling addiction

Some alternative methods to compute such data - no serious claims intended. Quite modest register usage on x64, RAX (EAX) and CL, one variable shift. Sometimes with some other memory accesses around such computation may be helpfull to find the "best" balance between memory accesss and register computation...

hgm wrote:On AMD machines load and stores are piggybacked on ALU makro ops, so there is no decoder penalty in using instructions with memory operands. uOp fusion in Intel machines now does the same (Pentium M and Core CPUs), while Pentium IV circumvented decoder limits by using a trace cache.

In addition, shifts are usually a bottleneck, as most CPUs have only one ALU capable of doing shifts, wich can become a bottleneck if your code uses them heavily. So better reserve them for what you really can't do without. (Plus, the i86 machine language requires a variable shift count to be in one specific register, leading to extra register-to-register moves to implement them.)

If you are worried about L1 footprint, you could pack several such tables in 64 byte. E.g. bits 0 and 1 the center distance, bit 2 the square color, etc.
Code: Select all
colorOfSquare  = Data[sqr] & 4; // assuming it is used as a boolean, so 4=true
centerDistance = Data[sqr] & 3;
(Btw, Joker uses Wesley's method for square color, but I am considering to change that.)

hgm · Post by **hgm** » Tue Apr 24, 2007 2:37 pm

You could of course replace the (sqr>>3^sqr)&1 by

Code: Select all

9*sqr & 8

for the purpose of making a boolean.

The multiplication could be implemented with a single LEA instruction:

Code: Select all

    movl  _sqr, %eax
    leal (%eax, %eax, 8), %eax
    andl  $8, %eax

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Apr 24, 2007 8:36 pm

hgm wrote:You could of course replace the (sqr>>3^sqr)&1 by
Code: Select all
9*sqr & 8
for the purpose of making a boolean.

The multiplication could be implemented with a single LEA instruction:
Code: Select all
    movl  _sqr, %eax
    leal (%eax, %eax, 8), %eax
    andl  $8, %eax

Cute, lea takes only one register.
Since I need "1" for dark squares (square a1 == 0) and "0" for light squares, lea eax+8*eax+8 would be nice...

But... lea with scaling takes two cycles on K8.
And i can't force vc2005 to generate that code. It always uses imul -7

Code: Select all

((9*sqr+8)>>3) & 1

Code: Select all

 0000  8D 44 C9 08  lea eax, [eax*8 + eax + 8]
 0004  C1 E8 01     shr eax, 3
 0007  83 E0 01     and eax, 1

But with sq already in ecx, i still would prefere the in-register lookup.
Codesize is exactly the same

Code: Select all

 0000  B8 55 AA 55 AA mov eax, aa55aa55
 0005  D3 E8          shr eax, cl
 0007  83 E0 01       and eax, 1

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Apr 24, 2007 10:11 pm

Three of zillions versions for a boolean subroutine, to look whether two squares share same color. And what vc2005 makes out of it...

Code: Select all

bool sameSquareColor(unsigned int sq1, unsigned int sq2) {
    //return  ((0xAA55AA55 >> sq1) ^ (sq2*9 >> 3)) & 1;
    //return  (( ((sq1 ^ sq2)>>3) ^ sq1 ^ sq2 ) ^ 1) & 1;
    return  ( (0xAA55AA55 >> sq1) ^ (sq2>>3) ^ sq2 ) & 1;
}

Code: Select all

?sameSquareColor@@YA_NII@Z PROC
  00000	6b d2 f9	 imul	 edx, -7
  00003	b8 55 aa 55 aa	 mov	 eax, aa55aa55H
  00008	d3 e8		 shr	 eax, cl
  0000a	c1 ea 03	 shr	 edx, 3
  0000d	33 c2		 xor	 eax, edx
  0000f	83 e0 01	 and	 eax, 1
  00012	c3		 ret	 0

Code: Select all

?sameSquareColor@@YA_NII@Z PROC
  00000	8b c1		 mov	 eax, ecx
  00002	33 c2		 xor	 eax, edx
  00004	c1 e8 03	 shr	 eax, 3
  00007	f7 d0		 not	 eax
  00009	33 c1		 xor	 eax, ecx
  0000b	33 c2		 xor	 eax, edx
  0000d	83 e0 01	 and	 eax, 1
  00010	c3		 ret	 0

Code: Select all

?sameSquareColor@@YA_NII@Z PROC
  00000	b8 55 aa 55 aa	 mov	 eax, aa55aa55H
  00005	d3 e8		 shr	 eax, cl
  00007	8b ca		 mov	 ecx, edx
  00009	c1 e9 03	 shr	 ecx, 3
  0000c	33 c1		 xor	 eax, ecx
  0000e	33 c2		 xor	 eax, edx
  00010	83 e0 01	 and	 eax, 1
  00013	c3		 ret	 0

handmade assembly (ml64)

Code: Select all

 00000000  _sameSquareColor PROC
 00000000  8D 44 C9 08		lea eax, [rcx + 8*rcx + 8]
 00000004  8D 0C D2		lea ecx, [rdx + 8*rdx]
 00000007  33 C1		xor eax, ecx
 00000009  C1 E8 03		shr eax, 3
 0000000C  83 E0 01		and eax, 1
 0000000F  C3			ret	 0

hgm · Post by **hgm** » Tue Apr 24, 2007 11:48 pm

You can XOR the squares first, and then do the test for one square.

Code: Select all

bool sameSquareColor(unsigned int sq1, unsigned int sq2) { 
    sq1 ^= sq2;
    return  ( (0xAA55AA55 >> sq1) & 1); 
}

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Apr 24, 2007 11:57 pm

hgm wrote:You can XOR the squares first, and then do the test for one square.
Code: Select all
bool sameSquareColor(unsigned int sq1, unsigned int sq2) { 
    sq1 ^= sq2;
    return  ( (0xAA55AA55 >> sq1) & 1); 
}

Of course! Looks quite competetive now.

Code: Select all

?sameSquareColor@@YA_NII@Z PROC
  00000   33 ca       xor    ecx, edx
  00002   b8 55 aa 55 aa    mov    eax, aa55aa55H
  00005   d3 e8       shr    eax, cl
  00007   83 e0 01    and    eax, 1

Two small in-register-lookups

Two small in-register-lookups

Re: Two small in-register-lookups

Re: Two small in-register-lookups

Re: Two small in-register-lookups

Re: Two small in-register-lookups

Re: Two small in-register-lookups

Re: Two small in-register-lookups

Re: Two small in-register-lookups

Re: Two small in-register-lookups

Re: Two small in-register-lookups