Distinguish rook and bishop move efficiently

hgm · Post by **hgm** » Fri Sep 17, 2021 10:31 am

klx wrote: ↑Fri Sep 17, 2021 5:01 amWith all due respect, you are wrong on all accounts.

1. No it does not need a ULL suffix.

It does for me (gcc 3.3.4). I get a warning that the constant will be clipped. So your code is not portable to 32-bit architectures.
Also note that when 'bool' would not be a 64-bit int type, the result would get clipped to the length of the bool.

2. On x64, "movabs" loads a 64 bit immediate into a register. It does not go through memory.

Good catch. I thought immediate operands were limited to 32 bits in x64, but MOV appears to be an exception I was not aware of.

Not that it matters in terms of instruction count, though: loading from a fixed address in a 'constant table' is also a single instruction. Of course you cannot afford to keep the constant in a register permanently, in a real-life chess p'engine.

3. As I wrote in another post, at least on x64 systems and with a decent compiler, the specific expression I wrote will compile to the BT (bit test) instruction. There's no loading 1ULL, no shifting. There's essentially only MOV, XOR and BT, and then the result is in the carry flag for whatever use case you have. And if this check is repeated many times (which is likely to be the case if it needs to be optimized), the constant would be kept in a register, in which case there's only XOR and BT.

Your version is not bad, but a memory lookup for each test, even if L1 cached, will nonetheless be slower.

Best regards / Emanuel

I did actually compile your code, (-O3 optimization), and unlike what you say, it does generate the shift:

Code: Select all

	.file	"bit.c"
	.text
	.p2align 4,,15
	.globl	test
	.type	test, @function
test:
.LFB0:
	.cfi_startproc
	movabsq	$1157442765409227006, %rax
	xorl	%esi, %edi
	movl	%edi, %ecx
	shrq	%cl, %rax
	andl	$1, %eax
	ret
	.cfi_endproc
.LFE0:
	.size	test, .-test
	.ident	"GCC: (Ubuntu 8.3.0-6ubuntu1~18.10.1) 8.3.0"
	.section	.note.GNU-stack,"",@progbits

My code compiles to

Code: Select all

	.file	"bit2.c"
	.text
	.p2align 4,,15
	.globl	test
	.type	test, @function
test:
.LFB0:
	.cfi_startproc
	xorl	%esi, %edi
	leaq	kind(%rip), %rax
	movslq	%edi, %rsi
	movzbl	(%rax,%rsi), %eax
	ret
	.cfi_endproc
.LFE0:
	.size	test, .-test
	.ident	"GCC: (Ubuntu 8.3.0-6ubuntu1~18.10.1) 8.3.0"
	.section	.note.GNU-stack,"",@progbits

That does use fewer instructions. But still disappointingly many, this is probably why 64-bit code is often slower than 32-bit code, despite the larger number of registers. You need extra instructions here to extend the result of the XOR to 64-bit in order to use it as an address. The IP-relative addressing also takes an extra instruction here.

R. Tomasi · Post by **R. Tomasi** » Fri Sep 17, 2021 10:42 am

One note concerning the ULL suffix: that is per se not portable, since the width of unsigned long long is not the same across different platforms or compilers. In practice that usually makes no difference, since in most cases it will be wide enough to represent a 64bit integer. The "recommended" way to do it would be to use the UINT64_C(...) macro to write 64bit literals.

Mergi · Post by **Mergi** » Fri Sep 17, 2021 10:43 am

hgm wrote: ↑Fri Sep 17, 2021 10:31 am
I did actually compile your code, (-O3 optimization), and unlike what you say, it does generate the shift:

I think he is right, but it depends on the compiler. This is the output from clang.

EDIT: And it seems that next release version of GCC will compile to this as well, not doing any shifts anymore.

Code: Select all

        xor     edi, esi
        movabs  rax, 72340172838076926
        bt      rax, rdi
        setb    al
        ret

hgm · Post by **hgm** » Fri Sep 17, 2021 10:57 am

Still just as many instructions as the compiled version of my code. Which in a larger context would probably be improved on, by reserving one register for holding a 'global data pointer' compared to which all accesses to global variables could be references, instead of using IP-relative addressing through an extra LEA. And in a 32-bit architecture it would just be two instructions: XOR and LOAD.

klx · Post by **klx** » Fri Sep 17, 2021 12:07 pm

hgm wrote: ↑Fri Sep 17, 2021 10:57 am Still just as many instructions as the compiled version of my code. Which in a larger context would probably be improved on, by reserving one register for holding a 'global data pointer' compared to which all accesses to global variables could be references, instead of using IP-relative addressing through an extra LEA. And in a 32-bit architecture it would just be two instructions: XOR and LOAD.

In practice, once this is inlined, you'd not need the "SETB AL" since the result is in the carry flag which arguably is more useful than in a register. So in practice, I'd expect it to be fewer instructions. Also, in that larger context, as I mentioned earlier, a register could also be reserved for holding the magic constant.

The other point I'd like to make is we can't just count number of instructions. There's additional latency with the memory lookup that's not present in my version.

hgm · Post by **hgm** » Fri Sep 17, 2021 12:47 pm

Latency (if it is not extreme) is usually not very important, with out-of-order execution. The 64-byte table would typically be in L1 or L2 cache. Of course packing it into 8 bytes as a packed array of bits, which is essentially what you are doing, requires even less space. It would not work so smoothly on a 32-bit architecture, though. As I already pointed out, you won't be able to dedicate a register to such a minor task.

In my experience it is important to have a proper mix of ALU and memory operations. Code that only uses registers often performs like crap. It is not only that one of the execution units is dedicated to memory operations, so that you effectively can do one instruction per cycle less when there aren't any LOADs or STOREs. It also seems the number of ports to the register file is a bottleneck, which makes it impossible to execute as many ALU instructions as you have ALUs if not some of the operands come directly from memory (through the register bypass).

klx · Post by **klx** » Fri Sep 17, 2021 4:19 pm

hgm wrote: ↑Fri Sep 17, 2021 12:47 pm Latency (if it is not extreme) is usually not very important, with out-of-order execution. /../ As I already pointed out, you won't be able to dedicate a register to such a minor task.

Yeah that's possible, on the other hand, if there's a need to shave 0.25 nanoseconds off this method, I'm going to assume it's part of some hot loop in which case latency and register availability might be a thing. Also, the OP specifically asked for a method that avoided array lookups.

Another point here, my version is much more SIMD friendly than an array lookup. E.g. if you need to call it for multiple from/to cells, the compiler will be able to turn many expressions into SIMD code.

hgm · Post by **hgm** » Fri Sep 17, 2021 4:27 pm

klx wrote: ↑Fri Sep 17, 2021 4:19 pmYeah that's possible, on the other hand, if there's a need to shave 0.25 nanoseconds off this method, I'm going to assume it's part of some hot loop in which case latency and register availability might be a thing. Also, the OP specifically asked for a method that avoided array lookups.

Ah, I missed that.

Another point here, my version is much more SIMD friendly than an array lookup. E.g. if you need to call it for multiple from/to cells, the compiler will be able to turn many expressions into SIMD code.

Well, forget about that. A chess engine will have no use for that. In fact I see very little application for diagnosing moves this way.

klx · Post by **klx** » Fri Sep 17, 2021 5:20 pm

hgm wrote: ↑Fri Sep 17, 2021 4:27 pm Well, forget about that. A chess engine will have no use for that. In fact I see very little application for diagnosing moves this way.

Well I wouldn't be so sure about that, let's say we have this method:

Code: Select all

int num_rook_moves(int f1, int t1, int f2, int t2, int f3, int t3, int f4, int t4) {
    return is_rook_move(f1, t1) + is_rook_move(f2, t2) + is_rook_move(f3, t3) + is_rook_move(f4, t4);
}

My version would SIMD this into:

Code: Select all

num_rook_moves(int, int, int, int, int, int, int, int):             # @num_rook_moves(int, int, int, int, int, int, int, int)
        vmovd   xmm0, ecx
        vpinsrd xmm0, xmm0, esi, 1
        vpinsrd xmm0, xmm0, r9d, 2
        vpinsrd xmm0, xmm0, dword ptr [rsp + 16], 3
        vmovd   xmm1, edx
        vpinsrd xmm1, xmm1, edi, 1
        vpinsrd xmm1, xmm1, r8d, 2
        vpinsrd xmm1, xmm1, dword ptr [rsp + 8], 3
        vpxor   xmm0, xmm0, xmm1
        vpmovzxdq       ymm0, xmm0              # ymm0 = xmm0[0],zero,xmm0[1],zero,xmm0[2],zero,xmm0[3],zero
        vpbroadcastq    ymm1, qword ptr [rip + .LCPI9_0] # ymm1 = [1,1,1,1]
        vpsllvq ymm0, ymm1, ymm0
        vpbroadcastq    ymm1, qword ptr [rip + .LCPI9_1] # ymm1 = [72340172838076926,72340172838076926,72340172838076926,72340172838076926]
        vpand   ymm0, ymm0, ymm1
        vpxor   xmm1, xmm1, xmm1
        vpcmpeqq        ymm0, ymm0, ymm1
        vpcmpeqd        ymm1, ymm1, ymm1
        vpxor   ymm0, ymm0, ymm1
        vextracti128    xmm1, ymm0, 1
        vpackssdw       xmm0, xmm0, xmm1
        vpbroadcastd    xmm1, dword ptr [rip + .LCPI9_2] # xmm1 = [1,1,1,1]
        vpand   xmm1, xmm0, xmm1
        vpshufd xmm1, xmm1, 238                 # xmm1 = xmm1[2,3,2,3]
        vpsubd  xmm0, xmm1, xmm0
        vpshufd xmm1, xmm0, 85                  # xmm1 = xmm0[1,1,1,1]
        vpaddd  xmm0, xmm0, xmm1
        vmovd   eax, xmm0
        vzeroupper
        ret

While yours would fail:

Code: Select all

num_rook_moves(int, int, int, int, int, int, int, int):             # @num_rook_moves(int, int, int, int, int, int, int, int)
        mov     eax, dword ptr [rsp + 16]
        xor     edi, esi
        movsxd  rsi, edi
        movzx   esi, byte ptr [rsi + kind]
        xor     edx, ecx
        movsxd  rcx, edx
        movzx   ecx, byte ptr [rcx + kind]
        add     ecx, esi
        xor     r8d, r9d
        movsxd  rdx, r8d
        movzx   edx, byte ptr [rdx + kind]
        add     edx, ecx
        xor     eax, dword ptr [rsp + 8]
        cdqe
        movzx   eax, byte ptr [rax + kind]
        add     eax, edx
        ret

hgm · Post by **hgm** » Fri Sep 17, 2021 7:16 pm

The point is that that method would be totally useless.

Distinguish rook and bishop move efficiently

Re: Distinguish rook and bishop move efficiently

Re: Distinguish rook and bishop move efficiently

Re: Distinguish rook and bishop move efficiently

Re: Distinguish rook and bishop move efficiently

Re: Distinguish rook and bishop move efficiently

Re: Distinguish rook and bishop move efficiently

Re: Distinguish rook and bishop move efficiently

Re: Distinguish rook and bishop move efficiently

Re: Distinguish rook and bishop move efficiently

Re: Distinguish rook and bishop move efficiently