Kindergarten bitboards without multiplying

Discussion of chess software programming and technical issues.

Moderator: Ras

wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Kindergarten bitboards without multiplying

Post by wgarvin »

I wonder if anyone has tried the PMOVMSKB instruction for files? It should be a pretty fast way to collect the bits. You could PCMPEQB with zero, and then PMOVMSKB. There's some additional cost to get it into an XMM register, but the result goes into a general-purpose register.
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

Gian-Carlo Pascutto wrote: In any case there is no need to do 64 bit multiplies with Kindergarten.
Sure, imul64 on 32-bit is a call with three imuls. But Piotr's proposal seems even a faster solution than a dedicated 2 times 32-bit mul approach.
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

Piotr Cichy wrote:I measured both the cycles of single (inlined) call to PackV/PackH
and the cycles of loop calling PackV/PackH.

For single call I got 29 (for mul) and 10 (for Pack).
For loops (100000000 calls) I got: 1 728 175 538 and 608 045 461.

The time was measured with simple use of RDTSC:

inline __int64 Ticks()
{
__int64 x;
_asm {
rdtsc
mov dword ptr x+4,edx
mov dword ptr x,eax
}
return x;
}

I don't know how exact this is, but should be much better than GetTickCount() of Windows API.
Sure, GetTickCount() resolution is far to coarse for that purpose, but you may try cpuid with eax zero before rdtsc, to make sure it doesn't executed out of order.
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

wgarvin wrote:I wonder if anyone has tried the PMOVMSKB instruction for files? It should be a pretty fast way to collect the bits. You could PCMPEQB with zero, and then PMOVMSKB. There's some additional cost to get it into an XMM register, but the result goes into a general-purpose register.
Yes, I think on core2 and K10 a PMOVMSKB approach is very competitive.
Gian-Carlo Pascutto
Posts: 1260
Joined: Sat Dec 13, 2008 7:00 pm

Re: Kindergarten bitboards without multiplying

Post by Gian-Carlo Pascutto »

Gerd Isenberg wrote:
Gian-Carlo Pascutto wrote: In any case there is no need to do 64 bit multiplies with Kindergarten.
Sure, imul64 on 32-bit is a call with three imuls. But Piotr's proposal seems even a faster solution than a dedicated 2 times 32-bit mul approach.
Ah, really? I would have expected that the 32 bit folded kindergarten (1 mul per direction) would be faster than a series of dependent shifts.
Gian-Carlo Pascutto
Posts: 1260
Joined: Sat Dec 13, 2008 7:00 pm

Re: Kindergarten bitboards without multiplying

Post by Gian-Carlo Pascutto »

Gerd Isenberg wrote: Sure, GetTickCount() resolution is far to coarse for that purpose, but you may try cpuid with eax zero before rdtsc, to make sure it doesn't executed out of order.
Still doesn't guarantee it stays on the same CPU. Again: use QueryPerformanceCounter or get burned.
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

Gian-Carlo Pascutto wrote:
Gerd Isenberg wrote:
Gian-Carlo Pascutto wrote: In any case there is no need to do 64 bit multiplies with Kindergarten.
Sure, imul64 on 32-bit is a call with three imuls. But Piotr's proposal seems even a faster solution than a dedicated 2 times 32-bit mul approach.
Ah, really? I would have expected that the 32 bit folded kindergarten (1 mul per direction) would be faster than a series of dependent shifts.
Ok, I was probably too optimistic, since I wrongly referred two 32-bit imuls per direction. Still even with one imul you have some shifts around. Processing two lines in parallel (for bishop or rook) relaxes the dependencies, and the 32-bit shifts are nops in 32-bit mode, so Collapsed Files for diagonals, anti-diagonal and ranks seems very competitive, even if compiler don't emit or al, ah. I think it is worth a try if you focus on 32-bit machines.
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Kindergarten bitboards without multiplying

Post by Gerd Isenberg »

Gian-Carlo Pascutto wrote:
Gerd Isenberg wrote: Sure, GetTickCount() resolution is far to coarse for that purpose, but you may try cpuid with eax zero before rdtsc, to make sure it doesn't executed out of order.
Still doesn't guarantee it stays on the same CPU. Again: use QueryPerformanceCounter or get burned.
Ok, is something similar available for gcc/linux?
Gian-Carlo Pascutto
Posts: 1260
Joined: Sat Dec 13, 2008 7:00 pm

Re: Kindergarten bitboards without multiplying

Post by Gian-Carlo Pascutto »

gettimeofday()

If you really absolutely need nanoseconds, there is clock_gettime().
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Kindergarten bitboards without multiplying

Post by wgarvin »

Gian-Carlo Pascutto wrote:
Gerd Isenberg wrote:
Gian-Carlo Pascutto wrote: In any case there is no need to do 64 bit multiplies with Kindergarten.
Sure, imul64 on 32-bit is a call with three imuls. But Piotr's proposal seems even a faster solution than a dedicated 2 times 32-bit mul approach.
Ah, really? I would have expected that the 32 bit folded kindergarten (1 mul per direction) would be faster than a series of dependent shifts.
Unless I misunderstand something, the shifts are not dependent on each other (but combining the result together is).