Does anyone have fast x86/x64 bitscan functions (fsb, lsb, popfirst) in C# (which are not Mono bound)? What do you use?
Playing around with these ideas, but not sure that they are worth the possible overhead (like having a delegate):
http://blogs.msdn.com/b/devinj/archive/ ... 38323.aspx
http://stackoverflow.com/questions/3216 ... in-c-sharp
http://blogs.microsoft.co.il/blogs/sash ... rom-c.aspx
Unsafe code ideas are welcome aswell.
Any thoughts?
Thanks, Balint
[.Net only] - fast bit operations
Moderators: hgm, Rebel, chrisw
-
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
Any dirty tricks regarding biboard related operations are welcome. Checked/unchecked, safe/unsafe.
My current ones are available in Portfish: https://github.com/bpfliegel/Portfish
Cheers, Balint
My current ones are available in Portfish: https://github.com/bpfliegel/Portfish
Cheers, Balint
-
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
Okay, so here we go - going for an x64 pop first bit function.
Idea from here:
http://blogs.msdn.com/b/devinj/archive/ ... 38323.aspx
Original C code looks like this:
C# code with some tests like this (can't even use FastCall...)
Slow as hell! (~around 1:5)
There is still some hope: the bytecode is full of junk for the first look (the 2 lines for checking was removed by me btw) . If someone could help me out to replace it into a specific optimized bytecode - we might be able to move forward...
Anyone in?
Cheers, Balint
Idea from here:
http://blogs.msdn.com/b/devinj/archive/ ... 38323.aspx
Original C code looks like this:
Code: Select all
#include "stdafx.h"
#include <intrin.h>
int __declspec(noinline) __stdcall pop_1st_bit_64(unsigned long& b) {
unsigned long index;
_BitScanForward64(&index, b);
b &= ~(1ULL<<(index));
return index;
}
Code: Select all
using System;
using System.Collections.Generic;
using System.Linq;
using System.Runtime.InteropServices;
using System.Text;
using System.Threading.Tasks;
namespace Popcount
{
// Original idea here: http://stackoverflow.com/questions/3216535/x86-x64-cpuid-in-c-sharp
public static class PopFirst
{
#region Interop
[DllImport("kernel32.dll", SetLastError = true)]
private static extern IntPtr VirtualAlloc(IntPtr lpAddress, UIntPtr dwSize, AllocationType flAllocationType,
MemoryProtection flProtect);
[DllImport("kernel32")]
private static extern bool VirtualFree(IntPtr lpAddress, UInt32 dwSize, UInt32 dwFreeType);
[Flags()]
private enum AllocationType : uint
{
COMMIT = 0x1000,
RESERVE = 0x2000,
RESET = 0x80000,
LARGE_PAGES = 0x20000000,
PHYSICAL = 0x400000,
TOP_DOWN = 0x100000,
WRITE_WATCH = 0x200000
}
[Flags()]
private enum MemoryProtection : uint
{
EXECUTE = 0x10,
EXECUTE_READ = 0x20,
EXECUTE_READWRITE = 0x40,
EXECUTE_WRITECOPY = 0x80,
NOACCESS = 0x01,
READONLY = 0x02,
READWRITE = 0x04,
WRITECOPY = 0x08,
GUARD_Modifierflag = 0x100,
NOCACHE_Modifierflag = 0x200,
WRITECOMBINE_Modifierflag = 0x400
}
#endregion
[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]
public delegate int PopFirstBit(ref UInt64 b);
static byte[] pop_first_x64 = new byte[]
{
0x48 ,0x89 ,0x4C ,0x24 ,0x08 //mov qword ptr [rsp+8],rcx
,0x57 //push rdi
,0x48 ,0x83 ,0xEC ,0x40 //sub rsp,40h
,0x48 ,0x8B ,0xFC //mov rdi,rsp
,0xB9 ,0x10 ,0x00 ,0x00 ,0x00 //mov ecx,10h
,0xB8 ,0xCC ,0xCC ,0xCC ,0xCC //mov eax,0CCCCCCCCh
,0xF3 ,0xAB //rep stos dword ptr [rdi]
,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b]
//unsigned long index;
//_BitScanForward64(&index, b);
,0x48 ,0x8B ,0x44 ,0x24 ,0x50 //mov rax,qword ptr [b]
,0x8B ,0x00 //mov eax,dword ptr [rax]
,0x48 ,0x0F ,0xBC ,0xC0 //bsf rax,rax
,0x89 ,0x44 ,0x24 ,0x24 //mov dword ptr [index],eax
//b &= ~(1ULL<<(index));
,0x8B ,0x44 ,0x24 ,0x24 //mov eax,dword ptr [index]
,0xB9 ,0x01 ,0x00 ,0x00 ,0x00 //mov ecx,1
,0x48 ,0x89 ,0x4C ,0x24 ,0x38 //mov qword ptr [rsp+38h],rcx
,0x0F ,0xB6 ,0xC8 //movzx ecx,al
,0x48 ,0x8B ,0x44 ,0x24 ,0x38 //mov rax,qword ptr [rsp+38h]
,0x48 ,0xD3 ,0xE0 //shl rax,cl
,0x48 ,0xF7 ,0xD0 //not rax
,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b]
,0x8B ,0x09 //mov ecx,dword ptr [rcx]
,0x48 ,0x23 ,0xC8 //and rcx,rax
,0x48 ,0x8B ,0xC1 //mov rax,rcx
,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b]
,0x89 ,0x01 //mov dword ptr [rcx],eax
//return index;
,0x8B ,0x44 ,0x24 ,0x24 // mov eax,dword ptr [index]
//}
,0x8B ,0xF8 // mov edi,eax
,0x48 ,0x8B ,0xCC // mov rcx,rsp
//,0x48 ,0x8D ,0x15 ,0x83 ,0x5F ,0x00 ,0x00 // lea rdx,[std::numeric_limits<float>::min_exponent10+26Ch (013FACBEB0h)]
//,0xE8 ,0xFE ,0xDB ,0xFF ,0xFF // call _RTC_CheckStackVars (013FAC3B30h)
,0x8B ,0xC7 // mov eax,edi
,0x48 ,0x83 ,0xC4 ,0x40 // add rsp,40h
,0x5F // pop rdi
,0xC3 // ret
};
// Taken from Portfish/Stockfish
internal static readonly int[] PopTable = new int[64];
internal static int pop_first_x64_sw(ref UInt64 b)
{
UInt64 bb = b;
b &= (b - 1);
return (PopTable[((bb & (0xffffffffffffffff - bb + 1)) * 0x218A392CD3D5DBFUL) >> 58]);
}
public static void PopFirstTest()
{
// Init (taken from Portfish/Stockfish)
for (int i = 0; i < 64; i++)
{
PopTable[((1UL << i) * 0x218A392CD3D5DBFUL) >> 58] = i;
}
IntPtr codepointer = IntPtr.Zero;
try
{
byte[] codeBytes = pop_first_x64;
codepointer = VirtualAlloc(
IntPtr.Zero,
new UIntPtr((uint)codeBytes.Length),
AllocationType.COMMIT,
MemoryProtection.EXECUTE_READWRITE
);
Marshal.Copy(codeBytes, 0, codepointer, codeBytes.Length);
PopFirstBit pop1st = (PopFirstBit)Marshal.GetDelegateForFunctionPointer(codepointer, typeof(PopFirstBit));
Console.WriteLine("test prepared, press any key");
Console.ReadKey();
for (UInt64 i = 0; i < 5000000; i++)
{
UInt64 target = i;
while (target != 0) { int first = pop1st(ref target); }
}
Console.WriteLine("first passed");
Console.ReadKey();
for (UInt64 i = 0; i < 5000000; i++)
{
UInt64 target = i;
while (target != 0) { int first = pop_first_x64_sw(ref target); }
}
Console.WriteLine("second passed");
Console.ReadKey();
}
finally
{
if (codepointer != IntPtr.Zero)
{
VirtualFree(codepointer, 0, 0x8000);
codepointer = IntPtr.Zero;
}
}
}
}
}
There is still some hope: the bytecode is full of junk for the first look (the 2 lines for checking was removed by me btw) . If someone could help me out to replace it into a specific optimized bytecode - we might be able to move forward...
Anyone in?
Cheers, Balint
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: [.Net only] - fast bit operations
Hi Balint,bpfliegel wrote:
C# code with some tests like this (can't even use FastCall...)
Slow as hell! (~around 1:5)Code: Select all
[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)] public delegate int PopFirstBit(ref UInt64 b); static byte[] pop_first_x64 = new byte[] { 0x48 ,0x89 ,0x4C ,0x24 ,0x08 //mov qword ptr [rsp+8],rcx ,0x57 //push rdi ,0x48 ,0x83 ,0xEC ,0x40 //sub rsp,40h ,0x48 ,0x8B ,0xFC //mov rdi,rsp ,0xB9 ,0x10 ,0x00 ,0x00 ,0x00 //mov ecx,10h ,0xB8 ,0xCC ,0xCC ,0xCC ,0xCC //mov eax,0CCCCCCCCh ,0xF3 ,0xAB //rep stos dword ptr [rdi] ,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b] //unsigned long index; //_BitScanForward64(&index, b); ,0x48 ,0x8B ,0x44 ,0x24 ,0x50 //mov rax,qword ptr [b] ,0x8B ,0x00 //mov eax,dword ptr [rax] ,0x48 ,0x0F ,0xBC ,0xC0 //bsf rax,rax ,0x89 ,0x44 ,0x24 ,0x24 //mov dword ptr [index],eax //b &= ~(1ULL<<(index)); ,0x8B ,0x44 ,0x24 ,0x24 //mov eax,dword ptr [index] ,0xB9 ,0x01 ,0x00 ,0x00 ,0x00 //mov ecx,1 ,0x48 ,0x89 ,0x4C ,0x24 ,0x38 //mov qword ptr [rsp+38h],rcx ,0x0F ,0xB6 ,0xC8 //movzx ecx,al ,0x48 ,0x8B ,0x44 ,0x24 ,0x38 //mov rax,qword ptr [rsp+38h] ,0x48 ,0xD3 ,0xE0 //shl rax,cl ,0x48 ,0xF7 ,0xD0 //not rax ,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b] ,0x8B ,0x09 //mov ecx,dword ptr [rcx] ,0x48 ,0x23 ,0xC8 //and rcx,rax ,0x48 ,0x8B ,0xC1 //mov rax,rcx ,0x48 ,0x8B ,0x4C ,0x24 ,0x50 //mov rcx,qword ptr [b] ,0x89 ,0x01 //mov dword ptr [rcx],eax //return index; ,0x8B ,0x44 ,0x24 ,0x24 // mov eax,dword ptr [index] //} ,0x8B ,0xF8 // mov edi,eax ,0x48 ,0x8B ,0xCC // mov rcx,rsp //,0x48 ,0x8D ,0x15 ,0x83 ,0x5F ,0x00 ,0x00 // lea rdx,[std::numeric_limits<float>::min_exponent10+26Ch (013FACBEB0h)] //,0xE8 ,0xFE ,0xDB ,0xFF ,0xFF // call _RTC_CheckStackVars (013FAC3B30h) ,0x8B ,0xC7 // mov eax,edi ,0x48 ,0x83 ,0xC4 ,0x40 // add rsp,40h ,0x5F // pop rdi ,0xC3 // ret }; // Taken from Portfish/Stockfish internal static readonly int[] PopTable = new int[64]; internal static int pop_first_x64_sw(ref UInt64 b) { UInt64 bb = b; b &= (b - 1); return (PopTable[((bb & (0xffffffffffffffff - bb + 1)) * 0x218A392CD3D5DBFUL) >> 58]); } public static void PopFirstTest() { // Init (taken from Portfish/Stockfish) for (int i = 0; i < 64; i++) { PopTable[((1UL << i) * 0x218A392CD3D5DBFUL) >> 58] = i; } IntPtr codepointer = IntPtr.Zero; try { byte[] codeBytes = pop_first_x64; codepointer = VirtualAlloc( IntPtr.Zero, new UIntPtr((uint)codeBytes.Length), AllocationType.COMMIT, MemoryProtection.EXECUTE_READWRITE ); Marshal.Copy(codeBytes, 0, codepointer, codeBytes.Length); PopFirstBit pop1st = (PopFirstBit)Marshal.GetDelegateForFunctionPointer(codepointer, typeof(PopFirstBit)); Console.WriteLine("test prepared, press any key"); Console.ReadKey(); for (UInt64 i = 0; i < 5000000; i++) { UInt64 target = i; while (target != 0) { int first = pop1st(ref target); } } Console.WriteLine("first passed"); Console.ReadKey(); for (UInt64 i = 0; i < 5000000; i++) { UInt64 target = i; while (target != 0) { int first = pop_first_x64_sw(ref target); } } Console.WriteLine("second passed"); Console.ReadKey(); } finally { if (codepointer != IntPtr.Zero) { VirtualFree(codepointer, 0, 0x8000); codepointer = IntPtr.Zero; } } } } }
There is still some hope: the bytecode is full of junk for the first look (the 2 lines for checking was removed by me btw) . If someone could help me out to replace it into a specific optimized bytecode - we might be able to move forward...
Anyone in?
Cheers, Balint
The machine language PopFirstBit seems to come from a debug version. There is much to throw out - a naked function without locals and stackframe should suffice. In bitboard serialization, I prefer a (inlined) routine for bitscan forward via call by value to reset the bit with independend "and" with one's decrement.
Code: Select all
if ( x ) do {
int idx = bitScanForward(x); // square index from 0..63
*list++ = foo(idx, ...);
} while (x &= x-1); // reset LS1B
https://chessprogramming.wikispaces.com/BitScan
https://chessprogramming.wikispaces.com ... %20Version
why (0xffffffffffffffff - bb + 1) intstead of (-bb) or (0-bb)?
Gerd
-
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
Thanks Gerd, agree with all of what you wrote!
- the code itself is just a lame initial version, I have to work on the .Net related parts.
- right, bb & ~(0-bb) might be used aswell, bb & ~-bb won't compile as I remember.
- the while trick is lovely!
- I have near zero x86/x64 asm knowledge, so the code might be best replaced anyhow by something written by a human expert and not by any compiler at the end, I think. But this link is great: https://chessprogramming.wikispaces.com ... %20Version
Have to run,
Balint
- the code itself is just a lame initial version, I have to work on the .Net related parts.
- right, bb & ~(0-bb) might be used aswell, bb & ~-bb won't compile as I remember.
- the while trick is lovely!
- I have near zero x86/x64 asm knowledge, so the code might be best replaced anyhow by something written by a human expert and not by any compiler at the end, I think. But this link is great: https://chessprogramming.wikispaces.com ... %20Version
Have to run,
Balint
-
- Posts: 13
- Joined: Sat May 12, 2012 9:45 pm
Re: [.Net only] - fast bit operations
I think you mean -bb or ~bb + 1. It is indeed annoying that you can't just negate a UInt64 value. For the pop function I found that isolating the LSB first was a bit faster in practice (perft) even though it was slower when testing with random values in a loop:
Code: Select all
public static Int32 pop_first_x64_sw(ref UInt64 b) {
UInt64 bb = b & (0UL - b);
b &= b - 1;
return BitIndex[(bb * 0x218A392CD3D5DBFUL) >> 58];
}
-
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
Correct. Thanks for the runtime info!zongli wrote:I think you mean -bb or ~bb + 1. It is indeed annoying that you can't just negate a UInt64 value. For the pop function I found that isolating the LSB first was a bit faster in practice (perft) even though it was slower when testing with random values in a loop:
Code: Select all
public static Int32 pop_first_x64_sw(ref UInt64 b) { UInt64 bb = b & (0UL - b); b &= b - 1; return BitIndex[(bb * 0x218A392CD3D5DBFUL) >> 58]; }
Created a new code from release build and - hopefully - proper optimizations set. Performance is still around 1:2,5 even with code security supressed, too much overhead...
Balint
Code: Select all
using System;
using System.Security;
using System.Runtime.InteropServices;
namespace Popcount
{
// Idea: http://ybeernet.blogspot.hu/2011/03/techniques-of-calling-unmanaged-code.html
// Original idea here: http://stackoverflow.com/questions/3216535/x86-x64-cpuid-in-c-sharp
public static class PopFirst
{
#region Interop
[DllImport("kernel32.dll", SetLastError = true)]
private static extern IntPtr VirtualAlloc(IntPtr lpAddress, UIntPtr dwSize, AllocationType flAllocationType,
MemoryProtection flProtect);
[DllImport("kernel32")]
private static extern bool VirtualFree(IntPtr lpAddress, UInt32 dwSize, UInt32 dwFreeType);
[Flags()]
private enum AllocationType : uint
{
COMMIT = 0x1000,
RESERVE = 0x2000,
RESET = 0x80000,
LARGE_PAGES = 0x20000000,
PHYSICAL = 0x400000,
TOP_DOWN = 0x100000,
WRITE_WATCH = 0x200000
}
[Flags()]
private enum MemoryProtection : uint
{
EXECUTE = 0x10,
EXECUTE_READ = 0x20,
EXECUTE_READWRITE = 0x40,
EXECUTE_WRITECOPY = 0x80,
NOACCESS = 0x01,
READONLY = 0x02,
READWRITE = 0x04,
WRITECOPY = 0x08,
GUARD_Modifierflag = 0x100,
NOCACHE_Modifierflag = 0x200,
WRITECOMBINE_Modifierflag = 0x400
}
#endregion
[SuppressUnmanagedCodeSecurity]
[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]
public delegate int PopFirstBit(ref UInt64 b);
static byte[] pop_first_x64 = new byte[]
{
0x44 ,0x8B ,0x01 //mov r8d,dword ptr [rcx]
,0x4C ,0x8B ,0xC9 //mov r9,rcx
,0xBA ,0x01 ,0x00 ,0x00 ,0x00 //mov edx,1
,0x49 ,0x0F ,0xBC ,0xC0 //bsf rax,r8
,0x8B ,0xC8 //mov ecx,eax
,0x48 ,0xD3 ,0xE2 //shl rdx,cl
,0xF7 ,0xD2 //not edx
,0x41 ,0x23 ,0xD0 //and edx,r8d
,0x41 ,0x89 ,0x11 //mov dword ptr [r9],edx
,0xC3 //ret
};
// Taken from Portfish/Stockfish
internal static readonly int[] PopTable = new int[64];
internal static int pop_first_x64_sw(ref UInt64 b)
{
UInt64 bb = b;
b &= (b - 1);
return (PopTable[((bb & (0xffffffffffffffff - bb + 1)) * 0x218A392CD3D5DBFUL) >> 58]);
}
public static void PopFirstTest()
{
// Init (taken from Portfish/Stockfish)
for (int i = 0; i < 64; i++)
{
PopTable[((1UL << i) * 0x218A392CD3D5DBFUL) >> 58] = i;
}
IntPtr codepointer = IntPtr.Zero;
try
{
byte[] codeBytes = pop_first_x64;
codepointer = VirtualAlloc(
IntPtr.Zero,
new UIntPtr((uint)codeBytes.Length),
AllocationType.COMMIT,
MemoryProtection.EXECUTE_READWRITE
);
Marshal.Copy(codeBytes, 0, codepointer, codeBytes.Length);
PopFirstBit pop1st = (PopFirstBit)Marshal.GetDelegateForFunctionPointer(codepointer, typeof(PopFirstBit));
Console.WriteLine("test prepared, press any key");
Console.ReadKey();
for (UInt64 i = 0; i < 5000000; i++)
{
UInt64 target = i;
while (target != 0)
{
int first = pop1st(ref target);
}
}
Console.WriteLine("first passed");
Console.ReadKey();
for (UInt64 i = 0; i < 5000000; i++)
{
UInt64 target = i;
while (target != 0)
{
int first = pop_first_x64_sw(ref target);
}
}
Console.WriteLine("second passed");
Console.ReadKey();
}
finally
{
if (codepointer != IntPtr.Zero)
{
VirtualFree(codepointer, 0, 0x8000);
codepointer = IntPtr.Zero;
}
}
}
}
}
-
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
I retested UInt64 bb = b & (0UL - b); instead of the old code and performs somewhat better. I have a very minimal movegen code and now it's visible - did not detect it when I tested this change on Portfish earlier. But is very logical, one op less.
Thanks anyway! Balint
Thanks anyway! Balint
-
- Posts: 13
- Joined: Sat May 12, 2012 9:45 pm
Re: [.Net only] - fast bit operations
Well, assuming the compiler understands the commutative property of addition, it's the same number of operations. I think the gain comes from slightly better parallelism once the method gets inlined.
I'm not sure if you'll be able to get much out of the assembly code; it might look terse but there's a lot of instructions needed to put it in a delegate. Also, it probably can't get inlined.
I'm not sure if you'll be able to get much out of the assembly code; it might look terse but there's a lot of instructions needed to put it in a delegate. Also, it probably can't get inlined.
-
- Posts: 71
- Joined: Fri Mar 16, 2012 10:16 am
Re: [.Net only] - fast bit operations
Agree again. Let's hope someone will come up with a better technical idea - I'm abandoning this attempt.zongli wrote:Well, assuming the compiler understands the commutative property of addition, it's the same number of operations. I think the gain comes from slightly better parallelism once the method gets inlined.
I'm not sure if you'll be able to get much out of the assembly code; it might look terse but there's a lot of instructions needed to put it in a delegate. Also, it probably can't get inlined.
Thanks, Balint