Page 1 of 2

Haswell New Instructions

Posted: Fri Sep 09, 2011 4:48 am
by Zach Wegner
The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.

Re: Haswell New Instructions

Posted: Fri Sep 09, 2011 11:27 am
by rbarreira
Very interesting.

The BLSR/BLSI instruction reset/extract the lowest set bit. I'm not aware of compiler intrinsics for those operations, so I hope that compilers will detect idioms like "x & (x-1)" and "x & (-x)" and convert them to those instructions.

Re: Haswell New Instructions

Posted: Fri Sep 09, 2011 7:46 pm
by Gerd Isenberg
Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Wonderful!
Time to write some new stuff in CPW ;-)

Looking for latencies ...
Intrinsics are already listed in the paper but possibly not yet available.

Code: Select all

PEXT unsigned __int64 _pext_u64(unsigned __int64 src, unsigned __int32 mask);
PDEP unsigned __int64 _pdep_u64(unsigned __int64 src, unsigned __int32 mask);
BLSR unsigned __int64 _blsr_u64(unsigned __int64 src);
Cheers,
Gerd

Re: Haswell New Instructions

Posted: Fri Sep 09, 2011 9:27 pm
by Zach Wegner
Gerd Isenberg wrote:
Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Wonderful!
Time to write some new stuff in CPW ;-)

Looking for latencies ...
Intrinsics are already listed in the paper but possibly not yet available.

Code: Select all

PEXT unsigned __int64 _pext_u64(unsigned __int64 src, unsigned __int32 mask);
PDEP unsigned __int64 _pdep_u64(unsigned __int64 src, unsigned __int32 mask);
BLSR unsigned __int64 _blsr_u64(unsigned __int64 src);
Cheers,
Gerd
There's a bug in their intrinsic definitions there: the 64-bit versions should take a 64-bit mask, of course.

Can't wait to get my hand on some silicon...

Re: Haswell New Instructions

Posted: Fri Sep 09, 2011 10:06 pm
by wgarvin
Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?

I'm sure several of these instructions will find niche uses though for chessprogramming. Compressing attack bitboards and using PDEP to uncompress them does sound like an easy win.

Re: Haswell New Instructions

Posted: Sat Sep 10, 2011 1:33 am
by rbarreira
Gerd Isenberg wrote: Intrinsics are already listed in the paper but possibly not yet available.

Code: Select all

PEXT unsigned __int64 _pext_u64(unsigned __int64 src, unsigned __int32 mask);
PDEP unsigned __int64 _pdep_u64(unsigned __int64 src, unsigned __int32 mask);
BLSR unsigned __int64 _blsr_u64(unsigned __int64 src);
Cheers,
Gerd
Right... Unfortunately they're probably icc only, not gcc. But if gcc adds intrinsics for it (and I'm guessing they will), icc will probably add the same ones since it's always gcc compatible. So it's probably all good.

Re: Haswell New Instructions

Posted: Sat Sep 10, 2011 7:32 am
by jwes
wgarvin wrote:
Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?
By not extracting the bits on the edges of the board and the bit of the moving piece, wouldn't you come up with 10 bits for rook attacks and 9 bits for bishop attacks, thus needing a 1K by 64 qword table for rooks and 512 by 64 qword table for bishops.

Re: Haswell New Instructions

Posted: Sat Sep 10, 2011 7:54 pm
by wgarvin
jwes wrote:
wgarvin wrote:
Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?
By not extracting the bits on the edges of the board and the bit of the moving piece, wouldn't you come up with 10 bits for rook attacks and 9 bits for bishop attacks, thus needing a 1K by 64 qword table for rooks and 512 by 64 qword table for bishops.
You'd extract exactly the same bits that are used as input to a variable-shift magic multiply. However, the index formed by magic mul, shift usually has 1 bit less. So the tables would approximately double in size.
[edit: it might still be worth it though, since the PEXT replaces both the multiply and the variable shift... in effect you get the variable index sizes "for free" with PEXT, which is pretty nice]

However, the PDEP at the end shrinks table size by a factor of four and requires no extra memory indirection, so that seems like a pure win.

Re: Haswell New Instructions

Posted: Sun Sep 11, 2011 12:19 am
by wgarvin
I thought of another idea... If there wasn't already a dedicated POPCNT instruction, it could easily be implemented with a PEXT, TZCNT sequence [edit: or PEXT, NOT, LZCNT might be better, I guess]

Re: Haswell New Instructions

Posted: Mon Sep 12, 2011 8:12 pm
by Gerd Isenberg
If pext/pdep pairs well, minimal Kindergarten BBs will get a new kick as well, ignoring redundancy of the outer squares for same extract and deposit masks.

Code: Select all

rankMask dq 64 dup (?) ; include squares
fileMask dq 64 dup (?) ; include squares
firstRankLookup db 256 dup (?)

; input:  rcx square index 0..63, not changed
;         rdx occupancy, not changed
; output: rax rook attacks

  pext    rax, rdx, quad word ptr [rankMask + 8*rcx]
  pext    r08, rdx, quad word ptr [fileMask + 8*rcx]
  movzx   rax, byte ptr [firstRankLookup + rax]
  movzx   r08, byte ptr [firstRankLookup + r08]
  pdep    rax, rax, quad word ptr [rankMask + 8*rcx]
  pdep    r08, r08, quad word ptr [fileMask + 8*rcx]
  or      rax, r08