Haswell New Instructions

Zach Wegner · Post by **Zach Wegner** » Fri Sep 09, 2011 4:48 am

The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.

rbarreira · Post by **rbarreira** » Fri Sep 09, 2011 11:27 am

Very interesting.

The BLSR/BLSI instruction reset/extract the lowest set bit. I'm not aware of compiler intrinsics for those operations, so I hope that compilers will detect idioms like "x & (x-1)" and "x & (-x)" and convert them to those instructions.

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Sep 09, 2011 7:46 pm

Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.

Wonderful!
Time to write some new stuff in CPW

Looking for latencies ...
Intrinsics are already listed in the paper but possibly not yet available.

Code: Select all

PEXT unsigned __int64 _pext_u64&#40;unsigned __int64 src, unsigned __int32 mask&#41;;
PDEP unsigned __int64 _pdep_u64&#40;unsigned __int64 src, unsigned __int32 mask&#41;;
BLSR unsigned __int64 _blsr_u64&#40;unsigned __int64 src&#41;;

Cheers,
Gerd

Zach Wegner · Post by **Zach Wegner** » Fri Sep 09, 2011 9:27 pm

Gerd Isenberg wrote:
Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Wonderful!
Time to write some new stuff in CPW

Looking for latencies ...
Intrinsics are already listed in the paper but possibly not yet available.
Code: Select all
PEXT unsigned __int64 _pext_u64&#40;unsigned __int64 src, unsigned __int32 mask&#41;;
PDEP unsigned __int64 _pdep_u64&#40;unsigned __int64 src, unsigned __int32 mask&#41;;
BLSR unsigned __int64 _blsr_u64&#40;unsigned __int64 src&#41;;
Cheers,
Gerd

There's a bug in their intrinsic definitions there: the 64-bit versions should take a 64-bit mask, of course.

Can't wait to get my hand on some silicon...

wgarvin · Post by **wgarvin** » Fri Sep 09, 2011 10:06 pm

Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.

Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?

I'm sure several of these instructions will find niche uses though for chessprogramming. Compressing attack bitboards and using PDEP to uncompress them does sound like an easy win.

rbarreira · Post by **rbarreira** » Sat Sep 10, 2011 1:33 am

Gerd Isenberg wrote: Intrinsics are already listed in the paper but possibly not yet available.
Code: Select all
PEXT unsigned __int64 _pext_u64&#40;unsigned __int64 src, unsigned __int32 mask&#41;;
PDEP unsigned __int64 _pdep_u64&#40;unsigned __int64 src, unsigned __int32 mask&#41;;
BLSR unsigned __int64 _blsr_u64&#40;unsigned __int64 src&#41;;
Cheers,
Gerd

Right... Unfortunately they're probably icc only, not gcc. But if gcc adds intrinsics for it (and I'm guessing they will), icc will probably add the same ones since it's always gcc compatible. So it's probably all good.

jwes · Post by **jwes** » Sat Sep 10, 2011 7:32 am

wgarvin wrote:
Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?

By not extracting the bits on the edges of the board and the bit of the moving piece, wouldn't you come up with 10 bits for rook attacks and 9 bits for bishop attacks, thus needing a 1K by 64 qword table for rooks and 512 by 64 qword table for bishops.

wgarvin · Post by **wgarvin** » Sat Sep 10, 2011 7:54 pm

jwes wrote:
wgarvin wrote:
Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?
By not extracting the bits on the edges of the board and the bit of the moving piece, wouldn't you come up with 10 bits for rook attacks and 9 bits for bishop attacks, thus needing a 1K by 64 qword table for rooks and 512 by 64 qword table for bishops.

You'd extract exactly the same bits that are used as input to a variable-shift magic multiply. However, the index formed by magic mul, shift usually has 1 bit less. So the tables would approximately double in size.
[edit: it might still be worth it though, since the PEXT replaces both the multiply and the variable shift... in effect you get the variable index sizes "for free" with PEXT, which is pretty nice]

However, the PDEP at the end shrinks table size by a factor of four and requires no extra memory indirection, so that seems like a pure win.

wgarvin · Post by **wgarvin** » Sun Sep 11, 2011 12:19 am

I thought of another idea... If there wasn't already a dedicated POPCNT instruction, it could easily be implemented with a PEXT, TZCNT sequence [edit: or PEXT, NOT, LZCNT might be better, I guess]

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Sep 12, 2011 8:12 pm

If pext/pdep pairs well, minimal Kindergarten BBs will get a new kick as well, ignoring redundancy of the outer squares for same extract and deposit masks.

Code: Select all

rankMask dq 64 dup (?) ; include squares
fileMask dq 64 dup (?) ; include squares
firstRankLookup db 256 dup (?)

; input&#58;  rcx square index 0..63, not changed
;         rdx occupancy, not changed
; output&#58; rax rook attacks

  pext    rax, rdx, quad word ptr &#91;rankMask + 8*rcx&#93;
  pext    r08, rdx, quad word ptr &#91;fileMask + 8*rcx&#93;
  movzx   rax, byte ptr &#91;firstRankLookup + rax&#93;
  movzx   r08, byte ptr &#91;firstRankLookup + r08&#93;
  pdep    rax, rax, quad word ptr &#91;rankMask + 8*rcx&#93;
  pdep    r08, r08, quad word ptr &#91;fileMask + 8*rcx&#93;
  or      rax, r08

Haswell New Instructions

Haswell New Instructions

Re: Haswell New Instructions

Re: Haswell New Instructions

Re: Haswell New Instructions

Re: Haswell New Instructions

Re: Haswell New Instructions

Re: Haswell New Instructions

Re: Haswell New Instructions

Re: Haswell New Instructions

Re: Haswell New Instructions