Haswell New Instructions

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Haswell New Instructions

Post by Zach Wegner »

The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
rbarreira
Posts: 900
Joined: Tue Apr 27, 2010 3:48 pm

Re: Haswell New Instructions

Post by rbarreira »

Very interesting.

The BLSR/BLSI instruction reset/extract the lowest set bit. I'm not aware of compiler intrinsics for those operations, so I hope that compilers will detect idioms like "x & (x-1)" and "x & (-x)" and convert them to those instructions.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Haswell New Instructions

Post by Gerd Isenberg »

Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Wonderful!
Time to write some new stuff in CPW ;-)

Looking for latencies ...
Intrinsics are already listed in the paper but possibly not yet available.

Code: Select all

PEXT unsigned __int64 _pext_u64(unsigned __int64 src, unsigned __int32 mask);
PDEP unsigned __int64 _pdep_u64(unsigned __int64 src, unsigned __int32 mask);
BLSR unsigned __int64 _blsr_u64(unsigned __int64 src);
Cheers,
Gerd
User avatar
Zach Wegner
Posts: 1922
Joined: Thu Mar 09, 2006 12:51 am
Location: Earth

Re: Haswell New Instructions

Post by Zach Wegner »

Gerd Isenberg wrote:
Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Wonderful!
Time to write some new stuff in CPW ;-)

Looking for latencies ...
Intrinsics are already listed in the paper but possibly not yet available.

Code: Select all

PEXT unsigned __int64 _pext_u64(unsigned __int64 src, unsigned __int32 mask);
PDEP unsigned __int64 _pdep_u64(unsigned __int64 src, unsigned __int32 mask);
BLSR unsigned __int64 _blsr_u64(unsigned __int64 src);
Cheers,
Gerd
There's a bug in their intrinsic definitions there: the 64-bit versions should take a 64-bit mask, of course.

Can't wait to get my hand on some silicon...
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Haswell New Instructions

Post by wgarvin »

Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?

I'm sure several of these instructions will find niche uses though for chessprogramming. Compressing attack bitboards and using PDEP to uncompress them does sound like an easy win.
rbarreira
Posts: 900
Joined: Tue Apr 27, 2010 3:48 pm

Re: Haswell New Instructions

Post by rbarreira »

Gerd Isenberg wrote: Intrinsics are already listed in the paper but possibly not yet available.

Code: Select all

PEXT unsigned __int64 _pext_u64(unsigned __int64 src, unsigned __int32 mask);
PDEP unsigned __int64 _pdep_u64(unsigned __int64 src, unsigned __int32 mask);
BLSR unsigned __int64 _blsr_u64(unsigned __int64 src);
Cheers,
Gerd
Right... Unfortunately they're probably icc only, not gcc. But if gcc adds intrinsics for it (and I'm guessing they will), icc will probably add the same ones since it's always gcc compatible. So it's probably all good.
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: Haswell New Instructions

Post by jwes »

wgarvin wrote:
Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?
By not extracting the bits on the edges of the board and the bit of the moving piece, wouldn't you come up with 10 bits for rook attacks and 9 bits for bishop attacks, thus needing a 1K by 64 qword table for rooks and 512 by 64 qword table for bishops.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Haswell New Instructions

Post by wgarvin »

jwes wrote:
wgarvin wrote:
Zach Wegner wrote:The new instruction specs for Haswell have been released: http://software.intel.com/file/36945

The instructions, particularly the bit manipulation instructions, look pretty awesome for chess. Check out PEXT/PDEP: instead of magic bitboards, you can just get the attack mask, extract out the relevant bits with PEXT, do a table lookup of a 2-byte value (just a compressed attack bitboard), and use PDEP to decompress back to an attack bitboard. Too bad there's not a vector version too :)

ANDN/BLSI/BLSMSK/BLSR/BZHI/SH*X should speed up many common bitboard operations, too.
Using PEXT for indices would give away the advantage of magic multiply where most squares have magics that give you a 1-bit-smaller index. Maybe it will be faster than a multiply, and still worth it?
By not extracting the bits on the edges of the board and the bit of the moving piece, wouldn't you come up with 10 bits for rook attacks and 9 bits for bishop attacks, thus needing a 1K by 64 qword table for rooks and 512 by 64 qword table for bishops.
You'd extract exactly the same bits that are used as input to a variable-shift magic multiply. However, the index formed by magic mul, shift usually has 1 bit less. So the tables would approximately double in size.
[edit: it might still be worth it though, since the PEXT replaces both the multiply and the variable shift... in effect you get the variable index sizes "for free" with PEXT, which is pretty nice]

However, the PDEP at the end shrinks table size by a factor of four and requires no extra memory indirection, so that seems like a pure win.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: Haswell New Instructions

Post by wgarvin »

I thought of another idea... If there wasn't already a dedicated POPCNT instruction, it could easily be implemented with a PEXT, TZCNT sequence [edit: or PEXT, NOT, LZCNT might be better, I guess]
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Haswell New Instructions

Post by Gerd Isenberg »

If pext/pdep pairs well, minimal Kindergarten BBs will get a new kick as well, ignoring redundancy of the outer squares for same extract and deposit masks.

Code: Select all

rankMask dq 64 dup (?) ; include squares
fileMask dq 64 dup (?) ; include squares
firstRankLookup db 256 dup (?)

; input:  rcx square index 0..63, not changed
;         rdx occupancy, not changed
; output: rax rook attacks

  pext    rax, rdx, quad word ptr [rankMask + 8*rcx]
  pext    r08, rdx, quad word ptr [fileMask + 8*rcx]
  movzx   rax, byte ptr [firstRankLookup + rax]
  movzx   r08, byte ptr [firstRankLookup + r08]
  pdep    rax, rax, quad word ptr [rankMask + 8*rcx]
  pdep    r08, r08, quad word ptr [fileMask + 8*rcx]
  or      rax, r08