SSE5 AMD

Nid Hogge · Post by **Nid Hogge** » Thu Aug 30, 2007 10:46 pm

Hi, thought it could be interesting for you guys.

AMD Announces SSE5 :

New instructions include:
* Fused multiply accumulate (FMACxx) instructions
* Integer multiply accumulate (IMAC, IMADC) instructions
* Permutation and conditional move instructions
* Vector compare and test instructions
* Precision control, rounding, and conversion instructions
* 46 base instructions that expand to 170 total instructions.

Here's the whole definition :
http://developer.amd.com/assets/sse5_43 ... -27-07.pdf (Long!)

Examples :

SSE3 :

SSE5:

Article about the new instruction set :
http://www.anandtech.com/cpuchipsets/sh ... i=3073&p=1

Official page :
http://developer.amd.com/sse5.jsp
Lots of new instructions! Wonder if they could help chess at some point!?

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Aug 31, 2007 8:35 am

Nid Hogge wrote:Hi, thought it could be interesting for you guys.

Thanks. Some very interesting integer instructions. Conditional simd-moves, dot-products, rotates, byte-wise shifts and packed permute bytes with bit-reversal as operator. The question is will intel and amd come up with more and more disjoint instruction-sets in the future or will they cooperate.

Nid Hogge · Post by **Nid Hogge** » Fri Aug 31, 2007 8:52 am

Gerd Isenberg wrote:
Nid Hogge wrote:Hi, thought it could be interesting for you guys.
Thanks. Some very interesting integer instructions. Conditional simd-moves, dot-products, rotates, byte-wise shifts and packed permute bytes with bit-reversal as operator. The question is will intel and amd come up with more and more disjoint instruction-sets in the future or will they cooperate.

Welcome. It's hard to imagine Intel and AMD cooperate on anything, tbh.
I see this path continues.

But at least, the court ruling says that they can implent both sides instruction sets for free.
AMD has adopted SSE SSE2 SSE3 and it would probably implent SSE4 for the next generations.
I'm sure Intel is already working on a new instruction set of theyre own already as we speak..

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Aug 31, 2007 11:27 am

PPERM is really a freaky instruction. It allows for instance all kind of reversals, byte- as well as bit reversals. Each of 16 destination bytes might be one of 32-source bytes applying an individual unary logical operation such as invert or reverse.

PPERM Packed Permute Bytes

Moves any of the 32-packed bytes in the source operands to each byte of the destination XMM register. Each byte of the result can optionally have a logical operation applied to it. The 32-byte source operand consists of the second source operand (src2) concatenated with the first source operand (src1). The third source operand (src3) contains control bytes specifying the source
byte and the logical operation for each destination byte. The destination register is an XMM register addressed by the DREX.dest field.

The PPERM instruction requires four operands:

PPERM dest, src1, src2,src3
For each byte of the 16-byte result, the corresponding byte in src3 is used as follows:
• bits 4:0 of src3 select one of the 32 bytes from src2:src1
• bits 7:5 of src3 select the logical operation applied.

000 Source byte (no logical operation)
001 Invert source byte
010 Bit reverse of source byte
011 Bit reverse of inverted source byte
100 0x00
101 0xFF
110 Most significant bit of source byte replicated in all bit positions.
111 Invert most significant bit of source byte and replicate in all bit positions.

Harald · Post by **Harald** » Fri Aug 31, 2007 9:38 pm

Gerd Isenberg wrote:PPERM is really a freaky instruction.

Do you already have a new bitboard method for us?

Perhaps something based on the hyperbola or kindergarten idea.

At least the whole 64 bits of a bitboard could be reversed with one
pperm instruction (reorder 8 bytes and for each reverse the bits).
You could even reverse two bitboards with one instruction.

Or could we do a make_move() function with a few pperm instructions?
Dest, src1 and src2 could be parts of the char board[64], src3 would be
filled with the from and to infos of the move. Captures included.

Is there an instruction that takes a bit pattern from a bitboard, multiplies
it as a vector with a byte vector and calculates the sum as word? Like this:
short mobility_value = sum(center_bit_pattern * center_mobility_weights);
short king_safety_value = sum(danger_square_bits * danger_square_weights);

Harald

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Sep 01, 2007 12:14 am

Harald wrote:
Gerd Isenberg wrote:PPERM is really a freaky instruction.
Do you already have a new bitboard method for us?
Perhaps something based on the hyperbola or kindergarten idea.

At least the whole 64 bits of a bitboard could be reversed with one
pperm instruction (reorder 8 bytes and for each reverse the bits).
You could even reverse two bitboards with one instruction.

Or could we do a make_move() function with a few pperm instructions?
Dest, src1 and src2 could be parts of the char board[64], src3 would be
filled with the from and to infos of the move. Captures included.

Is there an instruction that takes a bit pattern from a bitboard, multiplies
it as a vector with a byte vector and calculates the sum as word? Like this:
short mobility_value = sum(center_bit_pattern * center_mobility_weights);
short king_safety_value = sum(danger_square_bits * danger_square_weights);

Harald

Of course you can apply a pperm-hyperbola with ranks also, a generalized algorithm for ray[2] by passing twoMasks[sq] like antidia:diagonal[sq] or rank:file[sq]. I wonder how fast pperm will be in two or three years, when we see first processors with this instruction set. Smells like an expensive vector path instruction with some cycles latency.

Interesting are the generalized simd-shifts/rotates which shift/rotate byte/word/dword/qword-wise left with positive amount but right with negative shift amount. Also you may shift each element left/right by an different amount. The SSE5 three/four operand-opcodes with disjoint target-register and two/three sources safe some moves.

On the bit[64]*byte[64] dot-product - I haven't studied the new muladd-instructions that closely - but I guess there are some improvements possible, compared to the SSE2-one:

Code: Select all

int dotProduct64(u64 bb, u8 weights[] /* XMM_ALIGN */)
{
   static const u64 XMM_ALIGN sbitmask[2] = {
      0x8040201008040201,
      0x8040201008040201
   };
   __m128i x0, x1, x2, x3, bm;
   bm = _mm_load_si128  ( (__m128i*) sbitmask);
   x0 = _mm_loadl_epi64 ( (__m128i*) &bb);  // 0000000000000000:8040201008040201
   // extend bits to bytes 
   x0 = _mm_unpacklo_epi8  (x0, x0);        // 8080404020201010:0808040402020101
   x2 = _mm_unpackhi_epi16 (x0, x0);        // 8080808040404040:2020202010101010
   x0 = _mm_unpacklo_epi16 (x0, x0);        // 0808080804040404:0202020201010101
   x1 = _mm_unpackhi_epi32 (x0, x0);        // 0808080808080808:0404040404040404
   x0 = _mm_unpacklo_epi32 (x0, x0);        // 0202020202020202:0101010101010101
   x3 = _mm_unpackhi_epi32 (x2, x2);        // 8080808080808080:4040404040404040
   x2 = _mm_unpacklo_epi32 (x2, x2);        // 2020202020202020:1010101010101010
   x0 = _mm_cmpeq_epi8 (_mm_and_si128 (x0, bm), bm);
   x1 = _mm_cmpeq_epi8 (_mm_and_si128 (x1, bm), bm);
   x2 = _mm_cmpeq_epi8 (_mm_and_si128 (x2, bm), bm);
   x3 = _mm_cmpeq_epi8 (_mm_and_si128 (x3, bm), bm);
   // multiply by "and" with -1 or 0
   __m128i* pw = (__m128i*) weights;
   x0 = _mm_and_si128  (x0, pw[0]);		
   x1 = _mm_and_si128  (x1, pw[1]);
   x2 = _mm_and_si128  (x2, pw[2]);
   x3 = _mm_and_si128  (x3, pw[3]);
   // add all bytes (with saturation)
   x0 = _mm_adds_epu8  (x0, x1);		
   x0 = _mm_adds_epu8  (x0, x2);
   x0 = _mm_adds_epu8  (x0, x3);
   x0 = _mm_sad_epu8   (x0, _mm_setzero_si128 ());
   return _mm_extract_epi16 (x0, 0)
        + _mm_extract_epi16 (x0, 4);
}

Gerd

SSE5 AMD

SSE5 AMD

Re: SSE5 AMD

Re: SSE5 AMD

Re: SSE5 AMD

Re: SSE5 AMD

Re: SSE5 AMD