Diep tested on latest AMD and Intel processors

diep · Post by **diep** » Mon Mar 31, 2008 3:01 am

http://arstechnica.com/reviews/hardware ... view.ars/3

Most important is to look at one last page: power consumption. Measured also with Diep.

(thanks Joel Hruska for using diep for that).

95 watt difference. Boy oh boy.

Vincent

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Mar 31, 2008 6:14 pm

diep wrote:http://arstechnica.com/reviews/hardware ... view.ars/3

Most important is to look at one last page: power consumption. Measured also with Diep.

(thanks Joel Hruska for using diep for that).

95 watt difference. Boy oh boy.

Vincent

Hi Vincent,

How many micro wattseconds does Diep take per node on those processors?

Intel clearly has the edge now. Future Nehalem and even more Sandy Bridge with 256-bit wide vector extensions (AVX) with a three operand risc -instructions set similar to amd's anounced 128-bit SSE5 sounds very interesting. Four bitboards in one register! Each bitboard may be shifted by a different generalized +-amount. A lot of shuffling and permuation stuff - Wow!

Cheers,
Gerd

Pradu · Post by **Pradu** » Mon Mar 31, 2008 11:32 pm

Gerd Isenberg wrote:Four bitboards in one register! Each bitboard may be shifted by a different generalized +-amount. A lot of shuffling and permuation stuff - Wow!

Cheers,
Gerd

How much faster do you guess fills become with quadbitboards? I'm sure there'll be a plethora of new bitboard tricks we can do with 256-bits if it becomes competitively fast

.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Apr 01, 2008 8:32 am

Pradu wrote:How much faster do you guess fills become with quadbitboards? I'm sure there'll be a plethora of new bitboard tricks we can do with 256-bits if it becomes competitively fast .

With recent intel cpus or K10 we already have a throughput of three independent 128-bit instructions per cycle. So guesswork. Assuming 256-bit alus and busses (which might not be the case in the first processor generation of sandy bridge) we may perform some stuff 1.5-2 times faster.

The generalized, independent shifts are definitly usefull for single bitboards, where you initialize a 256-bit register with copies (shuffle) of one bitboard - to shift four directions with one instruction. But with distinct bitboards inside a quad you'll likely shift each vector by same amount and immediate shift per direction, to do several directions in parallel with multiple registers. Actually msvc schedules the c++ code, based on a sse2-intrinsic wrapper quite nicely. It fills a quadbitboard by Kogge-Stone (e.g. wSliders:bSilders, bKing:wKing) interlaced with two opposite directions (north, south) in one run. With wider registers we often need additional shuffles or unpacks to arrange bitboards horizontally to further combine them.

PPERM as specified by amd's sse5, able to reverse bitboards, looks interesting for pure calculation of attacks ala hyperbola quintessence. With 256-bit registers we were able to calculate the four attacking lines of queen-attacks in one run likely with max ipc.

bit[64]*char[64] or bit[64]*short[64] dot-products will clearly profit - even more with future 512-bit register sets.

Diep tested on latest AMD and Intel processors

Diep tested on latest AMD and Intel processors

Re: Diep tested on latest AMD and Intel processors

Re: Diep tested on latest AMD and Intel processors

Re: Diep tested on latest AMD and Intel processors