Stockfish haswell optimized build

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

j_romang
Posts: 79
Joined: Mon May 16, 2011 2:52 am

Stockfish haswell optimized build

Post by j_romang »

Hello,
I just tried to make a haswell optimized build : https://www.dropbox.com/s/ghbs1vw18q6q4 ... 8_bmi2.zip
Thanks to Ronald de Man's code, I implemented BMI2 instructions in stockfish. This build also supports his syzygy's tablebases.
Please tell me if it works, you should have a ~4% speedup with the corresponding abrok.eu version. Of course you need a Haswell processor to run it !
Please note that this is NOT an official build, but just an experiment :wink:
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Stockfish haswell optimized build

Post by Gerd Isenberg »

j_romang wrote:Hello,
I just tried to make a haswell optimized build : https://www.dropbox.com/s/ghbs1vw18q6q4 ... 8_bmi2.zip
Thanks to Ronald de Man's code, I implemented BMI2 instructions in stockfish. This build also supports his syzygy's tablebases.
Please tell me if it works, you should have a ~4% speedup with the corresponding abrok.eu version. Of course you need a Haswell processor to run it !
Please note that this is NOT an official build, but just an experiment :wink:
This attack code instead of magics?
https://github.com/syzygy1/tb/blob/master/src/bmi2.h

Or is there something better in the meantime or more BMI2 changes elsewhere?

Thanks,
Gerd
j_romang
Posts: 79
Joined: Mon May 16, 2011 2:52 am

Re: Stockfish haswell optimized build

Post by j_romang »

Here is the attack code, with no changes elsewhere :
https://github.com/jromang/Stockfish/bl ... ard.h#L271
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Stockfish haswell optimized build

Post by Gerd Isenberg »

j_romang wrote:Here is the attack code, with no changes elsewhere :
https://github.com/jromang/Stockfish/bl ... ard.h#L271
Thanks, yes Ronald's PDEP/PEXT 210.5k lookup approach. Wow, 4% is a huge speedup considering only attack-getters are changed, but of course affecting other memory and cache issues in other areas of the program. Assuming you also tried the PEXT only approach with 4 times greater tables, Ronald had the right sense, congratulations!

http://www.talkchess.com/forum/viewtopi ... 11&start=3

Cheers,
Gerd
phenri
Posts: 284
Joined: Tue Aug 13, 2013 9:44 am

Re: Stockfish haswell optimized build

Post by phenri »

I would probably have no utility of this patch before some months, or even some years.
But thank you for the future.

I have a question that has nothing to do with BMI2.
I'm asking you because I do not really expect a response from Marco.

I probably be wrong, but I want to understand.
Why in the makefile, POPCNT comes with the flag -msse3 instead of -msse4.2 while POPCNT is present only for architectures with a minimum SSE4.2.

And why the flag -mpopcnt is not included?

Code: Select all

### 3.9 popcnt
ifeq ($(popcnt),yes)
	CXXFLAGS += -msse3 -DUSE_POPCNT
endif

Same for preftech why the flag is so low?

Code: Select all

### 3.7 prefetch
ifeq ($(prefetch),yes)
	ifeq ($(sse),yes)
		CXXFLAGS += -msse
		DEPENDFLAGS += -msse
	endif
else
	CXXFLAGS += -DNO_PREFETCH
endif

Regards,

Paul
syzygy
Posts: 5559
Joined: Tue Feb 28, 2012 11:56 pm

Re: Stockfish haswell optimized build

Post by syzygy »

It might be interesting to try this.
In bitboard.h change

Code: Select all

struct BMI2Info {
  unsigned short *data;
  uint64_t mask1;
  uint64_t mask2;
};

extern unsigned short attack_table[107648];
extern struct BMI2Info bishop_bmi2[64];
extern struct BMI2Info rook_bmi2[64];

template<PieceType Pt>
inline Bitboard attacks_bb&#40;Square s, Bitboard occ&#41; &#123;
  struct BMI2Info *info = &#40;Pt == ROOK ? &rook_bmi2&#91;s&#93; &#58; &bishop_bmi2&#91;s&#93;);
  return _pdep_u64&#40;info->data&#91;_pext_u64&#40;occ, info->mask1&#41;&#93;, info->mask2&#41;;
&#125;
into

Code: Select all

struct BMI2Info &#123;
  uint64_t *data;
  uint64_t mask;
&#125;;

extern struct BMI2Info bishop_bmi2&#91;64&#93;;
extern struct BMI2Info rook_bmi2&#91;64&#93;;

template<PieceType Pt>
inline Bitboard attacks_bb&#40;Square s, Bitboard occ&#41; &#123;
  struct BMI2Info *info = &#40;Pt == ROOK ? &rook_bmi2&#91;s&#93; &#58; &bishop_bmi2&#91;s&#93;);
  return info->data&#91;_pext_u64&#40;occ, info->mask&#41;&#93;;
&#125;
In bitboard.cpp change the lines

Code: Select all

static unsigned short attacks_table&#91;107648&#93;;
...
        info&#91;sq&#93;.mask1 = bb
...
          if &#40;i == 0&#41;
        info&#91;sq&#93;.mask2 = bb2;
          attacks_table&#91;idx++&#93; = _pext_u64&#40;bb2, info&#91;sq&#93;.mask2&#41;;
into

Code: Select all

static uint64_t attacks_table&#91;107648&#93;;
...
        info&#91;sq&#93;.mask = bb
...
          attacks_table&#91;idx++&#93; = bb2;
I did not test this, so maybe something is wrong or missing.

This might be faster and this might be slower, but it would be interesting to know.
syzygy
Posts: 5559
Joined: Tue Feb 28, 2012 11:56 pm

Re: Stockfish haswell optimized build

Post by syzygy »

Gerd Isenberg wrote:Assuming you also tried the PEXT only approach with 4 times greater tables, Ronald had the right sense, congratulations!
I don't think he tried, but we can now find out :-)
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Stockfish haswell optimized build

Post by Gerd Isenberg »

syzygy wrote:
Gerd Isenberg wrote:Assuming you also tried the PEXT only approach with 4 times greater tables, Ronald had the right sense, congratulations!
I don't think he tried, but we can now find out :-)
Would be nice if Jean-Francois or you could try and report. U64 versus pdep_u64(U16, mask2).
j_romang
Posts: 79
Joined: Mon May 16, 2011 2:52 am

Re: Stockfish haswell optimized build

Post by j_romang »

I didn't try :wink:
j_romang
Posts: 79
Joined: Mon May 16, 2011 2:52 am

Re: Stockfish haswell optimized build

Post by j_romang »

Gerd Isenberg wrote: Wow, 4% is a huge speedup considering only attack-getters are changed, but of course affecting other memory and cache issues in other areas of the program.
According to my profiling experiments stockfish spends about 5-6% of time computing attack bitboards, that's we I wanted to give a try to the pext solution.