Hello,
I just tried to make a haswell optimized build : https://www.dropbox.com/s/ghbs1vw18q6q4 ... 8_bmi2.zip
Thanks to Ronald de Man's code, I implemented BMI2 instructions in stockfish. This build also supports his syzygy's tablebases.
Please tell me if it works, you should have a ~4% speedup with the corresponding abrok.eu version. Of course you need a Haswell processor to run it !
Please note that this is NOT an official build, but just an experiment
Stockfish haswell optimized build
Moderators: hgm, Rebel, chrisw
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: Stockfish haswell optimized build
This attack code instead of magics?j_romang wrote:Hello,
I just tried to make a haswell optimized build : https://www.dropbox.com/s/ghbs1vw18q6q4 ... 8_bmi2.zip
Thanks to Ronald de Man's code, I implemented BMI2 instructions in stockfish. This build also supports his syzygy's tablebases.
Please tell me if it works, you should have a ~4% speedup with the corresponding abrok.eu version. Of course you need a Haswell processor to run it !
Please note that this is NOT an official build, but just an experiment
https://github.com/syzygy1/tb/blob/master/src/bmi2.h
Or is there something better in the meantime or more BMI2 changes elsewhere?
Thanks,
Gerd
-
- Posts: 79
- Joined: Mon May 16, 2011 2:52 am
Re: Stockfish haswell optimized build
Here is the attack code, with no changes elsewhere :
https://github.com/jromang/Stockfish/bl ... ard.h#L271
https://github.com/jromang/Stockfish/bl ... ard.h#L271
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: Stockfish haswell optimized build
Thanks, yes Ronald's PDEP/PEXT 210.5k lookup approach. Wow, 4% is a huge speedup considering only attack-getters are changed, but of course affecting other memory and cache issues in other areas of the program. Assuming you also tried the PEXT only approach with 4 times greater tables, Ronald had the right sense, congratulations!j_romang wrote:Here is the attack code, with no changes elsewhere :
https://github.com/jromang/Stockfish/bl ... ard.h#L271
http://www.talkchess.com/forum/viewtopi ... 11&start=3
Cheers,
Gerd
-
- Posts: 284
- Joined: Tue Aug 13, 2013 9:44 am
Re: Stockfish haswell optimized build
I would probably have no utility of this patch before some months, or even some years.
But thank you for the future.
I have a question that has nothing to do with BMI2.
I'm asking you because I do not really expect a response from Marco.
I probably be wrong, but I want to understand.
Why in the makefile, POPCNT comes with the flag -msse3 instead of -msse4.2 while POPCNT is present only for architectures with a minimum SSE4.2.
And why the flag -mpopcnt is not included?
Same for preftech why the flag is so low?
Regards,
Paul
But thank you for the future.
I have a question that has nothing to do with BMI2.
I'm asking you because I do not really expect a response from Marco.
I probably be wrong, but I want to understand.
Why in the makefile, POPCNT comes with the flag -msse3 instead of -msse4.2 while POPCNT is present only for architectures with a minimum SSE4.2.
And why the flag -mpopcnt is not included?
Code: Select all
### 3.9 popcnt
ifeq ($(popcnt),yes)
CXXFLAGS += -msse3 -DUSE_POPCNT
endif
Same for preftech why the flag is so low?
Code: Select all
### 3.7 prefetch
ifeq ($(prefetch),yes)
ifeq ($(sse),yes)
CXXFLAGS += -msse
DEPENDFLAGS += -msse
endif
else
CXXFLAGS += -DNO_PREFETCH
endif
Regards,
Paul
-
- Posts: 5563
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Stockfish haswell optimized build
It might be interesting to try this.
In bitboard.h change
into
In bitboard.cpp change the lines
into
I did not test this, so maybe something is wrong or missing.
This might be faster and this might be slower, but it would be interesting to know.
In bitboard.h change
Code: Select all
struct BMI2Info {
unsigned short *data;
uint64_t mask1;
uint64_t mask2;
};
extern unsigned short attack_table[107648];
extern struct BMI2Info bishop_bmi2[64];
extern struct BMI2Info rook_bmi2[64];
template<PieceType Pt>
inline Bitboard attacks_bb(Square s, Bitboard occ) {
struct BMI2Info *info = (Pt == ROOK ? &rook_bmi2[s] : &bishop_bmi2[s]);
return _pdep_u64(info->data[_pext_u64(occ, info->mask1)], info->mask2);
}
Code: Select all
struct BMI2Info {
uint64_t *data;
uint64_t mask;
};
extern struct BMI2Info bishop_bmi2[64];
extern struct BMI2Info rook_bmi2[64];
template<PieceType Pt>
inline Bitboard attacks_bb(Square s, Bitboard occ) {
struct BMI2Info *info = (Pt == ROOK ? &rook_bmi2[s] : &bishop_bmi2[s]);
return info->data[_pext_u64(occ, info->mask)];
}
Code: Select all
static unsigned short attacks_table[107648];
...
info[sq].mask1 = bb
...
if (i == 0)
info[sq].mask2 = bb2;
attacks_table[idx++] = _pext_u64(bb2, info[sq].mask2);
Code: Select all
static uint64_t attacks_table[107648];
...
info[sq].mask = bb
...
attacks_table[idx++] = bb2;
This might be faster and this might be slower, but it would be interesting to know.
-
- Posts: 5563
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Stockfish haswell optimized build
I don't think he tried, but we can now find outGerd Isenberg wrote:Assuming you also tried the PEXT only approach with 4 times greater tables, Ronald had the right sense, congratulations!
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: Stockfish haswell optimized build
Would be nice if Jean-Francois or you could try and report. U64 versus pdep_u64(U16, mask2).syzygy wrote:I don't think he tried, but we can now find outGerd Isenberg wrote:Assuming you also tried the PEXT only approach with 4 times greater tables, Ronald had the right sense, congratulations!
-
- Posts: 79
- Joined: Mon May 16, 2011 2:52 am
Re: Stockfish haswell optimized build
I didn't try
-
- Posts: 79
- Joined: Mon May 16, 2011 2:52 am
Re: Stockfish haswell optimized build
According to my profiling experiments stockfish spends about 5-6% of time computing attack bitboards, that's we I wanted to give a try to the pext solution.Gerd Isenberg wrote: Wow, 4% is a huge speedup considering only attack-getters are changed, but of course affecting other memory and cache issues in other areas of the program.