So in that spirit I have taken a few months and re-implemented NNUE from scratch but just starting from the screenshot here:
https://github.com/glinscott/nnue-pytor ... chitecture
So what I want to share is applicable for AVX2 and some of AVX512. I want to share what is possible without spilling the source code for it. I was actually suprised that the most innermost hot loop of NNUE can be optimized for all intrinsic AVX2, AVX512 types and it was a very fun journey.
After I was done I compared what I have with what the official SF repo has implemented for the same ideas.
The incremental part of NNUE is optimal with good intrinsics etc. (I mean the part where we create maintain 1536 int16_t from incrementally updating the state, into what they call accumulators) having two of them is already not optimal since both of them can fit in one linear array for input into the next part.
From a architecture perspective I can tell you that ALL of NNUE (including init + file loads and hash comparison) can fit in around 150 lines of code with all of the actual non incremental code propagate_layers fitting in 54 lines of readable C++ code.
Of course if compilers were perfect we would be done here, but even clang does not 1) reorder memory read from a file to reshuffle in order to use optimal intrinsics, 2) does not emit optimal intrinsic from loops to begin with. So add around 60 loc for AVX2.
All being said it fits in a single file which makes it maintainable and as a .h there is no linking making clang much more efficient compared to .cpp + .h.
Optimisations missing from SF repo
0) Board Layout. I had the luck in gigantua my layout is color agnostic making the players pawn always move like this: pawn >>= 1.
Incidentally it seems that NNUE prefers this as well an SF has to go through some hoops to align indices.
0a) Memory Layout. Having optimal board layout means that the binary files are not compatible and need to be reshuffled. Here I can show you exactly what my philosophy is, and I can make choices here that are not possible otherwise.
This code is called for each and every change in nnue which is between 2x and 6x per move
Code: Select all
//SF schema: IndexType(orient(perspective, s, ksq) + PieceSquareIndex[perspective][pc] + PS_NB * KingBuckets[o_ksq]);
//Gigantua schema: feature_idx_own[pc][sq];
//2nd optimisation: return pointer instead of index!
https://github.com/official-stockfish/S ... hm.cpp#L30
Worse, the indices get put in a list which is unnecessary when using a template visitor pattern. Making the implementation of the same idea 20x faster for that snippet. 1 function, 4 lookups, some multiplications get all replaced by a instant 2d lookup.
Also returning a pointer directly is a nice speedup compared to returning an index in this case.
1) These definitions lead to that the compiler has to iterate over multiple indices which does not get optimized away even in O3. You can consolidate AffineTransform and Relu into a single function.
https://github.com/official-stockfish/S ... ure.h#L117
Going forward we can consolidate all layers into a simple function definition, and this is invoked with a up-to-date accumulator pointer. Notice how that is a single pointer even when nnue updates both colors.
Code: Select all
static inline int propagate_layers(const std::int16_t* acc, int8_t* w0, int32_t* b0, int8_t* w1, int32_t* b1, int8_t* w2, int32_t* b2) noexcept
Expanding on that idea we can even consolidate and shuffle the memory layout of the weights to have all weights in a linear and padded layout perfect for AVX2 or AVX512:
Code: Select all
static inline int propagate_layers(const __m256i* restrict acc, const __m256i* restrict w0, const __m256i* restrict b0)
These buffers can be removed completely - you dont need them and work with registers directly, incidentally the maximum usage is much smaller than defined here and fits in registers.
https://github.com/official-stockfish/S ... ture.h#L95
3) Memcpy is used which is quite slow when the domain already contracts (the stdlib cannot assume and has to run a few ifs) that pointers are aligned an non overlapping.
https://godbolt.org/z/15oYqMKjn
4)
NNUE weights can be calculated faster by skipping some intrinsics for AVX2.
Making this much faster: https://github.com/official-stockfish/S ... 40-L211C40
This is a throwaway sentence above but its the most important part right here. If you read this, it has maybe 30% of the overall impact.
If you are a SF developer read this sentence and you will understand instantly. Applicable for all AVX, AVX512 except for VNNI (then its good).
The relu layer clips the inputs to 0..128, making the transformation from packed 16bits to 32bit accumulators not necessary every iteration
So you dont need _mm256_madd_epi16 every iteration. Only on every 32th iteration overflow is possible. Skipping all of these intrinsics leads to
this perfectly: acc = _mm256_add_epi16(acc, _mm256_maddubs_epi16(input_simd[m], *w0++));
For 8 accumulators. Using 8 accumulators instead of 4 has another advantage:
5)
Without register spilling its possible to increase internal accumulators to 8 making this function m256_haddx8 - that allows this function to never mix SSE and AVX which is a slowdown.
https://github.com/official-stockfish/S ... #LL196C37-
For this one I can share my code.
Of course my style is to overload this function so it does more than it says. (adding biases for example)
Code: Select all
static inline __m256i accumulator_reduce(__m256i accs[8], __m256i bias) {
const __m256i one = _mm256_set1_epi16(1);
accs[0] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[0], one), _mm256_madd_epi16(accs[1], one));
accs[1] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[2], one), _mm256_madd_epi16(accs[3], one));
accs[2] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[4], one), _mm256_madd_epi16(accs[5], one));
accs[3] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[6], one), _mm256_madd_epi16(accs[7], one));
//a0 a1 a2 a3; b0 b1 b2 b3; c0 c1 c2 c3; d0 d1 d2 d3; a4 a5 a6 a7; b4 b5 b6 b7; c4 c5 c6 c7; d4 d5 d6 d7
//e0 e1 e2 e3; f0 f1 f2 f3; g0 g1 g2 g3; h0 h1 h2 h3; e4 e5 e6 e7; f4 f5 f6 f7; g4 g5 g6 g7; h4 h5 h6 h7
//a4 a5 a6 a7; b4 b5 b6 b7; c4 c5 c6 c7; d4 d5 d6 d7; e0 e1 e2 e3; f0 f1 f2 f3; g0 g1 g2 g3; h0 h1 h2 h3
accs[0] = _mm256_hadd_epi32(accs[0], accs[1]);
accs[1] = _mm256_hadd_epi32(accs[2], accs[3]);
accs[2] = _mm256_permute2x128_si256(accs[0], accs[1], 0b100001);
//Blend and add bias
return _mm256_add_epi32(bias, _mm256_blend_epi32(
_mm256_add_epi32(accs[0], accs[2]),
_mm256_add_epi32(accs[1], accs[2]),
0b11110000));
}
Disclaimer: some of this is not applicable when VNNI is available but most of it is, and I cant say what the improved memory layout does for NEON but going from many pointers into a 2 aligned SIMD pointers should help.
Code: Select all
//17.04.23 0.04 MNPS
//17.04.23 0.045 MNPS
//17.04.23 0.054 MNPS
//18.04.23 0.054 MNPS
//18.04.23 0.146 MNPS
//20.04.23 0.253 MNPS
//20.04.23 0.262 MNPS
//21.04.23 0.266 MNPS
//22.04.23 0.269 MNPS
//25.04.23 3.067 MNPS
//27.04.23 3.320 MNPS
//27.04.23 3.370 MNPS
//29.04.23 4.450 MNPS
//01.05.23 4.491 MNPS
//02.05.23 4.712 MNPS
It boils down to:
1) having all readable inside a single function with const and non const pointers and no outside references etc. (helps the compiler A LOT)
2) improving memory layout and reshuffle some weights to get from incremental layer to output faster with better intrinsics
3) decreasing the overall cost of intrinsics by finding redundancies from domain knowledge. (For example knowing a value is strictly below 128 and suddenly you can remove some instructions because it cannot overflow an integer)