Making NNUE 60% faster

AndrewGrant · Post by **AndrewGrant** » Wed Jun 07, 2023 4:17 pm

dangi12012 wrote: ↑Wed Jun 07, 2023 2:16 pm
AndrewGrant wrote: ↑Wed Jun 07, 2023 1:29 pm For any aspiring NNUE developers working on their own engines, you should know that most if not all of what is posted here is bunk. You are much better off hopping into the Stockfish or OpenBench discords and having a conversation about these topics, than trying to "uncover" the secrets that are supposedly offered here.

Of all of his points, the only one that I see as having any merit, is the 0a comment regarding the way you compute the index of a given weight. Which can maybe save you a few operations. But the real concern in NNUE is L1, which dwarfs any other optimizations you might make.
For any aspiring NNUE developers no one has offered a simple 1 file solution yet. Worse -
If you read exactly how Andrew implemented NNUE I would not put too much weight into his comments. It is one of the slower versions. https://github.com/AndyGrant/Ethereal/b ... nue/nnue.c

Slower version, implying you actually are able to run that code and see for yourself, which I doubt.

There is a reason for floats, but you are not worth the time to explain. I'll happily explain the trade-offs for floats in L2/L3 to anyone else who may be interested.

I'm going to mute this thread, as to not get drawn into your brain damaged grifting. Godspeed to anyone else who sticks it out.

Sopel · Post by **Sopel** » Wed Jun 07, 2023 5:43 pm

dangi12012 wrote: ↑Wed Jun 07, 2023 2:16 pm For example I am asserting that _mm256_madd_epi16 does not need to be invoked every single time because you clamp between 0..127. No one really engaged that point yet.

Sopel: see https://github.com/official-stockfish/S ... 7ceb83dbf8

Sopel:
You run the risk of overflow after 32 calls
Sopel: no, we run the risk of overflow after a accumulating two maddubs results in this case
Sopel: because 4*127*127>=2^15
Sopel: in other words, m256_add_dpbusd_epi32x2 can overflow int16
Sopel: in other words
__m256i product0 = _mm256_maddubs_epi16(a0, b0);
__m256i product1 = _mm256_maddubs_epi16(a1, b1);
_mm256_add_epi16(product0, product1);
can overflow

for completeness, I know you dislike discord, maybe you will be able to process this message here

ImNotStockfish · Post by **ImNotStockfish** » Wed Jun 07, 2023 9:25 pm

dangi12012 wrote: ↑Wed Jun 07, 2023 3:00 pm I have it and if you know a way to submit to Fishtest a commit on top of SF15 release brach without leaking my source code I can do that today. Cherry-pick or some sort of private commit
PM me.

If you have a working implementation that is great but Stockfish is based on open source collaboration (GPL 3). This means that it is not possible to distribute a modification of Stockfish to the hardware donators (or anyone) without disclosing the source code.

If you don't want to disclose the code then you can test it yourself on you own hardware using the same parameters as Fishtest does (adjusted TC and book for example) against the latest development version of Stockfish with a (really) high sample size.

dangi12012 · Post by **dangi12012** » Thu Jun 08, 2023 11:20 am

ImNotStockfish wrote: ↑Wed Jun 07, 2023 9:25 pm
dangi12012 wrote: ↑Wed Jun 07, 2023 3:00 pm I have it and if you know a way to submit to Fishtest a commit on top of SF15 release brach without leaking my source code I can do that today. Cherry-pick or some sort of private commit
PM me.
If you have a working implementation that is great but Stockfish is based on open source collaboration (GPL 3). This means that it is not possible to distribute a modification of Stockfish to the hardware donators (or anyone) without disclosing the source code.

If you don't want to disclose the code then you can test it yourself on you own hardware using the same parameters as Fishtest does (adjusted TC and book for example) against the latest development version of Stockfish with a (really) high sample size.

Well then I will just publish the binary here over the weekend. There can be no doubts when there is proof - one binary with a commit_id + compiler version + the resulting file hash. VS another version from commit_id + 1 local commit, same compile flags etc.

To hammer the point home:
This gets put into an added and remove list - for lazy evaluation later:

Code: Select all

return IndexType((int(s) ^ OrientTBL[Perspective][ksq]) + PieceSquareIndex[Perspective][pc] + KingBuckets[Perspective][ksq]);

My impl does all that at init time:

Code: Select all

for (int sq = 0; sq < 64; sq++) {
	int sq_flip = (((sq >> 3) | (sq << 3)) & 63) ^ 63; //flip sq too

	int idx = (ksq << 6) | sq_flip;
	int px = std::distance(piece_list, std::find(piece_list, piece_list + 12, pc));
	if (px >= 12) continue; //we condense 16 into 12

	//Move lea and shift by 12 to initialisation time!
	int own_idx = ((sq ^ piece_rotate[0][ksq]) + piece_offset[0][pc] * 64 + king_buckets[0][ksq]);
	int opp_idx = ((sq ^ piece_rotate[1][ksq]) + piece_offset[1][pc] * 64 + king_buckets[1][ksq]);
	idx_own[px][idx] =
	{
		FeatureTransform<simdlevel>::weights + traits * own_idx,
		FeatureTransform<simdlevel>::psqtWeights + dims * own_idx
	};
	idx_opp[px][idx] =
	{
		FeatureTransform<simdlevel>::weights + traits * opp_idx,
		FeatureTransform<simdlevel>::psqtWeights + dims * opp_idx
	};
}

Even storing pointers to have runtime a little bit faster yet again:

Code: Select all

//asm proof https://godbolt.org/z/4WhEddEr4
struct nnue_idx
{
	int16_t* weight;
	int32_t* psqt;
};

//Square, King, Piece => index lookups
static inline nnue_idx idx_own[12][64 * 64];
static inline nnue_idx idx_opp[12][64 * 64];

Just saw that in my comments which is a proof kernel: https://godbolt.org/z/4WhEddEr4
removes LEA among other things.
So when we activate a feature now it looks like this (no calculation needed anymore)

Code: Select all

activate_feature(own, own_q, idx_own[piece][w]);

Feature activation itself is fast in SF:

Code: Select all

for (const auto index : active)
        {
          const IndexType offset = HalfDimensions * index + j * TileHeight;
          auto column = reinterpret_cast<const vec_t*>(&weights[offset]);

          for (unsigned k = 0; k < NumRegs; ++k)
            acc[k] = vec_add_16(acc[k], column[k]);
        }

VS mine again where we already get a pointer vs calculation of index

Code: Select all

inline static void activate_feature(int16_t* acc, int32_t* psqt, const nnue_idx& feature_idx) when_avx2 {
	auto a = reinterpret_cast<__m256i*>(acc);
	auto w = reinterpret_cast<const __m256i*>(feature_idx.weight);

	for (int j = 0; j < traits / 16; ++j)
		a[j] = _mm256_add_epi16(a[j], w[j]);

	for(int j=0;j<8;++j)
		psqt[j] += feature_idx.psqt[j];
}

This function is called at least 2x on eval and any many times more because the lazy SF list grows a bit.
I really dont see why there is an antibody reaction from the SF discord against specific improvement ideas.

What really helps a lot on top of that is color agnosticity where we either bitrotate the BB or we call std::byteswap depending if we want our pawns to walk up vs left all the time. I found that the US vs THEM vs Black vs White approach is also a good performance gain.

So binary coming - so doubters can see with their own eyes. Probably I will append the MILLION_NNUE_EVALS_PER_SECOND to the uci output.

AndrewGrant · Post by **AndrewGrant** » Fri Jun 09, 2023 2:01 am

What you just posted is probably a minor gainer, very small gains just on the transformer.

The comments about 60% stemming from maddubs are still bunk.

And the only reason I still see this thread is because you PM'ed me some garbage. Don't do that again.

connor_mcmonigle · Post by **connor_mcmonigle** » Fri Jun 09, 2023 4:31 am

dangi12012 wrote: ↑Tue Jun 06, 2023 6:26 pm For the past few years people said that movegen is not where engines spend most of their time - which of course is true, but the design philosophy of taking an idea (not the implementation) and re-imagine memory layouts, intrinsics etc. and pushing it to the fullest extent is always a nice thing to have.
So in that spirit I have taken a few months and re-implemented NNUE from scratch but just starting from the screenshot here:
https://github.com/glinscott/nnue-pytor ... chitecture

So what I want to share is applicable for AVX2 and some of AVX512. I want to share what is possible without spilling the source code for it. I was actually suprised that the most innermost hot loop of NNUE can be optimized for all intrinsic AVX2, AVX512 types and it was a very fun journey.

After I was done I compared what I have with what the official SF repo has implemented for the same ideas.
The incremental part of NNUE is optimal with good intrinsics etc. (I mean the part where we create maintain 1536 int16_t from incrementally updating the state, into what they call accumulators) having two of them is already not optimal since both of them can fit in one linear array for input into the next part.

From a architecture perspective I can tell you that ALL of NNUE (including init + file loads and hash comparison) can fit in around 150 lines of code with all of the actual non incremental code propagate_layers fitting in 54 lines of readable C++ code.

Of course if compilers were perfect we would be done here, but even clang does not 1) reorder memory read from a file to reshuffle in order to use optimal intrinsics, 2) does not emit optimal intrinsic from loops to begin with. So add around 60 loc for AVX2.
All being said it fits in a single file which makes it maintainable and as a .h there is no linking making clang much more efficient compared to .cpp + .h.

Optimisations missing from SF repo

0) Board Layout. I had the luck in gigantua my layout is color agnostic making the players pawn always move like this: pawn >>= 1.
Incidentally it seems that NNUE prefers this as well an SF has to go through some hoops to align indices.

0a) Memory Layout. Having optimal board layout means that the binary files are not compatible and need to be reshuffled. Here I can show you exactly what my philosophy is, and I can make choices here that are not possible otherwise.

This code is called for each and every change in nnue which is between 2x and 6x per move
Code: Select all
//SF schema: IndexType(orient(perspective, s, ksq) + PieceSquareIndex[perspective][pc] + PS_NB * KingBuckets[o_ksq]);
//Gigantua schema: feature_idx_own[pc][sq];
//2nd optimisation: return pointer instead of index!
source:
https://github.com/official-stockfish/S ... hm.cpp#L30

Worse, the indices get put in a list which is unnecessary when using a template visitor pattern. Making the implementation of the same idea 20x faster for that snippet. 1 function, 4 lookups, some multiplications get all replaced by a instant 2d lookup.
Also returning a pointer directly is a nice speedup compared to returning an index in this case.

1) These definitions lead to that the compiler has to iterate over multiple indices which does not get optimized away even in O3. You can consolidate AffineTransform and Relu into a single function.
https://github.com/official-stockfish/S ... ure.h#L117

Going forward we can consolidate all layers into a simple function definition, and this is invoked with a up-to-date accumulator pointer. Notice how that is a single pointer even when nnue updates both colors.
Code: Select all
static inline int propagate_layers(const std::int16_t* acc, int8_t* w0, int32_t* b0, int8_t* w1, int32_t* b1, int8_t* w2, int32_t* b2) noexcept
Which correctly returns the material bypass value material = (material * 600 * 16) / (127 * (1 << weight_scale)) + positional bypass + eval as defined in V6 network.

Expanding on that idea we can even consolidate and shuffle the memory layout of the weights to have all weights in a linear and padded layout perfect for AVX2 or AVX512:
Code: Select all
static inline int propagate_layers(const __m256i* restrict acc, const __m256i* restrict w0, const __m256i* restrict b0)
2)
These buffers can be removed completely - you dont need them and work with registers directly, incidentally the maximum usage is much smaller than defined here and fits in registers.
https://github.com/official-stockfish/S ... ture.h#L95

3) Memcpy is used which is quite slow when the domain already contracts (the stdlib cannot assume and has to run a few ifs) that pointers are aligned an non overlapping.
https://godbolt.org/z/15oYqMKjn

4)
NNUE weights can be calculated faster by skipping some intrinsics for AVX2.
Making this much faster: https://github.com/official-stockfish/S ... 40-L211C40

This is a throwaway sentence above but its the most important part right here. If you read this, it has maybe 30% of the overall impact.
If you are a SF developer read this sentence and you will understand instantly. Applicable for all AVX, AVX512 except for VNNI (then its good).
The relu layer clips the inputs to 0..128, making the transformation from packed 16bits to 32bit accumulators not necessary every iteration
So you dont need _mm256_madd_epi16 every iteration. Only on every 32th iteration overflow is possible. Skipping all of these intrinsics leads to
this perfectly: acc = _mm256_add_epi16(acc, _mm256_maddubs_epi16(input_simd[m], *w0++));

For 8 accumulators. Using 8 accumulators instead of 4 has another advantage:

5)
Without register spilling its possible to increase internal accumulators to 8 making this function m256_haddx8 - that allows this function to never mix SSE and AVX which is a slowdown.
https://github.com/official-stockfish/S ... #LL196C37-

For this one I can share my code.
Of course my style is to overload this function so it does more than it says. (adding biases for example)
Code: Select all
		static inline __m256i accumulator_reduce(__m256i accs[8], __m256i bias) {
			const __m256i one = _mm256_set1_epi16(1);

			accs[0] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[0], one), _mm256_madd_epi16(accs[1], one));
			accs[1] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[2], one), _mm256_madd_epi16(accs[3], one));
			accs[2] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[4], one), _mm256_madd_epi16(accs[5], one));
			accs[3] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[6], one), _mm256_madd_epi16(accs[7], one));

			//a0 a1 a2 a3; b0 b1 b2 b3; c0 c1 c2 c3; d0 d1 d2 d3; a4 a5 a6 a7; b4 b5 b6 b7; c4 c5 c6 c7; d4 d5 d6 d7
			//e0 e1 e2 e3; f0 f1 f2 f3; g0 g1 g2 g3; h0 h1 h2 h3; e4 e5 e6 e7; f4 f5 f6 f7; g4 g5 g6 g7; h4 h5 h6 h7
			//a4 a5 a6 a7; b4 b5 b6 b7; c4 c5 c6 c7; d4 d5 d6 d7; e0 e1 e2 e3; f0 f1 f2 f3; g0 g1 g2 g3; h0 h1 h2 h3
			accs[0] = _mm256_hadd_epi32(accs[0], accs[1]);
			accs[1] = _mm256_hadd_epi32(accs[2], accs[3]);
			accs[2] = _mm256_permute2x128_si256(accs[0], accs[1], 0b100001);

			//Blend and add bias
			return _mm256_add_epi32(bias, _mm256_blend_epi32(
					_mm256_add_epi32(accs[0], accs[2]),
					_mm256_add_epi32(accs[1], accs[2]),
				0b11110000));
		}
All in all I want to share my performance log. This is inferences per second on random positions (so including full accumulator rebuild and no incrementality) on a single thread.
Disclaimer: some of this is not applicable when VNNI is available but most of it is, and I cant say what the improved memory layout does for NEON but going from many pointers into a 2 aligned SIMD pointers should help.
Code: Select all
//17.04.23 0.04 MNPS
//17.04.23 0.045 MNPS
//17.04.23 0.054 MNPS
//18.04.23 0.054 MNPS
//18.04.23 0.146 MNPS
//20.04.23 0.253 MNPS
//20.04.23 0.262 MNPS
//21.04.23 0.266 MNPS
//22.04.23 0.269 MNPS
//25.04.23 3.067 MNPS
//27.04.23 3.320 MNPS
//27.04.23 3.370 MNPS
//29.04.23 4.450 MNPS
//01.05.23 4.491 MNPS
//02.05.23 4.712 MNPS
Going forward I can say that the post format here is too limited and I should finally get around and publish all knowledge on a seperate blog-like website. All in all I described 7 ideas relevant to NNUE performance of around 35 that increased performance.

It boils down to:
1) having all readable inside a single function with const and non const pointers and no outside references etc. (helps the compiler A LOT)
2) improving memory layout and reshuffle some weights to get from incremental layer to output faster with better intrinsics
3) decreasing the overall cost of intrinsics by finding redundancies from domain knowledge. (For example knowing a value is strictly below 128 and suddenly you can remove some instructions because it cannot overflow an integer)

Let's assume you're not exaggerating and you've somehow actually "Made NNUE 60% faster" and Stockfish, as a consequence, some ~30% faster in my estimation. In this case, why would you start by making a lousy, self-aggrandizing post on TalkChess rather than simply forking SF, pushing your changes to a public branch and submitting a corresponding test to FishTest like every other reasonable person would do in this situation? Were you to have done this first, you would be celebrated by the community, there would be no doubt as to the truth of your claims and this conversation would actually be productive. If your claims are true, I can only presume you are a fool for proceeding in your chosen fashion.

In any case, I'll be very surprised if your proposed implementation is both correct (you've still neglected to address the proposed overflow issue) and anywhere near your claimed 60% faster.

syzygy · Post by **syzygy** » Fri Jun 09, 2023 9:54 pm

dangi12012 wrote: ↑Wed Jun 07, 2023 2:16 pm For any aspiring NNUE developers no one has offered a simple 1 file solution yet. Worse -
If you read exactly how Andrew implemented NNUE I would not put too much weight into his comments. It is one of the slower versions. https://github.com/AndyGrant/Ethereal/b ... nue/nnue.c

I've seen a lot of talk from you but nothing more.

I remember having to make this post:
forum3/viewtopic.php?p=925834#p925834

syzygy · Post by **syzygy** » Sat Jun 10, 2023 5:45 pm

Please do not share binaries with SF code without releasing the source code. Otherwise it is a GPL violation and you will be sued.

dangi12012 · Post by **dangi12012** » Sat Jun 10, 2023 7:17 pm

syzygy wrote: ↑Sat Jun 10, 2023 5:45 pm Please do not share binaries with SF code without releasing the source code. Otherwise it is a GPL violation and you will be sued.

Ah wow thanks - I am such a noob - Updated zip to make sure the GPLv3 is adhered to.
Old link invalidated.

Darmok and Jalad on the ocean
https://1drv.ms/u/s!AoDIyNlDG2QTg84rZuD ... g?e=H1Tjc1

Modern Times · Post by **Modern Times** » Sat Jun 10, 2023 8:41 pm

dangi12012 wrote: ↑Sat Jun 10, 2023 7:17 pm

Darmok and Jalad on the ocean

Good episode that.

Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster