Making NNUE 60% faster

dangi12012 · Post by **dangi12012** » Tue Jun 06, 2023 6:26 pm

For the past few years people said that movegen is not where engines spend most of their time - which of course is true, but the design philosophy of taking an idea (not the implementation) and re-imagine memory layouts, intrinsics etc. and pushing it to the fullest extent is always a nice thing to have.
So in that spirit I have taken a few months and re-implemented NNUE from scratch but just starting from the screenshot here:
https://github.com/glinscott/nnue-pytor ... chitecture

So what I want to share is applicable for AVX2 and some of AVX512. I want to share what is possible without spilling the source code for it. I was actually suprised that the most innermost hot loop of NNUE can be optimized for all intrinsic AVX2, AVX512 types and it was a very fun journey.

After I was done I compared what I have with what the official SF repo has implemented for the same ideas.
The incremental part of NNUE is optimal with good intrinsics etc. (I mean the part where we create maintain 1536 int16_t from incrementally updating the state, into what they call accumulators) having two of them is already not optimal since both of them can fit in one linear array for input into the next part.

From a architecture perspective I can tell you that ALL of NNUE (including init + file loads and hash comparison) can fit in around 150 lines of code with all of the actual non incremental code propagate_layers fitting in 54 lines of readable C++ code.

Of course if compilers were perfect we would be done here, but even clang does not 1) reorder memory read from a file to reshuffle in order to use optimal intrinsics, 2) does not emit optimal intrinsic from loops to begin with. So add around 60 loc for AVX2.
All being said it fits in a single file which makes it maintainable and as a .h there is no linking making clang much more efficient compared to .cpp + .h.

Optimisations missing from SF repo

0) Board Layout. I had the luck in gigantua my layout is color agnostic making the players pawn always move like this: pawn >>= 1.
Incidentally it seems that NNUE prefers this as well an SF has to go through some hoops to align indices.

0a) Memory Layout. Having optimal board layout means that the binary files are not compatible and need to be reshuffled. Here I can show you exactly what my philosophy is, and I can make choices here that are not possible otherwise.

This code is called for each and every change in nnue which is between 2x and 6x per move

Code: Select all

//SF schema: IndexType(orient(perspective, s, ksq) + PieceSquareIndex[perspective][pc] + PS_NB * KingBuckets[o_ksq]);
//Gigantua schema: feature_idx_own[pc][sq];
//2nd optimisation: return pointer instead of index!

source:
https://github.com/official-stockfish/S ... hm.cpp#L30

Worse, the indices get put in a list which is unnecessary when using a template visitor pattern. Making the implementation of the same idea 20x faster for that snippet. 1 function, 4 lookups, some multiplications get all replaced by a instant 2d lookup.
Also returning a pointer directly is a nice speedup compared to returning an index in this case.

1) These definitions lead to that the compiler has to iterate over multiple indices which does not get optimized away even in O3. You can consolidate AffineTransform and Relu into a single function.
https://github.com/official-stockfish/S ... ure.h#L117

Going forward we can consolidate all layers into a simple function definition, and this is invoked with a up-to-date accumulator pointer. Notice how that is a single pointer even when nnue updates both colors.

Code: Select all

static inline int propagate_layers(const std::int16_t* acc, int8_t* w0, int32_t* b0, int8_t* w1, int32_t* b1, int8_t* w2, int32_t* b2) noexcept

Which correctly returns the material bypass value material = (material * 600 * 16) / (127 * (1 << weight_scale)) + positional bypass + eval as defined in V6 network.

Expanding on that idea we can even consolidate and shuffle the memory layout of the weights to have all weights in a linear and padded layout perfect for AVX2 or AVX512:

Code: Select all

static inline int propagate_layers(const __m256i* restrict acc, const __m256i* restrict w0, const __m256i* restrict b0)

2)
These buffers can be removed completely - you dont need them and work with registers directly, incidentally the maximum usage is much smaller than defined here and fits in registers.
https://github.com/official-stockfish/S ... ture.h#L95

3) Memcpy is used which is quite slow when the domain already contracts (the stdlib cannot assume and has to run a few ifs) that pointers are aligned an non overlapping.
https://godbolt.org/z/15oYqMKjn

4)
NNUE weights can be calculated faster by skipping some intrinsics for AVX2.
Making this much faster: https://github.com/official-stockfish/S ... 40-L211C40

This is a throwaway sentence above but its the most important part right here. If you read this, it has maybe 30% of the overall impact.
If you are a SF developer read this sentence and you will understand instantly. Applicable for all AVX, AVX512 except for VNNI (then its good).
The relu layer clips the inputs to 0..128, making the transformation from packed 16bits to 32bit accumulators not necessary every iteration
So you dont need _mm256_madd_epi16 every iteration. Only on every 32th iteration overflow is possible. Skipping all of these intrinsics leads to
this perfectly: acc = _mm256_add_epi16(acc, _mm256_maddubs_epi16(input_simd[m], *w0++));

For 8 accumulators. Using 8 accumulators instead of 4 has another advantage:

5)
Without register spilling its possible to increase internal accumulators to 8 making this function m256_haddx8 - that allows this function to never mix SSE and AVX which is a slowdown.
https://github.com/official-stockfish/S ... #LL196C37-

For this one I can share my code.
Of course my style is to overload this function so it does more than it says. (adding biases for example)

Code: Select all

		static inline __m256i accumulator_reduce(__m256i accs[8], __m256i bias) {
			const __m256i one = _mm256_set1_epi16(1);

			accs[0] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[0], one), _mm256_madd_epi16(accs[1], one));
			accs[1] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[2], one), _mm256_madd_epi16(accs[3], one));
			accs[2] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[4], one), _mm256_madd_epi16(accs[5], one));
			accs[3] = _mm256_hadd_epi32(_mm256_madd_epi16(accs[6], one), _mm256_madd_epi16(accs[7], one));

			//a0 a1 a2 a3; b0 b1 b2 b3; c0 c1 c2 c3; d0 d1 d2 d3; a4 a5 a6 a7; b4 b5 b6 b7; c4 c5 c6 c7; d4 d5 d6 d7
			//e0 e1 e2 e3; f0 f1 f2 f3; g0 g1 g2 g3; h0 h1 h2 h3; e4 e5 e6 e7; f4 f5 f6 f7; g4 g5 g6 g7; h4 h5 h6 h7
			//a4 a5 a6 a7; b4 b5 b6 b7; c4 c5 c6 c7; d4 d5 d6 d7; e0 e1 e2 e3; f0 f1 f2 f3; g0 g1 g2 g3; h0 h1 h2 h3
			accs[0] = _mm256_hadd_epi32(accs[0], accs[1]);
			accs[1] = _mm256_hadd_epi32(accs[2], accs[3]);
			accs[2] = _mm256_permute2x128_si256(accs[0], accs[1], 0b100001);

			//Blend and add bias
			return _mm256_add_epi32(bias, _mm256_blend_epi32(
					_mm256_add_epi32(accs[0], accs[2]),
					_mm256_add_epi32(accs[1], accs[2]),
				0b11110000));
		}

All in all I want to share my performance log. This is inferences per second on random positions (so including full accumulator rebuild and no incrementality) on a single thread.
Disclaimer: some of this is not applicable when VNNI is available but most of it is, and I cant say what the improved memory layout does for NEON but going from many pointers into a 2 aligned SIMD pointers should help.

Code: Select all

//17.04.23 0.04 MNPS
//17.04.23 0.045 MNPS
//17.04.23 0.054 MNPS
//18.04.23 0.054 MNPS
//18.04.23 0.146 MNPS
//20.04.23 0.253 MNPS
//20.04.23 0.262 MNPS
//21.04.23 0.266 MNPS
//22.04.23 0.269 MNPS
//25.04.23 3.067 MNPS
//27.04.23 3.320 MNPS
//27.04.23 3.370 MNPS
//29.04.23 4.450 MNPS
//01.05.23 4.491 MNPS
//02.05.23 4.712 MNPS

Going forward I can say that the post format here is too limited and I should finally get around and publish all knowledge on a seperate blog-like website. All in all I described 7 ideas relevant to NNUE performance of around 35 that increased performance.

It boils down to:
1) having all readable inside a single function with const and non const pointers and no outside references etc. (helps the compiler A LOT)
2) improving memory layout and reshuffle some weights to get from incremental layer to output faster with better intrinsics
3) decreasing the overall cost of intrinsics by finding redundancies from domain knowledge. (For example knowing a value is strictly below 128 and suddenly you can remove some instructions because it cannot overflow an integer)

dangi12012 · Post by **dangi12012** » Tue Jun 06, 2023 6:35 pm

In that spirit - I also have ported the galoisfield movegen to a Vulkan compute shader on the GPU and it is completely memory bound now:

Code: Select all

UBO: 64kb
Workgroup size - 2x4 1177.6us
Total   : 46.3044ms

GQueens: 113.976
GB/s   : 911.805
Transfered to Host GQueens: 2.8986
Transfered to Host GB/s   : 23.1888

And on the topic of NNUE I can drop this gold nugget on you.
https://docs.nvidia.com/cuda/cuda-math- ... b597e8cde4

I feel the unique position of having the fastest movegenerator + the fastest quiet eval. But at my current speed it should still take years until an actual engine is there.
But my research shall be shared on a seperate blog soon.

Mike Sherwin · Post by **Mike Sherwin** » Tue Jun 06, 2023 9:03 pm

I wish that I had your ability Daniel. But I might still have something to offer. In my opinion NNUE based on king position is just not going for what is really possible. NNUE based on (a subset of) ECO codes will be worth hundreds of ELO more. Inerrant in ECO positions is the king position, the pawn structure, the best squares for the pieces, even all the specific plans of attack and defence would be baked in by the NNUE training. NNUE just based on king position can't do all that can it?

syzygy · Post by **syzygy** » Tue Jun 06, 2023 10:39 pm

dangi12012 wrote: ↑Tue Jun 06, 2023 6:26 pmI want to share what is possible without spilling the source code for it.

Sensationalist subject line. No code.

dangi12012 · Post by **dangi12012** » Wed Jun 07, 2023 11:27 am

Mike Sherwin wrote: ↑Tue Jun 06, 2023 9:03 pm I wish that I had your ability Daniel. But I might still have something to offer. In my opinion NNUE based on king position is just not going for what is really possible. NNUE based on (a subset of) ECO codes will be worth hundreds of ELO more. Inerrant in ECO positions is the king position, the pawn structure, the best squares for the pieces, even all the specific plans of attack and defence would be baked in by the NNUE training. NNUE just based on king position can't do all that can it?

What they currently use: 1 of 8 PSQ bypass + 1 of 8 feed forward Networks. The index for both is IDX = (popcount / 4 - 1)
The networks themselves have a material bypass. I encircled both material values. The remainder is positional.

In this document https://github.com/glinscott/nnue-pytor ... chitecture
they mention that many of the choices are arbitrary - and the approach to prove differently is to train a network that is better.
So in summary 8 networks and 8 PSQ for the non incremental part.

You are referring to the incremental part and here we have much more choices.
Weights are for: every of our own pieces and every of their pieces depending on the relative king positions.
Meaning the number of weights is this: InputDimensions = 64 * (11 * 64) / 2, traits = 1024 (nnuev5) and in sum we get 23068672 weights.

Now its not imporssible to switch from relative kings to something else, the hard thing would be to prove it. There needs to be enough training positions to set all of 23068672 weights to valid values.

But here is the cool part, the weights are already very big and mainly reside in main memory, so increasing that number 2-3x is not too costly. And indeed the SF developer have noticed that too and increase the number of traits from 1024 to 1536 literally last week. Which was an open question for me also last month and it even existed in my own log why exactly 1024?, and it made me happy that I have idea congruence with the leading edge so to speak. The answer was that it was just the historical value from 2022-05-14 until 2023-05-31. 1536 would have been the better choice in 2022 too.

I will release a single header file NNUE implementation

That can exist in less than 150 Lines of code including the network hash verification and file verification code. In my style of no external dependencies which was very much a lot of work to get it from scratch and comparison to SFs implementation.
Also I dont like the name NNUE at all, it suggests a known and fixed architecture which its really not. The underlying idea of fast incrementally updated networks is the real magic, and its the part that comes before the screenshot above!

If you look in the internet you wont find any nnue implementation without very very hard entanglement into engines types and includes.
What I will not share is the high performance AVX2 intrinsics version, I need that for my own ambitions.

Here is a preview and its already a big chunk of the overall code. Again try finding that out by reading other repos. Impossible.
This already leaks a few choices that are performance enhancing, for example we dont loop over anything twice and both accumulators reside in a single array. Overall this is very very readbale and I wish this keep it simple, and no code is best approach would spread a bit.

Code: Select all

// Forward propagation
static inline int propagate_layers(const std::int16_t* acc, int8_t* w0, int32_t* b0, int8_t* w1, int32_t* b1, int8_t* w2, int32_t* b2) noexcept {
	uint8_t input[traits];

	//Pointer declarations - compiled away offsets in inner loops
	uint8_t* own_half = input;
	uint8_t* opp_half = input + traits / 2;
	const int16_t* own1 = acc;
	const int16_t* own2 = acc + 512;
	const int16_t* opp1 = acc + 1024;
	const int16_t* opp2 = acc + 1536;

	//Featuretransform
	for (int j = 0; j < 512; ++j) {
		BiasType sum0 = std::max<int>(0, std::min<int>(127, own1[j]));
		BiasType sum1 = std::max<int>(0, std::min<int>(127, own2[j]));
		BiasType sum3 = std::max<int>(0, std::min<int>(127, opp1[j]));
		BiasType sum4 = std::max<int>(0, std::min<int>(127, opp2[j]));
		own_half[j] = static_cast<uint8_t>(sum0 * sum1 / 128);
		opp_half[j] = static_cast<uint8_t>(sum3 * sum4 / 128);
	}

	uint8_t buffer[30]; //32 for simd version
	uint8_t* input0 = buffer;
	uint8_t* input1 = buffer + 15;
	for (int i = 0; i < 15; ++i) {
		int8_t* pos_ptr = w0 + i * Layer0::PaddedInputDimensions;
		std::int32_t sum = b0[i];
		for (int j = 0; j < Layer0::InputDimensions; ++j) {
			sum += pos_ptr[j] * input[j];
		}
		input0[i] = static_cast<uint8_t>(std::max(0ll, std::min(127ll, (((long long)sum * sum) >> (2 * weight_scale)) / 128)));
		input1[i] = static_cast<uint8_t>(std::max(0, std::min(127, sum >> weight_scale)));
	}

	//Material
	int8_t* mat_ptr = w0 + 15 * Layer0::PaddedInputDimensions;
	std::int32_t material = b0[15];
	for (int j = 0; j < Layer0::InputDimensions; ++j) {
		material += mat_ptr[j] * input[j];
	}
	material = (material * 600 * 16) / (127 * (1 << weight_scale)); //Scaling

	//Positional
	int positional = *b2;
	for (int i = 0; i < 32; ++i) {
		const int offset = i * Layer1::PaddedInputDimensions;
		std::int32_t sum = b1[i];
		for (int j = 0; j < Layer1::InputDimensions; ++j) {
			sum += w1[offset + j] * buffer[j];
		}
		positional += w2[i] * static_cast<uint8_t>(std::max(0, std::min(127, sum >> weight_scale)));
	}
	return material + positional;
}

Coming up: github repo, single file, scalar implementation, no external dependencies
Coming up2: Personal blog website of Daniel where we can go into a lot more detail similar to zetachess (who researches into the right direction as well), BB board is not optimal for screenshot, resources, videos etc.

My secret is being a very very addicted workaholic. I even had a job at my favourite chess company at one point, but that was not enough so I had two other fulltime obligations at the same time, remote work makes it possible, but experience from especially the 3d industry goes a long way to seeing through code noise.

AndrewGrant · Post by **AndrewGrant** » Wed Jun 07, 2023 1:29 pm

For any aspiring NNUE developers working on their own engines, you should know that most if not all of what is posted here is bunk. You are much better off hopping into the Stockfish or OpenBench discords and having a conversation about these topics, than trying to "uncover" the secrets that are supposedly offered here.

Of all of his points, the only one that I see as having any merit, is the 0a comment regarding the way you compute the index of a given weight. Which can maybe save you a few operations. But the real concern in NNUE is L1, which dwarfs any other optimizations you might make.

dangi12012 · Post by **dangi12012** » Wed Jun 07, 2023 2:16 pm

AndrewGrant wrote: ↑Wed Jun 07, 2023 1:29 pm For any aspiring NNUE developers working on their own engines, you should know that most if not all of what is posted here is bunk. You are much better off hopping into the Stockfish or OpenBench discords and having a conversation about these topics, than trying to "uncover" the secrets that are supposedly offered here.

Of all of his points, the only one that I see as having any merit, is the 0a comment regarding the way you compute the index of a given weight. Which can maybe save you a few operations. But the real concern in NNUE is L1, which dwarfs any other optimizations you might make.

For any aspiring NNUE developers no one has offered a simple 1 file solution yet. Worse -
If you read exactly how Andrew implemented NNUE I would not put too much weight into his comments. It is one of the slower versions. https://github.com/AndyGrant/Ethereal/b ... nue/nnue.c

I think the ideas stand on their own and if not criticised with more specificity will simply stand the test of time. I dont like discord for that very reason - groupthink, literal gatekeepers and too much noise.

In any case a single header file which is binary compatible will be offered soon. Thats nothing that is publicly available yet and I dislike turning stuff into yet another literal copy of these 5 same files (like andrew did).

Its the right way to reimagine the ideas from time to time and not say any particular implementation is perfect.
I am specific and my claims and you can verify this yourself!

For example I am asserting that _mm256_madd_epi16 does not need to be invoked every single time because you clamp between 0..127. No one really engaged that point yet.

Another point - and if you think this has no validity I dare you to defuse this with specificity please:

Andrew did the smart thing and use the minimum amount of memory - you see how outN1, outN2 is used and they are the same memory. Good for cache etc. Lets put aside that he works with floats which is a not so smart move but here it is: https://github.com/AndyGrant/Ethereal/b ... nue.c#L508

Code: Select all

// Feed-forward the entire evaluation function
halfkp_relu(accum, out8, board->turn);
quant_affine_relu(l1_weights, l1_biases, out8,  outN1);
float_affine_relu(l2_weights, l2_biases, outN1, outN2);
output_transform (l3_weights, l3_biases, outN2, outN1);

Whereas the SF current head does indeed overcommit - leading to slower code. Source: https://github.com/official-stockfish/S ... ture.h#L93

Code: Select all

    struct alignas(CacheLineSize) Buffer
    {
      alignas(CacheLineSize) decltype(fc_0)::OutputBuffer fc_0_out;
      alignas(CacheLineSize) decltype(ac_sqr_0)::OutputType ac_sqr_0_out[ceil_to_multiple<IndexType>(FC_0_OUTPUTS * 2, 32)];
      alignas(CacheLineSize) decltype(ac_0)::OutputBuffer ac_0_out;
      alignas(CacheLineSize) decltype(fc_1)::OutputBuffer fc_1_out;
      alignas(CacheLineSize) decltype(ac_1)::OutputBuffer ac_1_out;
      alignas(CacheLineSize) decltype(fc_2)::OutputBuffer fc_2_out;

      Buffer()
      {
          std::memset(this, 0, sizeof(*this));
      }
    };
    
    fc_0.propagate(transformedFeatures, buffer.fc_0_out);
    ac_sqr_0.propagate(buffer.fc_0_out, buffer.ac_sqr_0_out);
    ac_0.propagate(buffer.fc_0_out, buffer.ac_0_out);
    std::memcpy(buffer.ac_sqr_0_out + FC_0_OUTPUTS, buffer.ac_0_out, FC_0_OUTPUTS * sizeof(decltype(ac_0)::OutputType));
    fc_1.propagate(buffer.ac_sqr_0_out, buffer.fc_1_out);
    ac_1.propagate(buffer.fc_1_out, buffer.ac_1_out);
    fc_2.propagate(buffer.ac_1_out, buffer.fc_2_out);

My implementation does not even use buffers is what I am saying to that - 9 AVX2 registers are enough. Goes well together with a full SIMD approach for other parts of the code too - and I compared the assemblies too to verify instructions. Thats 1 point where system performance comes from.

Summary - discussion yes. Personal attacking - no. If you claim my ideas are invalid please do so with specificity so everyone can learn.

What I am asserting that with my ideas above and a few others SFNNUEv5 eval with full buffer resets fore each position can run in 4.7Mnps which is 60% faster than SF main branch was for SFNNUEv5 when copying and comparing that impl.
Not more not less.

ImNotStockfish · Post by **ImNotStockfish** » Wed Jun 07, 2023 2:36 pm

Hopefully we can see you make a working implementation of this for Stockfish soonTM and test it out on Fishtest for some sweet elo gains

dangi12012 · Post by **dangi12012** » Wed Jun 07, 2023 3:00 pm

ImNotStockfish wrote: ↑Wed Jun 07, 2023 2:36 pm Hopefully we can see you make a working implementation of this for Stockfish soonTM and test it out on Fishtest for some sweet elo gains

I have it and if you know a way to submit to Fishtest a commit on top of SF15 release brach without leaking my source code I can do that today. Cherry-pick or some sort of private commit
PM me.

lithander · Post by **lithander** » Wed Jun 07, 2023 4:00 pm

I always enjoy reading your posts, Daniel. But I can totally see where the criticism is coming from. Stockfish is open source, if you have ideas of how to make it faster the best way to convince people of the soundness of these ideas is to demonstrate them by making Stockfish faster.

I would totally understand if you'd say you don't want to spend all that time on putting your ideas into practice or that it takes practical skills that you don't have but you claim that you've already done that privately. (which I don't even doubt) I really don't get what protecting your "source code" is all about? If someone else implements a series of performance improvements based on your ideas and submits them they'll get the fame and attention that you seem to be craving. What else than fame is there to be won in the chess programming space, anyway?

Reminds me of when you made your personal fork of Leorik, implemented PEXT based slider movegen (without saying that) and taunted me with some X% speed improvements. I appreciate a bit of friendly banter like that. But I think a lot of the negative comments you see under your posts stem from your very unique understanding of how a fruitful collaboration should look like.

Making NNUE 60% faster

Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster