Making NNUE 60% faster

syzygy · Post by **syzygy** » Sat Jun 10, 2023 8:58 pm

[deleted]

dangi12012 · Post by **dangi12012** » Sat Jun 10, 2023 10:36 pm

Great news!

So over the weekend I had a nice conversation with some SF devs who convinced me to share all changes that can be made ontop of NNUEV5 to increase performance that much. They said that the suggested changes for index calculation look like an improvement but are very sceptical towards a 60% performance increase. (this is a repost - old one deleted - had an invalid url)

I agree that its kind of ridiculous to announce a 60% increase and then not share code!

So as a start here are the binaries compiled in ubuntu WSL2 that everyone can run with increased NPS compared to the commit its emerging from, as there is no way to fake playing strength I think its verifyable with cutechess - as the commit id you can build and get the same SHA256 hash. GPLv3 is adhered to and infos are in the zip.

This build is for AVX2 only - and to reiterate if you have VNNI the SF forward propagation code is optimal imo.

To clear a few things up - forward propagation ONLY with a new memory layout and alternative intrinsics can be done this much faster:

Code: Select all

VANILLA - AVX2 Codepath (c3ce2204083400267592dc088b8ad9e88aed56b1)
===========================
Total time (ms) : 1461
Nodes searched  : 3548023
Nodes/second    : 2428489
Million NNUE evals per second: 2.0348

Reimpl - AVX2 Codepath
===========================
Total time (ms) : 1394
Nodes searched  : 3548023
Nodes/second    : 2545210
Million NNUE evals per second: 2.11533

Other 5% come from updated index calculation with new memory layout and the remaining 50% come from this change - that I did not use in SF at this point as its compatible only with a LOT of rework (which I have for my own ambitions so last thing I will do is fishtest + merge something like this into SF)

1st thing that helps a lot is a color agnostic board structure where 1 full board fits inside 1 AVX2 register that is rotated into the perspective of the player by default in 1 instruction. This allows for superiour movegen as we go into it with a ymm and come out of it with a ymm - but more importantly NNUE already likes color agnostic code as accumulator etc are in US vs THEM perspective.
I mentioned it here:
https://www.talkchess.com/forum3/viewto ... is#p945750

2nd)
With a very very big increase in memory size you can consolidate added_idx and remove_idx into a delta_index that you add or remove only once as there are only so many possible moves for each chessboard (knight on a3 takes queen on b5 relative to the king square) meaning on top of a quick index every move is not registered in terms of added and removed - but only once as a delta set.

So we get 1.7x faster incremental updates (not 2) because of memory increase.

3rd for the remaining peanuts up to 60% we can for example load all 6 pointers for weights and biases into 2 linearized memory arrays (with taking padding into account for L1)
static inline int propagate_layers(const __m256i* restrict w, const __m256i* restrict b)
meaning that all the index calculation inside propagation can go away and we just do *w++ and *b++ any time we use a weight or a bias in propagation.

This post was meant as an inspirational post to show that every system can always be improved and it would change some memory layout and use different AVX2 instructions to be faster than sf. The silent lurkers have seen some merit to this. To be honest I would not even have tried if it were not for that memcpy in the middle of the hot loop of SF: https://github.com/official-stockfish/S ... ure.h#L117
where I thought to myself - why dont we store into the right slot to begin with etc. Some of these code snippets are without any change for 22 months so of course they can be improved.

So the full code will land in this repo https://github.com/Gigantua/Chess_Movegen where we will add a new section for NNUE comparison where different implementations of a specific network can be tested in isolation against each other in terms of Million evals/second.

Summary:
NNUE has 3 parts that can be improved -

Incremental Update (50%)
FeatureTransform (5%)
Propagation (5%)

The binary proves that propagation can be done 5% faster with better memory layout and there are some intrinsics that can be used to shave off a few instructions. By reshuffling the weights for featuretransform we gain 5 % here too.

This outputs in the same Featuretransform format - but the permute op can be moved to init(). This removal for example is not done in sf's master leading to a call to vec_msb_pack_16 where we can remove the _mm256_permute4x64_epi64 at a reduced runtime cost.

Code: Select all

//Featuretransform - without weightshuffle during init
for (int j = 0; j < 512 / 16; ++j) {
	auto sum0 = _mm256_max_epi16(zero, _mm256_min_epi16(max, own1[j]));
	auto sum1 = _mm256_max_epi16(zero, _mm256_min_epi16(max, own2[j]));
	auto sum2 = _mm256_max_epi16(zero, _mm256_min_epi16(max, opp1[j]));
	auto sum3 = _mm256_max_epi16(zero, _mm256_min_epi16(max, opp2[j]));
	
	//We could permute Layer0 weights during init to skip permute and storing lower and upper 128 bits completely!
	//optimisation into two a direct store possible. (removal of permute and interleaved own, opp in input weights)
	//summary: removal of 2 instructions (costing 2 cpi) possible if we do some work on init
	auto result = _mm256_permute4x64_epi64(
		_mm256_packus_epi16(
			_mm256_srli_epi16(_mm256_mullo_epi16(sum0, sum1), 7), 
			_mm256_srli_epi16(_mm256_mullo_epi16(sum2, sum3), 7)),
		_MM_SHUFFLE(3, 1, 2, 0));

	own_half[j] = _mm256_castsi256_si128(result);
	opp_half[j] = _mm256_extractf128_si256(result, 1);
}

And with full verctorisation, better memory layout and shuffle weights during loading time:
Which is equivalent to both loops from here at once: https://github.com/official-stockfish/S ... mer.h#L274

Code: Select all

//Featuretransform
for (int j = 0; j < 512 / 16; ++j) {
	auto sum0 = _mm256_max_epi16(zero, _mm256_min_epi16(max, acc[j]));
	auto sum1 = _mm256_max_epi16(zero, _mm256_min_epi16(max, acc[j + 32]));
	auto sum2 = _mm256_max_epi16(zero, _mm256_min_epi16(max, acc[j + 64]));
	auto sum3 = _mm256_max_epi16(zero, _mm256_min_epi16(max, acc[j + 96]));

	input_simd[j] = _mm256_packus_epi16(
		_mm256_srli_epi16(_mm256_mullo_epi16(sum0, sum1), 7),
		_mm256_srli_epi16(_mm256_mullo_epi16(sum2, sum3), 7));
}

More information in the readme - including the validation layer that shows the absence of overflows. With any overflows the node number in bench() would very likely also not match! Source code available!

https://1drv.ms/u/s!AoDIyNlDG2QTg84rZuD ... g?e=H1Tjc1

Sopel · Post by **Sopel** » Sun Jun 11, 2023 12:01 am

The pattern continues. You promise X, show X*0.1, and provide nothing of actual usefulness.

dangi12012 · Post by **dangi12012** » Sun Jun 11, 2023 12:59 am

Sopel wrote: ↑Sun Jun 11, 2023 12:01 am The pattern continues. You promise X, show X*0.1, and provide nothing of actual usefulness.

No my friend you are the one who claimed none of this is even possible. When asked I delivered.
Yes Logs on page1 are running the same code and yes even a lot faster - and all ideas are listed with the appropriate effect. A full rewrite of SF for color agnosticity and moving to a 90 degrees rotated YMM board is nothing for me to prove to strangers. Its already running nicely in gigantua. All of this will be ported to cuda and C# soon enough. You cannot discuss this away and I am looking forward to seeing some of this land in main branch one way or another now that the sourcecode but more importantly improvement ideas are validatable and public.

[Moderation] Please do not try to chastise others for charter violations that have already been moderated away; this should be left to the moderators.

Also since we are now talking mr/ms only read and never post. YES YOU! I wanted to share this.
Search algo in micromice - maybe mate search?:
This video reminded me of poisoned chess moves:
And here we have examples of what you can do when you allow yourself to think fixed concepts anew (yes they work)

Greetings.

syzygy · Post by **syzygy** » Sun Jun 11, 2023 5:38 am

dangi12012 wrote: ↑Sun Jun 11, 2023 12:59 amFor the lurking reader, you the actual audience this person is responsible for writing some of the original code, so its hard to accept improvements.

Improvements are made all the time and people are generally happy to see their code being sped up or otherwise improved.

Indeed this does not apply to your case. Ask yourself why.

Modern Times · Post by **Modern Times** » Sun Jun 11, 2023 7:27 am

dangi12012 wrote: ↑Sun Jun 11, 2023 12:59 am When asked I delivered.

Well another option for you - Shaschess is perhaps the most well-known Stockfish derivative or fork. Supply your source code changes to the person who administers ShashChess and he may be willing to incorporate it.

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Jun 11, 2023 7:38 am

dangi12012 wrote: ↑Sat Jun 10, 2023 10:36 pm ...

Code: Select all

VANILLA - AVX2 Codepath (c3ce2204083400267592dc088b8ad9e88aed56b1)
===========================
Total time (ms) : 1461
Nodes searched  : 3548023
Nodes/second    : 2428489
Million NNUE evals per second: 2.0348

Reimpl - AVX2 Codepath
===========================
Total time (ms) : 1394
Nodes searched  : 3548023
Nodes/second    : 2545210
Million NNUE evals per second: 2.11533

...

So a ~4% overall speed up which should equate to maybe 2-3 Elo assuming the proposed overflow issue isn't actually a problem in practice. This falls well short of your claimed "NNUE 60% faster" in my calculations, but is still a nice improvement all the same. It's not too uncommon that Stockfish gets such optimization patches. What is uncommon is the author of said optimization patch creating a series of self-congratulatory posts on TalkChess in relation to said patch and turning the release of the source code associated with said patch into a multi-week process. Just submit your patch to fishtest instead of subjecting everyone to all this nonsense next time around.

Just look at how dramatically the efforts of Sopel and others have improved the NPS of Stockfish historically:
https://github.com/official-stockfish/S ... provements
This should put your ~4% speed up into context.

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Jun 11, 2023 7:53 am

dangi12012 wrote: ↑Sat Jun 10, 2023 10:36 pm ...
2nd)
With a very very big increase in memory size you can consolidate added_idx and remove_idx into a delta_index that you add or remove only once as there are only so many possible moves for each chessboard (knight on a3 takes queen on b5 relative to the king square) meaning on top of a quick index every move is not registered in terms of added and removed - but only once as a delta set.

So we get 1.7x faster incremental updates (not 2) because of memory increase.
...

Have you actually tested this beyond a toy benchmark? I tried this in my engine several months ago and the reduction in the number of feature transformer updates was not worth the increased cache pressure. More recently, in Berserk, a very small "delta cache" was attempted (-> adjacent positions in the search tree share many deltas), but this idea also did not bear fruit in practice.

Sopel · Post by **Sopel** » Sun Jun 11, 2023 9:46 am

[Moderation] Quoting of deleted material removed.

No my friend you are the one who claimed none of this is even possible.

1. Why are you lying 2. Am I wrong

The only thing you showed so far is that it's possible to pre-permute the weights for a questionable gain. Yes, we've known about this since the beginning of NNUE. Cfish did it even too. It's just an unmaintainable mess.

edit. I'll also include an answer to your next message because I don't feel like sitting here all the time. Yes, you did do int16 accumulators. It is an invalid optimization, it's been said at least 3 times why now.

AndrewGrant · Post by **AndrewGrant** » Sun Jun 11, 2023 1:11 pm

I've been permuting input weights on startup to avoid a later permute since the inception of NNUE in Ethereal.
Should have read my code a little closer before calling it one of the "slower implementations".

Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster