Making NNUE 60% faster

dangi12012 · Post by **dangi12012** » Sun Jun 11, 2023 2:17 pm

AndrewGrant wrote: ↑Sun Jun 11, 2023 1:11 pm I've been permuting input weights on startup to avoid a later permute since the inception of NNUE in Ethereal.
Should have read my code a little closer before calling it one of the "slower implementations".

Well nice of you to call most ideas "bunk" in post 1 when you have used these ideas yourself independently and know they are valid. Makes you a liar if nothing else. There are many parts to increase performance, not just this - but 10 smaller ideas together are getting you forward etc..

What I did in this thread was calling out what sprung to my eye and where I think SF code can be improved.
To add the full context for readers - and also documented in the ZIP i posted earlier. You can shuffle L0 for a smoother Featuretransform:
https://github.com/AndyGrant/Ethereal/b ... nue.c#L110

Same comments in zip file:

Code: Select all

//We could permute Layer0 weights during init to skip permute and storing lower and upper 128 bits completely!
//optimisation into two a direct stores possible. (removal of permute and interleaved own, opp in input weights)
//summary: removal of 2 instructions (costing 2 cpi) possible if we do some work on init and linear memory access during runtime!
//we just create a mapping table to permute all 1024 weights now.

Absent in SF leading to the need to permute here.
https://github.com/official-stockfish/S ... rmer.h#L82
Also this is inside a hot loop - so really worth the offort to shuffle on init. Do I want to create a fishtest commit for it? No. Probably for the same reason Andrew didnt either.

Other issue 1:
Yes the delta update means consolidating a remove and an add into a single delta and its much faster, in a benchmark where we see how fast we can do nnue eval() on all silent nodes (without AB essentially in perfT). Of course this depends on the existing cache pressure as well - so PEXT movegen vs KGSSB or Galoisfield movegen makes a difference as well as other cache pressure sources, but to see the difference (hand waving argument) performance on page1 vs performance above is the valid indicator.

Other issue 2:
And to adress this the 10th time - if there was an overflow bug the bench() values would already missmatch! There is a way to circumvent for the 10th time. By using 8 accumulators, splitting single loops into two parts and among other things you can see in the sourcecode that I clamp some values to -8192, 8192. Yes it can be done and removes some instructions from a hot loop. If not I will the first to admit mistake. Its a small improvement and maybe 3% increase in performance, but for a part of code that hasnt changed in 22 months its a valid improvement.

Code: Select all

//Second optimisation - work on 8 accumulators at a time. 2 iterations for all 30 relu and relusqr bits
//This is correct for 99.98% of positions but once every few 10000 the mullo_epi32 would need more than 32bits. 
//_mm256_srli_epi32(_mm256_mullo_epi32(sum8, sum8), 19) 
//Solution:
//sum*sum >> 19 is smaller than 127 (2^7) if and only if sum < 8192 - so we clamp it and dont get overflows. 
//sum >> 6 is not impacted by this clamping.

Doing a minmax analysis proves this on the actual weights of the actual trunk networks.

Summary
From my perspective I have shared everything I wanted to and from my POV this is closed - and the real interesting part will be the comparison of implementations inside C++ and finally in different languages like C# of at least SFNNUEv5 and above in single header files without external dependencies.
Where stuff like __viaddmax_s16x2_relu from here: https://docs.nvidia.com/cuda/cuda-math- ... ccd2894937 will be cool to test.
Probably a unified interface for static classes like this:

Code: Select all

template <typename T>
concept is_eval_capable = requires {
    { T::AddPiece(int, int, int) } -> std::same_as<void>;
    { T::RemovePiece(int, int, int) } -> std::same_as<void>;
    { T::TakesPiece(int, int, int, int) } -> std::same_as<void>;
};

connor_mcmonigle · Post by **connor_mcmonigle** » Sun Jun 11, 2023 6:14 pm

dangi12012 wrote: ↑Sun Jun 11, 2023 2:17 pm
Other issue 1:
Yes the delta update means consolidating a remove and an add into a single delta and its much faster, in a benchmark where we see how fast we can do nnue eval() on all silent nodes (without AB essentially in perfT). Of course this depends on the existing cache pressure as well - so PEXT movegen vs KGSSB or Galoisfield movegen makes a difference as well as other cache pressure sources, but to see the difference (hand waving argument) performance on page1 vs performance above is the valid indicator.

Other issue 2:
And to adress this the 10th time - if there was an overflow bug the bench() values would already missmatch! There is a way to circumvent for the 10th time. By using 8 accumulators, splitting single loops into two parts and among other things you can see in the sourcecode that I clamp some values to -8192, 8192. Yes it can be done and removes some instructions from a hot loop. If not I will the first to admit mistake. Its a small improvement and maybe 3% increase in performance, but for a part of code that hasnt changed in 22 months its a valid improvement.
Code: Select all
//Second optimisation - work on 8 accumulators at a time. 2 iterations for all 30 relu and relusqr bits
//This is correct for 99.98% of positions but once every few 10000 the mullo_epi32 would need more than 32bits. 
//_mm256_srli_epi32(_mm256_mullo_epi32(sum8, sum8), 19) 
//Solution:
//sum*sum >> 19 is smaller than 127 (2^7) if and only if sum < 8192 - so we clamp it and dont get overflows. 
//sum >> 6 is not impacted by this clamping.
Doing a minmax analysis proves this on the actual weights of the actual trunk networks.

The bench matching is not a sufficient condition for a rare overflow condition not existing. As far as I can tell, you've still not published the source code anywhere. Rather you've asked anyone who wants it to email you which is unreasonable and, consequently, it is unreasonable to expect readers to know the specific details of your implementation when the source code is still not publicly available.

Regarding the delta consolidation idea, in practice, any modern engine is going to have far more significant contributors to cache pressure than move generation LUTs anyways. I'm rather confident this optimization, as described, won't bear fruit in an actual chess engine (which is the whole point).

chrisw · Post by **chrisw** » Sun Jun 11, 2023 7:09 pm

Is the problem that he is claiming an improvement, or that he is claiming an improvement without providing the code? In ye olde days it was just fine to claim improvements without providing code, the proof (or not) would be in the sometime-to-be-provided executable. Is it reasonable to demand he back his claims up with source code?

Sopel · Post by **Sopel** » Sun Jun 11, 2023 9:52 pm

chrisw wrote: ↑Sun Jun 11, 2023 7:09 pm Is the problem that he is claiming an improvement, or that he is claiming an improvement without providing the code? In ye olde days it was just fine to claim improvements without providing code, the proof (or not) would be in the sometime-to-be-provided executable. Is it reasonable to demand he back his claims up with source code?

This is a matter of "extraordinary claims require extraordinary evidence", exacerbated by intentional malicious withholding of evidence.

Sopel · Post by **Sopel** » Sun Jun 11, 2023 9:57 pm

dangi12012 wrote: ↑Sun Jun 11, 2023 2:17 pm And to adress this the 10th time - if there was an overflow bug the bench() values would already missmatch!

Okay, maybe 4th time is the charm, please tell me whether you agree with the following assessment, or not, and provide your reasoning:

Code: Select all

acc = _mm256_add_epi16(acc, _mm256_maddubs_epi16(input_simd[m], *w0++));

`input_simd[m]` is in range <0, 127>, because it is after activation

`*w0` is in range <-128, 127> because it's the weight

`input_data * *w0` is therefore in range <-128, 127> * <0, 127> == <-16256, 16129>

therefore `_mm256_maddubs_epi16` gives values in range <-128*127 == -16256*2, 127*127 == 16129*2>

acc is any previous result, so after the first iteration is it in range <-16256*2, 16129*2>

so in the second iteration the addition, in the worst case, results in `<-16256*2, 16129*2>` * 2 == <-65024, 64516> > <-2^15, 2^15-1>

so it can overflow on the third iteration. You're doing 32.

AndrewGrant · Post by **AndrewGrant** » Mon Jun 12, 2023 5:29 pm

Sopel wrote: ↑Sun Jun 11, 2023 9:57 pm
dangi12012 wrote: ↑Sun Jun 11, 2023 2:17 pm And to adress this the 10th time - if there was an overflow bug the bench() values would already missmatch!
Okay, maybe 4th time is the charm, please tell me whether you agree with the following assessment, or not, and provide your reasoning:
Code: Select all
acc = _mm256_add_epi16(acc, _mm256_maddubs_epi16(input_simd[m], *w0++));
`input_simd[m]` is in range <0, 127>, because it is after activation

`*w0` is in range <-128, 127> because it's the weight

`input_data * *w0` is therefore in range <-128, 127> * <0, 127> == <-16256, 16129>

therefore `_mm256_maddubs_epi16` gives values in range <-128*127 == -16256*2, 127*127 == 16129*2>

acc is any previous result, so after the first iteration is it in range <-16256*2, 16129*2>

so in the second iteration the addition, in the worst case, results in `<-16256*2, 16129*2>` * 2 == <-65024, 64516> > <-2^15, 2^15-1>

so it can overflow on the third iteration. You're doing 32.

It _tends_ not to overflow until much deeper. which is why he thinks whatever he is doing works. But I don't think Stockfish devs care about "tends" to be correct, and prefer correct in theory.

chrisw · Post by **chrisw** » Mon Jun 12, 2023 7:07 pm

AndrewGrant wrote: ↑Mon Jun 12, 2023 5:29 pm
Sopel wrote: ↑Sun Jun 11, 2023 9:57 pm
dangi12012 wrote: ↑Sun Jun 11, 2023 2:17 pm And to adress this the 10th time - if there was an overflow bug the bench() values would already missmatch!
Okay, maybe 4th time is the charm, please tell me whether you agree with the following assessment, or not, and provide your reasoning:
Code: Select all
acc = _mm256_add_epi16(acc, _mm256_maddubs_epi16(input_simd[m], *w0++));
`input_simd[m]` is in range <0, 127>, because it is after activation

`*w0` is in range <-128, 127> because it's the weight

`input_data * *w0` is therefore in range <-128, 127> * <0, 127> == <-16256, 16129>

therefore `_mm256_maddubs_epi16` gives values in range <-128*127 == -16256*2, 127*127 == 16129*2>

acc is any previous result, so after the first iteration is it in range <-16256*2, 16129*2>

so in the second iteration the addition, in the worst case, results in `<-16256*2, 16129*2>` * 2 == <-65024, 64516> > <-2^15, 2^15-1>

so it can overflow on the third iteration. You're doing 32.
It _tends_ not to overflow until much deeper. which is why he thinks whatever he is doing works. But I don't think Stockfish devs care about "tends" to be correct, and prefer correct in theory.

SF devs were historically only interested in risk analysis, if it worked (gave Elo) it stayed in - which seems reasonable. Overflow isn't going to break anything (assuming the NNUE out is clamped to something sensible).

Joost Buijs · Post by **Joost Buijs** » Mon Jun 12, 2023 8:00 pm

With saturated arithmetic an occasional overflow is not so severe, AVX and AVX2 are such a mess, some instructions have saturated counterparts and some don't, this seems to be completely arbitrary.

To be on the safe side I only use 16x16->32 bit instructions, on my machines this is maybe 10% slower overall, and much easier to handle. The error with 8 bit weights is larger too, just not enough resolution, maybe using Quantization Aware Training could improve this, but I'm just to lazy to implement this.

dangi12012 · Post by **dangi12012** » Mon Jun 12, 2023 8:31 pm

Finally we arrive at a point where a specific question is asked and I have the feeling that this thread is productive and intellectually honest.
Let me put one different point aside first - above is the argument that consolidation of added_idx and removed_idx increases cache pressure - its exactly the opposite. Instead of having to load 2 independent cache lines in chunks of 1024 shorts = 2048byte we only need to do that once. reading half as much memory and at 1 location instead of 2. Together with faster idx calculation that is the most substantial improvement.

Background context:
I created this post to share what I found when creating a binary compatible version of silent eval for my own projects. The ideas around not doing memcpy inside NNUE when you could emit into the right memory location to begin with all stand - the improvement of FT also stands an everyone who is interested has access to the sourcecode and due to GPL3 you can even host it. Yes I have a 60% faster NNUE in my project - no that does not mean I can increase SFs nps by 30% and thats not what I claimed. Guess what - when using cuda it will be much more than 60% it could be many orders of magnitude. That does not mean its applicable or portable to SF.

What my claim is EXACTLY: There exists a binary compatible version of NNUE that can work with NNUEv5 weigths to produce the same eval as the reference implementation of the official SF repo is written in c++ but can do incremental updates + silent eval 60% faster. In the future to repeat this it can grow more on different languages etc. This claim has nothing to do with stockfish at all. My thread was intended to share what is universally improvable and to give SF devs a hint that the end is not reached!
In the future I might create a thread to say I made NNUE 20x faster. This claim would not mean that Stockfish suddenly can run gpu shaders. So lets take that inspiration and move forward.

Now we will have a "Sokath, his eyes uncovered" moment together.
Regarding maddubs.

There are 2 reasons why this works:
1) You keep talking about overflows - _mm256_maddubs_epi16 is using signed saturation arithmetic. So the notion of overflow is wrong. When you add too much your result is limited to the types integer limit and now your result is wrong going forward. This sentence will be 1 essence and I am proud to have found it. No the result is not wrong by default - you will clamp to 127 soon! - So you also first have to reach an integer limit AND you have to come back to 127 or below for the wrong result to be Actually wrong. As in i dont care if we calculate 4080 or 4081 and clamp to 127.
So the signed saturation arithmetic saves us and makes this possible. No overflow - a few wrong bits but we are above 127 or below anyways... This is essence 1. (Its more complicated because we dont do relu instantly but square and relu that but thats why in my code you can see I have a very specific clamping value not found anywhere else)

2) Is that with the knowledge of that we have to go up so far to reach the limit - and then back down to reach into clamping and I mentioned this: You have to do minmax analysis and you wont like this: on the domain knowledge of the weights themselves for a particular network. This way you get the leeway to go up to 32, 16, 8, or only 4 iterations depending on what the trainer gave you. This is done beforehand when picking a network and you see the code too. This insight came from when I tried to make NNUE a C++ compiled network - meaning 0, 1, -1 and all powers of 2 get replaced by c++ code for exactly that but that didnt go anywhere. Compile times were horrendous code size insane and avx2 we dont care so much when we multiply and 1/16 slots is zero, so avx2 is still faster. I got sidetracked so lets print what we actually work with (and here we assume 126 as the maximum for the righthand side and the proof is at the bottom on that point).

So lets print the actual calculations shall we:

Code: Select all

Sum over input dimensions
        2904 += -4x0
        2800 += -4x26
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2800 += -4x0
        2821 += -4x21
        2821 += -4x0
        2821 += -4x0
        2821 += -4x0

Eminating from here:

Code: Select all

for (int i = 0; i < 15; ++i) {
	int8_t* pos_ptr = w0 + i * Layer0::PaddedInputDimensions;
	std::int32_t sum = b0[i];
	std::cout << "Sum over input dimensions\n";
	for (int j = 0; j < Layer0::InputDimensions; ++j) {
		sum += pos_ptr[j] * input[j];
		max_weight = std::max(max_weight, (int)pos_ptr[j]);
		min_weight = std::min(min_weight, (int)pos_ptr[j]);
		std::cout << "\t" << sum << " += " << (int)pos_ptr[i] << "x" << (int)input[j] << "\n";
	}
	input0[i] = static_cast<uint8_t>(std::max(0ll, std::min(127ll, (((long long)sum * sum) >> (2 * weight_scale)) / 128)));
	input1[i] = static_cast<uint8_t>(std::max(0, std::min(127, sum >> weight_scale)));
}
std::cout << "DONE";

Minmax analysis is to set the righthand side to 126 by default - and 0 for negative values and vice versa. But the lefthand side is known for all published NNUEv5 sets. Minamx analysis was the first thing I mentioned. That means the domain knowledge of the specific network weights too. What saves us from some overflows is that we work with 16 shorts at a piece meaning even when we jump 16 inputs at a time and a individual avx2 slot sees many different weights after seeing a big 127 or -127 too. This can be Proven. Even when we reach the integer limit of signed 16bit arithmetic we have the chance to prove that its impossible to come back to 127 in due time most of the time!

Indeed that is still not enough so we have to limit the number of executions further AND have to split the outer loop into two parts. The number 32 is not some random coincidence its the number of valid iterations is valid for all sfnnuev5 networks and I tested. If minmax analysis with loops and split execution would yield 16 than thats the proof that we can safely use 16 iterations. In essence we have a program to do minmax analysis that tells us how often we can do _mm256_maddubs_epi16 before converting to bigger epi32s. The number of correct iterations could be 16 too for another set of weights. Good thing is we can check that when loading weights or have the nnue hash baked into a header.

Summary why you can trust my answer: One of the lead devs in SF said this above - `input_simd[m]` is in range <0, 127>, because it is after activation. Very first sentence already mistaken. In essence we clamp two values to 0..127 multiply them.. and divide by 128 for the FT. This yields a maximum of 126.

I thought this through from beginning to end this is not some random half thought through post also - thats why results match show and show that too. Statistical arguments do not hold as we want to be correct. Signed arithmetic, jumping 16 indices at a time from the perspective of a avx2 slot - and hitting the limit means having to come down again is the proof and indeed and after all that also splitting loops and having the minmax analysis tell us how often we can iterate. Finally it all just checks out:

Code: Select all

SF Vanilla				SF Reimpl
===========================             ===========================
Total time (ms) : 18539                 Total time (ms) : 18031
Nodes searched  : 42575839              Nodes searched  : 42575839
Nodes/second    : 2296555               Nodes/second    : 2361257
Million NNUE evals per second: 2.01711  Million NNUE evals per second: 2.08054
bench 512 1 20

Please read everything carefully and if you have further questions let me know.
Greetings - Daniel

Sopel · Post by **Sopel** » Mon Jun 12, 2023 10:34 pm

The issue was significant enough that we've hit it with just 2 maddubs results added together (yes, we actually did, someone reported it, it's buried somewhere in the issues). I already linked it (https://github.com/official-stockfish/S ... 7ceb83dbf8), to which you responded with an onion emote. You're definitely right that there's a subset of networks that can accumulate more than 1 iteration into int16, but at this point the gains are too small to be worth the complexity of identifying them, and potentially fixing (if at all possible) the unsuitable networks.

Next time please specify the exact assumptions you're working under.

Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster

Re: Making NNUE 60% faster