Well nice of you to call most ideas "bunk" in post 1 when you have used these ideas yourself independently and know they are valid. Makes you a liar if nothing else. There are many parts to increase performance, not just this - but 10 smaller ideas together are getting you forward etc..AndrewGrant wrote: ↑Sun Jun 11, 2023 1:11 pm I've been permuting input weights on startup to avoid a later permute since the inception of NNUE in Ethereal.
Should have read my code a little closer before calling it one of the "slower implementations".![]()
What I did in this thread was calling out what sprung to my eye and where I think SF code can be improved.
To add the full context for readers - and also documented in the ZIP i posted earlier. You can shuffle L0 for a smoother Featuretransform:
https://github.com/AndyGrant/Ethereal/b ... nue.c#L110
Same comments in zip file:
Code: Select all
//We could permute Layer0 weights during init to skip permute and storing lower and upper 128 bits completely!
//optimisation into two a direct stores possible. (removal of permute and interleaved own, opp in input weights)
//summary: removal of 2 instructions (costing 2 cpi) possible if we do some work on init and linear memory access during runtime!
//we just create a mapping table to permute all 1024 weights now.
https://github.com/official-stockfish/S ... rmer.h#L82
Also this is inside a hot loop - so really worth the offort to shuffle on init. Do I want to create a fishtest commit for it? No. Probably for the same reason Andrew didnt either.
Other issue 1:
Yes the delta update means consolidating a remove and an add into a single delta and its much faster, in a benchmark where we see how fast we can do nnue eval() on all silent nodes (without AB essentially in perfT). Of course this depends on the existing cache pressure as well - so PEXT movegen vs KGSSB or Galoisfield movegen makes a difference as well as other cache pressure sources, but to see the difference (hand waving argument) performance on page1 vs performance above is the valid indicator.
Other issue 2:
And to adress this the 10th time - if there was an overflow bug the bench() values would already missmatch! There is a way to circumvent for the 10th time. By using 8 accumulators, splitting single loops into two parts and among other things you can see in the sourcecode that I clamp some values to -8192, 8192. Yes it can be done and removes some instructions from a hot loop. If not I will the first to admit mistake. Its a small improvement and maybe 3% increase in performance, but for a part of code that hasnt changed in 22 months its a valid improvement.
Code: Select all
//Second optimisation - work on 8 accumulators at a time. 2 iterations for all 30 relu and relusqr bits
//This is correct for 99.98% of positions but once every few 10000 the mullo_epi32 would need more than 32bits.
//_mm256_srli_epi32(_mm256_mullo_epi32(sum8, sum8), 19)
//Solution:
//sum*sum >> 19 is smaller than 127 (2^7) if and only if sum < 8192 - so we clamp it and dont get overflows.
//sum >> 6 is not impacted by this clamping.
Summary
From my perspective I have shared everything I wanted to and from my POV this is closed - and the real interesting part will be the comparison of implementations inside C++ and finally in different languages like C# of at least SFNNUEv5 and above in single header files without external dependencies.
Where stuff like __viaddmax_s16x2_relu from here: https://docs.nvidia.com/cuda/cuda-math- ... ccd2894937 will be cool to test.
Probably a unified interface for static classes like this:
Code: Select all
template <typename T>
concept is_eval_capable = requires {
{ T::AddPiece(int, int, int) } -> std::same_as<void>;
{ T::RemovePiece(int, int, int) } -> std::same_as<void>;
{ T::TakesPiece(int, int, int, int) } -> std::same_as<void>;
};