Hacking around CFish NNUE

maksimKorzh · Post by **maksimKorzh** » Thu Oct 15, 2020 8:27 pm

hgm wrote: ↑Thu Oct 15, 2020 8:20 pm Note that I just fixed a few typos in the code (well, actually copy-paste errors, where I forgot to make the intended modifications to the copied code). The weights of all layers of course had to be different, and the last layer only needs a 1-d array of weights, as there is only a single output.

Thanks, this takes my confusion away a little bit)

maksimKorzh · Post by **maksimKorzh** » Fri Oct 16, 2020 12:20 am

Guys, I can't believe that eventually I've found exactly what I was looking for!

So I wanted to see the following implementation: Take FEN string as input -> get NNUE score as output

And OMG! Here it is! https://hxim.github.io/Stockfish-Evaluation-Guide/ (NNUE tab)
It allows user to upload NNUE in the browser and gives a score to whatever position is available on board! Can you believe it?!
So now I can implement it in C and embed into my engine and make a tutorial series on it!
Yes, it would be slow, inefficient but I'm interested in a proof of concept.

So thanks to everybody participating, eventually you've helped me to find the right solution.

Daniel Shawul · Post by **Daniel Shawul** » Fri Oct 16, 2020 12:56 am

I just finished implementing the library without incremental updates.

https://github.com/dshawul/nnue-probe.git

It has a FEN interface and a pieces[],squares[] interface as well

Code: Select all

DLLExport void _CDECL nnue_init(const char * evalFile);
DLLExport int _CDECL nnue_evaluate(int player, int* pieces, int* squares);
DLLExport int _CDECL nnue_evaluate_fen(const char* fen);

Funny thing is that incremental updates gives only 4.5% speedup on the start position that it may not be worth it at all.
The "NNUE" implementation below was directly implemented in my engine with all the incremental update etc.
The "NNUE without increment" is through the library.

Code: Select all

No NNUE                         = 2100 knps
NNUE                            = 1400 knps
NNUE without increment          = 1337 knps
NNUE without increment + memcpy = 1100 knps

NNUE is about 65% of the speed of classic.
NNUE without incremntal evaluation is just 4.5% slower on the start position.
When I added the updates in makemove with memcpy for Accumulator/DirtyPiece it was 14-18% slower.
But without it, it is so small not to worry about at all.

syzygy · Post by **syzygy** » Fri Oct 16, 2020 1:07 am

Daniel Shawul wrote: ↑Thu Oct 15, 2020 7:07 pmI wonder why auto-vectorization is not used instead of the manual SIMD code NNUE currently has. There is separate code for AVX2, SSE3,SSE2,SSE etc which is kind of ugly. Your code above can be easily auto-vectorized by the compiler, so I wonder why this approach is not taken. I don't see any operation preventing auto-vectorization in a simple dense network. The NNUE code either doesn't have easily vectorizable "default code" or compilers do a really bad job at it as it seems it is 3x slower without vectorization.

Autovectorization might do fine on some parts but probably not on all parts. Also, some speed is gained by reordering the weights in the right way for the vector instruction set being used, which autovectorization won't be able to (for example, SSE2 storing the weights as 16-bit ints is a huge win). It may be ugly but it only needs to be done once (until the network architecture changes

).

But autovectorization is worth a try if one wants the cleanest possible code. One could maybe also use the gcc vector extensions.

If you tried the default code and found it to be 3x slower, that might be because the Makefile did not enable the avx2/sse instructions.

maksimKorzh · Post by **maksimKorzh** » Fri Oct 16, 2020 1:41 am

Daniel Shawul wrote: ↑Fri Oct 16, 2020 12:56 am I just finished implementing the library without incremental updates.

https://github.com/dshawul/nnue-probe.git

It has a FEN interface and a pieces[],squares[] interface as well
Code: Select all
DLLExport void _CDECL nnue_init(const char * evalFile);
DLLExport int _CDECL nnue_evaluate(int player, int* pieces, int* squares);
DLLExport int _CDECL nnue_evaluate_fen(const char* fen);
Funny thing is that incremental updates gives only 4.5% speedup on the start position that it may not be worth it at all.
The "NNUE" implementation below was directly implemented in my engine with all the incremental update etc.
The "NNUE without increment" is through the library.
Code: Select all
No NNUE                         = 2100 knps
NNUE                            = 1400 knps
NNUE without increment          = 1337 knps
NNUE without increment + memcpy = 1100 knps
NNUE is about 65% of the speed of classic.
NNUE without incremntal evaluation is just 4.5% slower on the start position.
When I added the updates in makemove with memcpy for Accumulator/DirtyPiece it was 14-18% slower.
But without it, it is so small not to worry about at all.

OMG! Seems exactly what I was dreaming of!
Thank you so much, Daniel!
Can't be grateful enough!

maksimKorzh · Post by **maksimKorzh** » Fri Oct 16, 2020 2:22 am

Daniel Shawul wrote: ↑Fri Oct 16, 2020 12:56 am I just finished implementing the library without incremental updates.

https://github.com/dshawul/nnue-probe.git

It has a FEN interface and a pieces[],squares[] interface as well
Code: Select all
DLLExport void _CDECL nnue_init(const char * evalFile);
DLLExport int _CDECL nnue_evaluate(int player, int* pieces, int* squares);
DLLExport int _CDECL nnue_evaluate_fen(const char* fen);
Funny thing is that incremental updates gives only 4.5% speedup on the start position that it may not be worth it at all.
The "NNUE" implementation below was directly implemented in my engine with all the incremental update etc.
The "NNUE without increment" is through the library.
Code: Select all
No NNUE                         = 2100 knps
NNUE                            = 1400 knps
NNUE without increment          = 1337 knps
NNUE without increment + memcpy = 1100 knps
NNUE is about 65% of the speed of classic.
NNUE without incremntal evaluation is just 4.5% slower on the start position.
When I added the updates in makemove with memcpy for Accumulator/DirtyPiece it was 14-18% slower.
But without it, it is so small not to worry about at all.

I've retrieved the score via FEN:

Code: Select all

int main()
{
    nnue_init("nn-04cf2b4ed1da.nnue");
    int score = nnue_evaluate_fen("rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1");
    printf("score: %d\n", score);
    return 0;
}

But what confuses me slightly a bit is the output score (probably I'm doing something wrong)
e.g. above code gives output: 108
while same network in JS interface gives: 57 (0.28)

to try it yourself you can navigate here: file:///home/maksim/Desktop/nnue.html -> go to NNUE tab, download network (I used nn-04cf2b4ed1da.nnue from https://tests.stockfishchess.org/nns)

Is this the matter of different implementations or I did something horribly wrong?

Daniel Shawul · Post by **Daniel Shawul** » Fri Oct 16, 2020 2:49 am

maksimKorzh wrote: ↑Fri Oct 16, 2020 2:22 am
Daniel Shawul wrote: ↑Fri Oct 16, 2020 12:56 am I just finished implementing the library without incremental updates.

https://github.com/dshawul/nnue-probe.git

It has a FEN interface and a pieces[],squares[] interface as well
Code: Select all
DLLExport void _CDECL nnue_init(const char * evalFile);
DLLExport int _CDECL nnue_evaluate(int player, int* pieces, int* squares);
DLLExport int _CDECL nnue_evaluate_fen(const char* fen);
Funny thing is that incremental updates gives only 4.5% speedup on the start position that it may not be worth it at all.
The "NNUE" implementation below was directly implemented in my engine with all the incremental update etc.
The "NNUE without increment" is through the library.
Code: Select all
No NNUE                         = 2100 knps
NNUE                            = 1400 knps
NNUE without increment          = 1337 knps
NNUE without increment + memcpy = 1100 knps
NNUE is about 65% of the speed of classic.
NNUE without incremntal evaluation is just 4.5% slower on the start position.
When I added the updates in makemove with memcpy for Accumulator/DirtyPiece it was 14-18% slower.
But without it, it is so small not to worry about at all.
I've retrieved the score via FEN:
Code: Select all
int main()
{
    nnue_init("nn-04cf2b4ed1da.nnue");
    int score = nnue_evaluate_fen("rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1");
    printf("score: %d\n", score);
    return 0;
}
But what confuses me slightly a bit is the output score (probably I'm doing something wrong)
e.g. above code gives output: 108
while same network in JS interface gives: 57 (0.28)

to try it yourself you can navigate here: file:///home/maksim/Desktop/nnue.html -> go to NNUE tab, download network (I used nn-04cf2b4ed1da.nnue from https://tests.stockfishchess.org/nns)

Is this the matter of different implementations or I did something horribly wrong?

There was a bug that I just fixed with decoding FEN.
Here is how you probe it from FEN

Code: Select all

from ctypes import *
nnue = cdll.LoadLibrary("libnnueprobe.so")
nnue.nnue_init("/home/daniel/Scorpio/nets-scorpio/nn-baeb9ef2d183.nnue")
score = nnue.nnue_evaluate_fen("rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1")
print "Score = ", score

Daniel Shawul · Post by **Daniel Shawul** » Fri Oct 16, 2020 2:52 am

syzygy wrote: ↑Fri Oct 16, 2020 1:07 am
Daniel Shawul wrote: ↑Thu Oct 15, 2020 7:07 pmI wonder why auto-vectorization is not used instead of the manual SIMD code NNUE currently has. There is separate code for AVX2, SSE3,SSE2,SSE etc which is kind of ugly. Your code above can be easily auto-vectorized by the compiler, so I wonder why this approach is not taken. I don't see any operation preventing auto-vectorization in a simple dense network. The NNUE code either doesn't have easily vectorizable "default code" or compilers do a really bad job at it as it seems it is 3x slower without vectorization.
Autovectorization might do fine on some parts but probably not on all parts. Also, some speed is gained by reordering the weights in the right way for the vector instruction set being used, which autovectorization won't be able to (for example, SSE2 storing the weights as 16-bit ints is a huge win). It may be ugly but it only needs to be done once (until the network architecture changes ).

But autovectorization is worth a try if one wants the cleanest possible code. One could maybe also use the gcc vector extensions.

If you tried the default code and found it to be 3x slower, that might be because the Makefile did not enable the avx2/sse instructions.

Hmm..I belive I had all the -mavx2 etc defined without the -DUSE_AVX2 for that test but maybe I made a mistake.

Daniel Shawul · Post by **Daniel Shawul** » Fri Oct 16, 2020 4:03 am

Daniel Shawul wrote: ↑Fri Oct 16, 2020 2:52 am
syzygy wrote: ↑Fri Oct 16, 2020 1:07 am
Daniel Shawul wrote: ↑Thu Oct 15, 2020 7:07 pmI wonder why auto-vectorization is not used instead of the manual SIMD code NNUE currently has. There is separate code for AVX2, SSE3,SSE2,SSE etc which is kind of ugly. Your code above can be easily auto-vectorized by the compiler, so I wonder why this approach is not taken. I don't see any operation preventing auto-vectorization in a simple dense network. The NNUE code either doesn't have easily vectorizable "default code" or compilers do a really bad job at it as it seems it is 3x slower without vectorization.
Autovectorization might do fine on some parts but probably not on all parts. Also, some speed is gained by reordering the weights in the right way for the vector instruction set being used, which autovectorization won't be able to (for example, SSE2 storing the weights as 16-bit ints is a huge win). It may be ugly but it only needs to be done once (until the network architecture changes ).

But autovectorization is worth a try if one wants the cleanest possible code. One could maybe also use the gcc vector extensions.

If you tried the default code and found it to be 3x slower, that might be because the Makefile did not enable the avx2/sse instructions.
Hmm..I belive I had all the -mavx2 etc defined without the -DUSE_AVX2 for that test but maybe I made a mistake.

It seems the issue is the compiler.
clang does vectorization well and now the slowdown is about 1.7x.
OTOH, gcc version 7.5 seems to have a problem unless I missed some flag

gcc -c -O3 -ftree-vectorize -ftree-vectorizer-verbose=2 -msse2 -msse -mavx2 vect.c

This doesn't report any vectorized loops.
GCC is actually 6.2x slower when I don't do incremental update, and with incremental update it is about 3x slower.
For clang, I used

clang -c -O3 -ftree-vectorize -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize vect.c

and it does report vecorized loops and is fast.

maksimKorzh · Post by **maksimKorzh** » Fri Oct 16, 2020 10:37 am

Daniel Shawul wrote: ↑Fri Oct 16, 2020 2:49 am
maksimKorzh wrote: ↑Fri Oct 16, 2020 2:22 am
Daniel Shawul wrote: ↑Fri Oct 16, 2020 12:56 am I just finished implementing the library without incremental updates.

https://github.com/dshawul/nnue-probe.git

It has a FEN interface and a pieces[],squares[] interface as well
Code: Select all
DLLExport void _CDECL nnue_init(const char * evalFile);
DLLExport int _CDECL nnue_evaluate(int player, int* pieces, int* squares);
DLLExport int _CDECL nnue_evaluate_fen(const char* fen);
Funny thing is that incremental updates gives only 4.5% speedup on the start position that it may not be worth it at all.
The "NNUE" implementation below was directly implemented in my engine with all the incremental update etc.
The "NNUE without increment" is through the library.
Code: Select all
No NNUE                         = 2100 knps
NNUE                            = 1400 knps
NNUE without increment          = 1337 knps
NNUE without increment + memcpy = 1100 knps
NNUE is about 65% of the speed of classic.
NNUE without incremntal evaluation is just 4.5% slower on the start position.
When I added the updates in makemove with memcpy for Accumulator/DirtyPiece it was 14-18% slower.
But without it, it is so small not to worry about at all.
I've retrieved the score via FEN:
Code: Select all
int main()
{
    nnue_init("nn-04cf2b4ed1da.nnue");
    int score = nnue_evaluate_fen("rnbqkbnr/pppppppp/8/8/4P3/8/PPPP1PPP/RNBQKBNR b KQkq e3 0 1");
    printf("score: %d\n", score);
    return 0;
}
But what confuses me slightly a bit is the output score (probably I'm doing something wrong)
e.g. above code gives output: 108
while same network in JS interface gives: 57 (0.28)

to try it yourself you can navigate here: file:///home/maksim/Desktop/nnue.html -> go to NNUE tab, download network (I used nn-04cf2b4ed1da.nnue from https://tests.stockfishchess.org/nns)

Is this the matter of different implementations or I did something horribly wrong?
There was a bug that I just fixed with decoding FEN.
Here is how you probe it from FEN
Code: Select all
from ctypes import *
nnue = cdll.LoadLibrary("libnnueprobe.so")
nnue.nnue_init("/home/daniel/Scorpio/nets-scorpio/nn-baeb9ef2d183.nnue")
score = nnue.nnue_evaluate_fen("rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1")
print "Score = ", score

Thank you so much for fixing it!
your snippet is in python, but can't I use the lib in C? I head some errors when imported nnue.h but then I just added main() to nnue.cpp file and called nnue_evaluate_fen() from there. In future I just want ro compile nnue.cpp and misc.cpp along with my engine, is that ok?

Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE

Re: Hacking around CFish NNUE