None-GPL NNUE probing code.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jan 31, 2021 2:02 am

Hello,

I know some programmers like Tornado's author wanted a non-GPL NNUE probing code.
So here I give you a NNUE probing code that is not GPL, so you can use it in your private or even commercial engines.
https://github.com/dshawul/nncpu-probe
It has a similar interface as the one derived from CFish, i.e. libnnueprobe.so, so it should be a drop-in replacement

Caveats:
a) Only AVX2 support. The other option is to not use vectorization at all which is 2x slower.

b) libnncpu is slightly slower than libnnue but by a very small margin

c) Not tested well enough but on the start postion it seems to give identical results.
Did also a handful of games, but an exhaustive test comparing the two would be good.

libnnue

Code: Select all

$ ./scorpio.sh use_nn 0 use_nnue 1 nnue_type 0 montecarlo 0 nnue_path ../nets-nnue/nn-04cf2b4ed1da.nnue go quit
feature done=0
Number of cores 4 of 4
treeht 83886080 X 320 = 25600.0 MB
processors [1]
ht 67108864 X 16 = 1024.0 MB
eht 262144 X 8 X 1 = 4.0 MB
pht 32768 X 24 X 1 = 0.8 MB
EgbbProbe 4.3 by Daniel Shawul
egbb_cache 4084 X 8216 = 32.0 MB
180 egbbs loaded !      
Loading NNUE : ../nets-nnue/nn-04cf2b4ed1da.nnue
NNUE loaded !
loading_time = 0s
# rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
# [st = 8335ms, mt = 29250ms , hply = 0 , moves_left 10]
2 68 0 44 2 44000 0 d2-d4 d7-d5
3 5 0 257 3 51400 0 d2-d4 Ng8-f6 Ng1-f3
4 63 1 849 4 70750 0 d2-d4 Ng8-f6 c2-c4 d7-d5
5 11 2 3026 5 151300 0 d2-d4 d7-d5 c2-c4 d5xc4 Ng1-f3
6 73 3 9679 6 261594 0 d2-d4 e7-e6 c2-c4 c7-c5 d4-d5 e6xd5
7 23 4 17227 7 358895 0 d2-d4 d7-d5 Ng1-f3 Ng8-f6 c2-c4 d5xc4 Qd1-a4 c7-c6
7 31 6 29395 7 459296 0 c2-c4 e7-e6 Ng1-f3 d7-d5 d2-d4 d5xc4 Nb1-c3
8 -2 7 38961 8 526500 0 c2-c4 e7-e5 e2-e4 Nb8-c6 Ng1-f3 Bf8-c5 Nb1-c3 d7-d6
8 59 8 43406 8 535876 0 d2-d4 e7-e6 Ng1-f3 Ng8-f6 c2-c4 c7-c5 d4xc5 Bf8xc5
9 25 9 58641 9 604546 0 d2-d4 Ng8-f6 c2-c4 e7-e6 Ng1-f3 d7-d5 Nb1-c3 d5xc4 e2-e4
10 35 13 95271 10 690369 0 d2-d4 d7-d5 Ng1-f3 Ng8-f6 e2-e3 c7-c5 c2-c4 c5xd4 e3xd4 Nb8-c6 c4xd5 Nf6xd5
11 40 16 123288 11 729514 0 d2-d4 d7-d5 Ng1-f3 Ng8-f6 c2-c4 d5xc4 Nb1-c3 c7-c5 d4-d5 e7-e6 e2-e4 e6xd5 e4xd5
11 45 20 155620 11 774228 0 e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nb1-c3 Nb8-c6 d2-d4 e5xd4 Nf3xd4 Nc6xd4 Qd1xd4 Bf8-b4
12 64 24 201187 12 831351 0 e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nb1-c3 Nb8-c6 d2-d4 e5xd4 Nf3xd4 Bf8-b4 Nd4xc6 b7xc6
13 37 32 293342 13 911000 0 e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nf3xe5 Nf6xe4 d2-d4 d7-d5 Bf1-d3 Nb8-d7 Ke1-g1 Nd7xe5 d4xe5
14 33 43 424744 14 976422 0 e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nb1-c3 Nb8-c6 d2-d4 e5xd4 Nf3xd4 Bf8-b4 Nd4xc6 Bb4xc3 b2xc3 b7xc6 e4-e5
14 60 56 578867 14 1024543 0 d2-d4 d7-d5 c2-c4 e7-e6 Nb1-c3 c7-c5 c4xd5 e6xd5 d4xc5 d5-d4 Nc3-e4 Bc8-f5 Ne4-d6 Bf8xd6 c5xd6 Qd8xd6
15 34 83 941375 15 1127395 0 d2-d4 d7-d5 c2-c4 e7-e6 Nb1-c3 Ng8-f6 Ng1-f3 Bf8-b4 e2-e3 Ke8-g8 Bc1-d2 d5xc4 Bf1xc4 c7-c5 d4xc5
16 32 124 1525609 16 1221464 0 d2-d4 Ng8-f6 c2-c4 e7-e6 Ng1-f3 b7-b6 Nb1-c3 Bf8-b4 e2-e3 Bc8-b7 Bc1-d2 c7-c5 d4xc5 Bb4xc5 Bf1-d3 Ke8-g8
16 33 134 1656870 16 1236470 0 e2-e4 e7-e5 Ng1-f3 Nb8-c6 d2-d4 e5xd4 Nf3xd4 Ng8-f6 Nb1-c3 Bf8-b4 Nd4xc6 b7xc6 Bf1-d3 Ke8-g8 Ke1-g1 Bb4xc3 b2xc3
17 26 227 3275693 17 1441131 0 e2-e4 c7-c5 c2-c3 Nb8-c6 d2-d4 d7-d5 e4xd5 Qd8xd5 Ng1-f3 Ng8-f6 Bf1-e2 c5xd4 c3xd4 e7-e6 Nb1-c3 Qd5-d8 Ke1-g1
18 40 302 4711082 18 1556353 0 e2-e4 c7-c5 c2-c3 Ng8-f6 e4-e5 Nf6-d5 Ng1-f3 Nb8-c6 Bf1-c4 e7-e6 d2-d4 c5xd4 c3xd4 d7-d6 Bc1-g5 Bf8-e7 Bc4xd5 Be7xg5
19 24 406 6825623 19 1677469 0 e2-e4 c7-c5 Nb1-c3 d7-d6 Ng1-f3 Ng8-f6 d2-d4 c5xd4 Nf3xd4 a7-a6 Bf1-d3 Nb8-d7 Ke1-g1 e7-e6 f2-f4 Bf8-e7 Qd1-e2 Ke8-g8 Bc1-e3
19 24 473 8124708 19 1717334 0 e2-e4 c7-c5 Nb1-c3 d7-d6 Ng1-f3 Ng8-f6 d2-d4 c5xd4 Nf3xd4 a7-a6 Bf1-d3 Nb8-d7 Ke1-g1 e7-e6 f2-f4 Bf8-e7 Qd1-e2 Ke8-g8 Bc1-e3
# Stat: nodes 8124708 <13% qnodes> tbhits 0 splits 0 badsplits 0 time 4731ms nps 1717334 eps 604811 nneps 0
move e2e4
Bye Bye

libnncpu

Code: Select all

$ ./scorpio.sh use_nn 0 use_nnue 1 nnue_type 1 montecarlo 0 nnue_path ../nets-nnue/nn-04cf2b4ed1da.nnue go quit
feature done=0
Number of cores 4 of 4
treeht 83886080 X 320 = 25600.0 MB
processors [1]
ht 67108864 X 16 = 1024.0 MB
eht 262144 X 8 X 1 = 4.0 MB
pht 32768 X 24 X 1 = 0.8 MB
EgbbProbe 4.3 by Daniel Shawul
egbb_cache 4084 X 8216 = 32.0 MB
180 egbbs loaded !      
Loading NNCPU : ../nets-nnue/nn-04cf2b4ed1da.nnue
NNCPU loaded !
loading_time = 0s
# rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
# [st = 8335ms, mt = 29250ms , hply = 0 , moves_left 10]
2 68 0 44 2 44000 0 d2-d4 d7-d5
3 5 0 257 3 257000 0 d2-d4 Ng8-f6 Ng1-f3
4 63 0 849 4 283000 0 d2-d4 Ng8-f6 c2-c4 d7-d5
5 11 0 3026 5 432285 0 d2-d4 d7-d5 c2-c4 d5xc4 Ng1-f3
6 73 2 9679 6 483950 0 d2-d4 e7-e6 c2-c4 c7-c5 d4-d5 e6xd5
7 23 3 17227 7 555709 0 d2-d4 d7-d5 Ng1-f3 Ng8-f6 c2-c4 d5xc4 Qd1-a4 c7-c6
7 31 4 29395 7 653222 0 c2-c4 e7-e6 Ng1-f3 d7-d5 d2-d4 d5xc4 Nb1-c3
8 -2 5 38961 8 695732 0 c2-c4 e7-e5 e2-e4 Nb8-c6 Ng1-f3 Bf8-c5 Nb1-c3 d7-d6
8 59 6 43406 8 711573 0 d2-d4 e7-e6 Ng1-f3 Ng8-f6 c2-c4 c7-c5 d4xc5 Bf8xc5
9 25 7 58641 9 751807 0 d2-d4 Ng8-f6 c2-c4 e7-e6 Ng1-f3 d7-d5 Nb1-c3 d5xc4 e2-e4
10 35 11 95271 10 821301 0 d2-d4 d7-d5 Ng1-f3 Ng8-f6 e2-e3 c7-c5 c2-c4 c5xd4 e3xd4 Nb8-c6 c4xd5 Nf6xd5
11 40 14 123288 11 838693 0 d2-d4 d7-d5 Ng1-f3 Ng8-f6 c2-c4 d5xc4 Nb1-c3 c7-c5 d4-d5 e7-e6 e2-e4 e6xd5 e4xd5
11 45 18 155620 11 859779 0 e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nb1-c3 Nb8-c6 d2-d4 e5xd4 Nf3xd4 Nc6xd4 Qd1xd4 Bf8-b4
12 64 22 201187 12 906247 0 e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nb1-c3 Nb8-c6 d2-d4 e5xd4 Nf3xd4 Bf8-b4 Nd4xc6 b7xc6
13 37 31 293342 13 946264 0 e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nf3xe5 Nf6xe4 d2-d4 d7-d5 Bf1-d3 Nb8-d7 Ke1-g1 Nd7xe5 d4xe5
14 33 42 424744 14 997051 0 e2-e4 e7-e5 Ng1-f3 Ng8-f6 Nb1-c3 Nb8-c6 d2-d4 e5xd4 Nf3xd4 Bf8-b4 Nd4xc6 Bb4xc3 b2xc3 b7xc6 e4-e5
14 60 55 578867 14 1039258 0 d2-d4 d7-d5 c2-c4 e7-e6 Nb1-c3 c7-c5 c4xd5 e6xd5 d4xc5 d5-d4 Nc3-e4 Bc8-f5 Ne4-d6 Bf8xd6 c5xd6 Qd8xd6
15 34 82 941375 15 1135554 0 d2-d4 d7-d5 c2-c4 e7-e6 Nb1-c3 Ng8-f6 Ng1-f3 Bf8-b4 e2-e3 Ke8-g8 Bc1-d2 d5xc4 Bf1xc4 c7-c5 d4xc5
16 32 124 1525609 16 1226373 0 d2-d4 Ng8-f6 c2-c4 e7-e6 Ng1-f3 b7-b6 Nb1-c3 Bf8-b4 e2-e3 Bc8-b7 Bc1-d2 c7-c5 d4xc5 Bb4xc5 Bf1-d3 Ke8-g8
16 33 133 1656870 16 1239244 0 e2-e4 e7-e5 Ng1-f3 Nb8-c6 d2-d4 e5xd4 Nf3xd4 Ng8-f6 Nb1-c3 Bf8-b4 Nd4xc6 b7xc6 Bf1-d3 Ke8-g8 Ke1-g1 Bb4xc3 b2xc3
17 26 228 3275693 17 1432936 0 e2-e4 c7-c5 c2-c3 Nb8-c6 d2-d4 d7-d5 e4xd5 Qd8xd5 Ng1-f3 Ng8-f6 Bf1-e2 c5xd4 c3xd4 e7-e6 Nb1-c3 Qd5-d8 Ke1-g1
18 40 307 4711082 18 1533056 0 e2-e4 c7-c5 c2-c3 Ng8-f6 e4-e5 Nf6-d5 Ng1-f3 Nb8-c6 Bf1-c4 e7-e6 d2-d4 c5xd4 c3xd4 d7-d6 Bc1-g5 Bf8-e7 Bc4xd5 Be7xg5
19 24 415 6825623 19 1643936 0 e2-e4 c7-c5 Nb1-c3 d7-d6 Ng1-f3 Ng8-f6 d2-d4 c5xd4 Nf3xd4 a7-a6 Bf1-d3 Nb8-d7 Ke1-g1 e7-e6 f2-f4 Bf8-e7 Qd1-e2 Ke8-g8 Bc1-e3
19 24 483 8124708 19 1682134 0 e2-e4 c7-c5 Nb1-c3 d7-d6 Ng1-f3 Ng8-f6 d2-d4 c5xd4 Nf3xd4 a7-a6 Bf1-d3 Nb8-d7 Ke1-g1 e7-e6 f2-f4 Bf8-e7 Qd1-e2 Ke8-g8 Bc1-e3
# Stat: nodes 8124708 <13% qnodes> tbhits 0 splits 0 badsplits 0 time 4830ms nps 1682134 eps 592414 nneps 0
move e2e4
Bye Bye

As you can see the number of nodes searched (4th column) is identical at each depth. Nps(5th column) is slightly less.

Daniel

jdart · Post by **jdart** » Sun Jan 31, 2021 3:13 am

I have also done an alternative implementation of the Stockfish NNUE reading code. But it is not ready to release yet. I have not optimized it for vector instructions but I have noticed GCC at least generates those anyway for at least part of the code.

--Jon

Daniel Shawul · Post by **Daniel Shawul** » Sun Jan 31, 2021 3:58 am

GCC was really bad for me even with the CFish code. Clang OTOH was pretty good at vectorization.
The one operation that really benefits from vectorization is the "dot product" in the dense layers.
That one needed to be manually vectorized even for Clang.
Vectorizing addition (for the accumulator) and Clamping does give some benefit too but this is one you can ignore if you don't have time.
Also with quantized weights (INT8) vectorization maybe more difficult than using float32.

SF nets store quantized weights but in Scorpio's format I store float32 and scale them when the net is loaded.
That way I was able to compare against float32 and quantized network was 2x faster than float32.
If you don't quantize, and don't vectorize, you probably loose by a factor of 4 speedwise.

Milos · Post by **Milos** » Sun Jan 31, 2021 6:04 am

Daniel Shawul wrote: ↑Sun Jan 31, 2021 3:58 am GCC was really bad for me even with the CFish code. Clang OTOH was pretty good at vectorization.
The one operation that really benefits from vectorization is the "dot product" in the dense layers.
That one needed to be manually vectorized even for Clang.
Vectorizing addition (for the accumulator) and Clamping does give some benefit too but this is one you can ignore if you don't have time.
Also with quantized weights (INT8) vectorization maybe more difficult than using float32.

SF nets store quantized weights but in Scorpio's format I store float32 and scale them when the net is loaded.
That way I was able to compare against float32 and quantized network was 2x faster than float32.
If you don't quantize, and don't vectorize, you probably loose by a factor of 4 speedwise.

I don't get this inventing a wheel approach. Why don't you just use dot product from openblas or mkl or eigen???
Or is it an exercise in SIMD instructions coding?

Joost Buijs · Post by **Joost Buijs** » Sun Jan 31, 2021 8:31 am

Because it is fun to do. Why would you use a library when you can write a SIMD-DP in just a few line of code?

At least on my 10980XE with AVX-512/VNNI a 16 bit DP with a vector length of 32 is just a few lines:

Code: Select all

int vnni_dot_product(int32_t* bias, int16_t* w1, int16_t* w2)
{
	__m512i v1 = _mm512_loadu_si512(bias);
	__m512i v2 = _mm512_loadu_si512(w1);
	__m512i v3 = _mm512_loadu_si512(w2);
	__m512i vresult = _mm512_dpwssd_epi32(v1, v2, v3);
	return _mm512_reduce_add_epi32(vresult);
}

Under the hood the horizontal add still consists of several machine instructions, but the intrinsics are already there so why not use them.

Like Daniel I read in the weights as float32, this makes it easy to compare SIMD results with float results. I use int16 SIMD code to avoid the hassles of 8 bit quantization, this is fast enough for testing purposes. Maybe an int8 DP is twice as fast, but in practice there are so many other things in the engine giving overhead that the actual difference is much smaller.

mar · Post by **mar** » Sun Jan 31, 2021 9:08 am

Hmm, I don't see why people don't use a trasposed weight matrix:
instead of [output][input] use [input][output]
this is 1) cache-friendly when input is a bit plane where you interate a set of indices of 1 => no need to even touch the input weights, just collect nonzero indices (plus cache-align weights)
2) a breeze for compilers to auto-vectorize

the plain simple forward pass:
(also note that the "dot product loop" actually just fetches 1 input weight and simply sums scaled weights into a temp buffer)

the init: just memcopy biases

Code: Select all

void forward(const float *input, float *output)
{
	float tmp[MAX_OUTPUT_LAYER_SIZE];

	for (int i=0; i<outputSize; i++)
		tmp[i] = bias[i];

	for (int i=0; i<inputSize; i++)
	{
		const float *w = weights + i*outputSize;

		auto inputw = input[i];

		for (int j=0; j<outputSize; j++)
			tmp[j] += inputw * w[j];
	}

	for (int i=0; i<outputSize; i++)
		output[i] = activate(tmp[i]);
}

also in C++ with a fixed-topo net one can use templates with ints to squeeze the maximum out of it, though I'm pretty sure many already do this

mar · Post by **mar** » Sun Jan 31, 2021 9:57 am

can't edit my post anymore but here is what gcc generates for the hot inner loops in avx2 mode for a fixed output_size of 32:

Code: Select all

.Loop:
        vbroadcastss    ymm5, DWORD PTR [r8-8+rdx*4]
        vbroadcastss    ymm0, DWORD PTR [r8-4+rdx*4]
        mov     ecx, edx
        add     rdx, 2
        vfmadd231ps     ymm4, ymm5, YMMWORD PTR [rax]
        vfmadd231ps     ymm3, ymm5, YMMWORD PTR [rax+32]
        add     rax, 256
        vfmadd231ps     ymm2, ymm5, YMMWORD PTR [rax-192]
        vfmadd231ps     ymm4, ymm0, YMMWORD PTR [rax-128]
        vfmadd231ps     ymm1, ymm5, YMMWORD PTR [rax-160]
        vfmadd231ps     ymm3, ymm0, YMMWORD PTR [rax-96]
        vfmadd231ps     ymm2, ymm0, YMMWORD PTR [rax-64]
        vfmadd231ps     ymm1, ymm0, YMMWORD PTR [rax-32]
        cmp     rdx, input_size
        jne     .Loop

not so bad I guess? (doesn't include init and final activation pass)

note that the loop is even 2x unrolled as I used a fixed layout known at compile-time

Joost Buijs · Post by **Joost Buijs** » Sun Jan 31, 2021 12:03 pm

I use [output][input] because it seems more logical to me, but I still have a transposed implementation on my todo list. The whole inference code is pretty simple anyway, all basic stuff. Getting good results with NN-evaluation depends solely upon the data you use to train the network, this is the most difficult part to get right.

mar · Post by **mar** » Sun Jan 31, 2021 12:13 pm

my point was simply that the inference can be efficient without non-portable intrinsics. also transposed weights have another advantage (point 1)
I never liked the idea of having accumulator for "efficient updates" - extra boilerplate and extra state you drag around with your board representation,
that being said I'm not interested in the "NNUE" topology, for starters I think it makes sense to experiment with much smaller nets with "natural topology" in the vein of what Halogen does, YMMV of course
also if you don't transpose then you eat lots of cache misses in the "efficient update" part if I'm not mistaken, because then the output weights are far away from each other in memory

besides if I'm not mistaken this thread is all about inference

jdart · Post by **jdart** » Sun Jan 31, 2021 12:52 pm

I don't get this inventing a wheel approach. Why don't you just use dot product from openblas or mkl or eigen???
Or is it an exercise in SIMD instructions coding?

I actually considered using mlpack, which is based on the Armadillo vector library. But it doesn't support vectors of 8-bit quantities. It's also really big. Most of the NN libraries are really overkill for a project like this.

There are a number of SIMD libraries out there, for example:

https://p12tic.github.io/libsimdpp/
https://github.com/VcDevel/Vc
https://github.com/Twon/std-experimental-simd
https://github.com/jeffamstutz/tsimd

These are certainly an option. What I've found though is that in general, they are good shortcuts assuming you have fixed data types. Same with the NN libraries. If you're trying to write code that can be easily reconfigured (different weight sizes or dimensions for example), then you have an issue.

None-GPL NNUE probing code.

None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.