None-GPL NNUE probing code.

Joost Buijs · Post by **Joost Buijs** » Mon Feb 01, 2021 7:59 am

mar wrote: ↑Sun Jan 31, 2021 12:13 pm my point was simply that the inference can be efficient without non-portable intrinsics. also transposed weights have another advantage (point 1)
I never liked the idea of having accumulator for "efficient updates" - extra boilerplate and extra state you drag around with your board representation,
that being said I'm not interested in the "NNUE" topology, for starters I think it makes sense to experiment with much smaller nets with "natural topology" in the vein of what Halogen does, YMMV of course
also if you don't transpose then you eat lots of cache misses in the "efficient update" part if I'm not mistaken, because then the output weights are far away from each other in memory

besides if I'm not mistaken this thread is all about inference

I could be beneficial to transpose the input layer, something want to try when I start working on optimization. I'm not interested in the network topology they use in Stockfish either. I'm experimenting with a fully connected net with 736 inputs which seems to perform a lot better than my old HCE.

Since I don't like Python stuff and writing a trainer from scratch would be very time consuming I've written the trainer in C++ making use of Facebooks libTorch library (which is mostly Caffe2). Last week I got my RTX-3090-Turbo, this brought the training time down by almost a factor 3, which is very convenient, it enables me to do several tests per day.

For the inference code I stick with AVX2 because AMD doesn't support AVX-512. Out of curiosity I've been experimenting with AVX-512 on my i9-10980XE, it's somewhat faster but certainly not by a factor of 2. Maybe the VNNI-DP runs twice as fast, but all the surrounding code (like the horizontal add) still gives a lot of overhead.

Engin · Post by **Engin** » Mon Feb 01, 2021 11:03 pm

many thanks to Daniel Shawul for the non-GPL NN CPU probe its work too

but i want to train my own network to using it.

so how i can train from my own fen training set database ?

Daniel Shawul · Post by **Daniel Shawul** » Tue Feb 02, 2021 12:44 am

Engin wrote: ↑Mon Feb 01, 2021 11:03 pm many thanks to Daniel Shawul for the non-GPL NN CPU probe its work too

but i want to train my own network to using it.

so how i can train from my own fen training set database ?

Hi Engin,

Glad it worked for you.

Training of nets for Stockfish style nets is done with the nodchip repository for stockfish
https://github.com/nodchip/Stockfish
I have my own trainer with tensorflow keras but you probably want to use the Stockfish one.

You should join Stockfish discord where there is a lot of information about training nets.
Also there are many experts there that you can ask questions.

Good luck,

Daniel

Daniel Shawul · Post by **Daniel Shawul** » Tue Feb 02, 2021 12:54 am

Joost Buijs wrote: ↑Mon Feb 01, 2021 7:59 am
mar wrote: ↑Sun Jan 31, 2021 12:13 pm my point was simply that the inference can be efficient without non-portable intrinsics. also transposed weights have another advantage (point 1)
I never liked the idea of having accumulator for "efficient updates" - extra boilerplate and extra state you drag around with your board representation,
that being said I'm not interested in the "NNUE" topology, for starters I think it makes sense to experiment with much smaller nets with "natural topology" in the vein of what Halogen does, YMMV of course
also if you don't transpose then you eat lots of cache misses in the "efficient update" part if I'm not mistaken, because then the output weights are far away from each other in memory

besides if I'm not mistaken this thread is all about inference
I could be beneficial to transpose the input layer, something want to try when I start working on optimization. I'm not interested in the network topology they use in Stockfish either. I'm experimenting with a fully connected net with 736 inputs which seems to perform a lot better than my old HCE.

Since I don't like Python stuff and writing a trainer from scratch would be very time consuming I've written the trainer in C++ making use of Facebooks libTorch library (which is mostly Caffe2). Last week I got my RTX-3090-Turbo, this brought the training time down by almost a factor 3, which is very convenient, it enables me to do several tests per day.

For the inference code I stick with AVX2 because AMD doesn't support AVX-512. Out of curiosity I've been experimenting with AVX-512 on my i9-10980XE, it's somewhat faster but certainly not by a factor of 2. Maybe the VNNI-DP runs twice as fast, but all the surrounding code (like the horizontal add) still gives a lot of overhead.

For the input layer, I use [input][ouput] because it is updated incrementally using accumulator. That is, if the input node is set (1) update all
output nodes that are connected to it. For the rest of the layers, the reverse makes sense, so I use [output][input] for caching and for SIMD too since array needs to be contiguous.
For better caching, I also tried ordering the king square (just like in EGTB indexing) but didn't see much benefit in it.

NNUE architecture learns king safety and attack very quickly but something simple as piece values maybe hard without
factorizer, or having it as input directly. NNUE is just a tiny net, no magic in it. I feel like a 2x32 conv net I have can compete with it
but I need to write a fast inference code for it with SIMD or blas. Maybe I will try and use lc0 cpu backend where GCP spent time
writing winograd conv and other algorithms manually. I wish tensorflow c++ inference didn't have such a high overhead...

mar · Post by **mar** » Tue Feb 02, 2021 9:44 am

Joost Buijs wrote: ↑Mon Feb 01, 2021 7:59 am I could be beneficial to transpose the input layer, something want to try when I start working on optimization. I'm not interested in the network topology they use in Stockfish either. I'm experimenting with a fully connected net with 736 inputs which seems to perform a lot better than my old HCE.

Since I don't like Python stuff and writing a trainer from scratch would be very time consuming I've written the trainer in C++ making use of Facebooks libTorch library (which is mostly Caffe2). Last week I got my RTX-3090-Turbo, this brought the training time down by almost a factor 3, which is very convenient, it enables me to do several tests per day.

For the inference code I stick with AVX2 because AMD doesn't support AVX-512. Out of curiosity I've been experimenting with AVX-512 on my i9-10980XE, it's somewhat faster but certainly not by a factor of 2. Maybe the VNNI-DP runs twice as fast, but all the surrounding code (like the horizontal add) still gives a lot of overhead.

hmm, it's actually quite interesting that these small nets can beat handcrafted HCEs, I'm surprised.
let's consider 768 (or 736)-32-1. but that's nothing but 32 static psq tables that are then mixed together in the second layer. I clam that this can't possibly learn even mobility, which is considered as one of the pillars of HCE. needless to say we already use (typically) 2 in our HCEs.
so it seems NNUE simply tried to explore some correlation between certain pieces by throwing almost 2 orders of more weights on it.
intuitively, this seems incredibly wasteful. my guess is that you then need much more data and much more variety, obviously, because you need
enough data where a certain combination of pieces is at certain squares.
so I believe that there is potential to achieve the same improvement with much smaller nets.

on the other hand - with Texeltuning I noticed that it seemed to try to bake missing knowledge into PSQ tables.

it's sort of sad to see that many see "NNUE" simply as means to gain a lot of elo doing absolutely nothing (+300 overnight) instead of trying to really explore new topologies and new ways to integrate NNs into eval, which seems way more interesting

Rein Halbersma · Post by **Rein Halbersma** » Tue Feb 02, 2021 10:19 am

Daniel Shawul wrote: ↑Tue Feb 02, 2021 12:54 am NNUE architecture learns king safety and attack very quickly but something simple as piece values maybe hard without
factorizer, or having it as input directly. NNUE is just a tiny net, no magic in it. I feel like a 2x32 conv net I have can compete with it
but I need to write a fast inference code for it with SIMD or blas. Maybe I will try and use lc0 cpu backend where GCP spent time
writing winograd conv and other algorithms manually. I wish tensorflow c++ inference didn't have such a high overhead...

I wonder if things like TensorRT make automated inference performant compared to handwritten inference. Do you have experience with it?

Joost Buijs · Post by **Joost Buijs** » Tue Feb 02, 2021 1:46 pm

mar wrote: ↑Tue Feb 02, 2021 9:44 am hmm, it's actually quite interesting that these small nets can beat handcrafted HCEs, I'm surprised.
let's consider 768 (or 736)-32-1. but that's nothing but 32 static psq tables that are then mixed together in the second layer. I clam that this can't possibly learn even mobility, which is considered as one of the pillars of HCE. needless to say we already use (typically) 2 in our HCEs.
so it seems NNUE simply tried to explore some correlation between certain pieces by throwing almost 2 orders of more weights on it.
intuitively, this seems incredibly wasteful. my guess is that you then need much more data and much more variety, obviously, because you need
enough data where a certain combination of pieces is at certain squares.
so I believe that there is potential to achieve the same improvement with much smaller nets.

That is to say it beats my old HCE which is a lot worse than the Stockfish HCE. Several years ago I did some tests with the Stockfish HCE and it was like 150 Elo better. It is difficult to compare because the search has influence too, but my impression is that such a small net performs more or less at the same level as the Stockfish HCE.

It depends very much upon using the right data for training, this is the most difficult part, positions that work for Texel tuning don't seem to work very well for tuning a NN. You need a lot of positions too, 100M for a reasonable result but 1B is even better.

Daniel Shawul · Post by **Daniel Shawul** » Tue Feb 02, 2021 2:03 pm

Rein Halbersma wrote: ↑Tue Feb 02, 2021 10:19 am
Daniel Shawul wrote: ↑Tue Feb 02, 2021 12:54 am NNUE architecture learns king safety and attack very quickly but something simple as piece values maybe hard without
factorizer, or having it as input directly. NNUE is just a tiny net, no magic in it. I feel like a 2x32 conv net I have can compete with it
but I need to write a fast inference code for it with SIMD or blas. Maybe I will try and use lc0 cpu backend where GCP spent time
writing winograd conv and other algorithms manually. I wish tensorflow c++ inference didn't have such a high overhead...
I wonder if things like TensorRT make automated inference performant compared to handwritten inference. Do you have experience with it?

Yes, TensorRT is pretty good and I use it for GPU ResNet inference. It is as good as the lc0 cuda code, while giving you additional options
to experiment with int8 etc. I'd like to avoid writing inference code by hand as much as possible, but NNUE seem to be an exception as most
libraries are general purpose and have some overhead that is very bad for tiny NNs.

hgm · Post by **hgm** » Tue Feb 02, 2021 3:05 pm

Daniel Shawul wrote: ↑Tue Feb 02, 2021 12:54 amNNUE architecture learns king safety and attack very quickly but something simple as piece values maybe hard without
factorizer, or having it as input directly.

Isn't that just a matter of offering it the right training material? It seems that people have the inclination to train on positions from games between very strong players. But these usually contain zero information on the opening values of pieces, as they stay nearly balanced for a long time. So the only thing the net can learn from it is that positions with a lot of material are always balanced. And it would quite happily set all piece values to 0 in order to guarantee that. To make it understand that trading a Queen for two Pawns in the opening causes a certain loss, you would have to show it enough positions where a Queen was traded for two Pawns.

Sven · Post by **Sven** » Tue Feb 02, 2021 3:54 pm

Wouldn't positions with something like Q vs. BBN, QP vs. RR, RP vs. BN, R vs NPP, B vs. PPP or similar (plus more material on board), that are evaluated as "balanced", serve the same purpose?

None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.

Re: None-GPL NNUE probing code.