nnue-trainer

jdart · Post by **jdart** » Sat Mar 27, 2021 7:53 pm

I am doing some experiments with https://github.com/bmdanielsson/nnue-trainer and have a couple of questions:
1. What's a typical number of positions for training and validation?
2. I am seeing that the objective values seem to go down with increasing training set size. For example, 5 million positions generated with depth 8 and 500k positions for validation produce output like this:

Code: Select all

Epoch 0, 100% (611/611) => 0.04480
Epoch 1, 100% (611/611) => 0.03570
Epoch 2, 100% (611/611) => 0.03236
Epoch 3, 100% (611/611) => 0.02443
...

.
From my limited experience even larger input sets produce very small errors and this may make optimization problematic?

Joost Buijs · Post by **Joost Buijs** » Sun Mar 28, 2021 8:46 am

jdart wrote: ↑Sat Mar 27, 2021 7:53 pm I am doing some experiments with https://github.com/bmdanielsson/nnue-trainer and have a couple of questions:
1. What's a typical number of positions for training and validation?
2. I am seeing that the objective values seem to go down with increasing training set size. For example, 5 million positions generated with depth 8 and 500k positions for validation produce output like this:
Code: Select all
Epoch 0, 100% (611/611) => 0.04480
Epoch 1, 100% (611/611) => 0.03570
Epoch 2, 100% (611/611) => 0.03236
Epoch 3, 100% (611/611) => 0.02443
... 
.
From my limited experience even larger input sets produce very small errors and this may make optimization problematic?

I am using a much larger number of positions, from several hundred millions to a few billions. My current validation-set has around 24 million positions. It is very difficult and time consuming to get a good training-set, I've already spent months on it and I'm still not satisfied.

With 5 million positions it is very likely that it over-fits and that you will see very low errors. In my case adding dropout helps a lot, somehow L1/L2 regularization gives me bad results. Increasing the number of distinct positions helps against over-fitting too, with a larger number of distinct positions the error usually ends up higher.

To keep the training time bearable I usually limit the number of training positions to about 500 million, which is large enough to give me reasonably good results.

At the Stockfish Discord there are many people discussing these things, it could be interesting to take a look over there.

Martin · Post by **Martin** » Sun Mar 28, 2021 9:35 am

For my nets I have used 1M positions for validation and about 600M positions for training. But I think most people use a lot more positions for training, 1B positions or more. In my case I didn't see much improvement beyond 600M, but my nets are quite small compared to others.

When using the nnue-trainer code it is also interesting to look in the output folder. The loss value included in the file names is the loss calculated against the validation set. If the the running loss (printed in the terminal) continues to decrease, but the validation loss doesn't then it's a sign of overfitting.

Joost Buijs · Post by **Joost Buijs** » Sun Mar 28, 2021 10:06 am

It's also important that the training positions are very quiet. Although I only use positions from the tip of the PV in quiescence these are still not quiet enough. My positions are labeled with both the evaluation (HCE) and with the score of a 6 ply full-width search (I use full width a-b without any pruning). If the signs of the evaluation and search score are different, or the difference between the evaluation and the search score is larger than 50 cp I skip these these positions completely, and there are a lot of these.

Ferdy · Post by **Ferdy** » Sun Mar 28, 2021 1:54 pm

jdart wrote: ↑Sat Mar 27, 2021 7:53 pm I am doing some experiments with https://github.com/bmdanielsson/nnue-trainer and have a couple of questions:
1. What's a typical number of positions for training and validation?
2. I am seeing that the objective values seem to go down with increasing training set size. For example, 5 million positions generated with depth 8 and 500k positions for validation produce output like this:
Code: Select all
Epoch 0, 100% (611/611) => 0.04480
Epoch 1, 100% (611/611) => 0.03570
Epoch 2, 100% (611/611) => 0.03236
Epoch 3, 100% (611/611) => 0.02443
... 
.
From my limited experience even larger input sets produce very small errors and this may make optimization problematic?

This is what I have on 500m when using nodchip/stockfish pos generator and trainer.

Code: Select all

csvsql --query "SELECT tn,tpos,tdep,vdep,valcnt,rmpv,rmpvdi,lambda,mgrad,elo,remarks FROM nnue_training_log WHERE (tpos = '500mill') ORDER BY elo DESC" nnue_training_log.csv  | csvlook

| tn | tpos    | tdep | vdep |  valcnt | rmpv | rmpvdi | lambda | mgrad | elo | remarks                                           |
| -- | ------- | ---- | ---- | ------- | ---- | ------ | ------ | ----- | --- | ------------------------------------------------- |
| 30 | 500mill |    5 |   10 | 100,000 |    4 |    200 |      1 |   0.3 |  10 | +10 (500 games tc5s+50ms test) over not using net |
| 20 | 500mill |    5 |   10 |   2,000 |    4 |    100 |      1 |   0.3 | -10 |                                                   |

tpos: traning positions
tdep: training depth
vdep: validation depth
valcnt: validation count used in learning
rmpv: random_multi_pv
rmpvdi: random_multi_pv_diff
mgrad: max_grad

The net is tested using deuterium engine with a slower nnue_cpu probing code from daniel shawul.

From your test I think the objective is to minimize the loss, so that is fine.
I will test later this tn 30 data using pytorch trainer and compare the perf with nodchip/stockfish trainer.

From what I have observed higher tpos, tdep, vdep and valcnt performs better. But there are also other factors such as rmpv etc that might affect the perf of the output net. I am working on soon to be released hyperparameter optimizer to possibly address this optimization challenges.

jdart · Post by **jdart** » Sun Mar 28, 2021 7:22 pm

Thanks for the responses. It just does not seem intuitive to me that the objective would depend on input size. That's not what I see with Texel tuning, for example.

But maybe it is overfitted and/or the input should be more diverse. The gensfen program by default outputs all game positions past the book phase and not just a sample of them, so it does occur to me that this would generate a lot of positions with zero training error, especially those near the end of the game. It might be better to only sample some of the positions within a game, although this would of course increase the compute effort to generate a fixed number of positions.

Joost Buijs · Post by **Joost Buijs** » Mon Mar 29, 2021 8:07 am

It's all about feature size and feature frequency, with Texel tuning you look at a small set of linearly added features that occur frequently, the network looks at a much larger set of features which occur less frequent, this makes it magnitudes more difficult to train.

I observe the same with the 10x10 Draughts program I sometimes work on. The nowadays common evaluation system with linearly added 12 bit patterns is relatively easy to tune, just using 15 million positions for training already gives me very good results, while training the FCN needs at least 40 times as much positions to give me comparable results.

Since I label with evaluation/search scores I only use distinct positions with doubles removed, while for pure logistic regression it is better to keep the doubles in to give you better statistics when you train with WLD data from an existing set of games.

Labeling the positions with a shallow search works fine, removing positions with tactics in the PV of the shallow search gives me better results. I still have to work on improving the detection of these (unstable) positions.

Ferdy · Post by **Ferdy** » Tue Mar 30, 2021 12:17 am

jdart wrote: ↑Sun Mar 28, 2021 7:22 pm The gensfen program by default outputs all game positions past the book phase and not just a sample of them, so it does occur to me that this would generate a lot of positions with zero training error, especially those near the end of the game.

Have you read the gensfen description? There is random_multi_pv which by default will generate random move.

jdart · Post by **jdart** » Tue Mar 30, 2021 4:55 am

I am aware of the random move options. Actually as I have posted before, the way gensfen does this seems to me unsound, because it will score the whole game with the ultimate result, even though a random move inserted in the middle can change the eval significantly, and so really then the previous positions shouldn't get the same result label as the positions after the random move.

I use random moves, but I invalidate any FEN records before the random move to avoid this issue. I have also tried a different method of randomization: once in a while, don't do a fixed depth search, do a fixed node search, with a number of nodes equal to the nodes in the previous search plus some random variation. This may or may not produce a different move than the fixed depth search, but it might. It is a bit like the indeterminacy you get from multi-threading. TBD if this helps or not: I have not done enough testing.

Ferdy · Post by **Ferdy** » Tue Mar 30, 2021 7:57 am

Ferdy wrote: ↑Sun Mar 28, 2021 1:54 pm
jdart wrote: ↑Sat Mar 27, 2021 7:53 pm I am doing some experiments with https://github.com/bmdanielsson/nnue-trainer and have a couple of questions:
1. What's a typical number of positions for training and validation?
2. I am seeing that the objective values seem to go down with increasing training set size. For example, 5 million positions generated with depth 8 and 500k positions for validation produce output like this:
Code: Select all
Epoch 0, 100% (611/611) => 0.04480
Epoch 1, 100% (611/611) => 0.03570
Epoch 2, 100% (611/611) => 0.03236
Epoch 3, 100% (611/611) => 0.02443
... 
.
From my limited experience even larger input sets produce very small errors and this may make optimization problematic?
This is what I have on 500m when using nodchip/stockfish pos generator and trainer.
Code: Select all
csvsql --query "SELECT tn,tpos,tdep,vdep,valcnt,rmpv,rmpvdi,lambda,mgrad,elo,remarks FROM nnue_training_log WHERE (tpos = '500mill') ORDER BY elo DESC" nnue_training_log.csv  | csvlook

| tn | tpos    | tdep | vdep |  valcnt | rmpv | rmpvdi | lambda | mgrad | elo | remarks                                           |
| -- | ------- | ---- | ---- | ------- | ---- | ------ | ------ | ----- | --- | ------------------------------------------------- |
| 30 | 500mill |    5 |   10 | 100,000 |    4 |    200 |      1 |   0.3 |  10 | +10 (500 games tc5s+50ms test) over not using net |
| 20 | 500mill |    5 |   10 |   2,000 |    4 |    100 |      1 |   0.3 | -10 |                                                   |
tpos: traning positions
tdep: training depth
vdep: validation depth
valcnt: validation count used in learning
rmpv: random_multi_pv
rmpvdi: random_multi_pv_diff
mgrad: max_grad

The net is tested using deuterium engine with a slower nnue_cpu probing code from daniel shawul.

From your test I think the objective is to minimize the loss, so that is fine.
I will test later this tn 30 data using pytorch trainer and compare the perf with nodchip/stockfish trainer.

Result comparing sf and pytorch trainer. Input data is 500m d5 training pos and 1m d10 validation pos on gensfen. sf trained net uses 100k validation pos only while pytorch trained net uses 1m.

nsf_2021-02-08_tn30: nodchip/stockfish with sf trained net using a net from tn30. Trained around 13 hours.
nsf_2021-02-08_e31: nodchip/stockfish with pytorch trained net reaching 56 epoch with 680k steps but only use e31 or epoch 31 with 390k steps as it has the lowest val loss. Trained around 13 hours.

TC 10s+50ms, noomen_3move.pgn for 1000 games.

Code: Select all

Score of nsf_2021-02-08_tn30 vs nsf_2021-02-08_e31: 266 - 130 - 604  [0.568] 1000
...      nsf_2021-02-08_tn30 playing White: 151 - 42 - 307  [0.609] 500
...      nsf_2021-02-08_tn30 playing Black: 115 - 88 - 297  [0.527] 500
...      White vs Black: 239 - 157 - 604  [0.541] 1000
Elo difference: 47.5 +/- 13.5, LOS: 100.0 %, DrawRatio: 60.4 %

sf trained net is better.

nnue-trainer

nnue-trainer

Re: nnue-trainer

Re: nnue-trainer

Re: nnue-trainer

Re: nnue-trainer

Re: nnue-trainer

Re: nnue-trainer

Re: nnue-trainer

Re: nnue-trainer

Re: nnue-trainer