Why NNUE trainer requires an online qsearch on each training position?

nkg114mc · Post by **nkg114mc** » Sat Jan 01, 2022 8:55 am

Hi all,

I am going through the NNUE trainer implementation these days, but there is one part that I still did not fully understand. In the NNUE network training, the inference includes the following steps:

1. For the training chess position pos1, run qsearch with the current learned NNUE evaluation function to get quiescence PV and evaluation score;
2. Find the tail position pos2 in that quiescence PV, which is also the origin of the evaluation score, then do inference on that pos2 (feaurization and then forward prorogation);
3. Add pos2 as one training example in current mini-batch. During optimization step, compute gradient on pos2 and update network weights.

I understand that we should only invoke evaluation function on the quiescence positions, but why the trainer needs to run an online qsearch on each training position?

Let's think about three variations of the current inference method:

Variation 1): Do not run qsearch on the training position. Instead, directly do inference on each training position (the pos1 above);
Variation 2): Do not run qsearch on the training position. Instead, skip all non-quiescence positions, and directly do inference on each quiescence training position;
Variation 3): During the training position generation, instead of storing the search root position, store the quiescence position on the PV tail (so that all training examples are quiescence position at the very beginning). Then in training, we apply the same inference strategy as variation (1);

According to my understanding, (1) would definitely be worse because it includes too many non-quiescence positions. But would (2) and (3) also work well? From the experiments of nnue-pytorch, looks like at least method (2) can still achieve reasonable results.

"An inference with the online qsearch can provide more dynamic training positions (because after each epoch the network changed, thus the PV and tail position would change as well), but probably that's the only benefit we can get from it." Is this assumption correct?

Thanks!

Sopel · Post by **Sopel** » Sat Jan 01, 2022 2:09 pm

I assume you're talking about https://github.com/nodchip/Stockfish, which is not being used nor maintained. We now exclusively use https://github.com/glinscott/nnue-pytorch for training and https://github.com/official-stockfish/S ... tree/tools for data generation. With that said though, your question is still valid.

The variation 1 was tried as an experiment and we determined that training on non-quiet position results in significantly (it was around 100 Elo IIRC) worse networks.
The variation 2 is effectively what we've been doing ever since the pytorch trainer started being used, with a difference that we do an approximate test for quietness by just checking if the bestmove in the training data is a capture or not, or if the king is in check. Anywhere between 20% to 40% of the data is skipped depending on the source.
The variation 3 was tried briefly but due to how we compress the training data it resulted in about 15x larger files, which was unworkable, so we can't really test it properly as the required datasets would be around 500GB in size. The advantage over the variation 2 (as we do it) would be that the quietness would be determined with more accuracy, but there's no data currently that indicates it would be beneficial, or that the current approximate scheme is worse than an accurate one.

"An inference with the online qsearch can provide more dynamic training positions (because after each epoch the network changed, thus the PV and tail position would change as well), but probably that's the only benefit we can get from it." Is this assumption correct?

This might be a desirable side-effect, but the impact of this was never tested.

Rebel · Post by **Rebel** » Sat Jan 01, 2022 6:52 pm

Sopel wrote: ↑Sat Jan 01, 2022 2:09 pm The variation 2 is effectively what we've been doing ever since the pytorch trainer started being used, with a difference that we do an approximate test for quietness by just checking if the bestmove in the training data is a capture or not, or if the king is in check. Anywhere between 20% to 40% of the data is skipped depending on the source.

Have you also tried to skip positions that are really quiet? With that I mean no hanging pieces. Just as QS does. As after all the final score comes from QS evaluation. The skipped data will increase, maybe even to 50-60%

Rebel · Post by **Rebel** » Sat Jan 01, 2022 7:34 pm

Rebel wrote: ↑Sat Jan 01, 2022 6:52 pm
Sopel wrote: ↑Sat Jan 01, 2022 2:09 pm The variation 2 is effectively what we've been doing ever since the pytorch trainer started being used, with a difference that we do an approximate test for quietness by just checking if the bestmove in the training data is a capture or not, or if the king is in check. Anywhere between 20% to 40% of the data is skipped depending on the source.
Have you also tried to skip positions that are really quiet? With that I mean no hanging pieces. Just as QS does. As after all the final score comes from QS evaluation. The skipped data will increase, maybe even to 50-60%

Should be - Have you also tried positions that are really quiet?

Sopel · Post by **Sopel** » Sat Jan 01, 2022 7:51 pm

Rebel wrote: ↑Sat Jan 01, 2022 7:34 pm
Rebel wrote: ↑Sat Jan 01, 2022 6:52 pm
Sopel wrote: ↑Sat Jan 01, 2022 2:09 pm The variation 2 is effectively what we've been doing ever since the pytorch trainer started being used, with a difference that we do an approximate test for quietness by just checking if the bestmove in the training data is a capture or not, or if the king is in check. Anywhere between 20% to 40% of the data is skipped depending on the source.
Have you also tried to skip positions that are really quiet? With that I mean no hanging pieces. Just as QS does. As after all the final score comes from QS evaluation. The skipped data will increase, maybe even to 50-60%
Should be - Have you also tried positions that are really quiet?

Currently we skip a superset of this, and with how much data we're throwing at it I don't think it would have an effect. But still worth trying I guess.

nkg114mc · Post by **nkg114mc** » Sun Jan 02, 2022 12:45 am

Hi Sopel, thank you for the detailed replay!

For the repo I was referring to this one: https://github.com/joergoster/Stockfish-NNUE (but I guess it has no difference to the nodchip version). Thanks for link of Stockfish tools branch as latest data generation tool. I will go through this and refresh my knowledge about NNUE data generation.

training on non-quiet position results in significantly (it was around 100 Elo IIRC) worse networks.

Actually I observed the similar results. In my experiments with Senpai 2 (with NNUE data and training method, but just a linear model, weights trained with the non-quite positions are consistently 60~80 Elo worse than the one trained with quiet only.

Anywhere between 20% to 40% of the data is skipped depending on the source.

I did the same statistics on the three datasets I generated before. There are constantly around 23% positions being skipped in all three. No idea why I got such a magic number. Maybe it is related to the data generation configuration?

Thanks for explanation about variant 2 and 3. This actually resolved a question that confused me for a long time. In the original Pytorch NNUE training thread, Gary once mentioned that you implemented a new version of gensfen that "does the same as the training qsearch evaluation" in the original nodchip code, but I am still not clear about any details. Now I got a better idea about how that was done, and progress of that branch of attempt. BTW, I am still quite surprised that with such a the simple approach in variant 2, the nnue-pytorch can still reach a reasonably good training reasult. Really appreciate all your efforts in this exploration work

Sopel · Post by **Sopel** » Sun Jan 02, 2022 1:21 am

nkg114mc wrote: ↑Sun Jan 02, 2022 12:45 am
Anywhere between 20% to 40% of the data is skipped depending on the source.
I did the same statistics on the three datasets I generated before. There are constantly around 23% positions being skipped in all three. No idea why I got such a magic number. Maybe it is related to the data generation configuration?

It may depend on depth/node count, contempt, "style", adjudication, temperature. All of this can influence the rate of captures in a game.

Gary once mentioned that you implemented a new version of gensfen that "does the same as the training qsearch evaluation" in the original nodchip code, but I am still not clear about any details.

Yes, it was the "ensure_quiet" option in the generator (We may have removed it at some point, I don't remember), that could be later used with "assume_quiet" option in the trainer. It basically skipped saving positions with qsearch pv length >0. I didn't have any conclusive results because I couldn't get large enough datasets and the training process was really unstable and immature with the nodchip trainer.

nkg114mc · Post by **nkg114mc** » Mon Jan 03, 2022 10:12 am

Thanks again for the reply, Sopel. Will check the options of generator/trainer to get a better understanding.

Why NNUE trainer requires an online qsearch on each training position?

Why NNUE trainer requires an online qsearch on each training position?

Re: Why NNUE trainer requires an online qsearch on each training position?

Re: Why NNUE trainer requires an online qsearch on each training position?

Re: Why NNUE trainer requires an online qsearch on each training position?

Re: Why NNUE trainer requires an online qsearch on each training position?

Re: Why NNUE trainer requires an online qsearch on each training position?

Re: Why NNUE trainer requires an online qsearch on each training position?

Re: Why NNUE trainer requires an online qsearch on each training position?