Understanding neural networks in chess... from RL point.

Discussion of chess software programming and technical issues.

Moderator: Ras

osvitashev
Posts: 13
Joined: Tue Sep 07, 2021 6:17 pm
Full name: Alex S

Understanding neural networks in chess... from RL point.

Post by osvitashev »

Armchair chess engine programmer here...

Below is my broad understanding of how neural networks are usually used in chess engines. (Yes, we use a lot of tricks on top of minimax to make it run faster and there are some very smart optimizations in NNUEs, but for the purpose of this question, that is not relevant)
Please correct me if i am totally wrong...

1. Use a static collection of chess games to train a network to recognize game outcome based on a position.
2. Plug this network as an evaluation function into a minimax algorithm. A position that is more 'won' correlates to a higher eval score.
3. Optionally, generate more games/positions through self-play and repeat step #1

I have started reading up on reinforcement learning algorithms and got confused in terms of how it plugs into chess engines.

Step #1 is effectively an offline deep q-learning algorithm. Offline DQL is generally terrible at handling distribution shifts, as it tends to overestimate the value of actions underrepresented in the training data set.

So, my question is... how does it all work?
More specifically:
Does filtering a potentially bad (biased) evaluation function through several levels of minimax somehow average it out and make it usable?
Is there some deep connection between Bellman Equations that are all over RL and minimax algorithm?
Is DQL an oversimplification on my part, and the evaluation function network training needs to be done via something more sophisticated like CQL?
Or... is the solution mainly in the iterative exploration in step #3, in other words: self-play. So it is not mainly an offline RL problem?
ZirconiumX
Posts: 1355
Joined: Sun Jul 17, 2011 11:14 am
Full name: Hannah Ravensloft

Re: Understanding neural networks in chess... from RL point.

Post by ZirconiumX »

I'm going to draw a distinction here for clarity. "Value networks" take a position as input and output one score, and are used for the evaluation. "Policy networks" take a position as input and output scores for all possible moves, and are used for move scoring. (Policy networks are much rarer.)

Value network training is not reinforcement learning, but supervised learning (specifically regression). The value network itself makes no judgement on what the best move in a position is; that's entirely left to minimax.

As such, simple methods like gradient descent, or variants thereof like Adam, are sufficient for training.
tu ne cede malis, sed contra audentior ito
jdart
Posts: 4405
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Understanding neural networks in chess... from RL point.

Post by jdart »

The first step is supervised learning. The 3rd step is a form of reinforcement learning, at least that term is commonly used for it, because the training starts with and builds on top of the existing network. It typically uses a different learning rate and schedule, because you are presumably already close to the optimum point.
Aleks Peshkov
Posts: 903
Joined: Sun Nov 19, 2006 9:16 pm
Location: Russia
Full name: Aleks Peshkov

Re: Understanding neural networks in chess... from RL point.

Post by Aleks Peshkov »

I have a question about practical usage of NN.

I do not want NN to measue WDL or even full centipawn evaluation. I want to use NN as small positional correction to basic 1-3-3-5-9 or 1-4-4-6-12 material count as static evaluation.

Can we use simple and fast NN as PST substitution? It should make NN even with almost random values playable and not random mover.
ZirconiumX
Posts: 1355
Joined: Sun Jul 17, 2011 11:14 am
Full name: Hannah Ravensloft

Re: Understanding neural networks in chess... from RL point.

Post by ZirconiumX »

Aleks Peshkov wrote: Thu Aug 14, 2025 12:48 pm I have a question about practical usage of NN.

I do not want NN to measue WDL or even full centipawn evaluation. I want to use NN as small positional correction to basic 1-3-3-5-9 or 1-4-4-6-12 material count as static evaluation.

Can we use simple and fast NN as PST substitution? It should make NN even with almost random values playable and not random mover.
Sure you can, although a simple NNUE usually uses PST inputs anyway.
tu ne cede malis, sed contra audentior ito