TL;DR: why is mean-squared-error used as a NNUE loss function, instead of Kullback-Leibler divergence?
Hi all! I was using a NNUE network in a chess-related game (https://www.codecup.nl/0.1/rules.php), and recently found out that most (all?) people in chess use mean-square-error loss, which was a surprise to me. More specifically, the loss is
(sigmoid(output) - target)^2 with target equal to a mix of sigmoid(search eval) and WDL.
For me, with a statistical background, it’s weird to use MSE loss on a bounded outcome (between 0 and 1). For me it makes sense to model these as probabilities. (Ignoring draws, making them equivalent to 50% winning + 50% losing. Including draws correctly is a much larger effort). Then we can see the target as the true probability of winning, and use the Kullback-Leibler divergence of sigmoid(output) as a loss. So for me, that is a better theoretical foundation.
Of course, there’s the practical side too. You can look at the mathematical derivations to get a feeling, but I think this graph of the derivative of the loss functions helps: https://www.desmos.com/calculator/dwl8oth8dn (red = KL divergence, green = MSE; x-axis = output logits; y-axis = derivative of the loss; the value a is the target probability; you can also view the loss functions themselves if would want).
From that you can see both loss functions act similarly when the target and output probabilities are close. And that both functions do well in the 0.5 probability region, which is what we want. We see big differences if the probabilities are very different though (i.e. high target, low output and vice versa): MSE first drops a bit, then goes to zero, resulting in slow or no learning. Effectively, it ignores samples where the output is completely off target. KL divergence just converges to a stable derivative, so the network always learns from such an example.
Computationally, there is no big difference, they both are not that complex to implement without numerical issues. For the Kullback-Leibler divergence loss, you should do this by fusing it with the sigmoid computation.
For what it’s worth, I also did a quick experiment. For me, the difference between the two loss functions is statistically insignificant (confidence interval of [-10 to +21] ELO). Also training convergence is similar, maybe a tad quicker with KL divergence. So perhaps this is a fuss over nothing…
I couldn’t find anything on why MSE loss is used. So does anyone know why MSE loss is used in practice? Perhaps historical reasons, and it being “good enough”? And are there some places this topic was discussed that I missed?
NNUE loss function, why MSE?
Moderator: Ras
-
mtijink
- Posts: 2
- Joined: Sun Feb 01, 2026 7:39 pm
- Full name: Matthijs Tijink
-
sscg13
- Posts: 24
- Joined: Mon Apr 08, 2024 8:57 am
- Full name: Chris Bao
Re: NNUE loss function, why MSE?
For single "internal unit" -> "expected score" conversion it is tough to define KL divergence, since I think draws are important. Actually the most important positions usually have an expected score of 0.75 (or, 0.25). A position with 0.75 expected score is almost certainly 1/2 win, 1/2 draw, not 3/4 win and 1/4 loss. Similarly, 0.5 is almost certainly 1-eps draw, not 1/2 win, 1/2 loss, so not considering draws doesn't make much sense to me.
For nets that directly predict WDL (say, via softmax), it has been found that using cross entropy (similar to KL divergence) is significantly worse in elo than MSE.
In terms of general machine learning principles, one should treat NNUE as a "regression" task, not really a "classification" task, and hence MSE makes more sense.
For nets that directly predict WDL (say, via softmax), it has been found that using cross entropy (similar to KL divergence) is significantly worse in elo than MSE.
In terms of general machine learning principles, one should treat NNUE as a "regression" task, not really a "classification" task, and hence MSE makes more sense.
-
mtijink
- Posts: 2
- Joined: Sun Feb 01, 2026 7:39 pm
- Full name: Matthijs Tijink
Re: NNUE loss function, why MSE?
Right, draws are important! Including them correctly is not trivial though. For example, by using cross entropy you have two problems to solve:
Both MSE and KL-divergence (ignoring draws) are more pragmatic and ad-hoc. To me the balance seems to lie on the side of KL-divergence, but... that's not what people seem to use. So I'm still curious to what other people found, I'm trying to understand what works and what not, and why. I did search a bit on cross entropy, but couldn't find anyone saying it does worse than MSE though. Do you have any links or other search terms for me?
Regarding it being a regression task, not a classification task: I think it's a bit of both. On one hand, there's a clear connection to loss/draw/win. This is why we search for the best move, the highest eval, in the first place! But it's not a perfect match no. So regression makes sense, and MSE makes sense if the residual (=search eval - NNUE eval) is approximately normally distributed. But it's not, exactly because we apply the sigmoid function. The statistician in me says to find a transform that fixes that issue and makes sense from domain knowledge. Yet that means not applying the sigmoid, which has real practical issues, mostly because the region around probability 0.5 is not emphasized. So MSE can be a fine ad-hoc choice! Still, there are the issues I saw when the search/NNUE probabilities are very different, so I'd find it hard to recommend given there's a good alternative. That's what I'm trying to understand better, though.
- How to include search evals (important because they are a very strong signal!). You need something like KL-divergence, and need to translate search evals to loss/draw/win categories somehow; and
- You throw away information, namely that predicting "loss" is worse than predicting "draw" if the true outcome is "win".
Both MSE and KL-divergence (ignoring draws) are more pragmatic and ad-hoc. To me the balance seems to lie on the side of KL-divergence, but... that's not what people seem to use. So I'm still curious to what other people found, I'm trying to understand what works and what not, and why. I did search a bit on cross entropy, but couldn't find anyone saying it does worse than MSE though. Do you have any links or other search terms for me?
Regarding it being a regression task, not a classification task: I think it's a bit of both. On one hand, there's a clear connection to loss/draw/win. This is why we search for the best move, the highest eval, in the first place! But it's not a perfect match no. So regression makes sense, and MSE makes sense if the residual (=search eval - NNUE eval) is approximately normally distributed. But it's not, exactly because we apply the sigmoid function. The statistician in me says to find a transform that fixes that issue and makes sense from domain knowledge. Yet that means not applying the sigmoid, which has real practical issues, mostly because the region around probability 0.5 is not emphasized. So MSE can be a fine ad-hoc choice! Still, there are the issues I saw when the search/NNUE probabilities are very different, so I'd find it hard to recommend given there's a good alternative. That's what I'm trying to understand better, though.
-
JacquesRW
- Posts: 133
- Joined: Sat Jul 30, 2022 12:12 pm
- Full name: Jamie Whiting
Re: NNUE loss function, why MSE?
Citation needed? This is not the case in Monty, at least.
EDIT: CE was worse than MSE when tested in Viri and Plenty on WDL outputs.
Last edited by JacquesRW on Tue Feb 03, 2026 5:28 pm, edited 1 time in total.
-
JacquesRW
- Posts: 133
- Joined: Sat Jul 30, 2022 12:12 pm
- Full name: Jamie Whiting
Re: NNUE loss function, why MSE?
That is far too short of a test to determine anything. I'm gonna ask Plenty and Reckless authors to run a proper test as it is probable that no one has ever ran a proper SPRT on this specific thing in a relevant (i.e. strong engine) setting. At the end of the day elo is all that matters.mtijink wrote: ↑Sun Feb 01, 2026 9:47 pm For what it’s worth, I also did a quick experiment. For me, the difference between the two loss functions is statistically insignificant (confidence interval of [-10 to +21] ELO). Also training convergence is similar, maybe a tad quicker with KL divergence. So perhaps this is a fuss over nothing…
-
Sopel
- Posts: 397
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: NNUE loss function, why MSE?
It's a kind of a grey area, we're not even strictly using MSE. Early Stockfish networks used cross-entropy. Then we switched to MSE (l2-norm). Now we fine-tuned it to l2.5-norm with some additional correction factors. It's just whatever works in practice. https://github.com/official-stockfish/n ... ule.py#L57, https://github.com/official-stockfish/n ... fig.py#L14
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
jorose
- Posts: 384
- Joined: Thu Jan 22, 2015 3:21 pm
- Location: Zurich, Switzerland
- Full name: Jonathan Rosenthal
Re: NNUE loss function, why MSE?
Winter's net is trained with a weighted average between MSE (in win percent space) and cross-entropy (in WDL space). I remember being unsatisfied that this was better than cross-entropy, and also better than MSE, but it was consistent across more than one net.
That being said, it has been a long time since I tested this and Winter's correction history is actually based on Wasserstein distance, so I feel I need to revisit it as soon as I get back to network training.
That being said, it has been a long time since I tested this and Winter's correction history is actually based on Wasserstein distance, so I feel I need to revisit it as soon as I get back to network training.
-Jonathan