Threat inputs and engine eval stability

sscg13 · Post by **sscg13** » Wed Jan 28, 2026 2:20 am

disabled code that auto-downloads the network.

You can upload a NNUE onto fishtest, and then this code will work fine. It will also make the source code actually compile to the executable.

(In fact, you do not need fishtest specifically, any link capable of hosting a download will do. This includes google drive, etc.)

syzygy · Post by **syzygy** » Wed Jan 28, 2026 4:13 am

FireDragon761138 wrote: ↑Wed Jan 28, 2026 2:16 am
syzygy wrote: ↑Wed Jan 28, 2026 1:17 am https://www.theoriachess.org/research/c ... -analysis/
Modern chess engines employ fundamentally different approaches to position analysis. Stockfish-17.1 represents the state-of-the-art in brute-force calculation, evaluating positions through deep alpha-beta pruning and neural network evaluation. Theoria 0.1 incorporates conceptual frameworks that annotate chess motifs and themes. This research investigates which approach yields more interpretable strategic analysis for human learners.
All you did was take Stockfish-16.1 and call it Theoria 0.1.
Stockfish-16.1 does not "incorporate conceptual frameworks that annotate chess motifs and themes" any less than Stockfish-17.1.

What is the grift exactly? Are you cheating to get a PhD?
No, we trained a new NNUE, disabled the secondary network, and disabled code that auto-downloads the network. We also experimented with turning off and on various algorithms in the engine.

Disabling the secondary network was that one line.
It seems you have now added an NNUE net to the download page. It wasn't there before.
So what you did was use the standard tools in the standard way on standard data to create a standard network.

In no way does replacing the NNUE result in "conceptual frameworks that annotate chess motifs and themes" that were not there already in SF-16.1.

And replacing the NNUE obviously does not turn SF-16.1 into a different engine that you could call "yours".
There is not even copyright on an NNUE.

Ciekce · Post by **Ciekce** » Wed Jan 28, 2026 12:00 pm

FireDragon761138 wrote: ↑Tue Jan 27, 2026 11:08 pm Threat inputs adds nothing positive to our engine, either in evaluation stability or elo.

I do not believe you are capable of properly testing this.

> Stockfish-17.1 represents the state-of-the-art in brute-force calculation

Brute force? A modern engine?

You'll probably reply "yes", as in my (fairly painful) interactions with you in the SF server you convinced me that you entirely lack self-awareness and literacy, so I'll spell it out for you: I am ridiculing the idea that any modern engine can be remotely accurately described as "brute force".

FireDragon761138 · Post by **FireDragon761138** » Wed Jan 28, 2026 9:34 pm

Ciekce wrote: ↑Wed Jan 28, 2026 12:00 pm
FireDragon761138 wrote: ↑Tue Jan 27, 2026 11:08 pm Threat inputs adds nothing positive to our engine, either in evaluation stability or elo.
I do not believe you are capable of properly testing this.

SPRT runs aren't rocket science, and I can use Claude Opus to do scale-invariate analysis, so that isn't correct.

syzygy · Post by **syzygy** » Wed Jan 28, 2026 10:45 pm

FireDragon761138 wrote: ↑Wed Jan 28, 2026 9:34 pm
Ciekce wrote: ↑Wed Jan 28, 2026 12:00 pm
FireDragon761138 wrote: ↑Tue Jan 27, 2026 11:08 pm Threat inputs adds nothing positive to our engine, either in evaluation stability or elo.
I do not believe you are capable of properly testing this.

SPRT runs aren't rocket science, and I can use Claude Opus to do scale-invariate analysis, so that isn't correct.

What do you mean "adds ... to 'our' engine" ?
"Your" engine is SF16.1 with a net you allegedly trained yourself.

1.
Did you add threat inputs to your SF16.1 net?
Could you explain what you did exactly?

2.
How did you test Elo vs default SF16.1 exactly?
Could you explain your methodology?

sscg13 · Post by **sscg13** » Thu Jan 29, 2026 4:58 am

FireDragon761138 wrote: ↑Wed Jan 28, 2026 9:34 pm SPRT runs aren't rocket science

SPRT tells you whether a certain change is sufficiently good or not. It does not tell you if you have implemented the change properly.
I strongly suspect your implementation of threat inputs is not correct but I cannot say anything further without the exact code difference.

You have also been told multiple times that "averaging" used in MCTS by default is good for "stability" which is part of the reason why most people consider "stability" a nonsense metric.

sscg13 · Post by **sscg13** » Thu Jan 29, 2026 8:32 am

Even if you correctly implemented threat inputs, threat inputs need to be optimized well. There is also a fixed overhead to speed, meaning that it will only become better with large NNUE. Correspondingly, larger NNUE also requires longer training time.

FireDragon761138 · Post by **FireDragon761138** » Thu Jan 29, 2026 1:27 pm

sscg13 wrote: ↑Thu Jan 29, 2026 8:32 am Even if you correctly implemented threat inputs, threat inputs need to be optimized well. There is also a fixed overhead to speed, meaning that it will only become better with large NNUE. Correspondingly, larger NNUE also requires longer training time.

We tested a 300 mb NNUE with threat inputs. The 300 MB NNUE was only marginally better than the 100 MB NNUE with threat inputs in terms of SPRT testing.

The main problem we had with threat inputs was the loss of engine evaluation stability. It was fairly dramatic.

chrisw · Post by **chrisw** » Thu Jan 29, 2026 1:41 pm

FireDragon761138 wrote: ↑Thu Jan 29, 2026 1:27 pm
sscg13 wrote: ↑Thu Jan 29, 2026 8:32 am Even if you correctly implemented threat inputs, threat inputs need to be optimized well. There is also a fixed overhead to speed, meaning that it will only become better with large NNUE. Correspondingly, larger NNUE also requires longer training time.
We tested a 300 mb NNUE with threat inputs. The 300 MB NNUE was only marginally better than the 100 MB NNUE with threat inputs in terms of SPRT testing.

The main problem we had with threat inputs was the loss of engine evaluation stability. It was fairly dramatic.

We is normally used when there’s more than one worker. Unlikely any other worker would stick around for more than five minutes with the degree of narcissism on display here. We tested is therefore oxymoronic.

sscg13 · Post by **sscg13** » Thu Jan 29, 2026 3:08 pm

FireDragon761138 wrote: ↑Thu Jan 29, 2026 1:27 pm
sscg13 wrote: ↑Thu Jan 29, 2026 8:32 am Even if you correctly implemented threat inputs, threat inputs need to be optimized well. There is also a fixed overhead to speed, meaning that it will only become better with large NNUE. Correspondingly, larger NNUE also requires longer training time.
We tested a 300 mb NNUE with threat inputs. The 300 MB NNUE was only marginally better than the 100 MB NNUE with threat inputs in terms of SPRT testing.

I am frankly shocked by this, since our scaling data suggests a 300 MB NNUE (L1=3072?) would be around 100 elo worse compared to 100 MB (L1=1024?), not to mention undertraining.

Threat inputs and engine eval stability

Re: Threat inputs and engine eval stability

Re: Threat inputs and engine eval stability

Re: Threat inputs and engine eval stability

Re: Threat inputs and engine eval stability

Re: Threat inputs and engine eval stability

Re: Threat inputs and engine eval stability

Re: Threat inputs and engine eval stability

Re: Threat inputs and engine eval stability

Re: Threat inputs and engine eval stability

Re: Threat inputs and engine eval stability