Maybe that has something to do with different training data. Stockfish probably benefits alot more from speed if it's being trained on heterogeneous types of data optimized for finding forcing moves, instead of evaluating relatively quiet positions.sscg13 wrote: ↑Thu Jan 29, 2026 3:08 pmI am frankly shocked by this, since our scaling data suggests a 300 MB NNUE (L1=3072?) would be around 100 elo worse compared to 100 MB (L1=1024?), not to mention undertraining.FireDragon761138 wrote: ↑Thu Jan 29, 2026 1:27 pmWe tested a 300 mb NNUE with threat inputs. The 300 MB NNUE was only marginally better than the 100 MB NNUE with threat inputs in terms of SPRT testing.
Threat inputs and engine eval stability
Moderator: Ras
-
FireDragon761138
- Posts: 71
- Joined: Sun Dec 28, 2025 7:25 am
- Full name: Aaron Munn
Re: Threat inputs and engine eval stability
-
syzygy
- Posts: 5868
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Threat inputs and engine eval stability
Royal we.chrisw wrote: ↑Thu Jan 29, 2026 1:41 pmWe is normally used when there’s more than one worker. Unlikely any other worker would stick around for more than five minutes with the degree of narcissism on display here. We tested is therefore oxymoronic.FireDragon761138 wrote: ↑Thu Jan 29, 2026 1:27 pmWe tested a 300 mb NNUE with threat inputs. The 300 MB NNUE was only marginally better than the 100 MB NNUE with threat inputs in terms of SPRT testing.
The main problem we had with threat inputs was the loss of engine evaluation stability. It was fairly dramatic.
-
mar
- Posts: 2673
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
-
syzygy
- Posts: 5868
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Threat inputs and engine eval stability
I don't think it is humility that led to the use of "we" here.
-
FireDragon761138
- Posts: 71
- Joined: Sun Dec 28, 2025 7:25 am
- Full name: Aaron Munn
Re: Threat inputs and engine eval stability
That assertion is untrue. There are two humans working on this project.chrisw wrote: ↑Thu Jan 29, 2026 1:41 pmWe is normally used when there’s more than one worker. Unlikely any other worker would stick around for more than five minutes with the degree of narcissism on display here. We tested is therefore oxymoronic.FireDragon761138 wrote: ↑Thu Jan 29, 2026 1:27 pmWe tested a 300 mb NNUE with threat inputs. The 300 MB NNUE was only marginally better than the 100 MB NNUE with threat inputs in terms of SPRT testing.
The main problem we had with threat inputs was the loss of engine evaluation stability. It was fairly dramatic.