elo and engine interpretability

FireDragon761138 · Post by **FireDragon761138** » Mon Jan 26, 2026 4:18 am

Based on what I'm finding with testing, around 3600 Elo (CCRL) may be the upper threshold before an engine struggles to produce strategically/positionally coherent principal variations. Dragon sits right at the upper end of that boundary (with Dynamism at 100, instead of the more pedagogical 80), while the Theoria engine we're developing comes in just south of that at around 3580. Above this point, perhaps the curvature of the otherwise "flat" appearing manifold starts to become evident and chess analysis isn't as easily resolved into a coherent narrative using traditional positional chess principles like center control, weak squares, piece development, and so on.

How we're measuring this:

I'm tracking evaluation stability variance across nodes/depth as a proxy for strategic coherence. Essentially: does the engine's assessment converge smoothly as it calculates deeper, or does it thrash around with wild evaluation swings and constant PV changes?
Engines below the threshold can tend to show more stable convergence, with the principal variation maintaining thematic consistency. The position "makes sense" to the engine in terms of positional factors it can evaluate consistently.

Above 3600, you see dramatic instability, with the PV changing character more readily. Classical positional heuristics break down. It's like navigating the position by brute-force calculation rather than being guided by something analogous to coherent, interpretable strategic understanding.

The manifold interpretation:

The manifold hypothesis from neural network information theory suggests that high-dimensional data (chess positions with hundreds of variables) actually occupies a much simpler low-dimensional structure - a manifold embedded in that high space, like Earth's 2D surface embedded in 3D space.

Below ~3600 Elo, chess appears to lie on a relatively flat manifold where traditional strategic principles work as local approximations. The geometry is simple enough that "control the center" or "improve your worst piece" reliably point you in the right direction.
Above that threshold, the manifold's curvature becomes evident - it bends in ways that simple linear approximations (classical strategy) can't capture. The engine is navigating regions where the true geometry requires higher dimensionality, exhibits complex curvature, or has topological features that resist flattening into human-comprehensible principles. Moves that look strategically nonsensical might be following the manifold's actual structure through spaces only accessible via deep tactical calculation.

In fact, it seems analogous to a phase change - just a few Elo above or below this threshold produces dramatic differences in the strategic coherence of the engine's principal variations. Perhaps something about the mathematical complexity of chess itself has something to do with this.

Caveats:

Having said all that, it only appears weakly correlated - about 36 percent inverse correlation. So there are probably other factors. Network architecture, training data, evaluation function design all matter independently. Simply using a weaker engine probably won't automatically give clearer or more transparent analysis - the engine needs to be designed for strategic coherence, like Theoria aims to be.

What this means:

If ~3600 really is where the chess manifold's complexity exceeds what human strategic frameworks can approximate, then Theoria at 3580 sits at an optimal point: strong enough to be accurate and useful, yet operating in the region where chess positions can still be explained through classical positional principles rather than dissolving into purely tactical calculation that's relatively strategically opaque.

hgm · Post by **hgm** » Mon Jan 26, 2026 9:25 am

Interesting observation. But it could also be an artifact of the way Elo is measured:

Perhaps some randomness in the evaluation is needed to stand a chance of beating engines with a search deep enough to reach 3600 Elo. If you would go to a predictable 'best goal', the opponent might always be able to recognize that too, and there would be no way to tempt it to make errors.

Also, evaluation noise helps working around a fundamental problem with alpha-beta search, namely that it blindly trusts its own analysis, without paying any attention to contingency planning. It would play a line that can force it to go to the leaf it considers best, even if any deviation from that line on its own part would lead to disaster. Then, when it discovers on a later move (where it already has committed itself to playing the line) that the leaf it was aiming for is no good, it is in deep trouble. A simple example of this effect is a 3-ply search that somehow thinks it would be good in the opening to have a white Knight on c4, and could get it there from b1 via either a3 or d2. There then is no reason for pure alpha-beta search to prvent one path over the other. Because only the score of the Knight on c4 counts. But if that plan turns sour, e.g. because deeper search reveals that the Knight would be simply chased away from c4, and going there would just waste 2 tempi, it would be better if the Knight was not stuck on a2. So Nd2 would be a safer plan, but alpha-beta search does not award that.

Evaluation noise helps in that respect, because if there are many leaves with nearly equal evaluation that it could choose from by deviating from its own PV, there is a larger chance that one of these would have a large random bonus on its evaluation, so that it tends to select PVs that keep it possible to select any position to go for from that group, rather than picking a PV that is entirely dependent on the evaluation of a single leaf node, which is much less likely to get the highest random bonus ('Beal effect').

The 'phase transition' could just be a consequence of the fact that Elo of top engines is measured by playing them against a pool of engines where the opponents on average have that rating.

FireDragon761138 · Post by **FireDragon761138** » Mon Jan 26, 2026 10:08 am

hgm wrote: ↑Mon Jan 26, 2026 9:25 am Interesting observation. But it could also be an artifact of the way Elo is measured:

Perhaps some randomness in the evaluation is needed to stand a chance of beating engines with a search deep enough to reach 3600 Elo. If you would go to a predictable 'best goal', the opponent might always be able to recognize that too, and there would be no way to tempt it to make errors.

Also, evaluation noise helps working around a fundamental problem with alpha-beta search, namely that it blindly trusts its own analysis, without paying any attention to contingency planning. It would play a line that can force it to go to the leaf it considers best, even if any deviation from that line on its own part would lead to disaster. Then, when it discovers on a later move (where it already has committed itself to playing the line) that the leaf it was aiming for is no good, it is in deep trouble. A simple example of this effect is a 3-ply search that somehow thinks it would be good in the opening to have a white Knight on c4, and could get it there from b1 via either a3 or d2. There then is no reason for pure alpha-beta search to prvent one path over the other. Because only the score of the Knight on c4 counts. But if that plan turns sour, e.g. because deeper search reveals that the Knight would be simply chased away from c4, and going there would just waste 2 tempi, it would be better if the Knight was not stuck on a2. So Nd2 would be a safer plan, but alpha-beta search does not award that.

Evaluation noise helps in that respect, because if there are many leaves with nearly equal evaluation that it could choose from by deviating from its own PV, there is a larger chance that one of these would have a large random bonus on its evaluation, so that it tends to select PVs that keep it possible to select any position to go for from that group, rather than picking a PV that is entirely dependent on the evaluation of a single leaf node, which is much less likely to get the highest random bonus ('Beal effect').

The 'phase transition' could just be a consequence of the fact that Elo of top engines is measured by playing them against a pool of engines where the opponents on average have that rating.

That's food for thought.

It might explain why, at the 3600+ level, you can only discern clear differences in play strength by playing thousands of games against other engines. Alot of it is just throwing stuff against a wall and hoping it sticks. That's terrible for evaluation of any particular position, probably great if the goal is to catch the other engine off balance in competitive play. Then it becomes an arms race of who can sling the most random moves and manage the complications.

I started out all this with the assumption from a place of realist metaphysics, that there's objective structure in chess, and what I've found appears to be more nuanced: there's structure, almost all of it is humanly comprehensible. But the upper reaches of it get strange, perhaps? What starts off as a courtly joust ends in a cafeteria food fight?

FireDragon761138 · Post by **FireDragon761138** » Mon Jan 26, 2026 5:27 pm

A thought... could this have something to do with the optimism algorithm in Stockfish?

I noticed experimenting with it, turning it off or on, has the effect of restricting the engine's willingness to look for speculative sacrifices. It could induce something that resembles LLM hallucinations under deep search- moves that sound plausible but contain no real information (noise). As the engine explores the game tree, the principle variation gets increasingly speculative and noisy.

Also, adjusting Dragon's dynamism downward to 80 has a dramatic effect on engine analysis stability, but it also weakens play by about 30 elo.

FireDragon761138 · Post by **FireDragon761138** » Tue Jan 27, 2026 10:50 am

I tested optimism, and optimism actually makes the evaluations more stable, not less so. Which is counter-intuitive, but that seems to be the case.

elo and engine interpretability

elo and engine interpretability

Re: elo and engine interpretability

Re: elo and engine interpretability

Re: elo and engine interpretability

Re: elo and engine interpretability