Real engine ELO - normalised to classic time controls

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

ydebilloez
Posts: 188
Joined: Tue Jun 27, 2017 11:01 pm
Location: Lubumbashi
Full name: Yves De Billoëz

Real engine ELO - normalised to classic time controls

Post by ydebilloez »

I wonder if blitz and rapid rates are inflated. As per elo calculation, anything above 700 elo difference means a 0% chance for the weaker party, even after 100 games.

Stockfish 17.1 has almost 3800 elo in blitz, 3600 in rapid. This is a 200 elo gap, so at least one of the scores is wrong. (Magnus has 2800 in classic.) If we use Stockfish in 40 moves/120 minutes game, how much elo would it have in classic?

A good way of re-evaluating would be to have it play with rapid time controls against reference engines in standard time controls. I presume someone has already done this.

Another way of asking the same question is, how much elo is gained by increasing from 5s/move to 180s/move.

The best thing would be to measure it against humans. If comparing against humans, we need to remove some unfair practices. I would remove tablebases and pre-compiled specific NUEE tables and let it play with a limited opening book comparable to what GM know. (Humans are not allowed to consult any outside documentation while playing, so engines should be restricted the same way.)

Then we can in self-play again add the complete opening books and all the other goodies, just to see how much elo is added by that.
Yves De Billoëz @ macchess belofte chess
Once owner of a Mephisto I, II, challenger, ... chess computer.
jkominek
Posts: 105
Joined: Tue Sep 04, 2018 5:33 am
Full name: John Kominek

Re: Real engine ELO - normalised to classic time controls

Post by jkominek »

ydebilloez wrote: Mon Jan 05, 2026 6:25 pm I wonder if blitz and rapid rates are inflated. ... Stockfish 17.1 has almost 3800 elo in blitz, 3600 in rapid. This is a 200 elo gap, so at least one of the scores is wrong.
I would put it differently. Instead of saying that at least one of the (CCRL) lists is wrong, i.e. miscalibrated, it is more accurate to say that they are two unconnected pools of players, but that within each pool the rating system faithfully represents the relative performance of the participants. The Blitz lists are not inflated. Rather it is a time control that heightens sensitivity between engines.

It is fair to talk about rating inflation and deflation in human lists. This is because the ratings are adjusted with an incremental update equation, and the methods for handling the introduction of new players into the list - which injects rating points into the pool with high uncertainty - has been a thorn in the side of ratings stewards from the beginning. But engine ratings are computed using simultaneous likelihood optimization. Inflation as a concept does not apply.
A good way of re-evaluating would be to have it play with rapid time controls against reference engines in standard time controls. I presume someone has already done this.
I can't think of anyone who has done that. It would take a lot of compute resources to calibrate against a set of classical time control "standard candles," a term I borrow from astronomy. An easier step towards compatibility would be to cross-link the Blitz and Rapid lists. But to date Graham and his fellow testers have shown no interest in doing that.

The now-defunct SSDF rating list maintained continuity when they upgraded test computers by playing games between the old and new computers.
Another way of asking the same question is, how much elo is gained by increasing from 5s/move to 180s/move.
That depends on the engines being tested and also on the opening book employed, so there is no single answer. But I can give you an idea of how Stockfish scales with the "Chess 324" opening book. In the following plot I have established Stockfish 10 as the baseline against which future releases are compared.



It's interesting that the maximal relative difference between Stockfish versions has shifted over releases, and that it is found at quite short time controls. In my fixed nodes per move tests, starting with SF12 and above the peak difference is between 2^14 and 2^18 nodes/move, single threaded. On modern high-end hardware such as is used at CCC and TCEC, roughly: 2^25 = Hyper Bullet, 2^28 = Bullet, 2^31 = Blitz, 2^34 = Rapid, 2^37 = Classical.

When CCRL or someone else has two or more separate ratings lists at different time controls think of it as taking vertical slices through the graph. The separation will be increasingly compacted at long time controls due to improved play, and inversely it will be dilated at fast time controls. Except at the super-short range where the engines are too weak to take advantage of the higher mistake rates.

All that said, I get the impression that you want rating values to have stable interpretation on an absolute scale, e.g. 2800 Elo software should be an even match for Caruana. That's a pretty common sentiment. But not an inherent part of paired-comparison rating systems.
The best thing would be to measure it against humans.
That would be nice. It might happen if a sponsor with a big pile of cash were made available for prize money. It's hard for me to see it happening without some real incentive at play though.
User avatar
pohl4711
Posts: 2856
Joined: Sat Sep 03, 2011 7:25 am
Location: Berlin, Germany
Full name: Stefan Pohl

Re: Real engine ELO - normalised to classic time controls

Post by pohl4711 »

ydebilloez wrote: Mon Jan 05, 2026 6:25 pm I wonder if blitz and rapid rates are inflated.
Why do you wonder? Celos (Computer Elo) are always inflating, when doing a ratinglist. That is a very well known phenomenon and does not depend on blitz/rapid, it also happens with long TC. Already in the stone-age (old SSDF ratinglist (a 3min/move ratinglist (!!!)), this list was 2 times (?) lowered by -100 Celos in 20 years or so.
jkominek
Posts: 105
Joined: Tue Sep 04, 2018 5:33 am
Full name: John Kominek

Re: Real engine ELO - normalised to classic time controls

Post by jkominek »

https://www.chessprogramming.org/SSDF

Rating Calibration

In the beginning a lot of calibrations were done to match human ratings [4] :

Year Calibration
1985 To the next list the level is lowered with 49 point. The level is lowered with 23 points more
1986 Once again the level is lowered, this time with 17-22 points
1987 The lists level is lowered with about 3 points
1988 The ratinglist is lowered with about 18 points
1989 The level on the ratinglist gets a huge lowering with about 70 points
2000 Computers ratings lowered by 100 points again because the top is too high compare to humans
---------------------

My commentary: This was a time when aligning computer ratings to FIDE Elo was highly desired, as "How strong are they compared to Masters/Grandmasters/Kasparov?" was a question of great interest. The data used by SSDF for cross linkage was never as good as one would have liked. But they did what they could.

For a visual depiction here's a plot I made tracking the leading engines' rating over time as tested by SSDF. This is post all calibration adjustments. The year 2000 marked the time when PC software was becoming competitive with the elite Grandmaster class.

jkominek
Posts: 105
Joined: Tue Sep 04, 2018 5:33 am
Full name: John Kominek

Re: Real engine ELO - normalised to classic time controls

Post by jkominek »

In line with the topic of this thread, a couple years ago I got a result that made me go "Hmm, now that's interesting." I've not seen anyone else mention it so perhaps I'm the only person who's spotted it. The Elo scaling of HCE (pre-NNUE) engines is much steeper than that of NNUE engines. (For Stockfish at least.) I'll illustrate.



To create this plot I began with a fully connected round robin tournament of up to 2^18 fixed nodes per move, in node doubling steps, with the absolute scale anchored to Gaviota 1.0 on CCRL 40-15. Above that threshold each Stockfish version played only against itself at different node budget odds.

As you can see Stockfish 10 and 11 shoot up over 4000 with no clear asymptote in sight. Stockfish 12 to 15.1 rise more slowly, converge together, and are leveling off at a ceiling of around 3600. Obviously if had I played SF 15.1 against SF 10 at 2^29 nodes/mv each, Stockfish 10 would get walloped, completely at odds with a naive prediction based on rating differences above. Hence the need for continued cross-linking.

Digging into it, I found that starting with Stockfish 12 the NNUE-based engines were able to "solve" a subset of the Chess 324 games, always being able to secure a certain percentage of draws, which was not true of Stockfish 10 and 11. In the case where a certain percentage of draws are guaranteed, then, no matter how much stronger the opponent is in an absolute sense, the relative rating gap between them will be capped.(*)

When I compute a ratings list over the whole database and contrast it to pre/post-NNUE subsets it is evident that versions of Stockfish up to 11 exert an expansive "force" while versions 12 and above exert a compressive "force". The overall ratings list is a (max log likelihood) averaging, or compromise between the two forces.

I find this curiously similar to how human lists are compressed when compared to engine lists. It is also the case that in human pairings between players of very distant ratings, there is a statistically confirmed finding that the stronger player under-performs compared to the Elo ratings-based prediction. My working explanation is that human players have the ability and inclination to go for the quick draw. No engine has the concept of taking an easy last round draw in order to secure 3rd place prize money, for example. They grind on regardless.

A final note of commentary. Notice I have not once used the word inflation in this post (until now). In my opinion it is not helpful terminology. For one thing, with its common connection to economics it carries the connotation of being a bad thing -- of being something in our computer rating lists that should be fixed. Scaling divergence is a phenomena, not just between human and computer lists, but within computer lists too. To me it is a curious phenomena to be understood. I advocate for the more neutral terminology of compression and dilation.

(*) Scoring color-swapped game pairs as a single result is a reasonable way to raise the ratings cap.
Uri Blass
Posts: 11136
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Real engine ELO - normalised to classic time controls

Post by Uri Blass »

I see that even before NNUE a rating list based on SF11 engines at different number of nodes per move gives higher rating relative to rating list based on SF10 engines.

The steps if I understand correctly are the following:
1)make a round robin tournament for different engines with 2^n nodes per move fpr 6<=m<=18 so you have rating for every engine at 2^18 nodes
2)make engine-engine match between 2^n nodes per move and 2^(n+1) nodes per move when 18<=n<=28 to make rating for every engine at 2^n nodes.

I think it may be interesting to add bigger n that 28 to see which engine get the biggest rating.

If I understand correctly you claim that stockfish10 get better rating than stockfish11 and stockfish11 get better rating than new stockfish if you do not allow matches for different stockfishes for your list for n>=18
jkominek
Posts: 105
Joined: Tue Sep 04, 2018 5:33 am
Full name: John Kominek

Re: Real engine ELO - normalised to classic time controls

Post by jkominek »

The date on this experiment is Jan 1 2023, so in describing it here I'm working from 3 year old memory. And yes, it is true; dabbling in computer chess measurements is how I occupy my time on New Year's Day.

Up to 2^18 nodes/move the "contestants" are heavily connected. It is all-play-all for engines (engine settings) within about 1000 Elo of each other. This give the curves stability. Above that threshold the connections are only within a single Stockfish version, and if I remember correctly, are multi-point. Example: SF15 2^22 nodes plays downward to 2^21, 2^20, 2^19, 2^18, as well as upward to an extent, i.e. 2^23, 2^24. The nodes at the upper reaches of 2^28 and 2^29 has the fewest connections. So for example SF 11 at 2^29 nodes might only have a single match of 648 games against SF 11 at 2^28 nodes. In contrast, SF11 at say 2^14 nodes will have played tens of thousands of games. I'd have to reprocess the data afresh to put precise numbers on it.

Returning to the purpose of the experiment, what I attempted to compare was how well a particular version of Stockfish is able to exploit weaker, node-limited versions of itself. Curiously, the two tested HCE versions of SF10 and SF11 find a way to keep exploiting mistakes in weaker versions of itself, much more so than the NNUE versions can, even when starting from roughly the same baselines Elo.

And yes, in this data SF10 exploited itself to a somewhat greater extent than did SF11, leading to a higher phony-rating. But as you'd expect, in head-to-head competitions SF11 at equal node budget in those upper ranges beats SF10.

Direct head-to-head results collected at a later date.

Code: Select all

 53) Stockfish 11 Hash 8192 Threads 1 Nodes 536870912   3809.3 :   6480 (+569,=4364,-1547)  42.5%
     vs.                                                       :  games (   +,    =,    -)  Draw  Perc    Perf :    Diff    SD    LOS
     Stockfish 11 Hash 4096 Threads 1 Nodes 268435456          :    648 ( 122,  487,   39)  75.2  56.4   +44.7 :   +15.4   4.3  100.0
     Stockfish 10 Hash 8192 Threads 1 Nodes 536870912          :    648 ( 123,  475,   50)  73.3  55.6   +39.3 :   +27.5   4.8  100.0
Unlike most other testers I don't care about measuring (nearly) every engine under the sun. I only chase the Big Fish. So I have specialty interests. But there is a take-home lesson for me: maintain strong cross-connections between versions, above and below to an extent as great as possible.

If not you can end with systemic errors that dwarf the error bar estimates. The different scaling behavior of SF-HCE versus SF-NNUE is outside the i.i.d. (independent, identically distributed) random variable assumptions that statisticians use in proving theorems.
I think it may be interesting to add bigger n tha[n] 28 to see which engine get the biggest rating.
I agree. But I'd need the computing resource of technologov or noobpwndftw to make much headway in that direction.
FireDragon761138
Posts: 18
Joined: Sun Dec 28, 2025 7:25 am
Full name: Aaron Munn

Re: Real engine ELO - normalised to classic time controls

Post by FireDragon761138 »

jkominek wrote: Tue Jan 06, 2026 2:42 am
ydebilloez wrote: Mon Jan 05, 2026 6:25 pm I wonder if blitz and rapid rates are inflated. ... Stockfish 17.1 has almost 3800 elo in blitz, 3600 in rapid. This is a 200 elo gap, so at least one of the scores is wrong.
I would put it differently. Instead of saying that at least one of the (CCRL) lists is wrong, i.e. miscalibrated, it is more accurate to say that they are two unconnected pools of players, but that within each pool the rating system faithfully represents the relative performance of the participants. The Blitz lists are not inflated. Rather it is a time control that heightens sensitivity between engines.

It is fair to talk about rating inflation and deflation in human lists. This is because the ratings are adjusted with an incremental update equation, and the methods for handling the introduction of new players into the list - which injects rating points into the pool with high uncertainty - has been a thorn in the side of ratings stewards from the beginning. But engine ratings are computed using simultaneous likelihood optimization. Inflation as a concept does not apply.
A good way of re-evaluating would be to have it play with rapid time controls against reference engines in standard time controls. I presume someone has already done this.
I can't think of anyone who has done that. It would take a lot of compute resources to calibrate against a set of classical time control "standard candles," a term I borrow from astronomy. An easier step towards compatibility would be to cross-link the Blitz and Rapid lists. But to date Graham and his fellow testers have shown no interest in doing that.

The now-defunct SSDF rating list maintained continuity when they upgraded test computers by playing games between the old and new computers.
Another way of asking the same question is, how much elo is gained by increasing from 5s/move to 180s/move.
That depends on the engines being tested and also on the opening book employed, so there is no single answer. But I can give you an idea of how Stockfish scales with the "Chess 324" opening book. In the following plot I have established Stockfish 10 as the baseline against which future releases are compared.



It's interesting that the maximal relative difference between Stockfish versions has shifted over releases, and that it is found at quite short time controls. In my fixed nodes per move tests, starting with SF12 and above the peak difference is between 2^14 and 2^18 nodes/move, single threaded. On modern high-end hardware such as is used at CCC and TCEC, roughly: 2^25 = Hyper Bullet, 2^28 = Bullet, 2^31 = Blitz, 2^34 = Rapid, 2^37 = Classical.

When CCRL or someone else has two or more separate ratings lists at different time controls think of it as taking vertical slices through the graph. The separation will be increasingly compacted at long time controls due to improved play, and inversely it will be dilated at fast time controls. Except at the super-short range where the engines are too weak to take advantage of the higher mistake rates.

All that said, I get the impression that you want rating values to have stable interpretation on an absolute scale, e.g. 2800 Elo software should be an even match for Caruana. That's a pretty common sentiment. But not an inherent part of paired-comparison rating systems.
The best thing would be to measure it against humans.
That would be nice. It might happen if a sponsor with a big pile of cash were made available for prize money. It's hard for me to see it happening without some real incentive at play though.
You can't compare engine vs. human time controls, that would be my guess. Engines benefit alot more from that extra time in rapid play vs. blitz. That can mean searching alot more less promising lines and finding hidden resources in rapid games vs blitz.
Peter Berger
Posts: 767
Joined: Thu Mar 09, 2006 2:56 pm

Re: Real engine ELO - normalised to classic time controls

Post by Peter Berger »

ydebilloez wrote: Mon Jan 05, 2026 6:25 pm I wonder if blitz and rapid rates are inflated. As per elo calculation, anything above 700 elo difference means a 0% chance for the weaker party, even after 100 games.
The modern engines ( I am mainly talking Stockfish and lc0 here) clearly have a problem of beating way, way weaker engines in classical time control often enough from the standard opening position.

From time to time I download their latest and greatest versions and pit them against Crafty for fun. It simply never happens that Crafty doesn't get a draw in like 10 games. This should happen like never.

Some of this could be cured by simple opening books, but from what I have seen in my fun experiments, this simply hides the phenomenon, and it is still there.

Chess is drawish, but not +that+ drawish. If these engines were playing against weaker opponents more often in their testing procedure, I am convinced, that Crafty wouldn't get these draws.
jkominek
Posts: 105
Joined: Tue Sep 04, 2018 5:33 am
Full name: John Kominek

Re: Real engine ELO - normalised to classic time controls

Post by jkominek »

One nice thing about Chess 324 positions is that it contains amongst it the standard, or orthodox, chess opening. The first plot below contrasts average draw rate per opening versus its evaluation in pawn units, as provided by Komodo Dragon 3.2. The evaluations are from Stefan Pohl's analysis as part of his UHO opening book and related work.

The symmetrical positions are colored magenta, with the standard opening position colored red (if your eyes can make out the distinction). It is located at the very top-middle of the non-cyan cluster of dots. In the data collection I selected for analysis the standard opening had the 3rd highest draw rate out of the full set of 324. Chess might not be that drawish, but it is that drawish.



The image link is to a pair of plots. If you hover over the right edge you can click to the second one. It is even more interesting.

I analyzed the 324 opening positions to compare their individual capacity for ratings discrimination. Overly drawish openings will be poor at separating engines of course. But in the other direction highly imbalanced openings will also have poor separation ability because the results will be dominated by 1-0, 0-1 game pairs.

What I did was construct two pools of players. Pool A consisted of the 10 highest rated engines. Pool B consisted of 12 engines rated around 160 Elo lower on the full data list. I then broke down the pairings between members of Pool A and members of Pool B on a per-opening basis. From that I computed the average rating difference between Pool A and Pool B per each opening. The values range from around 80 to 240. The black line is a 4th degree polynomial fit to the set of dots.

As before symmetric positions are colored magenta with the standard opening colored red. In this analysis the standard opening again comes in third place, this time as the 3rd worst opening in terms of capacity for ratings separation. To my surprise the opening best at separation was one of the symmetric positions.

Another surprise was that many of the openings that would be considered balanced or near-balanced based on engine evaluation proved themselves good separators. In the Pohl approach one selects a band about 1.0 as an on-the-edge opening book, for example (0.85, 1.15) or (0.9, 1.1). I have no qualms about that. But one can also take a post-hoc approach. Run a ton of games and then after-the-fact select those openings that prove most able to separate engine A from engine A prime. The Stockfish team could do that. It might hasten their testing methodology.

If you have interest in examining some layouts, the top ranked FENs are:
  • 1. rbqnknbr/pppppppp/8/8/8/8/PPPPPPPP/RBQNKNBR w KQkq - 0 1; elodiff = 244.5, eval = 0.34, draws = 51.6%
    2. rbqnknbr/pppppppp/8/8/8/8/PPPPPPPP/RNBNKBQR w KQkq - 0 1; elodiff = 231.9, eval = -0.04, draws = 52.5%
    3. rbbnknqr/pppppppp/8/8/8/8/PPPPPPPP/RBNNKQBR w KQkq - 0 1; elodiff = 230.3, eval = 1.01, draws = 43.2%
    4. rnbnkbqr/pppppppp/8/8/8/8/PPPPPPPP/RBQNKNBR w KQkq - 0 1; elodiff = 229.0, eval = 0.79, draws = 47.4%
    5. rnbnkbqr/pppppppp/8/8/8/8/PPPPPPPP/RNQNKBBR w KQkq - 0 1; elodiff = 228.8; eval = 0.73, draws = 48.2%
I'm far from qualified to offer any chessic insights. But what does jump out at me is the positioning of the knights. They begin separated by one file.

Here are the worst two.
  • 323. rbbqknnr/pppppppp/8/8/8/8/PPPPPPPP/RQBNKBNR w KQkq - 0 1; elodiff = 84.6; eval = 0.00; draws = 78.8%
    324. rbbqknnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1; elodiff = 83.8; eval = 0.04; draws = 80.7%