a direct comparison of FIDE and CCRL rating systems

nimh · Post by **nimh** » Tue Feb 23, 2016 12:06 am

http://www.chessanalysis.ee/rating%20comparison.pdf

This paper offers a brief demonstration of the relationship between the quality of play and ratings of both human and computer rating systems using Komodo 8. In a few months I'll upload a longer overview where I'll look at further differences between the way engines and humans play chess. Instead of more usual centipawns, upon urging by users Kai Laskos and Larry Kaufman, I've transformed Komodo 8 evaluations into expected scores using the logistic function.

Code: Select all

p=cp/100 
 a=1.1  normalization factor 

 ExpectedScore = &#123;1 + &#40;Exp&#91;p/a&#93; - Exp&#91;-p/a&#93;)/&#40;Exp&#91;p/a&#93; + Exp&#91;-p/a&#93;)&#125;/2

This table compares the two methods:

cp exp score
0.00 50.00%
0.50 69.71%
1.00 84.11%
1.50 92.41%
2.00 96.56%
2.50 98.47%
3.00 99.33%
3.50 99.71%
4.00 99.87%
4.50 99.94%
5.00 99.98%

As you can see, using expected scores has two advantages over centipawns:
1) it eliminates the need for artifical thresholds or cut-offs. There's virtually no difference whether the evaluation swings from 0.12 to 2.96, or to 9.05; it is a lost position in either case. But it may have an unwanted and distorting effect on results in relatively small datasets.
2) high evaluation, like the difficulty of positions affects the accuracy of play. There are two different ways: a) the scaling effect - higher evaluations are accompanied by larger eval gaps between move choices; b) human players' tendency to make desperate moves when behind in eval, and to seek easier and riskless paths to mate the opponent when ahead in eval. Using expected scores eliminates the former - the scaling effect since it is independent of evaluation numbers.

This is not the first time I've attempted to compare engines and humans. Previously I've been subject to criticism by users who have played against low-rated engines and concluded that these are actually weaker than indicated by papers. However, the graphs are only intended to demonstrate the relative playing strength of humans and engines under assumption that humans play against engines without employing anti-computer strategy. It is impossible to take into account hypothetical increase in strength due to anti-computer strategy, as we do not know yet what factors ultimately determine its efficiency.

It should be stressed that the 'error' on both graphs on the Y axis represents the average expected error, i. e. the estimated hypothetical accuracy of play in case all entities involved had the same difficulty of positions. If the average difficulty of moves by an entity is lower than on average, then

What can we conclude from the results? Some people will certainly be surprised, because doesn't it seem most logical to assume that engines and humans share the same accuracy-strength relationship? To me it indeed seemed so and when I first undertook such comparison and saw the final results couple of years ago, it greatly surprised me. It turns out that while humans experience diminishing rating gains per equivalent accuracy increase, engines are the other way round: adding the accuracy of play leads to increasing rating gains (!). Not that it is increasingly easier to make progress in computer chess, of course, it is a completely another matter.
In retrospect, that should not entirely come as a complete surprise, because we know that engines and humans play chess very differently. These differences are following:
1) engines do not play pragmatically, they always strive for objectivity; but humans do.
2) engines have a broader search-tree than humans. They include in calculations a wide array of move choices. Even absurd-looking ones get calculated a few plies deep. Humans, on the contrary first look at the position to find potentially good moves, pick few candidates and then start calculating.
3) the most obvious difference lies in the fact that engines, unlike humans, almost purely rely on calculations, the relative importance of evaluation function is ever-diminishing by advances in hardware and search function. It implies that changes in the level of the difficulty of positions affects humans more, but engines less.
I think we can dismiss the first one for now, as it is more about deliberate and concscious choices than the fundamental nature of move selecting processes.

Unfortunately, they still don't explain the causes why the relationships are just like that and not reversed, i. e. humans increasing gains and engines diminishing gains. And what about other skill-based board games? Will go, checkers, arimaa etc discplay the exactly same phenomenon? These are intriguing questions and I think it's worth to make future research on it.

Here are tables making it easy to convert CCRL into FIDE and vice versa.

FIDE CCRL CCRL FIDE
3100 3375 3600 3130
3000 2915 3500 3118
2900 2646 3400 3104
2800 2461 3300 3088
2700 2324 3200 3069
2600 2216 3100 3048
2500 2127 3000 3024
2400 2053 2900 2996
2300 1989 2800 2963
2200 1934 2700 2924
2100 1885 2600 2878
2000 1841 2500 2824
1900 1802 2400 2759
1800 1767 2300 2680
1700 1734 2200 2584
1600 1704 2100 2466
1500 1677 2000 2318
1400 1651 1900 2132
1300 1628 1800 1894
1200 1605 1700 1584
1100 1585 1600 1175
1000 1565 1500 624

The hardware that is used in creating CCRL lists is outdated by todays's standards, and time controls are ca 3x shorter. How well would Stockfish 7 (3341 CCRL) perform in terms of FIDE 2014, if we had the best hardware possible and standard time controls? A direct comparison of data shows that 3341 CCRL corresponds to 3095 FIDE. According to the PassMark website, Athlon 64 X2 4600+ (2.4 GHz) has an Average CPU Mark of 1365, whereas the strongest one - Intel Xeon E5-2698 v3 @ 2.30GHz - has 22309. They altogether amount to ca 49x advantage in the search quantity. LOG2 of 49 is 5.6 doublings. At that level each doubling is actually worth less than a conventionally used estimate of 50 ELO; user Kai Laskos has done a reseach into this and found that at TCEC level (faster than the example given previously) the gain per doubling is below 40 ELO. Hence, using 40 ELO per doubling, the end result turns out to be 5.6 x 40 + 3341 = 3565, which equals to 3126 FIDE.

So the final conclusion I draw is that given humans do not use anti-computer strategy, Stockfish 7 on top hardware would perform 3100-3150 against humans.

Frank Quisinsky · Post by **Frank Quisinsky** » Tue Feb 23, 2016 1:34 am

That is realistic.
3.150 should be right.

For years I made an interview with GM Meyer (2.675 Elo at this time, Nr. 2 in Germany at this time). GM Meyer are thinking that Shredder 12 have not more as 2.800 Elo. The reason I changed to 2.800 Elo in my older SWCR Rating List. Today many others used that ... 2.800 Elo for Shredder engine after the interview with GM Meyer (I do it with GM Hickl for schwachwelt).

With the final result that Stockfish have around 3.175 in my current FCT Rating List.

CCRL Ratings are clearly to high!

But fact is that engines are in blitz games much stronger as in games with longer time controls comparing to grandmasters. Possible that Stockfish have 3.275 Elo in extrem blitz vs. humans.

Best
Frank

PS: An other GM wrote me that he is playing vs. Junior 13 his games. Junior is around 150 Elo stronger ... GM have 2.630 Elo. Also this one is right.

Mike S. · Post by **Mike S.** » Tue Feb 23, 2016 2:11 am

You fail plenty by NOT giving any significant information as for what your method is.

Mike S. · Post by **Mike S.** » Tue Feb 23, 2016 2:36 am

That is realistic.
3.150 should be right.

Bitte Frank, was ein 08/15-GM irgendwann im Jahre Schnee in einem Interview gemeint hat soll irgendeine Einschätzung relevant begründen? Shredder 12? Auf welcher Hardware...? Also wirklich, das ist keine vernünftige Grundlage für irgendeine Zahl.

Frank Quisinsky · Post by **Frank Quisinsky** » Tue Feb 23, 2016 2:57 am

Hello Mike,

nice to read you.

A good idea is to read the interview.
And all information you need you are able to find.

I believe the older Schachwelt computer chess news I have written are today still online on the site.

1 Core with current hardware at this time (current Hardware at this time are Intel processors before the first i7 processor is out ... Q seria ... Q9550, Q9650). This was a private interview with the GMs Hickl and Meyer at my home. Later available in the German Newspaper Schachwelt, of course with much other interesting things around chess. But my main topic are such questions.

Again, much other GMs have the same opinion!
Rybka 4.0 for an example around 2.900 Elo.

An other give me the information to Junior 13 ... I wrote before. I think this one is realistic.

We can not give an exactly information about it but the information we have are good enough.

CEGT, IPON or others ... using Shredder with 2800 Elo should be more right as CCRL.

Best
Frank

Graham Banks · Post by **Graham Banks** » Tue Feb 23, 2016 2:58 am

nimh wrote:The hardware that is used in creating CCRL lists is outdated by todays's standards, and time controls are ca 3x shorter.

That is not the actual hardware that we use.
We just used that machine as our original benchmark to find the adapted time controls to use on our computers.

For example, Nathanael's overclocked i7 Haswell quad uses 40/15, whereas an older Q6600 uses 40/32.
CCRL 40/40 equates to around 40/18 on modern hardware.

Hope that explains things.

Frank Quisinsky · Post by **Frank Quisinsky** » Tue Feb 23, 2016 3:21 am

Hi Graham,

I think on i7 haswell 40 in 13/14 is more correct if I compare with my 40/10 haswell results the CCRL Ratings with 40 in 40.

Best
Frank

But one minute more or less ...
Not important.

drj4759 · Post by **drj4759** » Tue Feb 23, 2016 8:36 am

The primary reason why CCRL and CEGT had seemingly bloated ELO rating list reaching as high as 3400+ ELO is the fact that their chess engine tournaments includes the number of CPU cores. For example, Stockfish 7 had versions named Stockfish 7 x64 12CPU, Stockfish 7 x64 8CPU, Stockfish 7 x64 4CPU, etc. The higher the number of the CPUs, the higher the ELO rating it will be.

When the number of CPU cores are stripped, the highest ELO rating goes down to 3200+. The actual information can be viewed from chessowl.blogspot.com under Combined Rating List section.

Shredder 12 x64 wtih ELO 2800 was used as the anchor in all the different rating list presentation. The CCRL and CEGT rating list is correct as far as the publishers are concerned. It will answer a specific question like "Will chess engines have higher ELO rating when played with more CPU cores?" As the data showed, the answer is yes! But when questions like "What is the normal ELO rating of Stockfish 7 (for example), the CCRL and CEGT rating list will not apply. It is better to look for generic rating list site that does not include the number of CPU cores greater than 2 in the tournament.

Laskos · Post by **Laskos** » Tue Feb 23, 2016 8:38 am

I fitted the points with FIDE = a + b/(CCRL+c)^3, the fit is almost perfect, R^2 being 0.99999990. The fit and the points given are here:

At a glance, I don't like too much the tails, but I will take a look later in the day.

nimh · Post by **nimh** » Tue Feb 23, 2016 9:22 am

Mike S. wrote:You fail plenty by NOT giving any significant information as for what your method is.

Harsh words... but as I mentioned, this is just a preliminary overview, I'll explain more thoroughly in my upcoming paper. You can feel free to ask anything, I'll answer gladly.

a direct comparison of FIDE and CCRL rating systems

a direct comparison of FIDE and CCRL rating systems

Re: a direct comparison of FIDE and CCRL rating systems

Re: a direct comparison of FIDE and CCRL rating systems

Re: a direct comparison of FIDE and CCRL rating systems

Re: a direct comparison of FIDE and CCRL rating systems

Re: a direct comparison of FIDE and CCRL rating systems

Re: a direct comparison of FIDE and CCRL rating systems

Re: a direct comparison of FIDE and CCRL rating systems

Re: a direct comparison of FIDE and CCRL rating systems

Re: a direct comparison of FIDE and CCRL rating systems