This paper offers a brief demonstration of the relationship between the quality of play and ratings of both human and computer rating systems using Komodo 8. In a few months I'll upload a longer overview where I'll look at further differences between the way engines and humans play chess. Instead of more usual centipawns, upon urging by users Kai Laskos and Larry Kaufman, I've transformed Komodo 8 evaluations into expected scores using the logistic function.
Code: Select all
p=cp/100
a=1.1 normalization factor
ExpectedScore = {1 + (Exp[p/a] - Exp[-p/a])/(Exp[p/a] + Exp[-p/a])}/2
cp exp score
0.00 50.00%
0.50 69.71%
1.00 84.11%
1.50 92.41%
2.00 96.56%
2.50 98.47%
3.00 99.33%
3.50 99.71%
4.00 99.87%
4.50 99.94%
5.00 99.98%
As you can see, using expected scores has two advantages over centipawns:
1) it eliminates the need for artifical thresholds or cut-offs. There's virtually no difference whether the evaluation swings from 0.12 to 2.96, or to 9.05; it is a lost position in either case. But it may have an unwanted and distorting effect on results in relatively small datasets.
2) high evaluation, like the difficulty of positions affects the accuracy of play. There are two different ways: a) the scaling effect - higher evaluations are accompanied by larger eval gaps between move choices; b) human players' tendency to make desperate moves when behind in eval, and to seek easier and riskless paths to mate the opponent when ahead in eval. Using expected scores eliminates the former - the scaling effect since it is independent of evaluation numbers.
This is not the first time I've attempted to compare engines and humans. Previously I've been subject to criticism by users who have played against low-rated engines and concluded that these are actually weaker than indicated by papers. However, the graphs are only intended to demonstrate the relative playing strength of humans and engines under assumption that humans play against engines without employing anti-computer strategy. It is impossible to take into account hypothetical increase in strength due to anti-computer strategy, as we do not know yet what factors ultimately determine its efficiency.
It should be stressed that the 'error' on both graphs on the Y axis represents the average expected error, i. e. the estimated hypothetical accuracy of play in case all entities involved had the same difficulty of positions. If the average difficulty of moves by an entity is lower than on average, then
What can we conclude from the results? Some people will certainly be surprised, because doesn't it seem most logical to assume that engines and humans share the same accuracy-strength relationship? To me it indeed seemed so and when I first undertook such comparison and saw the final results couple of years ago, it greatly surprised me. It turns out that while humans experience diminishing rating gains per equivalent accuracy increase, engines are the other way round: adding the accuracy of play leads to increasing rating gains (!). Not that it is increasingly easier to make progress in computer chess, of course, it is a completely another matter.
In retrospect, that should not entirely come as a complete surprise, because we know that engines and humans play chess very differently. These differences are following:
1) engines do not play pragmatically, they always strive for objectivity; but humans do.
2) engines have a broader search-tree than humans. They include in calculations a wide array of move choices. Even absurd-looking ones get calculated a few plies deep. Humans, on the contrary first look at the position to find potentially good moves, pick few candidates and then start calculating.
3) the most obvious difference lies in the fact that engines, unlike humans, almost purely rely on calculations, the relative importance of evaluation function is ever-diminishing by advances in hardware and search function. It implies that changes in the level of the difficulty of positions affects humans more, but engines less.
I think we can dismiss the first one for now, as it is more about deliberate and concscious choices than the fundamental nature of move selecting processes.
Unfortunately, they still don't explain the causes why the relationships are just like that and not reversed, i. e. humans increasing gains and engines diminishing gains. And what about other skill-based board games? Will go, checkers, arimaa etc discplay the exactly same phenomenon? These are intriguing questions and I think it's worth to make future research on it.
Here are tables making it easy to convert CCRL into FIDE and vice versa.
FIDE CCRL CCRL FIDE
3100 3375 3600 3130
3000 2915 3500 3118
2900 2646 3400 3104
2800 2461 3300 3088
2700 2324 3200 3069
2600 2216 3100 3048
2500 2127 3000 3024
2400 2053 2900 2996
2300 1989 2800 2963
2200 1934 2700 2924
2100 1885 2600 2878
2000 1841 2500 2824
1900 1802 2400 2759
1800 1767 2300 2680
1700 1734 2200 2584
1600 1704 2100 2466
1500 1677 2000 2318
1400 1651 1900 2132
1300 1628 1800 1894
1200 1605 1700 1584
1100 1585 1600 1175
1000 1565 1500 624
The hardware that is used in creating CCRL lists is outdated by todays's standards, and time controls are ca 3x shorter. How well would Stockfish 7 (3341 CCRL) perform in terms of FIDE 2014, if we had the best hardware possible and standard time controls? A direct comparison of data shows that 3341 CCRL corresponds to 3095 FIDE. According to the PassMark website, Athlon 64 X2 4600+ (2.4 GHz) has an Average CPU Mark of 1365, whereas the strongest one - Intel Xeon E5-2698 v3 @ 2.30GHz - has 22309. They altogether amount to ca 49x advantage in the search quantity. LOG2 of 49 is 5.6 doublings. At that level each doubling is actually worth less than a conventionally used estimate of 50 ELO; user Kai Laskos has done a reseach into this and found that at TCEC level (faster than the example given previously) the gain per doubling is below 40 ELO. Hence, using 40 ELO per doubling, the end result turns out to be 5.6 x 40 + 3341 = 3565, which equals to 3126 FIDE.
So the final conclusion I draw is that given humans do not use anti-computer strategy, Stockfish 7 on top hardware would perform 3100-3150 against humans.