Questions regarding rating systems of humans and engines

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

nimh
Posts: 46
Joined: Sun Nov 30, 2014 12:06 am

Questions regarding rating systems of humans and engines

Post by nimh »

Few weeks ago I finished and uploaded this paper:

http://www.chessanalysis.ee/Quality%20o ... suring.pdf

Relevant to this forum is the section 4.2 on the page 16 which offers a relation between FIDE and CCRL ratings.

Below are comparison tables.
  • CCRL 40/40 FIDE 2008 40/90+30
    3400 2926
    3300 2921
    3200 2916
    3100 2911
    3000 2904
    2900 2897
    2800 2888
    2700 2877
    2600 2864
    2500 2849
    2400 2830
    2300 2805
    2200 2774
    2100 2734
    2000 2679
    1900 2603
    1800 2493
    1700 2325
    1600 2054
  • FIDE 2008 40/90+30 CCRL 40/40
    2900 2941
    2850 2507
    2800 2281
    2750 2136
    2700 2034
    2650 1957
    2600 1896
    2550 1847
    2500 1805
    2450 1770
    2400 1739
    2350 1712
    2300 1688
    2250 1667
    2200 1647
    2150 1630
    2100 1614
    2050 1599
    2000 1586
It appears that the rating-accuracy relationship in both systems is compeltely polar to each other. Going towards to the bottom of the rating scale, human accuracy decreases logarithmically, but engine accuracy decreases exponentially. These facts bring in the following implications:

a) it is impossible to have a negative rating on engine rating lists, but easy to achieve in human rating systems

b) rating progress for humans is extremely difficult, but in the world of computer chess we still can expect engines to improve by hundreds of rating points in the near future.

Hence there's no simple linear relationship between the two very different systems. At the bottom, human rating rises at quicker pace than engine rating, whereas at the top the process is reversed; going from 2400 to 3400 is equivalent of merely 100 FIDE rating points.

So I have following questions:

Did you know or at least suspect that the strength-accuracy relationship of engines and humans are of opposite nature?

What are the exact reasons for this phenomenon?
carldaman
Posts: 2283
Joined: Sat Jun 02, 2012 2:13 am

Re: Questions regarding rating systems of humans and engines

Post by carldaman »

nimh wrote:Few weeks ago I finished and uploaded this paper:

http://www.chessanalysis.ee/Quality%20o ... suring.pdf

Relevant to this forum is the section 4.2 on the page 16 which offers a relation between FIDE and CCRL ratings.

Below are comparison tables.
  • CCRL 40/40 FIDE 2008 40/90+30
    3400 2926
    3300 2921
    3200 2916
    3100 2911
    3000 2904
    2900 2897
    2800 2888
    2700 2877
    2600 2864
    2500 2849
    2400 2830
    2300 2805
    2200 2774
    2100 2734
    2000 2679
    1900 2603
    1800 2493
    1700 2325
    1600 2054
  • FIDE 2008 40/90+30 CCRL 40/40
    2900 2941
    2850 2507
    2800 2281
    2750 2136
    2700 2034
    2650 1957
    2600 1896
    2550 1847
    2500 1805
    2450 1770
    2400 1739
    2350 1712
    2300 1688
    2250 1667
    2200 1647
    2150 1630
    2100 1614
    2050 1599
    2000 1586
It appears that the rating-accuracy relationship in both systems is compeltely polar to each other. Going towards to the bottom of the rating scale, human accuracy decreases logarithmically, but engine accuracy decreases exponentially. These facts bring in the following implications:

a) it is impossible to have a negative rating on engine rating lists, but easy to achieve in human rating systems

b) rating progress for humans is extremely difficult, but in the world of computer chess we still can expect engines to improve by hundreds of rating points in the near future.

Hence there's no simple linear relationship between the two very different systems. At the bottom, human rating rises at quicker pace than engine rating, whereas at the top the process is reversed; going from 2400 to 3400 is equivalent of merely 100 FIDE rating points.

So I have following questions:

Did you know or at least suspect that the strength-accuracy relationship of engines and humans are of opposite nature?

What are the exact reasons for this phenomenon?
Hi Erik,

I think the main reason has a lot to do with the engines' strength being primarily tactical, whereas humans are typically positionally-oriented and far weaker tactically.

The best way for human players to try to close the gap with engines is by adopting effective anti-engine positional strategies (i.e. steering the game into closed positions with some attacking potential, such as the King's Indian, Stonewall, etc). Battling the engines tactically will backfire even for the best human players.

Basically, engines can gain lots and lots of rating points against each other, but these don't translate into much of a gain vs humans since the nature of their strength is tactical, and therefore already far superior to humans. Further Elo gains for *top* engines thus mean little relative to human strength.

Regards,
CL
nimh
Posts: 46
Joined: Sun Nov 30, 2014 12:06 am

Re: Questions regarding rating systems of humans and engines

Post by nimh »

Thanks for the answer, but what you said is a common knowledge among engine enthusiasts. :)

My curiositiy is more about why differences in style and move selection cause such a diametrically opposite behaviour of accuracy and strength relationship.

Also, why do lower-rated engines gain a lot against humans, but little against other engines? For example, if your engines makes a rise from 1600 to 1800, its strength against humans, provided they don't use anti-computer strategies, rises more than by 400 elo.
nimh
Posts: 46
Joined: Sun Nov 30, 2014 12:06 am

Re: Questions regarding rating systems of humans and engines

Post by nimh »

It should be noted that a random move generator has a rating of 209 on CCRL 40/4 list.

http://www.computerchess.org.uk/ccrl/40 ... t_all.html

I wonder what the rating difference between Brutus RND and a top engine that's set to always make the worst move would be? :)
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: Questions regarding rating systems of humans and engines

Post by Ferdy »

Thanks for sharing. Lots of interesting stuff there.
Still reading ...
Janne Kokkala
Posts: 2
Joined: Sun Nov 23, 2014 12:46 am

Re: Questions regarding rating systems of humans and engines

Post by Janne Kokkala »

A couple of questions/comments:

- How did you choose to fit a logarithmic curve in the data in graph 20? To me it does not seem to have enough data to show that it is indeed logarithmic and not for example linear.

- If we take two equal strength (=they have the same expected score against a set of opponents close to their level) chess playing entities, differences in their playing style may result in very different accuracy metric as defined in this paper. Especially, comparing the play of a <2000 human to a <2000 engine using the accuracy metric may be very far from actually predicting the outcome of a game between them. There is a competition in Kaggle ( Predict a chess player's FIDE Elo rating from one game ) which might give other interesting metrics to try, but we'll have to wait until March before the solutions can be freely published and discussed.
nimh
Posts: 46
Joined: Sun Nov 30, 2014 12:06 am

Re: Questions regarding rating systems of humans and engines

Post by nimh »

Janne Kokkala wrote:A couple of questions/comments:

- How did you choose to fit a logarithmic curve in the data in graph 20? To me it does not seem to have enough data to show that it is indeed logarithmic and not for example linear.

- If we take two equal strength (=they have the same expected score against a set of opponents close to their level) chess playing entities, differences in their playing style may result in very different accuracy metric as defined in this paper. Especially, comparing the play of a <2000 human to a <2000 engine using the accuracy metric may be very far from actually predicting the outcome of a game between them. There is a competition in Kaggle ( Predict a chess player's FIDE Elo rating from one game ) which might give other interesting metrics to try, but we'll have to wait until March before the solutions can be freely published and discussed.
In statistics there is such a thing called coefficient of determination which demonstrates how well a model fits actual data. More information is here: http://en.wikipedia.org/wiki/Coefficien ... ermination

Here are the four possible trend lines and their R2s:

linear 0.84
logarithmic 0.86
exponential 0.77
power 0.82

In 2009 I uploaded this paper:
http://www.chessanalysis.ee/summary450.pdf
Here also the logarithmic curve fits the best.

Kaggle is an interesting project, but don't expect any breakthroughs - human play is way too instable and erratic to be accurately descibed by one game only. In my opinion at least 400-500 moves are needed to get satiscactory results. At best, it may yield intersting methods to describe the difficulty of positions and eliminate the objectivity-practicality bias.
carldaman
Posts: 2283
Joined: Sat Jun 02, 2012 2:13 am

Re: Questions regarding rating systems of humans and engines

Post by carldaman »

nimh wrote:Thanks for the answer, but what you said is a common knowledge among engine enthusiasts. :)

My curiositiy is more about why differences in style and move selection cause such a diametrically opposite behaviour of accuracy and strength relationship.

Also, why do lower-rated engines gain a lot against humans, but little against other engines? For example, if your engines makes a rise from 1600 to 1800, its strength against humans, provided they don't use anti-computer strategies, rises more than by 400 elo.
I think the short, oversimplified answer is that it has something to do with diminished returns (vs humans) from engine Elo gains at the top of the scale, whereas the curve is flipped/inverted at the lower end.

CL
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Questions regarding rating systems of humans and engines

Post by lkaufman »

nimh wrote:Few weeks ago I finished and uploaded this paper:

http://www.chessanalysis.ee/Quality%20o ... suring.pdf

Relevant to this forum is the section 4.2 on the page 16 which offers a relation between FIDE and CCRL ratings.

Below are comparison tables.
  • CCRL 40/40 FIDE 2008 40/90+30
    3400 2926
    3300 2921
    3200 2916
    3100 2911
    3000 2904
    2900 2897
    2800 2888
    2700 2877
    2600 2864
    2500 2849
    2400 2830
    2300 2805
    2200 2774
    2100 2734
    2000 2679
    1900 2603
    1800 2493
    1700 2325
    1600 2054
  • FIDE 2008 40/90+30 CCRL 40/40
    2900 2941
    2850 2507
    2800 2281
    2750 2136
    2700 2034
    2650 1957
    2600 1896
    2550 1847
    2500 1805
    2450 1770
    2400 1739
    2350 1712
    2300 1688
    2250 1667
    2200 1647
    2150 1630
    2100 1614
    2050 1599
    2000 1586
It appears that the rating-accuracy relationship in both systems is compeltely polar to each other. Going towards to the bottom of the rating scale, human accuracy decreases logarithmically, but engine accuracy decreases exponentially. These facts bring in the following implications:

a) it is impossible to have a negative rating on engine rating lists, but easy to achieve in human rating systems

b) rating progress for humans is extremely difficult, but in the world of computer chess we still can expect engines to improve by hundreds of rating points in the near future.

Hence there's no simple linear relationship between the two very different systems. At the bottom, human rating rises at quicker pace than engine rating, whereas at the top the process is reversed; going from 2400 to 3400 is equivalent of merely 100 FIDE rating points.

So I have following questions:

Did you know or at least suspect that the strength-accuracy relationship of engines and humans are of opposite nature?

I have for many years claimed that engine vs engine rating differences overstate what their differences would be against humans, but my studies indicate that the proper contraction factor is about 25%, i.e. that a 100 elo difference on an engine list means about a 75 elo difference on a human list. The notion that 100 engine points would equate to only five elo points on a human list is completely ridiculous and absurd. When I have the time I'll review the paper to see what is wrong with their methodology, but it is clearly very faulty.

What are the exact reasons for this phenomenon?
Komodo rules!
nimh
Posts: 46
Joined: Sun Nov 30, 2014 12:06 am

Re: Questions regarding rating systems of humans and engines

Post by nimh »

Perhaps against humans who employ anti-computer strategy gains would be much more than indicated in that study, but unfortunately I know no methods how to take that into account. That may explain differences in our conclusions.

Could I have a link to your studies?

How can a method be faulty, if it compares the relationship between ratings and accuracy in both systems, and determining the accuracy of play is rigorously conducted in the same manner for all games?