Questions regarding rating systems of humans and engines

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

nimh
Posts: 46
Joined: Sun Nov 30, 2014 12:06 am

Re: Questions regarding rating systems of humans and engines

Post by nimh »

I'm not sure what football results tell us about chess, but a hundred elo represents quite a large win to loss ratio in chess when you are talking about grandmasters. Anyway it doesn't matter what causes the zigzag line for human play; unless you have enough data to make the connecting line look somewhat like a line or a curve, it's pretty hard to guess what its real shape would be.
It tells that even clubs that are considered more-orl-less equal can have large rating differences. You can see teams and players weaker than 100 elo beat their stronger counterparts quite often.

Removing the effect of the instability of human play would requite a large amaount of moves, probably far over 1000, which is impractical for me. I write results into spreadsheets by hand.

In 2009 I produced the following study, where the elo vs accuracy graph had 200-elo difference between cohorts. There were no zigzags.

http://www.chessanalysis.ee/summary450.pdf

My next study will have 150-elo gaps with the span of 2800-1750.

You got wrong cut-offs and asymptotic behavior with centipawns. All three, Houdini, Komodo and Stockfish obey pretty same logistic curve while transforming centipawns (cp) to expected score:


Code:
p=cp/100
a=1.1 normalization factor

ExpectedScore = {1 + (Exp[p/a] - Exp[-p/a])/(Exp[p/a] + Exp[-p/a])}/2


You would get more reliable fit using this approximation with correct asymptotic behavior.
Thanks you for the formula, I had no idea this actually existed. I'm going to use that alongsidewith the old good centipawn method and compare them. If the expected score method really is superior, then the coefficient of determination value of rating vs accuracy trend lines should be bigger than in the case of the cps.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Questions regarding rating systems of humans and engines

Post by Laskos »

nimh wrote:
You got wrong cut-offs and asymptotic behavior with centipawns. All three, Houdini, Komodo and Stockfish obey pretty same logistic curve while transforming centipawns (cp) to expected score:

Code: Select all

p=cp/100 
a=1.1  normalization factor 

ExpectedScore = {1 + (Exp[p/a] - Exp[-p/a])/(Exp[p/a] + Exp[-p/a])}/2 
 

You would get more reliable fit using this approximation with correct asymptotic behavior.
Thanks you for the formula, I had no idea this actually existed. I'm going to use that alongsidewith the old good centipawn method and compare them. If the expected score method really is superior, then the coefficient of determination value of rating vs accuracy trend lines should be bigger than in the case of the cps.
This ExpectedScore is valid in games against an equal opponent at blitz time control. I don't know how the empirical factor a would behave to longer time control, but it wouldn't shift much. Generally with top three engines a is between 1.1 to 1.3, so using a=1.2 is maybe safer (instead of a=1.1 I wrote earlier).
nimh
Posts: 46
Joined: Sun Nov 30, 2014 12:06 am

Re: Questions regarding rating systems of humans and engines

Post by nimh »

I could easily try many different values of a to see which one will make the best fit to a curve. Theoretically there could even be different values of a for CCRL and FIDE game sets, but I'm not sure if comparisons would be valid then.
User avatar
Ajedrecista
Posts: 1971
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Questions regarding rating systems of humans and engines.

Post by Ajedrecista »

Hello Kai:
Laskos wrote:
nimh wrote:
You got wrong cut-offs and asymptotic behavior with centipawns. All three, Houdini, Komodo and Stockfish obey pretty same logistic curve while transforming centipawns (cp) to expected score:

Code: Select all

p=cp/100 
a=1.1  normalization factor 

ExpectedScore = {1 + (Exp[p/a] - Exp[-p/a])/(Exp[p/a] + Exp[-p/a])}/2 
 

You would get more reliable fit using this approximation with correct asymptotic behavior.
Thanks you for the formula, I had no idea this actually existed. I'm going to use that alongsidewith the old good centipawn method and compare them. If the expected score method really is superior, then the coefficient of determination value of rating vs accuracy trend lines should be bigger than in the case of the cps.
This ExpectedScore is valid in games against an equal opponent at blitz time control. I don't know how the empirical factor a would behave to longer time control, but it wouldn't shift much. Generally with top three engines a is between 1.1 to 1.3, so using a=1.2 is maybe safer (instead of a=1.1 I wrote earlier).
I read your post today. Just writing your formula in a more compact way:

Code: Select all

µ = [1 + tanh(p/a)]/2
Working a little with your formula:

Code: Select all

µ_a = [1 + tanh(p/a)]/2
µ_b = [1 + tanh(p/b)]/2

With a > 0, b > 0 and |a - b| > 0 (I mean: a =/= b).

------------------------

µ_a - µ_b = [tanh(p/a) - tanh(p/b)]/2 =
= (with the help of Derive 6) =
= [exp(2p/a) - exp(2p/b)]/{[1 + exp(2p/a)]·[1 + exp(2p/b)]}

------------------------

Where is the maximum of |µ_a - µ_b|? How much is this value?

I equalized the derivatives:

[d(µ_a)]/dp = 1/{2a·[cosh(p/a)]²}
[d(µ_b)]/dp = 1/{2b·[cosh(p/b)]²}

Hence:

a·[cosh(p/a)]² = b·[cosh(p/b)]²

(I will take absolute values):

[sqrt(a)]·cosh(p/a) = [sqrt(b)]·cosh(p/b)
I used Derive 6 again for the special cases of (a, b) = {(1.1, 1.3), (1.1, 1.2), (1.2, 1.3)}:

Code: Select all

a = 1.1, b = 1.3: |p| ~ 0.921312 (~ 92 cp); |µ_a - µ_b| ~ 0.037325 ~ 3.73%
a = 1.1, b = 1.2: |p| ~ 0.886225 (~ 89 cp); |µ_a - µ_b| ~ 0.035376 ~ 3.54%
a = 1.2, b = 1.3: |p| ~ 0.963494 (~ 96 cp); |µ_a - µ_b| ~ 0.035008 ~ 3.50%
I did these calculations just to give a rough estimate of some differences when chosing the parameter a. I hope no typos. Any correction is welcome.

Regards from Spain.

Ajedrecista.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Questions regarding rating systems of humans and engines

Post by Laskos »

Ajedrecista wrote:Hello Kai:
Laskos wrote:
nimh wrote:
You got wrong cut-offs and asymptotic behavior with centipawns. All three, Houdini, Komodo and Stockfish obey pretty same logistic curve while transforming centipawns (cp) to expected score:

Code: Select all

p=cp/100 
a=1.1  normalization factor 

ExpectedScore = {1 + (Exp[p/a] - Exp[-p/a])/(Exp[p/a] + Exp[-p/a])}/2 
 

You would get more reliable fit using this approximation with correct asymptotic behavior.
Thanks you for the formula, I had no idea this actually existed. I'm going to use that alongsidewith the old good centipawn method and compare them. If the expected score method really is superior, then the coefficient of determination value of rating vs accuracy trend lines should be bigger than in the case of the cps.
This ExpectedScore is valid in games against an equal opponent at blitz time control. I don't know how the empirical factor a would behave to longer time control, but it wouldn't shift much. Generally with top three engines a is between 1.1 to 1.3, so using a=1.2 is maybe safer (instead of a=1.1 I wrote earlier).
I read your post today. Just writing your formula in a more compact way:

Code: Select all

µ = [1 + tanh(p/a)]/2
Working a little with your formula:

Code: Select all

µ_a = [1 + tanh(p/a)]/2
µ_b = [1 + tanh(p/b)]/2

With a > 0, b > 0 and |a - b| > 0 (I mean: a =/= b).

------------------------

µ_a - µ_b = [tanh(p/a) - tanh(p/b)]/2 =
= (with the help of Derive 6) =
= [exp(2p/a) - exp(2p/b)]/{[1 + exp(2p/a)]·[1 + exp(2p/b)]}

------------------------

Where is the maximum of |µ_a - µ_b|? How much is this value?

I equalized the derivatives:

[d(µ_a)]/dp = 1/{2a·[cosh(p/a)]²}
[d(µ_b)]/dp = 1/{2b·[cosh(p/b)]²}

Hence:

a·[cosh(p/a)]² = b·[cosh(p/b)]²

(I will take absolute values):

[sqrt(a)]·cosh(p/a) = [sqrt(b)]·cosh(p/b)
I used Derive 6 again for the special cases of (a, b) = {(1.1, 1.3), (1.1, 1.2), (1.2, 1.3)}:

Code: Select all

a = 1.1, b = 1.3: |p| ~ 0.921312 (~ 92 cp); |µ_a - µ_b| ~ 0.037325 ~ 3.73%
a = 1.1, b = 1.2: |p| ~ 0.886225 (~ 89 cp); |µ_a - µ_b| ~ 0.035376 ~ 3.54%
a = 1.2, b = 1.3: |p| ~ 0.963494 (~ 96 cp); |µ_a - µ_b| ~ 0.035008 ~ 3.50%
I did these calculations just to give a rough estimate of some differences when chosing the parameter a. I hope no typos. Any correction is welcome.

Regards from Spain.

Ajedrecista.
Thanks Jesus for computing the maximal differences for reasonable values of a. It seems we differ for the last two values, mine are 1.95% and 1.79%, and they add up pretty accurately to the first one (as intuitively one would expect).

I have tentatively computed the values of a for top 3 engines in blitz games (equal strength, self-play). For Komodo 8 a~1.1, for SF and Houdini 4 a~1.2, maybe a bit higher for SF. I have taken late opening and middlegame positions, not sure if endgames preserve these values of a, but they shouldn't be far off.
User avatar
Ajedrecista
Posts: 1971
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Questions regarding rating systems of humans and engines.

Post by Ajedrecista »

Hello again:
Laskos wrote:It seems we differ for the last two values, mine are 1.95% and 1.79%, and they add up pretty accurately to the first one (as intuitively one would expect).
I did it again and I obtained your values. The values for |p| were good, but only God knows what I did wrong yesterday (I surely picked the wrong formula). I have now (rounding up to 1e-6) 0.019469 ~ 1.95% and 0.017911 ~ 1.79%, respectively... if I am not wrong again.

Thanks for your warning. I must admit that I was a bit confused yesterday due to the similar values of the three cases, where I expected as a first approach that the case of (1.1, 1.3) was a sum of the remaining two, as you already explained.

Regards from Spain.

Ajedrecista.
nimh
Posts: 46
Joined: Sun Nov 30, 2014 12:06 am

Re: Questions regarding rating systems of humans and engines

Post by nimh »

I have now finished analyzing Ziggurat 0.22, rated 1746 - the weakest on the CCRL 40/40 list - and FIDE 1725-1775-rated (avg 1750) humans.

The average errors are as follows:
centipawns: Ziggurat 0.22 - 0.256; FIDE 1750 - 0.299;
expected scores: Ziggurat 0.22 - 6.08%; FIDE 1750 - 6.00%.

They have a quite similar raw accuracy of moves, but the engine had more difficult positions, confirming that at the 1700-level CCRL has higher quality of play, although the gap appears to be a lot smaller than what my previous analysis had showed.

This paper demonstrates how the average errors rise due to the increasing difficulty according the three different difficuly factors and absolute evaluation.

http://www.chessanalysis.ee/ziggurat%20 ... 201750.pdf

As expected, the engine's play is relatively less affected by the difficulty of posiitons. It's also noteworthy how expected scores are virtually immune to the evaluation of the best move.