Questions regarding rating systems of humans and engines

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

nimh
Posts: 46
Joined: Sun Nov 30, 2014 12:06 am

Re: Questions regarding rating systems of humans and engines

Post by nimh »

Pawel Koziol

Your theory of comfort zones makes sense.

Chess play can be divided into two parts;

1) what one can play according to one's taste, style etc.
2) what one must play in order to retain the level

It has long been known that styes tend to blend at higher levels, styles of top masters are very similar to each other. Nowadays it is impossible to play like Capablanca or Tal. The level of play has risen and who wants to be successful, must have universal style.

As a consequence, zones of comfort and miscomfort are much more polarized at the bottom tiers of the rating ladder. Weak players have spent less time on studying the game, and the rate of specialization on certain aspects of the game is larger.


Carl Langan

Indeed, the premise of the comparison was that humans play against engines as they would against other humans, because I analyzed human vs human games.


To both of you - how about conducting a little research on the effect of anti-computer strategy against engines of various level? :) I wonder if weaker engines are more susceptible.


Larry Kaufman

Thanks for support! :)

You said you'd use expected scores. But do you really have a good conversion formula? Surely you cannot derive it from centipawns only - sometimes drawns positions get high evals.

What do you think, which of those two is more likely to be won for white? :)

A d5 2.06 d10 2.08 d15 2.07 d20 2.06 d25 2.08
B d5 -1.43 d10 -0.60 d15 0.00 d20 0.32 d25 0.75

I think I have enough datapoints - each rating cohort had at least 450 positions. As I explained above, logarithimic curve fits the best.

As for Naka-Stockfish match, 4 games is too shaort a format to have reliable TPRs, plus they obviously had better hardware and longer time controls.

I'm currently running another analysis project with Komodo 8 and a new computer with the aim of comparing human and engine play. This time I have expanded rating scopes; 1745-3200 for CCRL and 1750-2800 for FIDE.
If a logarithmic curve appears for the third time to fit the best, then it cannot be just a coincidence.

Could you post some queen odds games between players rated 500-1000? I'd like ta have a closer look. Perhaps you are underestimating their level of accuracy compared to 2000-rated players.
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Questions regarding rating systems of humans and engines

Post by lkaufman »

nimh wrote:Pawel Koziol

Your theory of comfort zones makes sense.

Chess play can be divided into two parts;

1) what one can play according to one's taste, style etc.
2) what one must play in order to retain the level

It has long been known that styes tend to blend at higher levels, styles of top masters are very similar to each other. Nowadays it is impossible to play like Capablanca or Tal. The level of play has risen and who wants to be successful, must have universal style.

As a consequence, zones of comfort and miscomfort are much more polarized at the bottom tiers of the rating ladder. Weak players have spent less time on studying the game, and the rate of specialization on certain aspects of the game is larger.


Carl Langan

Indeed, the premise of the comparison was that humans play against engines as they would against other humans, because I analyzed human vs human games.


To both of you - how about conducting a little research on the effect of anti-computer strategy against engines of various level? :) I wonder if weaker engines are more susceptible.


Larry Kaufman

Thanks for support! :)

You said you'd use expected scores. But do you really have a good conversion formula? Surely you cannot derive it from centipawns only - sometimes drawns positions get high evals.

What do you think, which of those two is more likely to be won for white? :)

A d5 2.06 d10 2.08 d15 2.07 d20 2.06 d25 2.08
B d5 -1.43 d10 -0.60 d15 0.00 d20 0.32 d25 0.75

I think I have enough datapoints - each rating cohort had at least 450 positions. As I explained above, logarithimic curve fits the best.

As for Naka-Stockfish match, 4 games is too shaort a format to have reliable TPRs, plus they obviously had better hardware and longer time controls.

I'm currently running another analysis project with Komodo 8 and a new computer with the aim of comparing human and engine play. This time I have expanded rating scopes; 1745-3200 for CCRL and 1750-2800 for FIDE.
If a logarithmic curve appears for the third time to fit the best, then it cannot be just a coincidence.

Could you post some queen odds games between players rated 500-1000? I'd like ta have a closer look. Perhaps you are underestimating their level of accuracy compared to 2000-rated players.
Of course given more data than the current eval, you can make a better estimate of the expected result. But given only the final eval of a search, the best estimate to make is that each score maps best to a specific win probability, with the logistic function being the best one to use for this purpose in my opinion. Houdini actually claims to have mapped each reported score to a win probability, while for Komodo and Stockfish this is not done explicitly, but there is a presumption that a given score should represent the same probability of winning in all phases of the game; at least the Rybka and Komodo teams believed that to be desirable, not so sure about Stockfish.
As for the Naka-SF match, yes, 4 games is too few, but I have something like 30 handicap games with GMs that put the rating of Rybka 3 on a quad or octal at somewhere near 3000 in FIDE terms.
As for players rated in the 500 vicinity, they are usually unable to keep score well enough to replay a game, and anyway no one records queen odds games. I have probably more experience giving odds of queen and various other pieces to kids than just about anyone on earth, and I can estimate a kid's rating rather accurately from the handicap he needs to beat me. Roughly rook odds is 1500, queen odds is 1000, queen and rook is 700, and queen and two rooks is 500, interpolate/extrapolate from there. As for your data being enough to conclude what you did about the shape of the curve fitting the human data, the zigzag line seems to clearly refute that. This is a sign of far too little data (or else some flaw in methodology) to draw any conclusion about the true shape of the curve. Anyway fitting to Komodo 8 should be a big step forward from using Rybka 3.
Looking forward to your next report.
Komodo rules!
nimh
Posts: 46
Joined: Sun Nov 30, 2014 12:06 am

Re: Questions regarding rating systems of humans and engines

Post by nimh »

I think as long as there's no relaible way to transform centipawn values into expected score, one should simply avoid those and stick to the old good centipawns. Using scores expressed in percentages are of course more preferrable, but the proper way to do that is already described in my study.
Another promising possibility is to use percentages instead of centipawns, i. e. similar to Monte-Carlo method. The
percentage indicates white's scores against black after a move. To find the score, a computer is set to run a certain
number of games against oneself. Better scores would represent moves that are more preferable. The downside is the
fact that it takes a lot of time to get a statistically valid number of moves, especially taking into account the need for
ensuring that the engine has enough time per move. Otherwise its useflness in more complicated positions becomes
questionable due to the horizon effect. Its advantage primarily lies in theoretically drawn endgame positions where
evaluation-based estimations are known to be unreliable. When a position is 100% drawn, Monte-Carlo method always
shows 50% score, while evaluation may, in some instances, be over 3.00. Therefore, there is always a risk that an
evaluation-based estimation may detect false 'blunders', assigning large evaluation differences to moves which are
actually drawn.
I hope in near future such a method will become feasible.

Well, it your results claim near 3000-strength on a much faster hardware and longer TCs than what was used in CCRL games, then I have no objections.

The zigzag is due to the two obvious factors: instability of human play, and rating cohorts being very close to each other. 100 ELO is actually a small difference. We, chess fans tend not to notice that, because there are tens and hundreds of palyers between 100 ELOs, but take a look at elo rating table of football clubs.

http://clubelo.com/All/Ranking.html

The difference between the best and the 100th best club is more than 500 points.
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Questions regarding rating systems of humans and engines

Post by lkaufman »

nimh wrote:I think as long as there's no relaible way to transform centipawn values into expected score, one should simply avoid those and stick to the old good centipawns. Using scores expressed in percentages are of course more preferrable, but the proper way to do that is already described in my study.
Another promising possibility is to use percentages instead of centipawns, i. e. similar to Monte-Carlo method. The
percentage indicates white's scores against black after a move. To find the score, a computer is set to run a certain
number of games against oneself. Better scores would represent moves that are more preferable. The downside is the
fact that it takes a lot of time to get a statistically valid number of moves, especially taking into account the need for
ensuring that the engine has enough time per move. Otherwise its useflness in more complicated positions becomes
questionable due to the horizon effect. Its advantage primarily lies in theoretically drawn endgame positions where
evaluation-based estimations are known to be unreliable. When a position is 100% drawn, Monte-Carlo method always
shows 50% score, while evaluation may, in some instances, be over 3.00. Therefore, there is always a risk that an
evaluation-based estimation may detect false 'blunders', assigning large evaluation differences to moves which are
actually drawn.
I hope in near future such a method will become feasible.

Well, it your results claim near 3000-strength on a much faster hardware and longer TCs than what was used in CCRL games, then I have no objections.

The zigzag is due to the two obvious factors: instability of human play, and rating cohorts being very close to each other. 100 ELO is actually a small difference. We, chess fans tend not to notice that, because there are tens and hundreds of palyers between 100 ELOs, but take a look at elo rating table of football clubs.

http://clubelo.com/All/Ranking.html

The difference between the best and the 100th best club is more than 500 points.
The idea you mention like MC is interesting, but I claim that simply transforming centipawn scores to expected results using the logistic function with any reasonable estimate for coefficients will produce better results than using centipawn scores themselves. Basically, the further the score is from zero, the less a one centipawn difference will matter, and using this transformation avoids the need for arbitrary cutoffs.
I'm not sure what football results tell us about chess, but a hundred elo represents quite a large win to loss ratio in chess when you are talking about grandmasters. Anyway it doesn't matter what causes the zigzag line for human play; unless you have enough data to make the connecting line look somewhat like a line or a curve, it's pretty hard to guess what its real shape would be.
Regarding the rating of top engines on FIDE scale, perhaps we'll have more matches of Komodo or Stockfsih vs. top human players with handicaps in the near future. The problem (aside from money) is that if our goal is to demonstrate that Komodo on a good pc plays at 3200 level or higher on the FIDE scale (which I believe), how do we prove that given that we can only play small numbers of games and cannot get the 2800+ players to play for reasonable sums. Giving the GM computer use himself seems not to be helpful for this purpose, while giving odds of material or moves or castling rights or bad openings requires some reasonable way to estimate the elo value of the handicap. The only handicap which has a rather clearly defined elo value is the first move, but this is obviously insufficient for such a match. Any suggestions or comments on this question would be welcome. My own opinion is that the best way to estimate the human (FIDE) elo of an engine is by a handicap of 3 or 4 free (non-capturing) moves to start the game, since we can extrapolate that if White is worth 40 elo, it means that a tempo is worth 80 elo, so two moves = 120 elo, three moves = 200 elo, etc. It's not perfect but should be close enough.
Komodo rules!
Uri Blass
Posts: 10314
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Questions regarding rating systems of humans and engines

Post by Uri Blass »

I think that the only way to demonstrate that komodo is at level of 3200
even against anti-computer strategy is to take some risk of losing 100,000$.

You should suggest every human with fide rating above 2400 to pay money for every game that he plays against komodo when the human get 100,000$ for the first draw and pay for every game based on his human rating(the human should pay more money in case of higher rating when the idea is that if komodo get a performance of 3200 you do not lose money and if komodo get a performance above 3200 you even earn money).

If no human accept your challange then it suggest that no human believe that they can force komodo to show lower performance than 3200

If enough humans accept your challange then we can see if komodo can get performance of at least 3200

If you think that 3200 is too risky for you because some humans may perform better thanks to anti-computer strategy then you can use the same idea for performance of 3000

Edit:Note that I ignore the case that the human win against komodo because I think that the probability for it is almost 0 and you can pay the same for humans who draw or humans who win.
lkaufman
Posts: 5960
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA

Re: Questions regarding rating systems of humans and engines

Post by lkaufman »

Uri Blass wrote:I think that the only way to demonstrate that komodo is at level of 3200
even against anti-computer strategy is to take some risk of losing 100,000$.

You should suggest every human with fide rating above 2400 to pay money for every game that he plays against komodo when the human get 100,000$ for the first draw and pay for every game based on his human rating(the human should pay more money in case of higher rating when the idea is that if komodo get a performance of 3200 you do not lose money and if komodo get a performance above 3200 you even earn money).

If no human accept your challange then it suggest that no human believe that they can force komodo to show lower performance than 3200

If enough humans accept your challange then we can see if komodo can get performance of at least 3200

If you think that 3200 is too risky for you because some humans may perform better thanks to anti-computer strategy then you can use the same idea for performance of 3000

Edit:Note that I ignore the case that the human win against komodo because I think that the probability for it is almost 0 and you can pay the same for humans who draw or humans who win.
Interesting idea, but several issues:
1. I don't claim Komodo would perform 3200 against someone who is just specialist in anti-computer play; rather that it would maintain a 3200 FIDE rating if regularly invited to play in RR tournaments with top ten humans. That's pretty much how Carlsen is rated.
2. I would have to spend massive amounts of time writing a draw-avoidance opening book, to offset the draw-seeking behavior of the GMs.
3. There is no point in playing players below about 2700 or so because a huge number of games would be needed to prove a 3200 performance against weaker opposition. The larger the elo spread, the more games you need.
4. Since the players would presumably spend huge amounts of time preparing and optimizing anti-computer play, I wouldn't have confidence in 3200, but I would in 3100. But I doubt that any top players would take up the bet at that level. Carlsen would need to score 20%, which seems impossible to me. Anyway top players want to be paid a lot for such matches, not just to make "fair" bets.

The pawn-odds SF vs Nakamura match might have answered the question if they had played the full 16 possible pawn-odds games (both colors) instead of just two. I don't think any GM would doubt that this is at least roughly a 200 elo handicap, so a 50% score would mean almost 3000, 75% (as achieved in two games) would mean almost 3200 (or more, a pawn might be worth more than 200), etc.
Komodo rules!
Uri Blass
Posts: 10314
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Questions regarding rating systems of humans and engines

Post by Uri Blass »

lkaufman wrote:
Uri Blass wrote:I think that the only way to demonstrate that komodo is at level of 3200
even against anti-computer strategy is to take some risk of losing 100,000$.

You should suggest every human with fide rating above 2400 to pay money for every game that he plays against komodo when the human get 100,000$ for the first draw and pay for every game based on his human rating(the human should pay more money in case of higher rating when the idea is that if komodo get a performance of 3200 you do not lose money and if komodo get a performance above 3200 you even earn money).

If no human accept your challange then it suggest that no human believe that they can force komodo to show lower performance than 3200

If enough humans accept your challange then we can see if komodo can get performance of at least 3200

If you think that 3200 is too risky for you because some humans may perform better thanks to anti-computer strategy then you can use the same idea for performance of 3000

Edit:Note that I ignore the case that the human win against komodo because I think that the probability for it is almost 0 and you can pay the same for humans who draw or humans who win.
Interesting idea, but several issues:
1. I don't claim Komodo would perform 3200 against someone who is just specialist in anti-computer play; rather that it would maintain a 3200 FIDE rating if regularly invited to play in RR tournaments with top ten humans. That's pretty much how Carlsen is rated.
2. I would have to spend massive amounts of time writing a draw-avoidance opening book, to offset the draw-seeking behavior of the GMs.
3. There is no point in playing players below about 2700 or so because a huge number of games would be needed to prove a 3200 performance against weaker opposition. The larger the elo spread, the more games you need.
4. Since the players would presumably spend huge amounts of time preparing and optimizing anti-computer play, I wouldn't have confidence in 3200, but I would in 3100. But I doubt that any top players would take up the bet at that level. Carlsen would need to score 20%, which seems impossible to me. Anyway top players want to be paid a lot for such matches, not just to make "fair" bets.

The pawn-odds SF vs Nakamura match might have answered the question if they had played the full 16 possible pawn-odds games (both colors) instead of just two. I don't think any GM would doubt that this is at least roughly a 200 elo handicap, so a 50% score would mean almost 3000, 75% (as achieved in two games) would mean almost 3200 (or more, a pawn might be worth more than 200), etc.
1)Note that the time control in the Nakamura match(45+30 fischer time control based on my memory) was faster than time control that humans use in tournaments and it may be interesting to have the time control that humans practically use in human-human games.

For example in the europe championship the time control is
90 minutes for 40 moves plus 30 minutes for the rest of the
game, with an increment of 30 seconds per move, starting from move one


2)It may be interesting if the komodo team suggest an open invitation for humans with fide rating 2400-2650 who are going to play with white when the rules are that humans need to pay 10,000$ per game and can get 100,000$ if they do not lose against komodo with white and if they lose they have the freedom to get another game with white for additional 10,000$.

Humans should be able to get a game in at most 3 months after they pay.

Maybe no human accept this invitation but in this case it is going to say that probably no human (maybe except top GM's) believe that he can draw often against komodo with the white pieces.

If many humans accept the invitation then of course you need to reject part of them because you cannot promise money that you do not have for the case that all of them win even if you do not believe that this case can happen.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Questions regarding rating systems of humans and engines

Post by Laskos »

nimh wrote:I think as long as there's no relaible way to transform centipawn values into expected score, one should simply avoid those and stick to the old good centipawns.
You got wrong cut-offs and asymptotic behavior with centipawns. All three, Houdini, Komodo and Stockfish obey pretty same logistic curve while transforming centipawns (cp) to expected score:

Code: Select all

p=cp/100
a=1.1  normalization factor

ExpectedScore = {1 + (Exp[p/a] - Exp[-p/a])/(Exp[p/a] + Exp[-p/a])}/2
You would get more reliable fit using this approximation with correct asymptotic behavior.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Questions regarding rating systems of humans and engines

Post by Laskos »

lkaufman wrote: 1. I don't claim Komodo would perform 3200 against someone who is just specialist in anti-computer play; rather that it would maintain a 3200 FIDE rating if regularly invited to play in RR tournaments with top ten humans. That's pretty much how Carlsen is rated.
2. I would have to spend massive amounts of time writing a draw-avoidance opening book, to offset the draw-seeking behavior of the GMs.
3. There is no point in playing players below about 2700 or so because a huge number of games would be needed to prove a 3200 performance against weaker opposition. The larger the elo spread, the more games you need.
4. Since the players would presumably spend huge amounts of time preparing and optimizing anti-computer play, I wouldn't have confidence in 3200, but I would in 3100. But I doubt that any top players would take up the bet at that level. Carlsen would need to score 20%, which seems impossible to me. Anyway top players want to be paid a lot for such matches, not just to make "fair" bets.
Maybe, in normal tournament conditions match:

5. Set contempt of Komodo to 150 in middlegame.
Uri Blass
Posts: 10314
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Questions regarding rating systems of humans and engines

Post by Uri Blass »

Thinking about it again I think that it is not good to allow the same human to bet again and again on 10,000$.

It is possible that you find some person who is going to lose all of his money by this way and I do not like it(there are stupid humans who lost all their money by gambling on other things).

The best solution is to find a sponsor if it is possible.