Yes, I think 80% at +1 was for a win, so the expected score would be 87-88% or so (depending on drawelo).Steve Maughan wrote:Hi Kai,
Good stuff!
If I remember correctly Houdini is tuned to win 80% of the games at +1 pawns. This matches Komodo's score profile almost perfectly.
Steve
Expected performance and eval of Komodo 8 and SF 6
Moderator: Ras
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Expected performance and eval of Komodo 8 and SF 6
-
nimh
- Posts: 46
- Joined: Sun Nov 30, 2014 12:06 am
Re: Expected performance and eval of Komodo 8 and SF 6
As long as it still remains superior to pure evaluations, I see no problem. The jump from centipawns to exp scores is a big one and I'm right now going to determine how big the increase in reliability it will be.Laskos wrote: Logistic still fits somehow the curve, say to move 40, although not very well. Less material (in average after moves 35-40) changes considerably the shape.
Tentative results show that centipawn method may underrate low-end human players. Looks like humans have a relatively larger portion of inaccuracies at the high/low evaluation area.
FIDE 1750: cps 0.299; exp score 6.00%
Ziggurat (CCRL 1745) cps 0.256; exp score 6.08% in slightly more difficult positions.
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Expected performance and eval of Komodo 8 and SF 6
You mean where the eval is far away from zero?nimh wrote:As long as it still remains superior to pure evaluations, I see no problem. The jump from centipawns to exp scores is a big one and I'm right now going to determine how big the increase in reliability it will be.Laskos wrote: Logistic still fits somehow the curve, say to move 40, although not very well. Less material (in average after moves 35-40) changes considerably the shape.
Tentative results show that centipawn method may underrate low-end human players. Looks like humans have a relatively larger portion of inaccuracies at the high/low evaluation area.
I saw just now your post with the pdf, what engine did you use to check for errors? I don't understand those "difficulty criteria" in the plots. Using Logistic (Tanh) as expected score is not perfect, but is far better than centipawns, whose distribution has wrong tails. Also, are the games of 1750 humans and of the engine comparable in length? Engines tend to play longer games, but maybe "difficulty criteria" take care of that.
FIDE 1750: cps 0.299; exp score 6.00%
Ziggurat (CCRL 1745) cps 0.256; exp score 6.08% in slightly more difficult positions.
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Expected performance and eval of Komodo 8 and SF 6
I actually tested a bit Houdini 4, the claimed numbers seem quite a bit off.Laskos wrote:Yes, I think 80% at +1 was for a win, so the expected score would be 87-88% or so (depending on drawelo).Steve Maughan wrote:Hi Kai,
Good stuff!
If I remember correctly Houdini is tuned to win 80% of the games at +1 pawns. This matches Komodo's score profile almost perfectly.
Steve
Here is what is written on Houdini 4 webpage:
Houdini 4 uses calibrated evaluations in which engine scores correlate directly with the win expectancy in the position. A +1.00 pawn advantage gives a 80% chance of winning the game against an equal opponent at blitz time control. At +2.00 the engine will win 95% of the time, and at +3.00 about 99% of the time. If the advantage is +0.50, expect to win nearly 50% of the time.
On 1,200 games at 10 minutes + 6 seconds, the results for win probabilities are:
Code: Select all
Eval Claimed Actual Actual
H4 win. prob. move 20 move 70
0.5 50% 41% 12%
1.0 80% 61% 32%
2.0 95% 90% 81% -
petero2
- Posts: 734
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Re: Expected performance and eval of Komodo 8 and SF 6
I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.Laskos wrote:Now I was curious about the issue, and performed the whole thing again (I plotted only Komodo 8 now) for different stages of the game, moves 15,25,35,50,70. Moves 15,25,35 are almost indistinguishable, then it diverges quite a bit, so endgames are a different matter.Isaac wrote:I remember Joseph Koss did a very similar study. He took over 30k games from the CCRL (40 moves in 40 minutes TC) from engines over 3000 elo. He then analyzed the position of each game at move 15, 30, 45, 60 and 75 for SF DD and Houdini 4 at a fixed depth (14 if I remember well). I think he also did that for Komodo 6 but I don't remember exactly.
Here are his results: https://public.bn1302.livefilestore.com ... png?psid=1, https://public.bn1.livefilestore.com/y2 ... png?psid=1.
As we can see, an eval of 1.0 at move 15 yields a higher expected result than at move 30, 45, 60 and 75. This holds for both engines and there is a general trend that a higher eval in later moves yields a lower expected result than at earlier moves in general.
I wish Joseph would post here though, I may have not remember all well.
In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:
1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.
2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.
-
nimh
- Posts: 46
- Joined: Sun Nov 30, 2014 12:06 am
Re: Expected performance and eval of Komodo 8 and SF 6
I used Komodo 8 on AMD FX 9590 at 2 minutes per move.Laskos wrote:You mean where the eval is far away from zero?nimh wrote:As long as it still remains superior to pure evaluations, I see no problem. The jump from centipawns to exp scores is a big one and I'm right now going to determine how big the increase in reliability it will be.Laskos wrote: Logistic still fits somehow the curve, say to move 40, although not very well. Less material (in average after moves 35-40) changes considerably the shape.
Tentative results show that centipawn method may underrate low-end human players. Looks like humans have a relatively larger portion of inaccuracies at the high/low evaluation area.I saw just now your post with the pdf, what engine did you use to check for errors? I don't understand those "difficulty criteria" in the plots. Using Logistic (Tanh) as expected score is not perfect, but is far better than centipawns, whose distribution has wrong tails. Also, are the games of 1750 humans and of the engine comparable in length? Engines tend to play longer games, but maybe "difficulty criteria" take care of that.
Yes, precisely.
There are two main reasons why humans have relatively lower accuracy in such positions:
1) In won positions there is no longer a need for expending energy to find moves that lead to quickest mate/ largest material advantage when there are easier, moe playable alternatives.
2) In lost positions the focus again shifts away from objectivity to making desperado moves, i. e. swindling, setting traps or increasing the difficulty of positions in hope of tricking the oponent into blundering. Sometimes engine suggestions are fairly easy to reply to.
According to K. W. Regan the latter has bigger effect on the accuracy, see figure 2, page 6 below:
http://www.cse.buffalo.edu/~regan/papers/pdf/RMH11b.pdf
FIDE 1750: cps 0.299; exp score 6.00%
Ziggurat (CCRL 1745) cps 0.256; exp score 6.08% in slightly more difficult positions.
There are three difiiculty categories that hopefully correspond to separate aspects of the difficulty of positions.
1) eval stability - the difference between the lowest and the highest eval in a particular position measured across all depths. For example, if at d4 the eval is 0.12 and at d21 -0.04, then the eval stability is 0.16.
2) difference - the difference between the two best moves at the latest ply.
3) equal moves - the number of moves within 0.30 cp distance.
The average length of human games is indeed shorter: 41.7 vs 53.7.
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Expected performance and eval of Komodo 8 and SF 6
I tested a bit Texel 1.05, the database consists of only 200 games at 10 minutes + 3 seconds, 1SD errors in expected scores are about 2%.petero2 wrote:I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.Laskos wrote:Now I was curious about the issue, and performed the whole thing again (I plotted only Komodo 8 now) for different stages of the game, moves 15,25,35,50,70. Moves 15,25,35 are almost indistinguishable, then it diverges quite a bit, so endgames are a different matter.Isaac wrote:I remember Joseph Koss did a very similar study. He took over 30k games from the CCRL (40 moves in 40 minutes TC) from engines over 3000 elo. He then analyzed the position of each game at move 15, 30, 45, 60 and 75 for SF DD and Houdini 4 at a fixed depth (14 if I remember well). I think he also did that for Komodo 6 but I don't remember exactly.
Here are his results: https://public.bn1302.livefilestore.com ... png?psid=1, https://public.bn1.livefilestore.com/y2 ... png?psid=1.
As we can see, an eval of 1.0 at move 15 yields a higher expected result than at move 30, 45, 60 and 75. This holds for both engines and there is a general trend that a higher eval in later moves yields a lower expected result than at earlier moves in general.
I wish Joseph would post here though, I may have not remember all well.
In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:
1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.
2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.
Code: Select all
Eval Move 30 Move 70
Texel 1.05 Exp. score Exp. score
1.0 61% 52%
1.5 70% 55%
2.0 75% 64% -
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Expected performance and eval of Komodo 8 and SF 6
Ok, clearer now. K8 at 2 min/move is a very good error checker.nimh wrote:
I used Komodo 8 on AMD FX 9590 at 2 minutes per move.
There are three difiiculty categories that hopefully correspond to separate aspects of the difficulty of positions.
1) eval stability - the difference between the lowest and the highest eval in a particular position measured across all depths. For example, if at d4 the eval is 0.12 and at d21 -0.04, then the eval stability is 0.16.
2) difference - the difference between the two best moves at the latest ply.
3) equal moves - the number of moves within 0.30 cp distance.
The average length of human games is indeed shorter: 41.7 vs 53.7.
Isn't 2) a bit like blunder check? And humans would naturally blunder more? I mean in your plots, anything larger than 0.5 or 1.0 is a blunder.
Aren't 2) and 3) a bit anti-correlated, more equal moves -> less important blunders?
If the games can be trimmed to 40 moves or so, a simple quasi-logistic curve can be applied to Komodo 8 expected score, and maybe the differences in styles of humans and engines of this 1750 comparable rating are less of a problem. Although not sure, endgames after all are part of the game.
-
Laskos
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Expected performance and eval of Komodo 8 and SF 6
I added Houdini 4 to the Expected Performance graph, valid for moves in the range 8 to 40.

The fits are
Komodo 8: (tanh[eval^1.757/1.313] + 1) / 2
SF6: (tanh[eval^1.4707/1.7645] + 1) / 2,
Houdini 4: (tanh[eval^1.5225/1.49175] + 1) / 2
It seems for eval in the range [0, 0.5] the Expected Score is the same for top engines, for larger evals, at fixed value, Expected Scores are K8>H4>SF6.

The fits are
Komodo 8: (tanh[eval^1.757/1.313] + 1) / 2
SF6: (tanh[eval^1.4707/1.7645] + 1) / 2,
Houdini 4: (tanh[eval^1.5225/1.49175] + 1) / 2
It seems for eval in the range [0, 0.5] the Expected Score is the same for top engines, for larger evals, at fixed value, Expected Scores are K8>H4>SF6.
-
Vinvin
- Posts: 5316
- Joined: Thu Mar 09, 2006 9:40 am
- Full name: Vincent Lejeune
Re: Expected performance and eval of Komodo 8 and SF 6
Nice graphic.
SF have a better resolution for score [0..1.5]
An opposite sight should be interresting : the more positive eval for a losing game
SF have a better resolution for score [0..1.5]
An opposite sight should be interresting : the more positive eval for a losing game