Expected performance and eval of Komodo 8 and SF 6

Laskos · Post by **Laskos** » Thu Feb 05, 2015 10:29 pm

Steve Maughan wrote:Hi Kai,

Good stuff!

If I remember correctly Houdini is tuned to win 80% of the games at +1 pawns. This matches Komodo's score profile almost perfectly.

Steve

Yes, I think 80% at +1 was for a win, so the expected score would be 87-88% or so (depending on drawelo).

nimh · Post by **nimh** » Fri Feb 06, 2015 12:22 am

Laskos wrote: Logistic still fits somehow the curve, say to move 40, although not very well. Less material (in average after moves 35-40) changes considerably the shape.

As long as it still remains superior to pure evaluations, I see no problem. The jump from centipawns to exp scores is a big one and I'm right now going to determine how big the increase in reliability it will be.

Tentative results show that centipawn method may underrate low-end human players. Looks like humans have a relatively larger portion of inaccuracies at the high/low evaluation area.

FIDE 1750: cps 0.299; exp score 6.00%
Ziggurat (CCRL 1745) cps 0.256; exp score 6.08% in slightly more difficult positions.

Laskos · Post by **Laskos** » Fri Feb 06, 2015 8:40 am

nimh wrote:
Laskos wrote: Logistic still fits somehow the curve, say to move 40, although not very well. Less material (in average after moves 35-40) changes considerably the shape.
As long as it still remains superior to pure evaluations, I see no problem. The jump from centipawns to exp scores is a big one and I'm right now going to determine how big the increase in reliability it will be.

Tentative results show that centipawn method may underrate low-end human players. Looks like humans have a relatively larger portion of inaccuracies at the high/low evaluation area.

You mean where the eval is far away from zero?

FIDE 1750: cps 0.299; exp score 6.00%
Ziggurat (CCRL 1745) cps 0.256; exp score 6.08% in slightly more difficult positions.

I saw just now your post with the pdf, what engine did you use to check for errors? I don't understand those "difficulty criteria" in the plots. Using Logistic (Tanh) as expected score is not perfect, but is far better than centipawns, whose distribution has wrong tails. Also, are the games of 1750 humans and of the engine comparable in length? Engines tend to play longer games, but maybe "difficulty criteria" take care of that.

Laskos · Post by **Laskos** » Fri Feb 06, 2015 6:29 pm

Laskos wrote:
Steve Maughan wrote:Hi Kai,

Good stuff!

If I remember correctly Houdini is tuned to win 80% of the games at +1 pawns. This matches Komodo's score profile almost perfectly.

Steve
Yes, I think 80% at +1 was for a win, so the expected score would be 87-88% or so (depending on drawelo).

I actually tested a bit Houdini 4, the claimed numbers seem quite a bit off.

Here is what is written on Houdini 4 webpage:
Houdini 4 uses calibrated evaluations in which engine scores correlate directly with the win expectancy in the position. A +1.00 pawn advantage gives a 80% chance of winning the game against an equal opponent at blitz time control. At +2.00 the engine will win 95% of the time, and at +3.00 about 99% of the time. If the advantage is +0.50, expect to win nearly 50% of the time.

On 1,200 games at 10 minutes + 6 seconds, the results for win probabilities are:

Code: Select all

Eval       Claimed        Actual       Actual
H4         win. prob.     move 20      move 70

0.5          50%            41%          12%          
1.0          80%            61%          32%
2.0          95%            90%          81%

They are not only off the claimed win probabilities, but also depend on the stage of the game. IIRC, R. Houdart claimed he normalized the eval to all stages, to give the same win probabilities for the same eval.

petero2 · Post by **petero2** » Fri Feb 06, 2015 10:34 pm

Laskos wrote:
Isaac wrote:I remember Joseph Koss did a very similar study. He took over 30k games from the CCRL (40 moves in 40 minutes TC) from engines over 3000 elo. He then analyzed the position of each game at move 15, 30, 45, 60 and 75 for SF DD and Houdini 4 at a fixed depth (14 if I remember well). I think he also did that for Komodo 6 but I don't remember exactly.
Here are his results: https://public.bn1302.livefilestore.com ... png?psid=1, https://public.bn1.livefilestore.com/y2 ... png?psid=1.

As we can see, an eval of 1.0 at move 15 yields a higher expected result than at move 30, 45, 60 and 75. This holds for both engines and there is a general trend that a higher eval in later moves yields a lower expected result than at earlier moves in general.

I wish Joseph would post here though, I may have not remember all well.
Now I was curious about the issue, and performed the whole thing again (I plotted only Komodo 8 now) for different stages of the game, moves 15,25,35,50,70. Moves 15,25,35 are almost indistinguishable, then it diverges quite a bit, so endgames are a different matter.

I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.

In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:

1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.

2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.

nimh · Post by **nimh** » Sat Feb 07, 2015 12:37 am

Laskos wrote:
nimh wrote:
Laskos wrote: Logistic still fits somehow the curve, say to move 40, although not very well. Less material (in average after moves 35-40) changes considerably the shape.
As long as it still remains superior to pure evaluations, I see no problem. The jump from centipawns to exp scores is a big one and I'm right now going to determine how big the increase in reliability it will be.

Tentative results show that centipawn method may underrate low-end human players. Looks like humans have a relatively larger portion of inaccuracies at the high/low evaluation area.
You mean where the eval is far away from zero?

Yes, precisely.

There are two main reasons why humans have relatively lower accuracy in such positions:

1) In won positions there is no longer a need for expending energy to find moves that lead to quickest mate/ largest material advantage when there are easier, moe playable alternatives.

2) In lost positions the focus again shifts away from objectivity to making desperado moves, i. e. swindling, setting traps or increasing the difficulty of positions in hope of tricking the oponent into blundering. Sometimes engine suggestions are fairly easy to reply to.

According to K. W. Regan the latter has bigger effect on the accuracy, see figure 2, page 6 below:

http://www.cse.buffalo.edu/~regan/papers/pdf/RMH11b.pdf

FIDE 1750: cps 0.299; exp score 6.00%
Ziggurat (CCRL 1745) cps 0.256; exp score 6.08% in slightly more difficult positions.
I saw just now your post with the pdf, what engine did you use to check for errors? I don't understand those "difficulty criteria" in the plots. Using Logistic (Tanh) as expected score is not perfect, but is far better than centipawns, whose distribution has wrong tails. Also, are the games of 1750 humans and of the engine comparable in length? Engines tend to play longer games, but maybe "difficulty criteria" take care of that.

I used Komodo 8 on AMD FX 9590 at 2 minutes per move.

There are three difiiculty categories that hopefully correspond to separate aspects of the difficulty of positions.

1) eval stability - the difference between the lowest and the highest eval in a particular position measured across all depths. For example, if at d4 the eval is 0.12 and at d21 -0.04, then the eval stability is 0.16.

2) difference - the difference between the two best moves at the latest ply.

3) equal moves - the number of moves within 0.30 cp distance.

The average length of human games is indeed shorter: 41.7 vs 53.7.

Laskos · Post by **Laskos** » Sat Feb 07, 2015 10:29 am

petero2 wrote:
Laskos wrote:
Isaac wrote:I remember Joseph Koss did a very similar study. He took over 30k games from the CCRL (40 moves in 40 minutes TC) from engines over 3000 elo. He then analyzed the position of each game at move 15, 30, 45, 60 and 75 for SF DD and Houdini 4 at a fixed depth (14 if I remember well). I think he also did that for Komodo 6 but I don't remember exactly.
Here are his results: https://public.bn1302.livefilestore.com ... png?psid=1, https://public.bn1.livefilestore.com/y2 ... png?psid=1.

As we can see, an eval of 1.0 at move 15 yields a higher expected result than at move 30, 45, 60 and 75. This holds for both engines and there is a general trend that a higher eval in later moves yields a lower expected result than at earlier moves in general.

I wish Joseph would post here though, I may have not remember all well.
Now I was curious about the issue, and performed the whole thing again (I plotted only Komodo 8 now) for different stages of the game, moves 15,25,35,50,70. Moves 15,25,35 are almost indistinguishable, then it diverges quite a bit, so endgames are a different matter.
I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.

In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:

1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.

2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.

I tested a bit Texel 1.05, the database consists of only 200 games at 10 minutes + 3 seconds, 1SD errors in expected scores are about 2%.

Code: Select all

Eval         Move 30       Move 70
Texel 1.05   Exp. score    Exp. score   

1.0            61%           52%        
1.5            70%           55%
2.0            75%           64%

I never quite see a stability from opening/middlegame to endgame.

Laskos · Post by **Laskos** » Sat Feb 07, 2015 12:05 pm

nimh wrote:
I used Komodo 8 on AMD FX 9590 at 2 minutes per move.

There are three difiiculty categories that hopefully correspond to separate aspects of the difficulty of positions.

1) eval stability - the difference between the lowest and the highest eval in a particular position measured across all depths. For example, if at d4 the eval is 0.12 and at d21 -0.04, then the eval stability is 0.16.

2) difference - the difference between the two best moves at the latest ply.

3) equal moves - the number of moves within 0.30 cp distance.

The average length of human games is indeed shorter: 41.7 vs 53.7.

Ok, clearer now. K8 at 2 min/move is a very good error checker.
Isn't 2) a bit like blunder check? And humans would naturally blunder more? I mean in your plots, anything larger than 0.5 or 1.0 is a blunder.
Aren't 2) and 3) a bit anti-correlated, more equal moves -> less important blunders?

If the games can be trimmed to 40 moves or so, a simple quasi-logistic curve can be applied to Komodo 8 expected score, and maybe the differences in styles of humans and engines of this 1750 comparable rating are less of a problem. Although not sure, endgames after all are part of the game.

Laskos · Post by **Laskos** » Sat Feb 07, 2015 1:57 pm

I added Houdini 4 to the Expected Performance graph, valid for moves in the range 8 to 40.

The fits are
Komodo 8: (tanh[eval^1.757/1.313] + 1) / 2
SF6: (tanh[eval^1.4707/1.7645] + 1) / 2,
Houdini 4: (tanh[eval^1.5225/1.49175] + 1) / 2

It seems for eval in the range [0, 0.5] the Expected Score is the same for top engines, for larger evals, at fixed value, Expected Scores are K8>H4>SF6.

Vinvin · Post by **Vinvin** » Sat Feb 07, 2015 2:52 pm

Nice graphic.
SF have a better resolution for score [0..1.5]

An opposite sight should be interresting : the more positive eval for a losing game

Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6

Re: Expected performance and eval of Komodo 8 and SF 6