Expected performance and eval of Komodo 8 and SF 6

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

nimh
Posts: 46
Joined: Sun Nov 30, 2014 12:06 am

Re: Expected performance and eval of Komodo 8 and SF 6

Post by nimh »

Laskos wrote:
nimh wrote:

There is actually a positive correlataion between the distance between the 2 best engine suggestions and the number of more-or-equal moves.
I miss something, apparently they seem to exclude each other, one measures the "forced moves", other the "quiet moves".
I see what you mean, my statement implied that bigger number moves correlates to bigger gap between the 2 best moves. :) Should have used more careful wording. High numbers of equal moves and low differences between 1st and 2nd engine suggestion both occur primarily in quitet strategical positions and hence quite strongly correlated in the sense that accurate play is easy to maintain.

But consider the following examples.

Posiiton A:

1. Rd1 0.30
2. Re1 0.30
3. Qf5 -1.04

Position B:

1. e5 0.03
2. f3 0.31
3. Bg2 0.31
4. Nd3 0.32
5. a3 0.33
6. Rc2 0.33

It shows that both factors complement each other nicely. 2 moves within 0.30 is far below average, and the difference of 0.28 is about average. Sometimes in highly tactical positions there are 2 or 3 moves evaluated as equal, so relying on the difference alone would be misleading.
User avatar
Ozymandias
Posts: 1537
Joined: Sun Oct 25, 2009 2:30 am

Re: Expected performance and eval of Komodo 8 and SF 6

Post by Ozymandias »

Vinvin wrote:An opposite sight should be interesting : the more positive eval for a losing game :-)
Either that, or a zoom on the rest of the graph; let's say from evals 3 to 6 and a 99 to 100% expected performance. Useful for game adjudication.
petero2
Posts: 734
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Expected performance and eval of Komodo 8 and SF 6

Post by petero2 »

Laskos wrote:
petero2 wrote: I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.

In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:

1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.

2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.
I tested a bit Texel 1.05, the database consists of only 200 games at 10 minutes + 3 seconds, 1SD errors in expected scores are about 2%.

Code: Select all

Eval         Move 30       Move 70
Texel 1.05   Exp. score    Exp. score   

1.0            61%           52%        
1.5            70%           55%
2.0            75%           64% 
I never quite see a stability from opening/middlegame to endgame.
Thank you.

So a calibrated evaluation function does not necessarily mean that the result returned by search is calibrated.

It seems like it should be possible to improve an engine by making the search scores more consistent between middle game and endgame. Otherwise there is a risk that the engine when ahead chooses to trade down into a position that has a higher centipawn score, but a lower expected game score.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Expected performance and eval of Komodo 8 and SF 6

Post by Laskos »

petero2 wrote:
Laskos wrote:
petero2 wrote: I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.

In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:

1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.

2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.
I tested a bit Texel 1.05, the database consists of only 200 games at 10 minutes + 3 seconds, 1SD errors in expected scores are about 2%.

Code: Select all

Eval         Move 30       Move 70
Texel 1.05   Exp. score    Exp. score   

1.0            61%           52%        
1.5            70%           55%
2.0            75%           64% 
I never quite see a stability from opening/middlegame to endgame.
Thank you.

So a calibrated evaluation function does not necessarily mean that the result returned by search is calibrated.

It seems like it should be possible to improve an engine by making the search scores more consistent between middle game and endgame. Otherwise there is a risk that the engine when ahead chooses to trade down into a position that has a higher centipawn score, but a lower expected game score.
Do you have a PGN database from cutechess-cli of Texel self-games of similar strength at blitz TC? In thousands of games. Thanks.
lkaufman
Posts: 6284
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Expected performance and eval of Komodo 8 and SF 6

Post by lkaufman »

petero2 wrote:
Laskos wrote:
petero2 wrote: I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.

In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:

1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.

2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.
I tested a bit Texel 1.05, the database consists of only 200 games at 10 minutes + 3 seconds, 1SD errors in expected scores are about 2%.

Code: Select all

Eval         Move 30       Move 70
Texel 1.05   Exp. score    Exp. score   

1.0            61%           52%        
1.5            70%           55%
2.0            75%           64% 
I never quite see a stability from opening/middlegame to endgame.
Thank you.

So a calibrated evaluation function does not necessarily mean that the result returned by search is calibrated.

It seems like it should be possible to improve an engine by making the search scores more consistent between middle game and endgame. Otherwise there is a risk that the engine when ahead chooses to trade down into a position that has a higher centipawn score, but a lower expected game score.
I noticed a while ago that Komodo didn't score as well in the endgame as in the middlegame for a given positive eval. We tried a couple ways to eliminate thi dfference, but the resultant version was weaker than the original. I don't really know why, but one theory is that the deeper into the game you are, the more likely it is that the score has persisted for a while and cannot easily be improved. I'm not sure what do to about this. Comments welcome.
Komodo rules!
petero2
Posts: 734
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Expected performance and eval of Komodo 8 and SF 6

Post by petero2 »

Laskos wrote:
petero2 wrote:
Laskos wrote:
petero2 wrote: I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.

In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:

1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.

2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.
I tested a bit Texel 1.05, the database consists of only 200 games at 10 minutes + 3 seconds, 1SD errors in expected scores are about 2%.

Code: Select all

Eval         Move 30       Move 70
Texel 1.05   Exp. score    Exp. score   

1.0            61%           52%        
1.5            70%           55%
2.0            75%           64% 
I never quite see a stability from opening/middlegame to endgame.
Thank you.

So a calibrated evaluation function does not necessarily mean that the result returned by search is calibrated.

It seems like it should be possible to improve an engine by making the search scores more consistent between middle game and endgame. Otherwise there is a risk that the engine when ahead chooses to trade down into a position that has a higher centipawn score, but a lower expected game score.
Do you have a PGN database from cutechess-cli of Texel self-games of similar strength at blitz TC? In thousands of games. Thanks.
I don't have a lot of blitz games but I do have lots of hyper-bullet (1s+0.08s/move) games: Here are 37100 such games: http://dl.dropboxusercontent.com/u/8968 ... s105a32.xz. These games are played under the same conditions as I use when tuning the evaluation function.

If you want to I can play some games at longer time control. Just specify the time control and the number of games you want.
Pio
Posts: 338
Joined: Sat Feb 25, 2012 10:42 pm
Location: Stockholm

Re: Expected performance and eval of Komodo 8 and SF 6

Post by Pio »

Hi Larry!

Maybe you could use the logistic function with different parameters depending on how far the game has progressed (of course not w.r.t. move nr but maybe w.r.t. material left) to modify your score so that the score will better reflect the probability of winning the game.

Good luck!
lkaufman
Posts: 6284
Joined: Sun Jan 10, 2010 6:15 am
Location: Maryland USA
Full name: Larry Kaufman

Re: Expected performance and eval of Komodo 8 and SF 6

Post by lkaufman »

Pio wrote:Hi Larry!

Maybe you could use the logistic function with different parameters depending on how far the game has progressed (of course not w.r.t. move nr but maybe w.r.t. material left) to modify your score so that the score will better reflect the probability of winning the game.

Good luck!
Well, the problem is that modifying the score based on material does make Komodo better reflect winning probabilities, but the revised version loses to the normal one. This is what we don't understand.
Komodo rules!
Pio
Posts: 338
Joined: Sat Feb 25, 2012 10:42 pm
Location: Stockholm

Re: Expected performance and eval of Komodo 8 and SF 6

Post by Pio »

Hi Larry!

I guess the problem you have is that when adjusting the scores to better reflect the winning probabilities you will not search as deep into the simplified positions that are much easier to resolve to either a win, draw or a loss since the search is guided by the evaluation.

I guess you could fix this problem if you modify your search a little bit. I had an idea http://www.talkchess.com/forum/viewtopi ... 76&t=42677 how to do this.

What I want to say is that I think you should search a lot deeper in the simplified parts of the tree since it is cheaper but that you should not play those moves going into the simplified parts if you are not pretty sure that they are good.

Good luck!
kbhearn
Posts: 411
Joined: Thu Dec 30, 2010 4:48 am

Re: Expected performance and eval of Komodo 8 and SF 6

Post by kbhearn »

It sounds to me like you're just running into the second purpose of an evaluation function of progressing the game. At some point, preferably not imposed by the 50 move rule's proximity when it may be too late to make the right decision, you have to accept an ending that is closer to drawn than the middlegame you're currently in and as such the evaluation for said endings needs to be slightly higher than the middlegames they come from.