There is actually a positive correlataion between the distance between the 2 best engine suggestions and the number of more-or-equal moves.
I miss something, apparently they seem to exclude each other, one measures the "forced moves", other the "quiet moves".
I see what you mean, my statement implied that bigger number moves correlates to bigger gap between the 2 best moves. Should have used more careful wording. High numbers of equal moves and low differences between 1st and 2nd engine suggestion both occur primarily in quitet strategical positions and hence quite strongly correlated in the sense that accurate play is easy to maintain.
It shows that both factors complement each other nicely. 2 moves within 0.30 is far below average, and the difference of 0.28 is about average. Sometimes in highly tactical positions there are 2 or 3 moves evaluated as equal, so relying on the difference alone would be misleading.
petero2 wrote:
I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.
In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:
1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.
2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.
I tested a bit Texel 1.05, the database consists of only 200 games at 10 minutes + 3 seconds, 1SD errors in expected scores are about 2%.
I never quite see a stability from opening/middlegame to endgame.
Thank you.
So a calibrated evaluation function does not necessarily mean that the result returned by search is calibrated.
It seems like it should be possible to improve an engine by making the search scores more consistent between middle game and endgame. Otherwise there is a risk that the engine when ahead chooses to trade down into a position that has a higher centipawn score, but a lower expected game score.
petero2 wrote:
I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.
In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:
1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.
2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.
I tested a bit Texel 1.05, the database consists of only 200 games at 10 minutes + 3 seconds, 1SD errors in expected scores are about 2%.
I never quite see a stability from opening/middlegame to endgame.
Thank you.
So a calibrated evaluation function does not necessarily mean that the result returned by search is calibrated.
It seems like it should be possible to improve an engine by making the search scores more consistent between middle game and endgame. Otherwise there is a risk that the engine when ahead chooses to trade down into a position that has a higher centipawn score, but a lower expected game score.
Do you have a PGN database from cutechess-cli of Texel self-games of similar strength at blitz TC? In thousands of games. Thanks.
petero2 wrote:
I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.
In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:
1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.
2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.
I tested a bit Texel 1.05, the database consists of only 200 games at 10 minutes + 3 seconds, 1SD errors in expected scores are about 2%.
I never quite see a stability from opening/middlegame to endgame.
Thank you.
So a calibrated evaluation function does not necessarily mean that the result returned by search is calibrated.
It seems like it should be possible to improve an engine by making the search scores more consistent between middle game and endgame. Otherwise there is a risk that the engine when ahead chooses to trade down into a position that has a higher centipawn score, but a lower expected game score.
I noticed a while ago that Komodo didn't score as well in the endgame as in the middlegame for a given positive eval. We tried a couple ways to eliminate thi dfference, but the resultant version was weaker than the original. I don't really know why, but one theory is that the deeper into the game you are, the more likely it is that the score has persisted for a while and cannot easily be improved. I'm not sure what do to about this. Comments welcome.
petero2 wrote:
I think it would be interesting to test if engines such as Junior, Gaviota, Texel, Rhetoric and Cheng, who have had their evaluation functions tuned to match the logistic curve, are more consistent between middle game and end game.
In principle they should have the same centipawn to expected score relation in all phases of the game, since that is what the tuning tries to achieve. In practice though that may not be the case for a number of reasons:
1. Fortress positions are more common in the end game, and if the evaluation function is not able to detect such positions, there is nothing the tuning can do to fix it.
2. The tuning typically adjusts weights to make the evaluation function be more consistent between game stages, but since the engine typically searches deeper in the endgame, the search scores could be inconsistent anyway.
I tested a bit Texel 1.05, the database consists of only 200 games at 10 minutes + 3 seconds, 1SD errors in expected scores are about 2%.
I never quite see a stability from opening/middlegame to endgame.
Thank you.
So a calibrated evaluation function does not necessarily mean that the result returned by search is calibrated.
It seems like it should be possible to improve an engine by making the search scores more consistent between middle game and endgame. Otherwise there is a risk that the engine when ahead chooses to trade down into a position that has a higher centipawn score, but a lower expected game score.
Do you have a PGN database from cutechess-cli of Texel self-games of similar strength at blitz TC? In thousands of games. Thanks.
I don't have a lot of blitz games but I do have lots of hyper-bullet (1s+0.08s/move) games: Here are 37100 such games: http://dl.dropboxusercontent.com/u/8968 ... s105a32.xz. These games are played under the same conditions as I use when tuning the evaluation function.
If you want to I can play some games at longer time control. Just specify the time control and the number of games you want.
Maybe you could use the logistic function with different parameters depending on how far the game has progressed (of course not w.r.t. move nr but maybe w.r.t. material left) to modify your score so that the score will better reflect the probability of winning the game.
Maybe you could use the logistic function with different parameters depending on how far the game has progressed (of course not w.r.t. move nr but maybe w.r.t. material left) to modify your score so that the score will better reflect the probability of winning the game.
Good luck!
Well, the problem is that modifying the score based on material does make Komodo better reflect winning probabilities, but the revised version loses to the normal one. This is what we don't understand.
I guess the problem you have is that when adjusting the scores to better reflect the winning probabilities you will not search as deep into the simplified positions that are much easier to resolve to either a win, draw or a loss since the search is guided by the evaluation.
What I want to say is that I think you should search a lot deeper in the simplified parts of the tree since it is cheaper but that you should not play those moves going into the simplified parts if you are not pretty sure that they are good.
It sounds to me like you're just running into the second purpose of an evaluation function of progressing the game. At some point, preferably not imposed by the 50 move rule's proximity when it may be too late to make the right decision, you have to accept an ending that is closer to drawn than the middlegame you're currently in and as such the evaluation for said endings needs to be slightly higher than the middlegames they come from.