Question: Why is Sicilian misevaluated by engines?Larry Kaufman wrote:That seems like a good theory. The question now arises: Can I have my cake and eat it too? In other words, is there a way to steer the search towards such promising areas without resorting to artificially high mobility scores? The problem is that programs which use such unrealisitic scores are not very useful for opening analysis by humans, because the evaluations are just way out of line with results in actual play (whether human or engine play) from positions where one side has much more mobility but worse structure. Humans tend to prefer (and score better from) the positions with the better structure but worse mobility.Tord Romstad wrote:
Now to the question of why such high mobility scores work: I don't know, and honestly I wasn't even aware that our mobility scores were unreasonably high (I'm not a chess player). If they are, here's a possible explanation of why these high bonuses work so well in practice:
Except in positions with a single long, completely forced line, the quality of the last few moves of a long PV usually isn't high. The position at the end of the PV will never appear on the board. When the program has an advantage in mobility at the position at the and of the PV, however, it probably also has an advantage in mobility at most positions close to the end position in the search tree. This means that in the last few nodes along the PV, where the PV moves are probably not good, the program will probably have many reasonable alternative moves and the opponent considerably fewer. High mobility scores therefore steers the search towards areas of the tree where there is a good chance to find unexpected resources for the program, and not for the opponent.
Maximizing the chance of pleasant surprises towards the end of the PV while minimizing the chance of unpleasant surprises seems like a good idea, in general.
Because mobility is overevaluated. And for that, I see two related reasons.
This is the first reason:
Mobility is good, because if you are mobile, you have better chance to find an alternative when your principal variation turns wrong.
I like the way two different threads came to similar conclusions:
Entirely random evaluation makes the search to steer to positions with many choices. That is a naive form of mobility. And with that, random evaluation gives not completely weak performace.
This is the second reason:
Finding tactical solutions. Usually, middlegame is more important than opening. In opening, it is reasonable that the side behind in development can be quite well, because of some structural compensation. But in the middlegame, when one side is still underdeveloped, that usually means it is desperately defending some weak points. This is a clear sign for engine to go for a kill. Even if the engine does not understand what the weakness is, it is often sufficient to rely on tactics here. So overall, overevaluating mobility helps in punishing mistakes, and this is more important than performance in openings (or in closed positions).
Humans do not work in this simple way, because they have better heuristics. Overloaded defenders and multiple weaknesses are better indicators of tactical solution, than merely restricted mobility. Also, in opening phase, humans understand that you need to develop initiative to actually make something tangible from better mobility. You have to be able to attack multiple possible targets, mere moving of pieces is next to nothing.
About having a cake:
The half of the second reason can be saved by giving the engine bigger and better static knowledge, to mimic human heuristics. Something like computing "speed of activation" for pieces, compare that to "time to develop" computed from the speed of enemy threats, and evaluate underdevelopment as very bad only when it is not likely to vanish in few moves. I think future is here.
The half of the first reason is even harder. It would be nice if, beside score, we also had something like measure of danger (of finding something bad along PV). Humans do that, they choose lines safe for them and risky for opponent. But I am not sure how the dangerousness would be computed or propagated towards the root in alpha-beta search. It is possible that measuring safety the human way is only feasible in massive parallel enviroments, and traditional engines will always overvaluate mobility (and king safety, ...) and use some counter-intuitive search tricks to mimic naive dangersense while maintaining depth of search.
And about evaluation score:
It is anything that makes engine to play well using search. It is only needed to compare two positions, and the result is not completely determined by which position has better winning probability. Both 'danger sense' and 'smell for combination' are effects that make the searched bestmove better overall, while the reported score remains less reliable because of overevaluations. Maybe propagating two scores ('active' for search decisions and 'objective' to report) would be nicer, but it is not clear how to tune the latter evaluation.