In the days when I knew even less about chess programming than I do now, I used to think that writing a good evaluation function was simply about identifying pieces of knowledge that my program lacked, finding good ways to quantify the missing knowledge, choosing good weights, and implementing it all without bugs.
But over the years, it slowly became clear to me that there must be something more to writing a good evaluation function. A very frustrating experience repeated itself over and over: I noticed some obvious evaluation weakness in my program, and attempted to fix it (using more or less the procedure described above). Initially, the new evaluation term usually seemed to work: The previously misevaluated positions were understood better, and I saw plenty of games where the new knowledge helped my program to win. Nevertheless, the results in long matches were very often worse with the new evaluation term implemented, even when there was no significant slowdown, and no implementation bugs.
I think there are two main reasons for this phenomenon. The first one is that whenever you add a new term to the evaluation function, you have to carefully analyse how it interacts with all the previously existing evaluation terms. It is possible that some of the existing terms should be rewritten, re-tuned, or removed entirely.
This is best understood by an example: Assume that you have written a basic chess engine where the evaluation consists of material and piece square tables only. In this program, you add a single new evaluation term: A small bonus for rooks on open files. By doing this seemingly minor and innocent change, you are actually doing something potentially very dangerous: You are
increasing the average value of a rook over the set of all possible chess positions. This has some probably undesired side-effects. The program will, for instance, be slightly more inclined than before to exchange two minor pieces for a rook and a pawn. In order to prevent this, you should either give the rook a small penalty for being on a closed file, or (equivalently) slightly lower the base material value of a rook.
Similar effects occur in almost every case when you add a new evaluation term, but usually the interactions between the new and old evaluation terms are much more subtle and hard to spot than in the simple example above. Identifying and fixing the potential problems is generally very difficult.
The second reason is that evaluation rules which are correct only in the vast majority of chess positions (as opposed to all chess positions) can very often make the program play worse in practice. The point is that a chess player's strength is mainly determined by the quality of his/her/its
worst moves, and not by the
best moves. Therefore, the price of misevaluating 1% of the nodes will sometimes more than outweigh the benefits of evaluating the remaining 99% of all nodes more precisely. This is particularly obvious for evaluation terms which can take very big absolute values, but the same thing can happen even for small bonuses and penalties.
In summary, an evaluation function is not just the sum of its parts. You can't just heap lots of correct (in isolation) pieces of chess knowledge on top of each other and expect to end up with a good evaluation function. You need patience, restraint, thorough testing, and a sound basic philosophy to succeed.
The following are, in my opinion, the most important properties of a good evaluation function:
- Orthogonality. When it is possible, it is better to avoid having two different evaluation components which to some extent quantify the same extent of the position. When adding a new evaluation component which has a non-zero "orthogonal projection" (in a methaphorical sense, of course) on a previously existing component, try to adjust the two components in such a way that the projection is minimized, or to generalize and combine the two components into one.
- Continuity. If two positions X and Y which are "close to each other" in the sense that it is possible to get from position X to position Y by a short sequence of good moves, the two positions should ideally have similar evaluations. As a corollary, when one adds a big bonus or penalty for some particular pattern, one should also consider introducing a smaller bonus or penalty for getting close to this pattern. For instance, when adding a big bonus for a knight on an outpost square, it might be a good idea to add a smaller bonus for a knight attacking an outpost square.
- Sense of progress. It is much more important that the evaluation function is able to accurately judge which of two very similar
positions is better, than that it is able to judge which of two totally different positions is better. The evaluation function doesn't need to be able to answer questions like whether a certain classical King's Indian middle game is better than an endgame arising from a Richter-Rauzer Sicilian. What it needs is to be able to decide is things like whether one side should try to exchange a bishop for a knight, or whether it is better to castle kingside or queenside.
- Good worst case behavior. It is better to be wrong by 10 centipawns all the time than to be completely correct 99.9% of the time and wrong by 300 centipawns 0.1% of the time.
Stan Arts wrote:(A good example is Fruit. (I think Fruit's evaluation is not amazing like some claim, simply because my engine can knock of plenty of games at fixed depth. If it is amazing it is only amazing as it works so well with it's search.) There are a great deal of programs with more knowledgeable evals, yet Fruit outsearches anything and ends up at the top. It is terribly hard to compensate 1-2 extra ply with evaluation.)
Fruit's evaluation function is actually very good. It is true that there are many programs with more
knowledgeable evals, but as explained above, this is not the same as
better evals. Fruit's evaluation is founded on a sound philosophy, and has very few bugs. This is far more important than how much knowledge it contains.
BTW: Using fixed-depth search games to compare the quality of evaluation functions isn't very meaningful, but that's a topic for another post (and probably a different day): This post is already far too long.
Tord