In SF development we put a lot of efforts in removing useless evaluation terms.mjlef wrote:Measuring this is pretty hard. Larry and I have discussed this a lot. It is not very hard to make two programs (with full source code, of course) search alike, so we can play them against each other to try and measure the evaluation quality. But values that work at shallow depths do not always also work in deeper searches. One example is king safety. The strongest programs I have seen source code (or written) have very high values for say the ability to check the opponent's king. The values often look crazy high. This works in deep searches but seem bad at shallow searches. So the effect is if a program is tuned for a shallow search it might look like it has a better eval than one better suited for deep searches.
But anyway, we love trying to measure these things. I can confirm that Komodo's eval is "bigger" (has more terms and does more things) that Stockfish. I hope it is better, but it is very hard to prove, or even measure.
If you have seen the patches of the last 2 months, many are what we call "simplifications" that it means code removal. This is valuable to us for long term maintainability of the code base, for instance a simplification patch has more relaxed constrains to be considered passed at tests, we even accept that sometime a simplification could yield a small ELO decrease. Instead adding a new evaluation term has to be proved useful with much stricter statistical constraints. This patch acceptance asymmetry, that we consciously introduced, is a testament to the importance for us of removing code more than to add it.
The possibility to test single changes with hundreds of thousands of games is the enabling technology that allows to test simplifications and is a recent possibility for us (mainly since when we have fishtest framework, few years ago). In the past, once you added a new evaluation term you were more or less doomed to live with it for all the foreseeable future. This is because to prove for a term is almost neutral it is much harder and requires much more games than to prove a term is good.
Personally I think that testing for neutral simplifications is one of the new and most powerful advancement in chess engine testing technology and the key to avoid rewriting the engine (or important parts of it) from scratch every 10 years.