mjlef wrote:Measuring this is pretty hard. Larry and I have discussed this a lot. It is not very hard to make two programs (with full source code, of course) search alike, so we can play them against each other to try and measure the evaluation quality. But values that work at shallow depths do not always also work in deeper searches. One example is king safety. The strongest programs I have seen source code (or written) have very high values for say the ability to check the opponent's king. The values often look crazy high. This works in deep searches but seem bad at shallow searches. So the effect is if a program is tuned for a shallow search it might look like it has a better eval than one better suited for deep searches.
But anyway, we love trying to measure these things. I can confirm that Komodo's eval is "bigger" (has more terms and does more things) that Stockfish. I hope it is better, but it is very hard to prove, or even measure.
I think that the way to try to decide which evaluation is better should be by evaluation contests based on a fixed search rules to test both evaluations with the same number of nodes.
The question is how to define the fixed search rules.
Evaluation should be able to compare between positions at different depths(otherwise bonus for the side to move is going to give nothing) so obviously alpha beta with no extensions and no pruning is not relevant here.
I suggest alpha beta with random reduction.
At every node you reduce 1 ply with probability of 50%.
I suggest no qsearch because I think that a good evaluation should be good also at evaluating positions with many captures without qsearch.
I suggest also to have a rule that the engine has to search at least 1,000,000 positions per second in some known hardware from every position(you can decide about a different number but the idea is not to allow doing too much work in the evaluation because by definition doing much work is the job of the search).
The target is to prevent the engine to search many lines in the qsearch and claim that this heavy qsearch is part of the evaluation function.