TalkChess.com

Posted: **Sun Feb 19, 2017 12:13 am**

It is said Komodo to have one of the best evals.
But Vincent Diepeveen also claimed his program Diep to have a very good eval too, I remember he even challenged the original Komodo programmer for a one ply match, wich didn't happen.
I remember in one of the old Fritz releases Frans Morsch claimed Fritz to have the most knowledgeable evaluation function of all chess engines at the time, when asked what he did to maintain speed, he answered "there are pretty smart data structures", I don't know what that means, probably something like not needing to recompute all eval terms every time the eval is called.
Anyway an engine is the sum of eval + search, and only that sum can produce a program that actualy can play chess at an high level.
I used to consider the eval as probably the most important part of an engine.
Turns out I was wrong, time proved (at least to me) that engines that prune like hell and have light evals can be really strong.

best regards,
Alvaro

Posted: **Sun Feb 19, 2017 12:45 am**

Measuring this is pretty hard. Larry and I have discussed this a lot. It is not very hard to make two programs (with full source code, of course) search alike, so we can play them against each other to try and measure the evaluation quality. But values that work at shallow depths do not always also work in deeper searches. One example is king safety. The strongest programs I have seen source code (or written) have very high values for say the ability to check the opponent's king. The values often look crazy high. This works in deep searches but seem bad at shallow searches. So the effect is if a program is tuned for a shallow search it might look like it has a better eval than one better suited for deep searches.

But anyway, we love trying to measure these things. I can confirm that Komodo's eval is "bigger" (has more terms and does more things) that Stockfish. I hope it is better, but it is very hard to prove, or even measure.

Posted: **Sun Feb 19, 2017 5:14 am**

Cardoso wrote:....
Anyway an engine is the sum of eval + search, and only that sum can produce a program that actualy can play chess at an high level.
I used to consider the eval as probably the most important part of an engine.
Turns out I was wrong, time proved (at least to me) that engines that prune like hell and have light evals can be really strong.

best regards,
Alvaro

+1 good comment
search is key, eval is secondary, it's probably the 80/20 rule. - stockfish is where it is because of search.cpp , not because of evaluate.cpp , I have played around with both a lot and I speak from my experiences

Posted: **Sun Feb 19, 2017 6:50 am**

mjlef wrote:Measuring this is pretty hard. Larry and I have discussed this a lot. It is not very hard to make two programs (with full source code, of course) search alike, so we can play them against each other to try and measure the evaluation quality. But values that work at shallow depths do not always also work in deeper searches. One example is king safety. The strongest programs I have seen source code (or written) have very high values for say the ability to check the opponent's king. The values often look crazy high. This works in deep searches but seem bad at shallow searches. So the effect is if a program is tuned for a shallow search it might look like it has a better eval than one better suited for deep searches.

But anyway, we love trying to measure these things. I can confirm that Komodo's eval is "bigger" (has more terms and does more things) that Stockfish. I hope it is better, but it is very hard to prove, or even measure.

I think that the way to try to decide which evaluation is better should be by evaluation contests based on a fixed search rules to test both evaluations with the same number of nodes.

The question is how to define the fixed search rules.

Evaluation should be able to compare between positions at different depths(otherwise bonus for the side to move is going to give nothing) so obviously alpha beta with no extensions and no pruning is not relevant here.

I suggest alpha beta with random reduction.
At every node you reduce 1 ply with probability of 50%.

I suggest no qsearch because I think that a good evaluation should be good also at evaluating positions with many captures without qsearch.

I suggest also to have a rule that the engine has to search at least 1,000,000 positions per second in some known hardware from every position(you can decide about a different number but the idea is not to allow doing too much work in the evaluation because by definition doing much work is the job of the search).

The target is to prevent the engine to search many lines in the qsearch and claim that this heavy qsearch is part of the evaluation function.

Posted: **Sun Feb 19, 2017 10:30 am**

mjlef wrote:Measuring this is pretty hard. Larry and I have discussed this a lot. It is not very hard to make two programs (with full source code, of course) search alike, so we can play them against each other to try and measure the evaluation quality. But values that work at shallow depths do not always also work in deeper searches. One example is king safety. The strongest programs I have seen source code (or written) have very high values for say the ability to check the opponent's king. The values often look crazy high. This works in deep searches but seem bad at shallow searches. So the effect is if a program is tuned for a shallow search it might look like it has a better eval than one better suited for deep searches.

But anyway, we love trying to measure these things. I can confirm that Komodo's eval is "bigger" (has more terms and does more things) that Stockfish. I hope it is better, but it is very hard to prove, or even measure.

In SF development we put a lot of efforts in removing useless evaluation terms.

If you have seen the patches of the last 2 months, many are what we call "simplifications" that it means code removal. This is valuable to us for long term maintainability of the code base, for instance a simplification patch has more relaxed constrains to be considered passed at tests, we even accept that sometime a simplification could yield a small ELO decrease. Instead adding a new evaluation term has to be proved useful with much stricter statistical constraints. This patch acceptance asymmetry, that we consciously introduced, is a testament to the importance for us of removing code more than to add it.

The possibility to test single changes with hundreds of thousands of games is the enabling technology that allows to test simplifications and is a recent possibility for us (mainly since when we have fishtest framework, few years ago). In the past, once you added a new evaluation term you were more or less doomed to live with it for all the foreseeable future. This is because to prove for a term is almost neutral it is much harder and requires much more games than to prove a term is good.

Personally I think that testing for neutral simplifications is one of the new and most powerful advancement in chess engine testing technology and the key to avoid rewriting the engine (or important parts of it) from scratch every 10 years.

Posted: **Sun Feb 19, 2017 4:28 pm**

mcostalba wrote:
mjlef wrote:Measuring this is pretty hard. Larry and I have discussed this a lot. It is not very hard to make two programs (with full source code, of course) search alike, so we can play them against each other to try and measure the evaluation quality. But values that work at shallow depths do not always also work in deeper searches. One example is king safety. The strongest programs I have seen source code (or written) have very high values for say the ability to check the opponent's king. The values often look crazy high. This works in deep searches but seem bad at shallow searches. So the effect is if a program is tuned for a shallow search it might look like it has a better eval than one better suited for deep searches.

But anyway, we love trying to measure these things. I can confirm that Komodo's eval is "bigger" (has more terms and does more things) that Stockfish. I hope it is better, but it is very hard to prove, or even measure.
In SF development we put a lot of efforts in removing useless evaluation terms.

If you have seen the patches of the last 2 months, many are what we call "simplifications" that it means code removal. This is valuable to us for long term maintainability of the code base, for instance a simplification patch has more relaxed constrains to be considered passed at tests, we even accept that sometime a simplification could yield a small ELO decrease. Instead adding a new evaluation term has to be proved useful with much stricter statistical constraints. This patch acceptance asymmetry, that we consciously introduced, is a testament to the importance for us of removing code more than to add it.

The possibility to test single changes with hundreds of thousands of games is the enabling technology that allows to test simplifications and is a recent possibility for us (mainly since when we have fishtest framework, few years ago). In the past, once you added a new evaluation term you were more or less doomed to live with it for all the foreseeable future. This is because to prove for a term is almost neutral it is much harder and requires much more games than to prove a term is good.

Personally I think that testing for neutral simplifications is one of the new and most powerful advancement in chess engine testing technology and the key to avoid rewriting the engine (or important parts of it) from scratch every 10 years.

I am all for simplification as long as it is elo neutral. We certainly do remove them in Komodo when we prove they do not help. We use similar tests with similar error margins whether we are trying to prove something helps or now. Even things that only help a small amount are rejected unless they pass some reasonable error margin. But we do not have your fantastic testing framework and millions of hours of donated computer time, so that limits what we can test (to improve or simplify).

Posted: **Sun Feb 19, 2017 9:45 pm**

Vinvin wrote:
pkumar wrote:
1 6k1/8/6PP/3B1K2/8/2b5/8/8 b - - 0 1
2 8/8/r5kP/6P1/1R3K2/8/8/8 w - - 0 1
3 7k/R7/7P/6K1/8/8/2b5/8 w - - 0 1
4 8/8/5k2/8/8/4qBB1/6K1/8 w - - 0 1
5 8/8/8/3K4/8/4Q3/2p5/1k6 w - - 0 1
6 8/8/4nn2/4k3/8/Q4K2/8/8 w - - 0 1
7 8/k7/p7/Pr6/K1Q5/8/8/8 w - - 0 1
8 k7/p4R2/P7/1K6/8/6b1/8/8 w - - 0 1
Nice draw positions for fooling engines! Are there some more?
Sure

[d]8/8/8/8/2b1k3/3R4/3RK3/8 w - - 0 1

[d]8/3k4/8/8/P2B4/P2K4/P7/8 w - - 0 1
And more here : https://en.wikipedia.org/wiki/Fortress_(chess)

I guess you both need to go back to the drawing board since with 6 man egtb, they are all evaluated correctly by SF.

Posted: **Mon Feb 20, 2017 9:08 pm**

[quote="Uri Blass"][quote="mjlef"]Measuring this is pretty hard. Larry and I have discussed this a lot. It is not very hard to make two programs (with full source code, of course) search alike, so we can play them against each other to try and measure the evaluation quality. But values that work at shallow depths do not always also work in deeper searches. One example is king safety. The strongest programs I have seen source code (or written) have very high values for say the ability to check the opponent's king. The values often look crazy high. This works in deep searches but seem bad at shallow searches. So the effect is if a program is tuned for a shallow search it might look like it has a better eval than one better suited for deep searches.

But anyway, we love trying to measure these things. I can confirm that Komodo's eval is "bigger" (has more terms and does more things) that Stockfish. I hope it is better, but it is very hard to prove, or even measure.[/quote]

I think that the way to try to decide which evaluation is better should be by evaluation contests based on a fixed search rules to test both evaluations with the same number of nodes.

The question is how to define the fixed search rules.

Evaluation should be able to compare between positions at different depths(otherwise bonus for the side to move is going to give nothing) so obviously alpha beta with no extensions and no pruning is not relevant here.

I suggest alpha beta with random reduction.
At every node you reduce 1 ply with probability of 50%.

I suggest no qsearch because I think that a good evaluation should be good also at evaluating positions with many captures without qsearch.

I suggest also to have a rule that the engine has to search at least 1,000,000 positions per second in some known hardware from every position(you can decide about a different number but the idea is not to allow doing too much work in the evaluation because by definition doing much work is the job of the search).

The target is to prevent the engine to search many lines in the qsearch and claim that this heavy qsearch is part of the evaluation function.[/quote]

I think the quality of the static evaluation was very important that old days when the maximum of search depth was very low. This was the time of Chess Genius, Mephistos, Novags, Rebel, early Fritz, early Hiarcs, early Shredder, etc.
Nowadays due to the huge power of CPUs the search of top engines may comprehend the whole middle game. Disregarding the straightforward variations might be a lot of combinations of the engine parameters what
results the same power of a given engine. This is that circumstance what give possibility to simplify the function of evaluation.

Posted: **Tue Feb 21, 2017 3:15 am**

A TOTAL non-programmer observation:

Can't one infer how important chess knowledge may be by how many ply it takes an engine to 'decided' and generally stick to it?

I wonder because when using Stockfish 8 and Komodo 8 to analyse a position, Komodo often settles on a move at say 25 ply when Stockfish might take 31 ply.

Granted...it might take Komodo a hair longer to actually get to 25 ply than it does SF to get to 31 ply...

Posted: **Tue Feb 21, 2017 4:13 am**

leavenfish wrote:A TOTAL non-programmer observation:

Can't one infer how important chess knowledge may be by how many ply it takes an engine to 'decided' and generally stick to it?

I wonder because when using Stockfish 8 and Komodo 8 to analyse a position, Komodo often settles on a move at say 25 ply when Stockfish might take 31 ply.

Granted...it might take Komodo a hair longer to actually get to 25 ply than it does SF to get to 31 ply...

That is certainly possible, but the meaining of plies are different for every engines which you pointed out in K and S.

I think still the best way to compare is by time limit. You can remove time limit and get the static eval directly is also fine. However to get a reasonable score from the engine (as engine is designed by search and eval) it is better to add a little bit of time.

TalkChess.com

Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing