Chessprogams with the most chessknowing

mcostalba · Post by **mcostalba** » Sun Feb 19, 2017 10:30 am

mjlef wrote:Measuring this is pretty hard. Larry and I have discussed this a lot. It is not very hard to make two programs (with full source code, of course) search alike, so we can play them against each other to try and measure the evaluation quality. But values that work at shallow depths do not always also work in deeper searches. One example is king safety. The strongest programs I have seen source code (or written) have very high values for say the ability to check the opponent's king. The values often look crazy high. This works in deep searches but seem bad at shallow searches. So the effect is if a program is tuned for a shallow search it might look like it has a better eval than one better suited for deep searches.

But anyway, we love trying to measure these things. I can confirm that Komodo's eval is "bigger" (has more terms and does more things) that Stockfish. I hope it is better, but it is very hard to prove, or even measure.

In SF development we put a lot of efforts in removing useless evaluation terms.

If you have seen the patches of the last 2 months, many are what we call "simplifications" that it means code removal. This is valuable to us for long term maintainability of the code base, for instance a simplification patch has more relaxed constrains to be considered passed at tests, we even accept that sometime a simplification could yield a small ELO decrease. Instead adding a new evaluation term has to be proved useful with much stricter statistical constraints. This patch acceptance asymmetry, that we consciously introduced, is a testament to the importance for us of removing code more than to add it.

The possibility to test single changes with hundreds of thousands of games is the enabling technology that allows to test simplifications and is a recent possibility for us (mainly since when we have fishtest framework, few years ago). In the past, once you added a new evaluation term you were more or less doomed to live with it for all the foreseeable future. This is because to prove for a term is almost neutral it is much harder and requires much more games than to prove a term is good.

Personally I think that testing for neutral simplifications is one of the new and most powerful advancement in chess engine testing technology and the key to avoid rewriting the engine (or important parts of it) from scratch every 10 years.

mjlef · Post by **mjlef** » Sun Feb 19, 2017 4:28 pm

mcostalba wrote:
mjlef wrote:Measuring this is pretty hard. Larry and I have discussed this a lot. It is not very hard to make two programs (with full source code, of course) search alike, so we can play them against each other to try and measure the evaluation quality. But values that work at shallow depths do not always also work in deeper searches. One example is king safety. The strongest programs I have seen source code (or written) have very high values for say the ability to check the opponent's king. The values often look crazy high. This works in deep searches but seem bad at shallow searches. So the effect is if a program is tuned for a shallow search it might look like it has a better eval than one better suited for deep searches.

But anyway, we love trying to measure these things. I can confirm that Komodo's eval is "bigger" (has more terms and does more things) that Stockfish. I hope it is better, but it is very hard to prove, or even measure.
In SF development we put a lot of efforts in removing useless evaluation terms.

If you have seen the patches of the last 2 months, many are what we call "simplifications" that it means code removal. This is valuable to us for long term maintainability of the code base, for instance a simplification patch has more relaxed constrains to be considered passed at tests, we even accept that sometime a simplification could yield a small ELO decrease. Instead adding a new evaluation term has to be proved useful with much stricter statistical constraints. This patch acceptance asymmetry, that we consciously introduced, is a testament to the importance for us of removing code more than to add it.

The possibility to test single changes with hundreds of thousands of games is the enabling technology that allows to test simplifications and is a recent possibility for us (mainly since when we have fishtest framework, few years ago). In the past, once you added a new evaluation term you were more or less doomed to live with it for all the foreseeable future. This is because to prove for a term is almost neutral it is much harder and requires much more games than to prove a term is good.

Personally I think that testing for neutral simplifications is one of the new and most powerful advancement in chess engine testing technology and the key to avoid rewriting the engine (or important parts of it) from scratch every 10 years.

I am all for simplification as long as it is elo neutral. We certainly do remove them in Komodo when we prove they do not help. We use similar tests with similar error margins whether we are trying to prove something helps or now. Even things that only help a small amount are rejected unless they pass some reasonable error margin. But we do not have your fantastic testing framework and millions of hours of donated computer time, so that limits what we can test (to improve or simplify).

MikeB · Post by **MikeB** » Sun Feb 19, 2017 9:45 pm

Vinvin wrote:
pkumar wrote:
1 6k1/8/6PP/3B1K2/8/2b5/8/8 b - - 0 1
2 8/8/r5kP/6P1/1R3K2/8/8/8 w - - 0 1
3 7k/R7/7P/6K1/8/8/2b5/8 w - - 0 1
4 8/8/5k2/8/8/4qBB1/6K1/8 w - - 0 1
5 8/8/8/3K4/8/4Q3/2p5/1k6 w - - 0 1
6 8/8/4nn2/4k3/8/Q4K2/8/8 w - - 0 1
7 8/k7/p7/Pr6/K1Q5/8/8/8 w - - 0 1
8 k7/p4R2/P7/1K6/8/6b1/8/8 w - - 0 1
Nice draw positions for fooling engines! Are there some more?
Sure

[d]8/8/8/8/2b1k3/3R4/3RK3/8 w - - 0 1

[d]8/3k4/8/8/P2B4/P2K4/P7/8 w - - 0 1
And more here : https://en.wikipedia.org/wiki/Fortress_(chess)

I guess you both need to go back to the drawing board since with 6 man egtb, they are all evaluated correctly by SF.

corres · Post by **corres** » Mon Feb 20, 2017 9:08 pm

[quote="Uri Blass"][quote="mjlef"]Measuring this is pretty hard. Larry and I have discussed this a lot. It is not very hard to make two programs (with full source code, of course) search alike, so we can play them against each other to try and measure the evaluation quality. But values that work at shallow depths do not always also work in deeper searches. One example is king safety. The strongest programs I have seen source code (or written) have very high values for say the ability to check the opponent's king. The values often look crazy high. This works in deep searches but seem bad at shallow searches. So the effect is if a program is tuned for a shallow search it might look like it has a better eval than one better suited for deep searches.

But anyway, we love trying to measure these things. I can confirm that Komodo's eval is "bigger" (has more terms and does more things) that Stockfish. I hope it is better, but it is very hard to prove, or even measure.[/quote]

I think that the way to try to decide which evaluation is better should be by evaluation contests based on a fixed search rules to test both evaluations with the same number of nodes.

The question is how to define the fixed search rules.

Evaluation should be able to compare between positions at different depths(otherwise bonus for the side to move is going to give nothing) so obviously alpha beta with no extensions and no pruning is not relevant here.

I suggest alpha beta with random reduction.
At every node you reduce 1 ply with probability of 50%.

I suggest no qsearch because I think that a good evaluation should be good also at evaluating positions with many captures without qsearch.

I suggest also to have a rule that the engine has to search at least 1,000,000 positions per second in some known hardware from every position(you can decide about a different number but the idea is not to allow doing too much work in the evaluation because by definition doing much work is the job of the search).

The target is to prevent the engine to search many lines in the qsearch and claim that this heavy qsearch is part of the evaluation function.[/quote]

I think the quality of the static evaluation was very important that old days when the maximum of search depth was very low. This was the time of Chess Genius, Mephistos, Novags, Rebel, early Fritz, early Hiarcs, early Shredder, etc.
Nowadays due to the huge power of CPUs the search of top engines may comprehend the whole middle game. Disregarding the straightforward variations might be a lot of combinations of the engine parameters what
results the same power of a given engine. This is that circumstance what give possibility to simplify the function of evaluation.

leavenfish · Post by **leavenfish** » Tue Feb 21, 2017 3:15 am

A TOTAL non-programmer observation:

Can't one infer how important chess knowledge may be by how many ply it takes an engine to 'decided' and generally stick to it?

I wonder because when using Stockfish 8 and Komodo 8 to analyse a position, Komodo often settles on a move at say 25 ply when Stockfish might take 31 ply.

Granted...it might take Komodo a hair longer to actually get to 25 ply than it does SF to get to 31 ply...

Ferdy · Post by **Ferdy** » Tue Feb 21, 2017 4:13 am

leavenfish wrote:A TOTAL non-programmer observation:

Can't one infer how important chess knowledge may be by how many ply it takes an engine to 'decided' and generally stick to it?

I wonder because when using Stockfish 8 and Komodo 8 to analyse a position, Komodo often settles on a move at say 25 ply when Stockfish might take 31 ply.

Granted...it might take Komodo a hair longer to actually get to 25 ply than it does SF to get to 31 ply...

That is certainly possible, but the meaining of plies are different for every engines which you pointed out in K and S.

I think still the best way to compare is by time limit. You can remove time limit and get the static eval directly is also fine. However to get a reasonable score from the engine (as engine is designed by search and eval) it is better to add a little bit of time.

corres · Post by **corres** » Tue Feb 21, 2017 10:55 am

[quote="leavenfish"]A TOTAL non-programmer observation:

Can't one infer how important chess knowledge may be by how many ply it takes an engine to 'decided' and generally stick to it?

I wonder because when using [b]Stockfish 8[/b] and [b]Komodo 8[/b] to analyse a position, Komodo often settles on a move at say 25 ply when Stockfish might take 31 ply.

Granted...it might take Komodo a hair longer to actually get to 25 ply than it does SF to get to 31 ply...
[/quote]

A very selective program like Stockfish and a moderate selective program like Komodo has this behavior.
The user of chess programs and the programmers are interested only in the position which occupied by their program on the lists and not the real chess knowledge of that engine.

leavenfish · Post by **leavenfish** » Sat Feb 25, 2017 5:51 pm

MikeB wrote:
Cardoso wrote:....
Anyway an engine is the sum of eval + search, and only that sum can produce a program that actualy can play chess at an high level.
I used to consider the eval as probably the most important part of an engine.
Turns out I was wrong, time proved (at least to me) that engines that prune like hell and have light evals can be really strong.

best regards,
Alvaro
+1 good comment
search is key, eval is secondary, it's probably the 80/20 rule. - stockfish is where it is because of search.cpp , not because of evaluate.cpp , I have played around with both a lot and I speak from my experiences

This is where the non-programmer (like me) probably has a mental block. I mean, search obviously is important - if you are missing 'good' (better?) lines in the search, the evals might be misleading when it comes to the given position at hand- but conversely if the evals are less than optimal and the search is more precise, the engine might be guided down a road that is not optimal.

So, I can see where the balance is important.

But from a strictly 'judgement from a given end position in the search'...it would seem that people (who purchase these products) would favor and engine which searches a set depth and then uses the greater 'knowledge' it has to evaluate the....say pawn structure, weak squares, open lines/diagonals, king safety in some descending order of importance given the general look of the board at that end point.

That I think is why people like myself have always gravitated to Komodo as maybe its 'static eval' was (n the past at least) a bit more precise than other (particularly free) engines.

I mean, if there is no discernable (important) difference in rating between the top engines in 'engine vs engine' play (and I do not see any), the people actually purchasing an engine needs to have SOMETHING to argue for his purchasing a given commercial engine.

That is why I have purchased Komodo 3 times over the years...

Uri Blass · Post by **Uri Blass** » Sat Feb 25, 2017 10:16 pm

corres wrote:
Uri Blass wrote:
mjlef wrote:Measuring this is pretty hard. Larry and I have discussed this a lot. It is not very hard to make two programs (with full source code, of course) search alike, so we can play them against each other to try and measure the evaluation quality. But values that work at shallow depths do not always also work in deeper searches. One example is king safety. The strongest programs I have seen source code (or written) have very high values for say the ability to check the opponent's king. The values often look crazy high. This works in deep searches but seem bad at shallow searches. So the effect is if a program is tuned for a shallow search it might look like it has a better eval than one better suited for deep searches.

But anyway, we love trying to measure these things. I can confirm that Komodo's eval is "bigger" (has more terms and does more things) that Stockfish. I hope it is better, but it is very hard to prove, or even measure.
I think that the way to try to decide which evaluation is better should be by evaluation contests based on a fixed search rules to test both evaluations with the same number of nodes.

The question is how to define the fixed search rules.

Evaluation should be able to compare between positions at different depths(otherwise bonus for the side to move is going to give nothing) so obviously alpha beta with no extensions and no pruning is not relevant here.

I suggest alpha beta with random reduction.
At every node you reduce 1 ply with probability of 50%.

I suggest no qsearch because I think that a good evaluation should be good also at evaluating positions with many captures without qsearch.

I suggest also to have a rule that the engine has to search at least 1,000,000 positions per second in some known hardware from every position(you can decide about a different number but the idea is not to allow doing too much work in the evaluation because by definition doing much work is the job of the search).

The target is to prevent the engine to search many lines in the qsearch and claim that this heavy qsearch is part of the evaluation function.
I think the quality of the static evaluation was very important that old days when the maximum of search depth was very low. This was the time of Chess Genius, Mephistos, Novags, Rebel, early Fritz, early Hiarcs, early Shredder, etc.
Nowadays due to the huge power of CPUs the search of top engines may comprehend the whole middle game. Disregarding the straightforward variations might be a lot of combinations of the engine parameters what
results the same power of a given engine. This is that circumstance what give possibility to simplify the function of evaluation.

I disagree.
I do not think that the quality of the static evaluation was higher in the old days.

Chess Genius was basically a stupid program that was a preprocessor.
I remember that I analyzed positions with chess genius and after trading queens the evaluation suddenly changed not because of deeper search.

Simplifications in stockfish are good enough to make it better even in super bullet time control of 10+0.1 with 1 cpu so I do not believe this idea was bad at the old days and people simply did not know what is good or bad at that time.

corres · Post by **corres** » Mon Feb 27, 2017 10:23 am

[quote="corres"]
I think the quality of the static evaluation was very important that old days when the maximum of search depth was very low. This was the time of Chess Genius, Mephistos, Novags, Rebel, early Fritz, early Hiarcs, early Shredder, etc.
Nowadays due to the huge power of CPUs the search of top engines may comprehend the whole middle game. Disregarding the straightforward variations might be a lot of combinations of the engine parameters what
results the same power of a given engine. This is that circumstance what give possibility to simplify the function of evaluation.[/quote]

I disagree.
I do not think that the quality of the static evaluation was higher in the old days.

Chess Genius was basically a stupid program that was a preprocessor.
I remember that I analyzed positions with chess genius and after trading queens the evaluation suddenly changed not because of deeper search.

Simplifications in stockfish are good enough to make it better even in super bullet time control of 10+0.1 with 1 cpu so I do not believe this idea was bad at the old days and people simply did not know what is good or bad at that time.
[/quote]

As you can see above I did not write that the evaluation of the old chess engines was better then the modern ones are. At that time engine makers did not use supercomputers or network of computers to optimize engine parameters. I only stated that engines (like Genius, etc) running on a very slow computer need more better evaluation function than - for e.g.- Stockfish has. The static evaluation is a prediction about the probable result of the game. Old engines make prophecy from a short sequence of the line of moves only so they should need better evaluation function for a good "prophecy".
The main aim of simplifications are to get a better reading source code and getting some speed enhancement. A faster engine is stronger not only in short time control but in long time control, too.

Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing

Re: Chessprogams with the most chessknowing