Seriously, you don't think we checked that out?Houdini wrote:Apparently you don't understand your own engineDon wrote:I have written many time that I BELIEVE we have the best positional program in the world. There is no test that can prove that I am right or wrong. I base this on the fact that we are one of the top 2 programs and yet we are probably only top 10 in tactical problem sets. We must be doing something right..
Being poor in tactics but having a strong engine over-all doesn't demonstrate the quality of the evaluation, it's a by-product of the LMR and null move reductions. Tactics are based on playing non-obvious, apparently unsound moves. If you LMR/NMR much, you'll miss tactics, it's as simple as that.
Doesn't that measure how efficiently the code is written? At very fast time controls my program may be better or worse by 20 or 30% just because I use a better or worse data structure. I also don't understand why you don't know this but if you are doing really low search depths a tiny speedup is a big ELO gain, but at great depths it is much less. So if I'm running a computer 5% faster it may be as much as 20 ELO or more at 3 ply but only 5 ELO or less at 20 ply. So comparing two programs at really fast time controls as you suggest DIMINISHES the impact of evaluation, it does not ISOLATE the effect of evaluation. You guys are making me feel like I'm a genius when even top people don't understand the basics.Stockfish is, probably to an even higher degree than Komodo, relatively poor in tactical tests but very good over-all, for exactly the same reason.
Instead I would measure the quality of the evaluation function by the performance at very fast TC. If you take out most of the search, what remains is evaluation.
Robert
Seriously, there is no simple test you can construct that measures what you think you are measuring. This should be painfully obvious to any good scientist or engineer. You should also realize that every program varies even at 1 ply and these short depths are so heavily dominated by tactics that the slightest search change will add 50 ELO. You cannot take 2 different programs, do a fast time control test and use this to declare which program has the best evaluation function.
The only reasonable way to TRY to construct such a test is to have a third party build an old fashioned brute force search and have the programmers write a module that does nothing but provide an evaluation function in black box fashion. But I STILL don't think that would prove which evaluation function was truly superior because the evaluation function affects so many aspects of the search - such as the speed of the search. The only way to remove most of that is to run fixed depth tests - (and even fixed node tests would be deceptive here.) But even if we did that we have the possibility that a single "tactical" term would have an unusually strong impact the results - especially at low depth you believe isolates the most sophisticated evaluation terms. For example if I were in such a 1 ply contest I would add a heavy penalty for the side NOT on the move if a piece was subject to profitable capture and this single item would cause my evaluation to blow away everyone else's that didn't have this. It would not prove my evaluation function had all sorts of sophisticated pawn structure knowledge that someone else's does not have.
I'm starting to believe that I AM in a Monte Python skit now

Tell me what you make of the data in the table below. Presumably you are buying into Vincents argument that Komodo overdoes LMR and this is making it play really weak in tactics. So I did a big round robin of depth 1 through depth 6, Houdini 1.5 vs Komodo. Now after looking at the result I can conclude absolutely nothing from it. What it shows is that if you consider DEPTH only, Komodo is much stronger than Houdini 1.5. But we BOTH know that Komodo is not better. So does this data prove that Komodo has better and safer LMR? Does it prove that Komodo has better evaluation? Maybe it shows that Houdini just misses things that Komodo picks up? Honestly, I don't think it proves anything at all because your concept of running games (at any level) to prove which has the better evaluation function is broken.
So here is what you said:
Instead I would measure the quality of the evaluation function by the performance at very fast TC. If you take out most of the search, what remains is evaluation.
And here is the data from tests at "very fast" time controls. I'm not showing the timing data that shows your program is faster because we are not measuring who has the faster program or the fastest evaluation function:
Code: Select all
h4 = houdini 1.5 at 4 ply
k4 = komodo at 4 ply.
Komodo does not play Komodo, Houdini does not play Houdini, otherwise round robin.
Rank Name Elo + - games score oppo. draws
1 k6 3000.0 29.2 29.2 1183 87.4% 2480.2 8.5%
2 h6 2924.7 25.8 25.8 1184 79.9% 2529.6 11.6%
3 k5 2801.4 24.7 24.7 1183 74.9% 2480.2 11.9%
4 h5 2776.8 23.7 23.7 1185 69.3% 2529.5 13.8%
5 k4 2648.2 23.4 23.4 1182 62.8% 2480.3 13.3%
6 h4 2587.1 23.5 23.5 1184 54.0% 2529.6 12.0%
7 k3 2486.2 23.2 23.2 1183 49.7% 2480.9 13.3%
8 h3 2412.0 23.9 23.9 1183 40.4% 2529.5 10.5%
9 k2 2202.5 24.8 24.8 1189 28.2% 2480.2 10.8%
10 h2 2188.1 26.4 26.4 1184 24.2% 2529.6 7.5%
11 k1 2043.3 27.1 27.1 1188 17.7% 2480.6 10.4%
12 h1 1995.2 29.0 29.0 1188 11.9% 2530.0 12.7%
So my conclusion from this data is .... nothing. No conclusion possible. Houdini 1.5 is sightly stronger than Komodo because it's faster - enough to overcome per ply superiority (which is a meaningless measure) and this data does NOT in any way prove who has the superior evaluation function. I believe Komodo has a superior evaluation function but I'm not so stupid as to believe this is the definitive test to prove it. I think there is no reasonable test you can construct to prove it either way and "positional strength" is a real concept but way too abstract to give a formal definition for. It's like trying to measure who loves their children more.