lkaufman wrote:
We have a different definition of eval than Vince. He refers to the eval function, while we are talking about eval in positions where search won't help appreciably. Probably Diep has a better eval function, because it gives up search depth for evaluating tactics. We claim that Komodo has the best eval when tactics don't matter, and I don't know of a way to prove this. When tactics do matter, search depth is extremely important, and comparing us on equal depth to Diep has no value.
So a redefined claim which you can not proof? What purpose serves that?
Well, it's like claiming to be the best artist or best musician; it is a subjective claim, judged by the opinion of those who see the art or hear the music. So it could be judged by a poll here, if we can trust people to answer only by their own experience and not by whay they have read.
Such a poll would be nothing else than a popularity contest and prove nothing. It won't work for Diep, because no one has it. It won't work for Komodo, because nobody has the source code for its evaluation function (and even with the source code, nobody is really able to say much without putting in hundreds or thousands of hours of testing).
I think Don has already admitted that his claim lacks an objective basis. An evaluation function cannot be judged independent from the rest of the engine. It is easy to create a very powerful evaluation function by simply letting it do a 10-ply search. What counts between engines is overall strength. Comparing strength of evaluation makes sense between versions of the same engine, but not much between entirely different engines.
Don wrote:I have written many time that I BELIEVE we have the best positional program in the world. There is no test that can prove that I am right or wrong. I base this on the fact that we are one of the top 2 programs and yet we are probably only top 10 in tactical problem sets. We must be doing something right.
Apparently you don't understand your own engine .
Robert
Don wrote:I have written many time that I BELIEVE we have the best positional program in the world. There is no test that can prove that I am right or wrong. I base this on the fact that we are one of the top 2 programs and yet we are probably only top 10 in tactical problem sets. We must be doing something right.
Apparently you don't understand your own engine .
Being poor in tactics but having a strong engine over-all doesn't demonstrate the quality of the evaluation, it's a by-product of the LMR and null move reductions. Tactics are based on playing non-obvious, apparently unsound moves. If you LMR/NMR much, you'll miss tactics, it's as simple as that.
Stockfish is, probably to an even higher degree than Komodo, relatively poor in tactical tests but very good over-all, for exactly the same reason.
Instead I would measure the quality of the evaluation function by the performance at very fast TC. If you take out most of the search, what remains is evaluation.
Don wrote:I have written many time that I BELIEVE we have the best positional program in the world. There is no test that can prove that I am right or wrong. I base this on the fact that we are one of the top 2 programs and yet we are probably only top 10 in tactical problem sets. We must be doing something right.
Apparently you don't understand your own engine .
Being poor in tactics but having a strong engine over-all doesn't demonstrate the quality of the evaluation, it's a by-product of the LMR and null move reductions. Tactics are based on playing non-obvious, apparently unsound moves. If you LMR/NMR much, you'll miss tactics, it's as simple as that.
Stockfish is, probably to an even higher degree than Komodo, relatively poor in tactical tests but very good over-all, for exactly the same reason.
Instead I would measure the quality of the evaluation function by the performance at very fast TC. If you take out most of the search, what remains is evaluation.
Robert
If we miss tactics but have the same rating as another engine, we must be better at positional play, what else is there? If doing more LMR causes us to miss tactics, it also causes us to get a bit more depth which improves evaluations.
As for fast TC to measure eval, that is exactly backward. The faster the game, the greater the likelihood that it will be decided by tactics. You are a decent chessplayer yourself in the Expert range, you should know this. I myself have a lower blitz rating than standard rating, because as is typical of older players my positional understanding is good but my tactics are poor compared to other 2400 level players. This should apply to engines as well.
I'm starting to see that Komodo outperforms Houdini when tested with primarily non-tactical openings (like very short books), but when tested with long, sharp opening books Houdini wins. This supports the theory that Houdini is stronger in tactics, Komodo in positional play.
"1 ply" means very different things for different engines, it's exactly what makes the suggested match between Komodo and Diep completely nonsensical.
The same applies to using fixed node counts, playing a "20,000 node" match between different engines would be pointless.
The only sensible approach is equal time.
lkaufman wrote:If we miss tactics but have the same rating as another engine, we must be better at positional play, what else is there? If doing more LMR causes us to miss tactics, it also causes us to get a bit more depth which improves evaluations.
By tweaking LMR and NMR I can easily create a Houdini compile that is significantly weaker in tactics tests, yet doesn't lose any Elo strength.
In your assessment this would imply that this "tactically weaker" compile has a far better evaluation, when in fact they are using exactly the same function.
The bottom-line is that relatively poor performance in tactical test suites can give no indication as to the quality of the evaluation function, as you and Don seem to suggest (then again, that is probably just a marketing ploy...).
Rebel wrote:
And another fun expiriment, the "best" full static eval. No QS but in the middle of the chaos on the board with multiple hanging pieces and counter attacks one evaluates and checking moves needs special code to see if the move is not mate, cool.
So we have 2 challenges, the best dynamic eval (with QS) and the best static eval. Now we need participants
This is just a silly experiment. Probably all the top programs would lose badly to any program that make an attempt to resolve tactics in some way. I'm not going to devote the next week to fixing up Komodo to win some competition that has nothing to do with real chess skill.
There is a contest you can have right now - just test several programs doing a 1 ply search. I don't know what it will prove - but it's a lot better than trying to design an experiment that requires everyone except Vincent to dumb down their search (presumably to prove that Vincent really does have the best program of all.)
Then don't make statements you can not proof
New idea:
1. depth=8 (9 or 10) match. As long as it is fast so many games can be played.
2. Brute force
3. No extensions.
4. Standard QS.
So 100% equal search.
Not much work involved, in my case (2) and (3) are parameter driven and (4) I don't expect more than an hour work.
lkaufman wrote:If we miss tactics but have the same rating as another engine, we must be better at positional play, what else is there? If doing more LMR causes us to miss tactics, it also causes us to get a bit more depth which improves evaluations.
By tweaking LMR and NMR I can easily create a Houdini compile that is significantly weaker in tactics tests, yet doesn't lose any Elo strength.
In your assessment this would imply that this "tactically weaker" compile has a far better evaluation, when in fact they are using exactly the same function.
The bottom-line is that relatively poor performance in tactical test suites can give no indication as to the quality of the evaluation function, as you and Don seem to suggest (then again, that is probably just a marketing ploy...).
Robert
You are equating "evaluation function" with "quality of evaluation". The tweaked Houdini you describe would presumably search a bit deeper and play better than the normal one in non-tactical positions, and hence could be said to evaluate better than the normal one. No one should care whether better moves are found by code in the eval function or code in the search function, all that matters are the moves. If you weaken tactics and preserve the rating, you must have improved positional play, which I equate to evaluation quality. Maybe they are not exactly the same, but very highly correlated.
Rebel wrote:
And another fun expiriment, the "best" full static eval. No QS but in the middle of the chaos on the board with multiple hanging pieces and counter attacks one evaluates and checking moves needs special code to see if the move is not mate, cool.
So we have 2 challenges, the best dynamic eval (with QS) and the best static eval. Now we need participants
This is just a silly experiment. Probably all the top programs would lose badly to any program that make an attempt to resolve tactics in some way. I'm not going to devote the next week to fixing up Komodo to win some competition that has nothing to do with real chess skill.
There is a contest you can have right now - just test several programs doing a 1 ply search. I don't know what it will prove - but it's a lot better than trying to design an experiment that requires everyone except Vincent to dumb down their search (presumably to prove that Vincent really does have the best program of all.)
Then don't make statements you can not proof
New idea:
1. depth=8 (9 or 10) match. As long as it is fast so many games can be played.
2. Brute force
3. No extensions.
4. Standard QS.
So 100% equal search.
Not much work involved, in my case (2) and (3) are parameter driven and (4) I don't expect more than an hour work.
The winning program will be the one that does the most tactical work in the eval function, so probably Diep. So what? This tells us nothing about which program finds better moves in non-tactical positions, which is (or should be) the relevant questiion.