How effective is move ordering from TT?

Don · Post by **Don** » Sun Aug 12, 2012 5:43 pm

Houdini wrote:
Don wrote:I have written many time that I BELIEVE we have the best positional program in the world. There is no test that can prove that I am right or wrong. I base this on the fact that we are one of the top 2 programs and yet we are probably only top 10 in tactical problem sets. We must be doing something right.
Apparently you don't understand your own engine .

Being poor in tactics but having a strong engine over-all doesn't demonstrate the quality of the evaluation, it's a by-product of the LMR and null move reductions. Tactics are based on playing non-obvious, apparently unsound moves. If you LMR/NMR much, you'll miss tactics, it's as simple as that.

Seriously, you don't think we checked that out?

Stockfish is, probably to an even higher degree than Komodo, relatively poor in tactical tests but very good over-all, for exactly the same reason.

Instead I would measure the quality of the evaluation function by the performance at very fast TC. If you take out most of the search, what remains is evaluation.

Robert

Doesn't that measure how efficiently the code is written? At very fast time controls my program may be better or worse by 20 or 30% just because I use a better or worse data structure. I also don't understand why you don't know this but if you are doing really low search depths a tiny speedup is a big ELO gain, but at great depths it is much less. So if I'm running a computer 5% faster it may be as much as 20 ELO or more at 3 ply but only 5 ELO or less at 20 ply. So comparing two programs at really fast time controls as you suggest DIMINISHES the impact of evaluation, it does not ISOLATE the effect of evaluation. You guys are making me feel like I'm a genius when even top people don't understand the basics.

Seriously, there is no simple test you can construct that measures what you think you are measuring. This should be painfully obvious to any good scientist or engineer. You should also realize that every program varies even at 1 ply and these short depths are so heavily dominated by tactics that the slightest search change will add 50 ELO. You cannot take 2 different programs, do a fast time control test and use this to declare which program has the best evaluation function.

The only reasonable way to TRY to construct such a test is to have a third party build an old fashioned brute force search and have the programmers write a module that does nothing but provide an evaluation function in black box fashion. But I STILL don't think that would prove which evaluation function was truly superior because the evaluation function affects so many aspects of the search - such as the speed of the search. The only way to remove most of that is to run fixed depth tests - (and even fixed node tests would be deceptive here.) But even if we did that we have the possibility that a single "tactical" term would have an unusually strong impact the results - especially at low depth you believe isolates the most sophisticated evaluation terms. For example if I were in such a 1 ply contest I would add a heavy penalty for the side NOT on the move if a piece was subject to profitable capture and this single item would cause my evaluation to blow away everyone else's that didn't have this. It would not prove my evaluation function had all sorts of sophisticated pawn structure knowledge that someone else's does not have.

I'm starting to believe that I AM in a Monte Python skit now

Tell me what you make of the data in the table below. Presumably you are buying into Vincents argument that Komodo overdoes LMR and this is making it play really weak in tactics. So I did a big round robin of depth 1 through depth 6, Houdini 1.5 vs Komodo. Now after looking at the result I can conclude absolutely nothing from it. What it shows is that if you consider DEPTH only, Komodo is much stronger than Houdini 1.5. But we BOTH know that Komodo is not better. So does this data prove that Komodo has better and safer LMR? Does it prove that Komodo has better evaluation? Maybe it shows that Houdini just misses things that Komodo picks up? Honestly, I don't think it proves anything at all because your concept of running games (at any level) to prove which has the better evaluation function is broken.

So here is what you said:

Instead I would measure the quality of the evaluation function by the performance at very fast TC. If you take out most of the search, what remains is evaluation.

And here is the data from tests at "very fast" time controls. I'm not showing the timing data that shows your program is faster because we are not measuring who has the faster program or the fastest evaluation function:

Code: Select all


h4 = houdini 1.5 at 4 ply
k4 = komodo at 4 ply.
Komodo does not play Komodo,  Houdini does not play Houdini,  otherwise round robin. 


Rank Name    Elo      +      -    games   score   oppo.   draws 
   1 k6    3000.0   29.2   29.2    1183   87.4%  2480.2    8.5% 
   2 h6    2924.7   25.8   25.8    1184   79.9%  2529.6   11.6% 
   3 k5    2801.4   24.7   24.7    1183   74.9%  2480.2   11.9% 
   4 h5    2776.8   23.7   23.7    1185   69.3%  2529.5   13.8% 
   5 k4    2648.2   23.4   23.4    1182   62.8%  2480.3   13.3% 
   6 h4    2587.1   23.5   23.5    1184   54.0%  2529.6   12.0% 
   7 k3    2486.2   23.2   23.2    1183   49.7%  2480.9   13.3% 
   8 h3    2412.0   23.9   23.9    1183   40.4%  2529.5   10.5% 
   9 k2    2202.5   24.8   24.8    1189   28.2%  2480.2   10.8% 
  10 h2    2188.1   26.4   26.4    1184   24.2%  2529.6    7.5% 
  11 k1    2043.3   27.1   27.1    1188   17.7%  2480.6   10.4% 
  12 h1    1995.2   29.0   29.0    1188   11.9%  2530.0   12.7%

Presumably, from your comments Komodo must be weaker tactically because it overdoes LMR but at the same depth it appears to me that it's seeing at least as much as Houdini 1.5. Are you saying that Houdini has the same weakness? Why is it stronger on tactical tests?

So my conclusion from this data is .... nothing. No conclusion possible. Houdini 1.5 is sightly stronger than Komodo because it's faster - enough to overcome per ply superiority (which is a meaningless measure) and this data does NOT in any way prove who has the superior evaluation function. I believe Komodo has a superior evaluation function but I'm not so stupid as to believe this is the definitive test to prove it. I think there is no reasonable test you can construct to prove it either way and "positional strength" is a real concept but way too abstract to give a formal definition for. It's like trying to measure who loves their children more.

Don · Post by **Don** » Sun Aug 12, 2012 5:52 pm

Houdini wrote:
Laskos wrote:1-ply match 400 games
"1 ply" means very different things for different engines, it's exactly what makes the suggested match between Komodo and Diep completely nonsensical.
The same applies to using fixed node counts, playing a "20,000 node" match between different engines would be pointless.
The only sensible approach is equal time.

Robert

Equal time just measures who has the faster program. ANY time control measures which program is best, not which program has the best evaluation.

Don · Post by **Don** » Sun Aug 12, 2012 5:57 pm

lkaufman wrote:
Houdini wrote:
Don wrote:I have written many time that I BELIEVE we have the best positional program in the world. There is no test that can prove that I am right or wrong. I base this on the fact that we are one of the top 2 programs and yet we are probably only top 10 in tactical problem sets. We must be doing something right.
Apparently you don't understand your own engine .

Being poor in tactics but having a strong engine over-all doesn't demonstrate the quality of the evaluation, it's a by-product of the LMR and null move reductions. Tactics are based on playing non-obvious, apparently unsound moves. If you LMR/NMR much, you'll miss tactics, it's as simple as that.
Stockfish is, probably to an even higher degree than Komodo, relatively poor in tactical tests but very good over-all, for exactly the same reason.

Instead I would measure the quality of the evaluation function by the performance at very fast TC. If you take out most of the search, what remains is evaluation.

Robert
If we miss tactics but have the same rating as another engine, we must be better at positional play, what else is there?

I don't think it's so simple. In fact I am not so sure you can really say that we are weaker tactically than other programs of the same or similar strength. We may just be weaker in some kinds of moves that tactical programs sets try to measure. It could be just checks for example or something else. I don't think it means we are missing the stuff that really matters.

There is no rigid and formal definition for "tactics" anyway. Same with positional play. These are words that have some meanings to humans but are rather abstract - cannot be defined simply without being ridiculously arbitrary.

If doing more LMR causes us to miss tactics, it also causes us to get a bit more depth which improves evaluations.
As for fast TC to measure eval, that is exactly backward. The faster the game, the greater the likelihood that it will be decided by tactics. You are a decent chessplayer yourself in the Expert range, you should know this. I myself have a lower blitz rating than standard rating, because as is typical of older players my positional understanding is good but my tactics are poor compared to other 2400 level players. This should apply to engines as well.
I'm starting to see that Komodo outperforms Houdini when tested with primarily non-tactical openings (like very short books), but when tested with long, sharp opening books Houdini wins. This supports the theory that Houdini is stronger in tactics, Komodo in positional play.

Don · Post by **Don** » Sun Aug 12, 2012 5:59 pm

Houdini wrote:
lkaufman wrote:If we miss tactics but have the same rating as another engine, we must be better at positional play, what else is there? If doing more LMR causes us to miss tactics, it also causes us to get a bit more depth which improves evaluations.
By tweaking LMR and NMR I can easily create a Houdini compile that is significantly weaker in tactics tests, yet doesn't lose any Elo strength.
In your assessment this would imply that this "tactically weaker" compile has a far better evaluation, when in fact they are using exactly the same function.

Now wait a minute - you are the one saying that you run a fast time control test to figure out which program has the better "evaluation function." I'm telling you that it's easy to make trivial changes that would affect the results of such a test. So be consistent - which is it?

The bottom-line is that relatively poor performance in tactical test suites can give no indication as to the quality of the evaluation function, as you and Don seem to suggest (then again, that is probably just a marketing ploy...).

Robert

Houdini · Post by **Houdini** » Sun Aug 12, 2012 6:01 pm

lkaufman wrote: You are equating "evaluation function" with "quality of evaluation".

The previous discussion was about "quality of evaluation function". For example, Don said: "Ed proposed a 1 ply match to "prove" which program has the best evaluation function"

Pablo Vazquez · Post by **Pablo Vazquez** » Sun Aug 12, 2012 6:15 pm

Houdini wrote:
Pablo Vazquez wrote:Ah, projection. Gotta love it.
Thank you for your valuable contribution.
I gather that you have nothing to say about the technical point I was making?

A very technical point indeed. I couldn't say anything because I was mystified by such wisdom.

Houdini · Post by **Houdini** » Sun Aug 12, 2012 6:21 pm

Don wrote:I don't think it's so simple. In fact I am not so sure you can really say that we are weaker tactically than other programs of the same or similar strength. We may just be weaker in some kinds of moves that tactical programs sets try to measure.

This summarizes everything, and I agree with it. Komodo is probably not significantly weaker tactically than other engines, it just doesn't perform well in the tactical test suites, exactly like Stockfish.

Quite the opposite of what you wrote two hours ago: "I base this on the fact that we are one of the top 2 programs and yet we are probably only top 10 in tactical problem sets.".

Houdini · Post by **Houdini** » Sun Aug 12, 2012 6:24 pm

Don wrote:Now wait a minute - you are the one saying that you run a fast time control test to figure out which program has the better "evaluation function." I'm telling you that it's easy to make trivial changes that would affect the results of such a test. So be consistent - which is it?

No, I'm saying that there is no way to measure the "quality of the evaluation function" other than completely removing the search.
It basically has no meaning.

bob · Post by **bob** » Sun Aug 12, 2012 6:25 pm

diep wrote:
bob wrote:
Rebel wrote:
diep wrote:
Don wrote:
diep wrote:The best way to compare move ordering is by turning off all your selectivity except for nullmove.

So just nullmove + hashtables.

No other forward pruning. And i disagree with Bob that hashtables are a form of forward pruning... ...as you remove a transposition you already searched so in the first place it belongs to a part of the search space you already searched to that depth...

Reason i wouldn't want to turn off hashtables nor nullmove is because otherwise we aren't gonna get some serious search depth and because when searching fullwidth 100% would have the disadvantage that any extension you do blows up the search tree too much - nullmove will avoid that.

If you forward prune last few plies and aren't doing nullmove there, keep it turned on.

That is a good comparision IMHO, though of course no replacement for the real thing.

You'll see then that LMR and such selectivity basically hide at many programs a very imperfect move ordering.

If authors like Don comment on that: "i don't need a good move ordering as LMR works magnificent for me so i save out time where you lost probably over a year building your move ordering at the time", there is no way to refute that argument from his viewpoint. Note this is a fictional example. Just an example.
Vincent,

I believe that LMR is useless UNLESS you have excellent move ordering.
If i make the distinction between different issues then you'll see we probably disagree less than you'd guess.

First of all LMR works magnificent for engines that have a worthless move ordering, as it will search them a ply or 5 deeper or so meaning their mainline gets checked quite deeper. This delivers them elo.
Engines with a worthless move ordering will reduce the wrong moves. That is a bad receipt for elo improvement. Don is right, the better your move ordering, the more secure the LMR reductions.
This does not have to be true. I don't presently reduce based on a moves position in the move list, I reduce on static criteria instead. The better your move ordering, the easier LMR becomes because you can reduce later moves more if you so choose. But it is not a necessary condition to reduce.
Bob, the big problem is this.

If you try the worst move first, you search DEEPER with LMR.

I win today 2 plies using reductions. The same search depth win i had in 1999 with reductions as well. I got 14 ply back in 1999 with reductions in experiments ( many minutes a move). Diep world champs 1999 used those reductions.

Today it's 13 years later and i'm still winning only 2 plies with LMR, Diep gets plies deeper obviously than back then at your Quad Xeon, which got diep around a 70k nps.

Getting 2 plies deeper is not much huh?

If you select a queen away giving move as first you can easily win 5 plies with LMR. That's my whole point.

The worse your move ordering, the better LMR seems to work for you.

If you do select queen away giving moves and win that 5 plies, your iteration depth is 5 plies deeper.

In superbullet 5 plies deeper is a lot, you won't find tactics quickly, yet you avoid losing for another 5 plies. That's a lot.

So if you have a crap move ordering, LMR works relative a lot better than if you have a magnificent move ordering.

Vincent

You are paying too much attention to the "L" in LMR. I was taking about a MR instead, where you reduce a move based on some static analysis of the move itself, rather than how the move compares to the rest of the moves in the list. I've not been a big fan of LMR, due to one specific example where all moves are good, equally good, yet the "L" causes those appearing later in the list to be reduced more than those up front. My limited approach to this has been simple static rules as to whether a move should be reduced or not, regardless of whether it is sorted first or last. I've discovered, during testing, that one can safely reduce checks, if reasonable static analysis is done (for example, Qxh7+ where the queen is undefended and instantly lost is very safe to reduce. yes there are exceptions, but 99+% of the cases are not the exception here...

If you ignore a moves position in the ordering, then the "L" no longer applies. This is what I have been doing. I have experimented with ordering, and have some experiments planned to reintroduce history ordering and test in a LMR framework... What I expect to see, however, is more "information" as opposed to "great Elo gains." That is, I want to see if the ordering stuff can actually work better than the static analysis I currently do. Will know in a month or two.

bob · Post by **bob** » Sun Aug 12, 2012 6:32 pm

chrisw wrote:
Don wrote:
lkaufman wrote:
Rebel wrote:
Don wrote: I think you are ducking a REAL match. Trying to make a 1 ply match happen is way down on my list of priorities and I would not allow myself to be distracted by such a thing.
Don,

On several occasions you have said Komodo has the best eval in the world. I think you should proof it now that you have a challenger.

In good old Rome tradition we want to see the gladiators blood flow
We have a different definition of eval than Vince. He refers to the eval function, while we are talking about eval in positions where search won't help appreciably. Probably Diep has a better eval function, because it gives up search depth for evaluating tactics. We claim that Komodo has the best eval when tactics don't matter, and I don't know of a way to prove this. When tactics do matter, search depth is extremely important, and comparing us on equal depth to Diep has no value.
Just to illustrate how differently our definition really is, Vincent proposes to "prove" which program has the most sophisticated evaluation function by doing a 1 ply search.

As almost anyone in this field knows, a 1 ply search is completely dominated by tactics and fine positional understanding is almost completely irrelevant.

I'm trying really hard to parse this sentence.

"as almost anyone in the field knows" is an attempt to democratise and thus legitimise the text which follows. Except that unverifiable woffle can't do that.

"a 1 ply search is completely dominated by tactics", actually what does this mean? A one ply search has no tactics? A one ply search would be overwhelmed by an N ply search? The game would appear to be tactical? No reason why it should be. "Completely" sounds strong, but with what justification? "Dominated"? Again strong word, but what justification? Heavy adjectival use but no backup for them. Are you in marketing mode?

"fine positional understanding is almost completely irrelevant" Is it really? Well you qualified "completely" this time, so you are not too sure it seems. Actually positional understanding seems perfectly relevent, what would you suggest as an alternative? Random understanding? Partially right and partially wrong understanding? Would they be better?

Any yet he believes that is a legitimate test of how strong a program is in the positional sense.
False straw man. It's a legitimate test of the evaluation function. It really is intellectual cheating to switch his argument from "eval" to the "whole program", "in a positional sense" (what does that mean btw?) and then attack that. Don't you think?

I can only say that his definition of what is positional is different than ours.

it would be different when your positional definition keeps changing at will. From part (the non-tactical) to ALL (see below), for example

I think the best test is to simply play a long match at long time controls.

yes, that finds the stongest program, but this thread is about strongest eval function. Anyway, you won't change your tune, for why would a marketeer enter a competition to find scientific truth which by its nature runs the risk of his product appearing dumb?

The program that wins is playing the best chess and we don't have to debate "positional" vs "tactical" play, as has been pointed out it is ALL positional play, right?

Don

All this "experiment" proves is "who wrote the most complex eval, including several different tactical motifs such as pins, trapped pieces and such?"

One can evaluate pins, or let the search "evaluate" them. Ditto for trapped pieces. So this "test" doesn't find the "best" evaluation. It finds the most complex one, that likely has many bugs due to the complexity.

You've been on this "heavy eval" them for years, calling it by different names including "new paradigm." Yet the original "bean-counter" approach is STILL at the top of the performance heap. Most of us understand exactly why this is. Chess is a combination of search AND evaluation. It won't be dominated by an "evaluation only" program any more than a GM player will never rely on search and just make all his moves based on instant evaluation.

This experiment is flawed in exactly the same way as fixed-depth comparisons between different programs are flawed. No one wants to use SEE inside the evaluation for every piece and pawn, yet this would be a good idea for an eval-only 1-ply - no q-search test like this... What would it prove???

How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?