How effective is move ordering from TT?

Laskos · Post by **Laskos** » Mon Aug 13, 2012 9:13 pm

Ajedrecista wrote:Hello Kai:
Laskos wrote:
Houdini wrote:
Don wrote:I have written many time that I BELIEVE we have the best positional program in the world. There is no test that can prove that I am right or wrong. I base this on the fact that we are one of the top 2 programs and yet we are probably only top 10 in tactical problem sets. We must be doing something right.
Apparently you don't understand your own engine .

Being poor in tactics but having a strong engine over-all doesn't demonstrate the quality of the evaluation, it's a by-product of the LMR and null move reductions. Tactics are based on playing non-obvious, apparently unsound moves. If you LMR/NMR much, you'll miss tactics, it's as simple as that.
Stockfish is, probably to an even higher degree than Komodo, relatively poor in tactical tests but very good over-all, for exactly the same reason.

Instead I would measure the quality of the evaluation function by the performance at very fast TC. If you take out most of the search, what remains is evaluation.

Robert
1-ply match 400 games
Code: Select all
Games Completed = 400 of 10000 &#40;Avg game length = 1.029 sec&#41;
Settings = RR/16MB/100ms per move/M 400000cp for 1000 moves, D 120000 moves/PGN&#58;C&#58;\Users\Ani\Downloads\LittleBlitzer\swcr.pgn&#40;5120&#41;
Time = 516 sec elapsed, 12385 sec remaining
 1.  Komodo 4                 	246.0/400	188-96-116  	&#40;L&#58; m=96 t=0 i=0 a=0&#41;	&#40;D&#58; r=94 i=4 f=6 s=12 a=0&#41;	&#40;tpm=11.6 d=1.00 nps=843878&#41;
 2.  Houdini 1.5a             	154.0/400	96-188-116  	&#40;L&#58; m=188 t=0 i=0 a=0&#41;	&#40;D&#58; r=94 i=4 f=6 s=12 a=0&#41;	&#40;tpm=10.3 d=1.00 nps=543249&#41;
Sorry for being off-topic: may I ask you how you managed to run fixed depth test under LB? I have ran some test using this GUI, so I know how to built Engines.lbe file, although I am unable of getting the engines play until the fixed depth I want... what option must I write in Engines.lbe file?

I have LB 2.5 (I know that the last version is 2.74, which is what you have used in view of the two decimal numbers of average depth) but I suppose that there will not be big differences. Thanks in advance!

Regards from Spain.

Ajedrecista.

I am using the InBetween interface between the engines and LB. Google and download InBetween. Setting these options in the InBetween.ini, which must be in the same folder with InBetween.exe and the engine

Code: Select all

CommandLine &#58;= Critter_1.4_64bit.exe
go movetime 100 &#58;= go nodes 200

takes the engine Critter 1.4 from calculating for 100ms move to calculating 200 nodes move. When you set in LB fixed time/move to 100ms, it transforms in fixed nodes at 200. You could as well use

Code: Select all

go movetime 100&#58;= go depth 1

this will take fixed time/move in LB 100ms to depth 1 instead. It depends on the engine which UCI commands it understands, to say, I cannot have Houdini 1.5 and Rybka 4 at fixed nodes.

For example, here is an example. Correcting for engine's speed on my comp, I take fixed nodes/move for engines as:

Critter 1.4 200 nodes/move
Komodo 4 110 nodes/move
StockFish 2.2 140 nodes/move
Rybka 3 10 nodes/move

These are adjusted for speed of engines on my comp, so that the time taken is similar for all engines. I get this

Code: Select all

Games Completed = 1200 of 10000 &#40;Avg game length = 1.750 sec&#41;
Settings = RR/16MB/100ms per move/M 400000cp for 1000 moves, D 120000 moves/PGN&#58;C&#58;\Users\Ani\Downloads\LittleBlitzer\swcr.pgn&#40;5120&#41;
Time = 2658 sec elapsed, 19492 sec remaining
 1.  Komodo 4                 	254.5/600	215-306-79  	&#40;L&#58; m=306 t=0 i=0 a=0&#41;	&#40;D&#58; r=31 i=26 f=15 s=7 a=0&#41;	&#40;tpm=29.9 d=2.21 nps=1457556&#41;
 2.  Critter 1.4              	172.0/600	131-387-82  	&#40;L&#58; m=387 t=0 i=0 a=0&#41;	&#40;D&#58; r=37 i=22 f=11 s=12 a=0&#41;	&#40;tpm=12.6 d=3.30 nps=24557&#41;
 3.  Rybka 3                  	341.0/600	293-211-96  	&#40;L&#58; m=211 t=0 i=0 a=0&#41;	&#40;D&#58; r=36 i=35 f=11 s=14 a=0&#41;	&#40;tpm=10.8 d=1.47 nps=27803&#41;
 4.  StockFish 2.2            	432.5/600	390-125-85  	&#40;L&#58; m=125 t=0 i=0 a=0&#41;	&#40;D&#58; r=40 i=31 f=5 s=9 a=0&#41;	&#40;tpm=11.6 d=4.35 nps=254992&#41;

This is somehow equivalent to an extremely short time control, like move in 0.1ms or game in 10ms, which is impossible to set as time/move but possible as nodes/move. Critter 1.4 is pretty representative for IvanHoe and Houdini, and I can suspect that they have a lighter eval than Komodo, and that StockFish and Rybka 3 have some solid evals. It is consistent with the claim of Richard Vida that Ippos have lighter evals than Rybka 3.

Kai

Don · Post by **Don** » Mon Aug 13, 2012 9:44 pm

diep wrote:
lkaufman wrote:
diep wrote: If they don't accept any match as proving anything, then it doesn't make sense to do any test Uri.

If i have a normal match against Komodo, they claim diep is parallel and uses all cores and they just 1, so logically they lose.

On other hand i didn't optimize Diep for single core contests as just forward pruning a lot and super selective search helps you there, as todays evaluations seem to need 20 ply (selective plies - not really comparable with real plies) for the evaluation functions to get the maximum eloscaling and above 20 plies you see most engines hardly win elo each ply.

So the struggle is to get that 20 plies quickly no matter how dubious you search. If you search SMP you can get it at rapid levels if you have a bunch of cores.

In most of those superbullet tests they do nowadays of course no one gets that 20 ply yet.
Okay, Vincent, here is my proposal. If you look at the CCRL 40/40 rating list you will see that Komodo 5 on one core outrates a recent Ivanhoe running on FOUR cores by 13 elo. So let's have a match between Komodo 5 on one core and Diep on FOUR cores. If you win the match it will at least imply that Diep is stronger than Ivanhoe and is therefore one of the top 6 engines. Regarding time limit, I suggest 30' + 30" increment, since repeat time control testing is a big waste of time, playing out dead drawn endings at the same slow pace as the early middlegame. Any fairly short opening book/test suite is fine, as long as you reverse colors with the same opening after each game. Maybe fifty game match?
Note that this test does not require us to do anything, it only requires that you send a version of DIEP to the person who will run the match.
I expect Komodo will win the match. I say this because I believe that if Diep already had the level of Ivanhoe you would be selling it now. If you win the match you can safely go commercial and expect decent sales.
Oh you are sure you get 20 plies in 30 minutes a move i guess?

As for playing Diep, in some months from now it should play online so you can play whatever time control there i suppose.

But the whole point is, for a few months now i have seen you guys post about how 'strong' Komodo's evaluation is.

to me it seems very similar to deepsjeng's evaluation though which looks again from a distance at rybka/fruit evaluation with a few small additions - and no i don't claim anything has been copied.

I say *similar*.

You already admitted that you took over the material evaluation from Rybka.

Now all those months you claim that something much of a beancounter like that has worlds 'strongest' evaluation and a 'lot of knowledge' (my own words).

In fact Diep has 20x more knowledge. Sure very bad tuned, despite already big experiments from my viewpoint to see what bugs i can get out using automatic parameter tuning.

Those claims is what i responded against.
In fact it's trivial you both don't even know how things have been tuned in Komodo and what the effect of parameter tuning is, otherwise you would've blindfolded grabbed the opportunity for a 1 ply match.

In fact i'm sure had Christophe theron been in this forum still, he would've directly posted to play 1 ply matches against some very old Tiger (and i bet Tiger would win every single match), as he knows something about making your own evaluation and you guys obviously do NOT.

That's what i wanted to demonstrate and it's more than obvious to everyone.

There is like a 100 engines out there now with nearly identical evaluation, only some minor search differences and some of them like deepsjeng and komodo know a tad more about passed pawns seemingly. That's such tiny differences - it's not even funny to do any claim about your evaluation function i'd say.

Vincent

Vincent,

We have been friends for many years now and I still don't understand you. You are definitely a classic.

You are the only one I know of who will make some assertion that is complete nonsense and do it with a straight face.

So I think it's time for me to stop entertaining you now - the silliness is liable to rub off on me. I think it is starting to already ...

So let's assume that Komodo really does have a horrible evaluation function, a horrible search, uses LMR (to cover up its horrible move ordering) , is a beancounter and in general is far inferior to Diep in every imaginable way. Ok. fine. As long as it continues to top the rating lists and beats everyone it plays I will go ahead and live with the fact that it's the weakest program in the world.

You unlock this door with the key of imagination. Beyond it is another dimension - a dimension of sound, a dimension of sight, a dimension of mind. You're moving into a land of both shadow and substance, of things and ideas. You've just crossed over into the Vincent Zone.

Don

syzygy · Post by **syzygy** » Mon Aug 13, 2012 9:57 pm

Don wrote:
syzygy wrote:
Don wrote:I can put this another way. I DO believe our evaluation function is one of the best and possibly the best.
"The best" relative to what... How do you compare the strength of two evaluation functions of two different engines?
Have you not read the posts? I do not believe there is a reasonable test for this.

That is what I read, and I agree.

This is a technical forum, so I assumed that your "belief" was at least a scientific belief. Something that could somehow be tested. Not to prove you wrong, because you did not state it as a fact but only as a belief, but to find out if your belief is correct or not.

As it is, I guess it is fair to say your "belief" is more comparable to a religious belief (but without the religion)?

However, I DO believe the concept is valid. You cannot measure "love" scientifically but we recognize it when we see it.

Sure, and we can recognize an evaluation function when we see one, even if we might not be able in general to sharply define the boundary between instructions that are part of the search and instructions that are part of the evaluation. Continuing the analogy, I would say that the belief that the evaluation of engine A is better than the evaluation function of engine B is as meaningful as the belief that the "love" of person A is greater than the "love" of person B.

syzygy · Post by **syzygy** » Mon Aug 13, 2012 10:21 pm

Uri Blass wrote:
syzygy wrote:
Don wrote:I can put this another way. I DO believe our evaluation function is one of the best and possibly the best.
"The best" relative to what... How do you compare the strength of two evaluation functions of two different engines?
I think that there is one case when it is possible to say that one evaluation is superior.

If engine A replace the evaluation to an evaluation that is semantically equivalent to the evaluation of engine B and get better results
inspite of not getting more nodes per seconds
then it is possible to say that B has a superior evaluation.

Well, in this case at least the "semantically equivalent" evalution function of engine A could be said to be "better" than the original evaluation function of engiine A, where "better" applies only to engine A. It cannot be excluded that transplanting the evaluation function of engine A into engine B would show that the resulting "semantically equivalent" evaluation function of engine B is "better" than the original evaluation function of engine B. The first "better" (applying only to engine A) is a different relation than the second "better" (applying only to engine B). (Of course with well-tuned engines it is far more likely that the transplanted evaluation function of A/B into B/A is in both cases worse than the original evaluation function of B/A, instead of better.)

You could define "the evaluation of B is better than the evaluation of A" as "the transplanted evaluation function of B is better than the evaluation function of A in respect of engine A AND the evaluation function of B is better than the transplanted evaluation function of A in respect of engine B", but then you will have evaluation functions of different engines that are not comparable anymore.

Another major complication is that the "transplantation" itself is not well-defined. Semantic equivalence is not sufficient. Implementation details will affect speed and thereby strength.

Desperado · Post by **Desperado** » Mon Aug 13, 2012 10:22 pm

Hello Don,

i really dont want to stress this already long thread anymore, but
i am curious.

Here is a serious question for you and for Larry.

What does make you believe that Komodo has such a real strong,
maybe the strongest evaluation (i think you are talking of chess knowledge)
of all engines out there ? So what is the base ?

a. do you have 2xmore evaluation features than others
b. do you have the same quantitiy but a better quality,
better tuned so to say ?
c. do you have simply different evaluation features ?
d. you consider your search as average search and you
think the strength has to come from the evaluation ?
e. a combination of the points above ?
f. some other reason i dont have in mind at the moment ?

So, what is it ? if you already answered this question, please just
refer to it, because i am not in mood to read every post of this thread anymore.
Finally i am not interested in any proof, but just a hint what
makes the difference in your opinion.

Thank you,

Michael

Don · Post by **Don** » Mon Aug 13, 2012 10:27 pm

syzygy wrote:
Don wrote:
syzygy wrote:
Don wrote:I can put this another way. I DO believe our evaluation function is one of the best and possibly the best.
"The best" relative to what... How do you compare the strength of two evaluation functions of two different engines?
Have you not read the posts? I do not believe there is a reasonable test for this.
That is what I read, and I agree.

This is a technical forum, so I assumed that your "belief" was at least a scientific belief.

It's a technical forum but that doesn't mean nobody can make a statements without supplying some sort of formal proof. You are just being silly. It's supposed to be completely open to speculation, assertions, bouncing ideas around and so on.

I can only surmise that you have not been here long if you think I'm the only one to express a believe without also publishing a paper here.

Something that could somehow be tested. Not to prove you wrong, because you did not state it as a fact but only as a belief, but to find out if your belief is correct or not.

As it is, I guess it is fair to say your "belief" is more comparable to a religious belief (but without the religion)?

You are being silly again. When I play tennis and notice my opponent has a weak backhand I hit as many balls to his backhand that I can. It's evident to me that his backhand is weak and it's a technical skill that good players have. It's not religious. I cannot prove his backhand is weak with some scientific test but that doesn't mean I should not take advantage of it.

However, I DO believe the concept is valid. You cannot measure "love" scientifically but we recognize it when we see it.
Sure, and we can recognize an evaluation function when we see one, even if we might not be able in general to sharply define the boundary between instructions that are part of the search and instructions that are part of the evaluation. Continuing the analogy, I would say that the belief that the evaluation of engine A is better than the evaluation function of engine B is as meaningful as the belief that the "love" of person A is greater than the "love" of person B.

That's clearly not the case. Good players can easily see when one program seems to understand the game better than another if there is a significant difference. It's just difficult to construct a reasonable test for it. But you can probably easily construct CRUDE tests that will work pretty well. I don't think that is so easy for "love." The STS is probably a pretty good "crude" test for such a thing.

You do bring up a good point about distinguishing between evaluation and search. It's a naive simplification to separate them - trying to pretend that a weak evaluation with a deeper search plays the same as a strong evaluation with a less deep search. Is that your mental model of how this works?

No, if you take a program with a weak evaluation you can find the right handicap to equalize the two programs. But that does not mean the weaker program suddenly is playing the same way. A good player carefully studying the games will notice that the programs are each winning their games for different reasons. I would assume his impression would be that the handicapped program seems to know more about chess and that the other program is making up for this in other ways, perhaps tactics. It might come across to him as if the weaker program is getting away with a lot of crap.

Rebel · Post by **Rebel** » Mon Aug 13, 2012 10:32 pm

bob wrote:
Rebel wrote:
bob wrote:
chrisw wrote: Yes, I have indeed been on about "heavy eval" for years. Curious now how my "heavy eval" from early 1990s is the basis of Fruit. I think you use it too now.

On GM play, the sad truth is that you have no idea.
Fruit? Heavy Eval?
Chris did not say the Fruit eval is heavy.
How does that jive with the following quote?

Curious now how my "heavy eval" from early 1990s is the basis of Fruit. I think you use it too now.

By the mouth of Dann Corbit a quote from Anthony Cozzie:

Anthony Cozzie quit chess programming because he felt that the winning program was nothing more than the biggest bag of tricks collected from all the existing programs.

How could you miss Chris' point?

Insult snipped.

bob · Post by **bob** » Mon Aug 13, 2012 11:10 pm

Rebel wrote:
bob wrote:
Rebel wrote:
bob wrote:
chrisw wrote: Yes, I have indeed been on about "heavy eval" for years. Curious now how my "heavy eval" from early 1990s is the basis of Fruit. I think you use it too now.

On GM play, the sad truth is that you have no idea.
Fruit? Heavy Eval?
Chris did not say the Fruit eval is heavy.
How does that jive with the following quote?

Curious now how my "heavy eval" from early 1990s is the basis of Fruit. I think you use it too now.
By the mouth of Dann Corbit a quote from Anthony Cozzie:

Anthony Cozzie quit chess programming because he felt that the winning program was nothing more than the biggest bag of tricks collected from all the existing programs.

How could you miss Chris' point?

Insult snipped.

How can one make a point with an outright false statement? There is nothing "heavy" (heavy is commonly a synonym for complex or computationally expensive in computer chess discussions) in Fruit's eval at all. Does he REALLY want to claim that a very simple bean-counter evaluation was derived from HIS evaluation?

One never knows what one is going to read here nowadays. But clearly, accuracy is not exactly a description of things like this.

Fruit's eval is NOT "heavy" by any possible measure of the term. Here's the typical "slang" definition of heavy (not particularly computer-science specific, but the last one is on point...

heavy

* good.

Usher is a heavy dancer.

* Cool, Awesome

Wow your website is heavy.

* deep, hard to deal with, burdensome, or hard to understand

BTW where does CSTal come in here? It was never open-source, and I have never heard of anyone reverse-engineering the thing since it was never near the top of any rating list or tournament. How, then, did someone (Fabien) supposedly hijack his eval ideas? So other than being totally unrelated to current state of affairs, how exactly was there a point to be made???

lkaufman · Post by **lkaufman** » Mon Aug 13, 2012 11:27 pm

Desperado wrote:Hello Don,

i really dont want to stress this already long thread anymore, but
i am curious.

Here is a serious question for you and for Larry.

What does make you believe that Komodo has such a real strong,
maybe the strongest evaluation (i think you are talking of chess knowledge)
of all engines out there ? So what is the base ?

a. do you have 2xmore evaluation features than others
b. do you have the same quantitiy but a better quality,
better tuned so to say ?
c. do you have simply different evaluation features ?
d. you consider your search as average search and you
think the strength has to come from the evaluation ?
e. a combination of the points above ?
f. some other reason i dont have in mind at the moment ?

So, what is it ? if you already answered this question, please just
refer to it, because i am not in mood to read every post of this thread anymore.
Finally i am not interested in any proof, but just a hint what
makes the difference in your opinion.

Thank you,

Michael

In my own case, I need to use the program that evaluates positions best in my iDeA work, which was the basis for my book, and is vital to me for my teaching, my writing, and my tournament play. I've tried all of the top-five engines, and the resulting evals convinced me that I should use Komodo for this work. The evals and order of moves in theoretical positions simply agree much better with GM theory. I guess what I'm saying is that considering both speed and quality of evaluation function, I think Komodo gives the most reliable evals in positions where seeing very deep tactics is not critical, which is most of the time. I also think it follows that if one of two closely rated engines (Houdini) is stronger in deep/surprising tactics, as we acknowledge, Komodo is probably stronger in positions where they are not present. Otherwise Komodo should be clearly weaker. At the non-blitz time controls, the ratings for Komodo and Houdini are very close now. I think any engine that is not within 50 elo of the top is unlikely to be best at anything important.

Don · Post by **Don** » Mon Aug 13, 2012 11:38 pm

Desperado wrote:Hello Don,

i really dont want to stress this already long thread anymore, but
i am curious.

Here is a serious question for you and for Larry.

What does make you believe that Komodo has such a real strong,
maybe the strongest evaluation (i think you are talking of chess knowledge)
of all engines out there ? So what is the base ?

The answer is pretty simple. Komodo is in the top 2 - so if it does not have a great evaluation function then it must have a great search or some combination of both.

But I don't believe Komodo has a search that is very special - it's certain not fast. If one aspect of the program is poor the other parts must carry the burden of having to compensate.

Larry and I also set out specifically to build an exceptionally good evaluation - and I wanted this more perhaps even more that Larry did. But we are both very pragmatic so we are going to do whatever works best. If it's a close call we choose evaluation. If it somehow turned out that the only right way to build a chess program is to ignore evaluation complete and just count wood, that's what we would do.

a. do you have 2xmore evaluation features than others
b. do you have the same quantitiy but a better quality,
better tuned so to say ?

I believe in quality over quantity and I cannot compare Komodo to other programs because I have never specifically tried to count how many evaluation terms are in other programs. Some are difficult to count due to semantics, for example if you have a table built with a formula that has 3 parameters do you count 3 terms or do you could each table element? Do you count each square of a piece square table as an evaluation parameter? Technically it is. We have 156 terms that are directly adjustable via the UCI interface when we have that activated but there are more that are not directly accessible. Most of them are hard coded tables, tables built by formula's, and other carefully tuned constants for evaluation. I don't know or care how that compares to other programs because we only care if it works, not how many there are. If we see an evaluation problem we try to solve it by making adjustments but if we have to we will add an evaluation term and then we have to see that it works before it gets to be in Komodo.

I don't think we do this any differently than most of the other programs - it's basic software engineering 101 and many other programs have fine evaluation functions too.

c. do you have simply different evaluation features ?
d. you consider your search as average search and you
think the strength has to come from the evaluation ?

Yes.

e. a combination of the points above ?
f. some other reason i dont have in mind at the moment ?

So, what is it ? if you already answered this question, please just
refer to it, because i am not in mood to read every post of this thread anymore

Thank you,

Michael

How effective is move ordering from TT?

Re: Off-topic, sorry...

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?

Re: How effective is move ordering from TT?