Crafty and Stockfish question

Volker Annuss · Post by **Volker Annuss** » Sun Jul 18, 2010 9:48 am

lkaufman wrote:Here is a good example of what I'm talking about. The following opening line is one of the most common positions after 11 moves in GM chess; it is a very typical Sicilian. Both sides have castled so that's not an issue.

1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4
Nf6 5. Nc3 a6 6. Be2 e6 7. O-O Be7 8. f4 Qc7 9. Be3 Nc6 10. Kh1 O-O 11. a4 Re8

[...]

Does anyone know of any other program that evaluates this at two ply at around 0.3 or less?
[...]

Hermann 2.5 evaluates gives a score of 0.23:

Code: Select all

position fen r1b1r1k1/1pq1bppp/p1nppn2/8/P2NPP2/2N1B3/1PP1B1PP/R2Q1R1K w - - 1 12
go depth 2
info depth 1 seldepth 6 time 0 nodes 76 pv d4c6 c7c6 score cp 23 hashfull 0 nps 76000
info depth 2 seldepth 11 time 15 nodes 440 hashfull 1 nps 29333 score cp 23
bestmove d4c6 ponder c7c6

My current develoment version 2.5.36 gives a score of 0.24.

bhlangonijr · Post by **bhlangonijr** » Sun Jul 18, 2010 10:30 am

lkaufman wrote:Here is a good example of what I'm talking about. The following opening line is one of the most common positions after 11 moves in GM chess; it is a very typical Sicilian. Both sides have castled so that's not an issue.

1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4
Nf6 5. Nc3 a6 6. Be2 e6 7. O-O Be7 8. f4 Qc7 9. Be3 Nc6 10. Kh1 O-O 11. a4 Re8

There have been more than two thousand games at IM/GM level from this position, with White scoring the usual 55%. So a proper eval would be about 1/5 of the value of a pawn in the opening position, so about 0.2 for most programs, about 0.15 for Rybka.

I did a two ply search from here on a lot of programs. Stockfish 1.8 evaluated it at +.56, Rybka 4 as +.57, Crafty 23 as +.60, Deep Shredder 12 at 0.69, Fritz 12 at +.79, Naum 4 at +.89. All ridiculously optimistic for white. The only program that was even close to reasonable, much to my surprise, was Komodo 1.2 at +.31. Does anyone know of any other program that evaluates this at two ply at around 0.3 or less?

I'll now run a randomized playout with Rybka to see how the engine actually performs from here.

just for curiosity I tried this position using Houdini 1.02 (It's based on ipp* sources so I guess it's from Rybka guts

). Houldini evaluates this position as 0.08 and I think Houldini is as strong as Stockfish (maybe stronger).

Code: Select all

Houdini 1.02 w32 1_CPU
build 2010-06-18
by Robert Houdart
position fen r1b1r1k1/1pq1bppp/p1nppn2/8/P2NPP2/2N1B3/1PP1B1PP/R2Q1R1K w - - 1 12
go depth 2
info depth 2 seldepth 15 score cp 8  time 3 nodes 1525 nps 508000 pv d4c6 c7c6 d1d4
bestmove d4c6 ponder c7c6

Houldini's author pointed to this thread when asked about Houldini's evaluation, which makes me think he has already addressed this issue in his last developments.

http://talkchess.com/forum/viewtopic.php?t=35453

Regards,

mcostalba · Post by **mcostalba** » Sun Jul 18, 2010 11:25 am

lkaufman wrote:This reference is amusing to me, because it mimics the study I did back in 1999 on a human 2300+ database.

You have not answered my question.

BTW the in the above study "Only data pertaining to the material configuration was taken" so I still miss the link with the evaluation, especially in an opening position where, if is considered only material configuration, evaluation should be zero.

I think only Bob has understood my point. Pretending an engine evaluation of 5 plies depth of an opening position is able to give winning percentage is pure science fiction. It seems people find, surprisingly, very difficut to dissociate the concept of position evaluation by a GM from the little number that an engine spits outs together with the best move when returning from the search.

The fact that the numbers could be "similar" and are called in the same way is an obstacle in realizing that actually are two different things. We arrive at the paradox that people tries to make the latter to mimic the first, in a totally misguided goal effort.

lkaufman · Post by **lkaufman** » Sun Jul 18, 2010 4:05 pm

bhlangonijr wrote: just for curiosity I tried this position using Houdini 1.02 (It's based on ipp* sources so I guess it's from Rybka guts ). Houldini evaluates this position as 0.08 and I think Houldini is as strong as Stockfish (maybe stronger).
Code: Select all
Houdini 1.02 w32 1_CPU
build 2010-06-18
by Robert Houdart
position fen r1b1r1k1/1pq1bppp/p1nppn2/8/P2NPP2/2N1B3/1PP1B1PP/R2Q1R1K w - - 1 12
go depth 2
info depth 2 seldepth 15 score cp 8  time 3 nodes 1525 nps 508000 pv d4c6 c7c6 d1d4
bestmove d4c6 ponder c7c6
Houldini's author pointed to this thread when asked about Houldini's evaluation, which makes me think he has already addressed this issue in his last developments.

http://talkchess.com/forum/viewtopic.php?t=35453

Regards,

Houdini multiplies the score by a variable fraction of 0.5 or less, apparently in an attempt to hide the link to IPP, so that a clean pawn up is scored as only around 0.4. So 0.08 is really around 0.2. Still it's much less than Rybka. But I forgot that Rybka does a five ply search when you tell it to do 2 ply, and there is no way to force Rybka to do only a 2 ply real search. Probably if I could the score would be close to 0.2, so I guess it comes the closest to being reasonable.

lkaufman · Post by **lkaufman** » Sun Jul 18, 2010 4:17 pm

mcostalba wrote: You have not answered my question.

BTW the in the above study "Only data pertaining to the material configuration was taken" so I still miss the link with the evaluation, especially in an opening position where, if is considered only material configuration, evaluation should be zero.

I think only Bob has understood my point. Pretending an engine evaluation of 5 plies depth of an opening position is able to give winning percentage is pure science fiction. It seems people find, surprisingly, very difficut to dissociate the concept of position evaluation by a GM from the little number that an engine spits outs together with the best move when returning from the search.

The fact that the numbers could be "similar" and are called in the same way is an obstacle in realizing that actually are two different things. We arrive at the paradox that people tries to make the latter to mimic the first, in a totally misguided goal effort.

I initially chose the evaluation parameters for both Rybka and Komodo based on my judgment of what they should be to give "correct" evaluations in human GM terms. I modified them somewhat where I had reason to believe that other values would give scores that were better predictors of engine vs engine performance. Then these values were gradually modified based on self-play, but in most cases the values did not change so dramatically. Vas thought this to be a sound approach, and as you know Rybka is not such a bad program. So I have trouble understanding why you think there should be no link between displayed scores of an engine and reasonable evaluations designed to predict percentage scores. I do understand that your approach in Stockfish has some advantages, in that it may steer the search towards more promising parts of the tree even if this means choosing a less optimum move if all the scores were "correct". But now I see why Rybka and Komodo seem to evaluate much more like human GMs, although there are still some very noticeable biases in favor of dynamics over statics.

bob · Post by **bob** » Sun Jul 18, 2010 4:33 pm

michiguel wrote:
bob wrote:
michiguel wrote:
mcostalba wrote:
lkaufman wrote:My point is that a clean pawn up translates to a certain winning percentage, maybe 75% or so (it varies a bit with time limit and engine).
And why should be like that in your opinion ?

http://chessprogramming.wikispaces.com/ ... e,+and+ELO

Miguel

Knowing how evaluation is tuned according to if a change makes an engine wins more then losing I see absoultly no direct relation between evaluation score and winning percentage, perhaps there is but it is not immediate to figure it out nor it is trivial to demonstarte a certain relation does exsist.

Intutively, but just intuitively because I have no element to state a fixed link between evaluation score and winning percentage, I could argue that if this relation does exist it would be, in case, in endgame / late midgame evaluations, but more diffcult in middlegame or even in opening positions.

But anyhow my opinion is that evaluation score is just a tool to let the engine to choose a move among a given set, no more then that. It is designed for this goal and all the changes are aimed at improving this goal, not something else, for sure not to help humans to understand the positions, this, in case, could be just a nice side effect but not the reason why evaluation exists and is tuned.
That's interesting, but totally worthless in this context. What do you do when program A says +2.1, program B says +1.2, and program C say s+0.3, all for the same position? Trying to have a single "unified field theory" to translate score to winning percentage for all chess engines is as elusive as Einstein's goal was/is. You could certainly do one for each distinct program, but a unified formula "one size fits all"? Doubt it is possible, much less practical. Ever seen a game where computers are playing both sides, and both sides think they are winning?
That is not the point of the link I provided. It does not deal with score, but it tries to relate material vs winning percentage. I thought that was the question MC asked to LK.

Miguel

Larry's original question was about raw score, the +0.6-+0.7 in a position he considered equal... However, I still believe that trying to convert raw material advantage to a probability is tough to impossible. If a program has large positional scores, it will quite often be down in material and do OK if it is well-tuned...

bob · Post by **bob** » Sun Jul 18, 2010 8:13 pm

bob wrote:
lkaufman wrote:
bob wrote:[quote
For crafty, it is pretty easy to grasp the score. For the position from the second post in this thread, you can discover via the "score" command that some of this is from development (knights on the edge, unconnected rooks, uncastled, etc. Not much of that comes from mobility in our case, it is mainly the special case "uncastled development scoring"...

Remember, it is not the score that counts, it is the move. I suppose everyone could just add in a -50 constant to their scores and make them appear more conservative, but it would not change the move at all...
Changing the scores by a constant would solve nothing, because they are interpreted relative to material and to static factors. The issue is about the relative weighting of static vs. dynamic factors (leaving out king safety as it has elements of both). Perhaps I am mistaken about Crafty overweighting dynamics; I have spent far more time with Stockfish which displays similar behavior in the opening. For me (and surely many others) what I want most from an engine is to get an accurate evaluation of an opening line (which may extend all the way to the endgame!). I put the scores in an IDeA tree using Aquarium and research openings this way. If the evals systematically overrate positions where White has more mobility, it will be "recommending" the wrong lines. So for me, a correct eval of the end node is more important than the rating of an engine.
I spent literally months of time on "centralizing" the evaluation. And outside of the development issues, most scores are pretty well centered around zero. This may have changed during testing, since anything is possible there, but we always wanted "equal" positions to be somewhere near zero. But development is different and trickier. If you do that completely symmetrically, then you either develop a piece or prevent your opponent from developing a piece. This can backfire, thanks to tempo's importance.

There are certainly issues between "real" and "imagined" positional advantages that we do not handle very well (nor does any other program I have seen so far.)

Here's an idea I might could make happen:

1. Set up a bunch of buckets, say -10, -9.5, ..., 0.0, +.5, ..., +10. Those represent evaluation scores (explicitly Crafty of course).

2. Write a program to eat Crafty log files and first notice the result (win, lose or draw) and then look at each last evaluation displayed when making a move, to see what the outcome for that evaluation averages over millions of games.

3. From that, I could assign a probability of winning or losing for each evaluation, and then when I display an evaluation, I could map it through that function to convert it into a probability of winning from 0.0 to 1.0, and perhaps double that and subtract 1.0 so that the scores come out as -1.0 ... 0.0 ... +1.0 with -1.0 being a certain loss, +1.0 being a certain win, and scores distributed in that range.

That is doable, for Crafty only...

Milos · Post by **Milos** » Mon Jul 19, 2010 12:33 am

lkaufman wrote:Rybka does a five ply search when you tell it to do 2 ply, and there is no way to force Rybka to do only a 2 ply real search.

Of course there is. Just put -1 as the search depth

.

jhaglund · Post by **jhaglund** » Mon Jul 19, 2010 8:41 pm

bob wrote:
lkaufman wrote:My "proof" that my human evaluation (and that of all GMs) is the right one is that even in engine databases, the openings score more or less similarly to the results in human play (in most cases). Thus you would never find 75% scores for White in any database of Sicilian games, which might be expected from the huge evals in Crafty and Stockfish. I also try randomized playouts with Rybka from major openings and usually get similar results to human databases, i.e. scores around 55% for White.

As for earlier starting positions, what do you think about randomized moves for the first N ply, filtering out all positions unbalanced by more than X?
Aha, now we get to the bottom of the "symantic" war here.

What you are doing, and incorrectly, is to assume that the scores match up proportionally to winning or losing. But if you think about the score as something different entirely, it becomes less muddy. A program thinks in terms of "best move" and has to assign some numeric value so that minimax works. But nothing says that +1.5 means "winning advantage". To understand, what if I multiplied all scores by 100, including material. Now you would see truly huge scores with no greater probability of winning than before. The scores help the program choose between moves. I'd like to have scores between -1.0 and +1.0, where -1.0 is absolutely lost and +1.0 is absolutely won. But that would not necessarily help the thing play one bit better. Would be a more useful number for humans to see, I agree. But that would be all.

Comparing scores between engines is very much like comparing depths, or branching factors, etc. And in the end, the only thing that really matters is who wins and who loses...

What I think could be useful here:

Implementing a concurrent search (SFSB)...
http://talkchess.com/forum/viewtopic.ph ... 70&t=35066

Playing out the moves to a complete game or just to x plies + y and generating statistics of each evaluation +/-... and using that to chose your move or for move ordering.

1.e4 ...
... e5 (34%W)(32%L)(34%D)
... c5 (33%W)(30%L)(37%D)
... c6 (31%W)(28%L)(41%D)
...Nc6(29%W)(30%L)(41%D)
...
or...

Average eval at end of x plies...

1.e4 ...
... e5 (+.34)
... c5 (+.15)
... c6 (+.11)
...Nc6(+.09)
...

or...

Order move after Eval pv[x]+pv[ply[y]]

1.e4 ...
... e5 (+.4)
... c5 (+.3)
... c6 (+.2)
...Nc6(+.1)
...

lkaufman · Post by **lkaufman** » Mon Jul 19, 2010 8:53 pm

Milos wrote:
lkaufman wrote:Rybka does a five ply search when you tell it to do 2 ply, and there is no way to force Rybka to do only a 2 ply real search.
Of course there is. Just put -1 as the search depth .

Is there some interface that allows a negative depth input and handles it properly? The Fritz interface I normally use doesn't allow this.

Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question