Crafty and Stockfish question

QED · Post by **QED** » Sat Jul 17, 2010 8:02 pm

Larry Kaufman wrote:
Tord Romstad wrote:
Now to the question of why such high mobility scores work: I don't know, and honestly I wasn't even aware that our mobility scores were unreasonably high (I'm not a chess player). If they are, here's a possible explanation of why these high bonuses work so well in practice:

Except in positions with a single long, completely forced line, the quality of the last few moves of a long PV usually isn't high. The position at the end of the PV will never appear on the board. When the program has an advantage in mobility at the position at the and of the PV, however, it probably also has an advantage in mobility at most positions close to the end position in the search tree. This means that in the last few nodes along the PV, where the PV moves are probably not good, the program will probably have many reasonable alternative moves and the opponent considerably fewer. High mobility scores therefore steers the search towards areas of the tree where there is a good chance to find unexpected resources for the program, and not for the opponent.

Maximizing the chance of pleasant surprises towards the end of the PV while minimizing the chance of unpleasant surprises seems like a good idea, in general.
That seems like a good theory. The question now arises: Can I have my cake and eat it too? In other words, is there a way to steer the search towards such promising areas without resorting to artificially high mobility scores? The problem is that programs which use such unrealisitic scores are not very useful for opening analysis by humans, because the evaluations are just way out of line with results in actual play (whether human or engine play) from positions where one side has much more mobility but worse structure. Humans tend to prefer (and score better from) the positions with the better structure but worse mobility.

Question: Why is Sicilian misevaluated by engines?
Because mobility is overevaluated. And for that, I see two related reasons.

This is the first reason:
Mobility is good, because if you are mobile, you have better chance to find an alternative when your principal variation turns wrong.

I like the way two different threads came to similar conclusions:
Entirely random evaluation makes the search to steer to positions with many choices. That is a naive form of mobility. And with that, random evaluation gives not completely weak performace.

This is the second reason:
Finding tactical solutions. Usually, middlegame is more important than opening. In opening, it is reasonable that the side behind in development can be quite well, because of some structural compensation. But in the middlegame, when one side is still underdeveloped, that usually means it is desperately defending some weak points. This is a clear sign for engine to go for a kill. Even if the engine does not understand what the weakness is, it is often sufficient to rely on tactics here. So overall, overevaluating mobility helps in punishing mistakes, and this is more important than performance in openings (or in closed positions).

Humans do not work in this simple way, because they have better heuristics. Overloaded defenders and multiple weaknesses are better indicators of tactical solution, than merely restricted mobility. Also, in opening phase, humans understand that you need to develop initiative to actually make something tangible from better mobility. You have to be able to attack multiple possible targets, mere moving of pieces is next to nothing.

About having a cake:
The half of the second reason can be saved by giving the engine bigger and better static knowledge, to mimic human heuristics. Something like computing "speed of activation" for pieces, compare that to "time to develop" computed from the speed of enemy threats, and evaluate underdevelopment as very bad only when it is not likely to vanish in few moves. I think future is here.

The half of the first reason is even harder. It would be nice if, beside score, we also had something like measure of danger (of finding something bad along PV). Humans do that, they choose lines safe for them and risky for opponent. But I am not sure how the dangerousness would be computed or propagated towards the root in alpha-beta search. It is possible that measuring safety the human way is only feasible in massive parallel enviroments, and traditional engines will always overvaluate mobility (and king safety, ...) and use some counter-intuitive search tricks to mimic naive dangersense while maintaining depth of search.

And about evaluation score:
It is anything that makes engine to play well using search. It is only needed to compare two positions, and the result is not completely determined by which position has better winning probability. Both 'danger sense' and 'smell for combination' are effects that make the searched bestmove better overall, while the reported score remains less reliable because of overevaluations. Maybe propagating two scores ('active' for search decisions and 'objective' to report) would be nicer, but it is not clear how to tune the latter evaluation.

lkaufman · Post by **lkaufman** » Sat Jul 17, 2010 9:07 pm

You make some good points here, but I don't see anything we could quickly implement and test. Maybe that's asking for too much.
Meanwhile the result of my playout of 5,000 Rybka 4 games at "5 ply" (which is really 8 ply) from my Sicilian position after 11...Re8: Black actually came out ahead by 51-49%! This really shows how far from the truth all the engines are here.

michiguel · Post by **michiguel** » Sat Jul 17, 2010 10:43 pm

lkaufman wrote:Here is a good example of what I'm talking about. The following opening line is one of the most common positions after 11 moves in GM chess; it is a very typical Sicilian. Both sides have castled so that's not an issue.

1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4
Nf6 5. Nc3 a6 6. Be2 e6 7. O-O Be7 8. f4 Qc7 9. Be3 Nc6 10. Kh1 O-O 11. a4 Re8

There have been more than two thousand games at IM/GM level from this position, with White scoring the usual 55%. So a proper eval would be about 1/5 of the value of a pawn in the opening position, so about 0.2 for most programs, about 0.15 for Rybka.

I did a two ply search from here on a lot of programs. Stockfish 1.8 evaluated it at +.56, Rybka 4 as +.57, Crafty 23 as +.60, Deep Shredder 12 at 0.69, Fritz 12 at +.79, Naum 4 at +.89. All ridiculously optimistic for white. The only program that was even close to reasonable, much to my surprise, was Komodo 1.2 at +.31. Does anyone know of any other program that evaluates this at two ply at around 0.3 or less?

I'll now run a randomized playout with Rybka to see how the engine actually performs from here.

Of course, Gaviota is not that strong, but... The static eval is 0.29. After 11 plies is still pretty much the same.

It does not choose the right move though.

Miguel

Code: Select all

+-----------------+
| r . b . r . k . |
| . p q . b p p p |
| p . n p p n . . |
| . . . . . . . . |    Castling: 
| P . . N P P . . |    ep: -
| . . N . B . . . |
| . P P . B . P P |
| R . . Q . R . K | [White]
+-----------------+

score
===> 0.29

analyze
********* Starts iterative deepening, thread = 0
set timer to infinite
        82   1:      0.0    +0.52  12.Nxc6 Qxc6
       448   2:      0.0    +0.52  12.Nxc6 Qxc6
      2326   3:      0.0    +0.48  12.Nxc6 Qxc6 13.Qd3
      8791   4:      0.0    +0.44  12.Nxc6 Qxc6 13.Bf3 b6
     25491   5:      0.1    +0.49  12.Nxc6 bxc6 13.Qe1 d5 14.e5
     38468   6       0.2    +0.39  12.Nxc6 bxc6 13.Qe1 Rb8 14.b3 d5
    147610   6:      0.5    +0.39  12.Nxc6 bxc6 13.Qe1 Rb8 14.b3 d5
    181949   7       0.5    +0.29  12.Nxc6 bxc6 13.Qe1 c5 14.Qg3 Bb7
                                   15.Bd3
    566885   7       1.4    +0.34  12.Nb3 b6 13.Nd2 d5 14.e5 d4 15.exf6
                                   dxe3 16.fxe7 exd2 17.Qxd2 Qxe7
    584205   7:      1.4    +0.34  12.Nb3 b6 13.Nd2 d5 14.e5 d4 15.exf6
                                   dxe3 16.fxe7 exd2 17.Qxd2 Qxe7
    734304   8       1.8    +0.40  12.Nb3 b6 13.Qe1 Nb4 14.Nd4 e5 15.fxe5
                                   dxe5
    948431   8:      2.2    +0.40  12.Nb3 b6 13.Qe1 Nb4 14.Nd4 e5 15.fxe5
                                   dxe5
   1255373   9       2.9    +0.37  12.Nb3 b6 13.Nd2 d5 14.e5 d4 15.exf6
                                   dxe3 16.fxe7 exd2 17.Ne4
   2129662   9:      4.5    +0.37  12.Nb3 b6 13.Nd2 d5 14.e5 d4 15.exf6
                                   dxe3 16.fxe7 exd2 17.Ne4
   4005124  10       8.0    +0.25  12.Nb3 d5 13.e5 Ne4 14.Bd3 Nxc3 15.bxc3
                                   Bd7 16.Qh5 g6 17.Qh3
   5323484  10      10.4    +0.32  12.Nxc6 bxc6 13.Qd3 e5 14.a5 Rb8 15.b3
                                   Rb4 16.Bb6 Rxb6 17.axb6 Qxb6
   8424159  10:     15.8    +0.32  12.Nxc6 bxc6 13.Qd3 e5 14.a5 Rb8 15.b3
                                   Rb4 16.Bb6 Rxb6 17.axb6 Qxb6
  10654679  11      19.6    +0.29  12.Nxc6 bxc6 13.Qd3 e5 14.a5 Rb8 15.b3
                                   exf4 16.Bxf4 Rb4 17.h3
  26211062  11:     45.8    +0.29  12.Nxc6 bxc6 13.Qd3 e5 14.a5 Rb8 15.b3
                                   exf4 16.Bxf4 Rb4 17.h3

michiguel · Post by **michiguel** » Sat Jul 17, 2010 10:57 pm

michiguel wrote:
lkaufman wrote:Here is a good example of what I'm talking about. The following opening line is one of the most common positions after 11 moves in GM chess; it is a very typical Sicilian. Both sides have castled so that's not an issue.

1. e4 c5 2. Nf3 d6 3. d4 cxd4 4. Nxd4
Nf6 5. Nc3 a6 6. Be2 e6 7. O-O Be7 8. f4 Qc7 9. Be3 Nc6 10. Kh1 O-O 11. a4 Re8

There have been more than two thousand games at IM/GM level from this position, with White scoring the usual 55%. So a proper eval would be about 1/5 of the value of a pawn in the opening position, so about 0.2 for most programs, about 0.15 for Rybka.

I did a two ply search from here on a lot of programs. Stockfish 1.8 evaluated it at +.56, Rybka 4 as +.57, Crafty 23 as +.60, Deep Shredder 12 at 0.69, Fritz 12 at +.79, Naum 4 at +.89. All ridiculously optimistic for white. The only program that was even close to reasonable, much to my surprise, was Komodo 1.2 at +.31. Does anyone know of any other program that evaluates this at two ply at around 0.3 or less?

I'll now run a randomized playout with Rybka to see how the engine actually performs from here.
Of course, Gaviota is not that strong, but... The static eval is 0.29. After 11 plies is still pretty much the same.

It does not choose the right move though.

Miguel
Code: Select all
+-----------------+
| r . b . r . k . |
| . p q . b p p p |
| p . n p p n . . |
| . . . . . . . . |    Castling: 
| P . . N P P . . |    ep: -
| . . N . B . . . |
| . P P . B . P P |
| R . . Q . R . K | [White]
+-----------------+

score
===> 0.29

analyze
********* Starts iterative deepening, thread = 0
set timer to infinite
        82   1:      0.0    +0.52  12.Nxc6 Qxc6
       448   2:      0.0    +0.52  12.Nxc6 Qxc6
      2326   3:      0.0    +0.48  12.Nxc6 Qxc6 13.Qd3
      8791   4:      0.0    +0.44  12.Nxc6 Qxc6 13.Bf3 b6
     25491   5:      0.1    +0.49  12.Nxc6 bxc6 13.Qe1 d5 14.e5
     38468   6       0.2    +0.39  12.Nxc6 bxc6 13.Qe1 Rb8 14.b3 d5
    147610   6:      0.5    +0.39  12.Nxc6 bxc6 13.Qe1 Rb8 14.b3 d5
    181949   7       0.5    +0.29  12.Nxc6 bxc6 13.Qe1 c5 14.Qg3 Bb7
                                   15.Bd3
    566885   7       1.4    +0.34  12.Nb3 b6 13.Nd2 d5 14.e5 d4 15.exf6
                                   dxe3 16.fxe7 exd2 17.Qxd2 Qxe7
    584205   7:      1.4    +0.34  12.Nb3 b6 13.Nd2 d5 14.e5 d4 15.exf6
                                   dxe3 16.fxe7 exd2 17.Qxd2 Qxe7
    734304   8       1.8    +0.40  12.Nb3 b6 13.Qe1 Nb4 14.Nd4 e5 15.fxe5
                                   dxe5
    948431   8:      2.2    +0.40  12.Nb3 b6 13.Qe1 Nb4 14.Nd4 e5 15.fxe5
                                   dxe5
   1255373   9       2.9    +0.37  12.Nb3 b6 13.Nd2 d5 14.e5 d4 15.exf6
                                   dxe3 16.fxe7 exd2 17.Ne4
   2129662   9:      4.5    +0.37  12.Nb3 b6 13.Nd2 d5 14.e5 d4 15.exf6
                                   dxe3 16.fxe7 exd2 17.Ne4
   4005124  10       8.0    +0.25  12.Nb3 d5 13.e5 Ne4 14.Bd3 Nxc3 15.bxc3
                                   Bd7 16.Qh5 g6 17.Qh3
   5323484  10      10.4    +0.32  12.Nxc6 bxc6 13.Qd3 e5 14.a5 Rb8 15.b3
                                   Rb4 16.Bb6 Rxb6 17.axb6 Qxb6
   8424159  10:     15.8    +0.32  12.Nxc6 bxc6 13.Qd3 e5 14.a5 Rb8 15.b3
                                   Rb4 16.Bb6 Rxb6 17.axb6 Qxb6
  10654679  11      19.6    +0.29  12.Nxc6 bxc6 13.Qd3 e5 14.a5 Rb8 15.b3
                                   exf4 16.Bxf4 Rb4 17.h3
  26211062  11:     45.8    +0.29  12.Nxc6 bxc6 13.Qd3 e5 14.a5 Rb8 15.b3
                                   exf4 16.Bxf4 Rb4 17.h3

With a little more time, the move selection is more reasonable (not the PV though...)

Miguel

Code: Select all

  10654679  11      19.6    +0.29  12.Nxc6 bxc6 13.Qd3 e5 14.a5 Rb8 15.b3
                                   exf4 16.Bxf4 Rb4 17.h3
  26211062  11:     45.8    +0.29  12.Nxc6 bxc6 13.Qd3 e5 14.a5 Rb8 15.b3
                                   exf4 16.Bxf4 Rb4 17.h3
  37302458  12      64.8    +0.24  12.Nxc6 bxc6 13.Qd2 e5 14.a5 Rb8 15.b3
                                   Ng4 16.Bb6 Rxb6 17.axb6 Qxb6
  62507966  12     110.6    +0.25  12.Nb3 b6 13.Qe1 Nb4 14.Qf2 Nd7 15.Nd4
                                   Bb7 16.Qf3 e5 17.fxe5 Nxe5
  98707521  12     172.0    +0.30  12.Bf3 Bd7 13.Nxc6 Bxc6 14.Qd3 Kh8
                                   15.b3 Rg8 16.f5 Rgf8 17.Qc4 Qa5 18.fxe6
                                   fxe6
 105227111  12:    183.0    +0.30  12.Bf3 Bd7 13.Nxc6 Bxc6 14.Qd3 Kh8
                                   15.b3 Rg8 16.f5 Rgf8 17.Qc4 Qa5 18.fxe6
                                   fxe6
 196954190  13     341.8    +0.33  12.Bf3 Na5 13.Qd3 Nc4 14.Bc1 e5 15.Nf5
                                   exf4 16.Nxe7+ Rxe7 17.Bxf4 Bg4 18.b3
                                   Bxf3 19.gxf3
 242867629  13:    416.9    +0.33  12.Bf3 Na5 13.Qd3 Nc4 14.Bc1 e5 15.Nf5
                                   exf4 16.Nxe7+ Rxe7 17.Bxf4 Bg4 18.b3
                                   Bxf3 19.gxf3
 428994936  14     719.2    +0.32  12.Bf3 Bd7 13.Nxc6 Bxc6 14.Qd3 Kh8
                                   15.Bd4 Rf8 16.b3 Qa5 17.Be2 Rg8 18.f5
                                   Nd7 19.fxe6 fxe6

mcostalba · Post by **mcostalba** » Sat Jul 17, 2010 11:11 pm

lkaufman wrote:My point is that a clean pawn up translates to a certain winning percentage, maybe 75% or so (it varies a bit with time limit and engine).

And why should be like that in your opinion ?

Knowing how evaluation is tuned according to if a change makes an engine wins more then losing I see absoultly no direct relation between evaluation score and winning percentage, perhaps there is but it is not immediate to figure it out nor it is trivial to demonstarte a certain relation does exsist.

Intutively, but just intuitively because I have no element to state a fixed link between evaluation score and winning percentage, I could argue that if this relation does exist it would be, in case, in endgame / late midgame evaluations, but more diffcult in middlegame or even in opening positions.

But anyhow my opinion is that evaluation score is just a tool to let the engine to choose a move among a given set, no more then that. It is designed for this goal and all the changes are aimed at improving this goal, not something else, for sure not to help humans to understand the positions, this, in case, could be just a nice side effect but not the reason why evaluation exists and is tuned.

michiguel · Post by **michiguel** » Sat Jul 17, 2010 11:23 pm

mcostalba wrote:
lkaufman wrote:My point is that a clean pawn up translates to a certain winning percentage, maybe 75% or so (it varies a bit with time limit and engine).
And why should be like that in your opinion ?

http://chessprogramming.wikispaces.com/ ... e,+and+ELO

Miguel

Knowing how evaluation is tuned according to if a change makes an engine wins more then losing I see absoultly no direct relation between evaluation score and winning percentage, perhaps there is but it is not immediate to figure it out nor it is trivial to demonstarte a certain relation does exsist.

Intutively, but just intuitively because I have no element to state a fixed link between evaluation score and winning percentage, I could argue that if this relation does exist it would be, in case, in endgame / late midgame evaluations, but more diffcult in middlegame or even in opening positions.

But anyhow my opinion is that evaluation score is just a tool to let the engine to choose a move among a given set, no more then that. It is designed for this goal and all the changes are aimed at improving this goal, not something else, for sure not to help humans to understand the positions, this, in case, could be just a nice side effect but not the reason why evaluation exists and is tuned.

lkaufman · Post by **lkaufman** » Sun Jul 18, 2010 1:13 am

This reference is amusing to me, because it mimics the study I did back in 1999 on a human 2300+ database. I also concluded that the data implied that an extra pawn was on average worth 100 Elo (!). However I went on to explain that very often the player who lost a pawn did so on purpose or at least got compensation, and I suggested that it might be a good guess to assume that on average the pawn-down side had 50% compensation, which would make the true value of an uncompensated extra pawn 200 Elo points, which equals roughly 75%. Of course the precise percentage depends on the level of play. Since it is generally believed that a clean extra pawn in the opening (no compensation) should be a winning advantage, this percentage should rise with longer time limits and/or stronger players, eventually reaching 100% if we can be sure of "no compensation". In fact my testing confirms that this percentage does indeed rise with depth. My 75% figure is roughly correct for human masters in tournament chess and for strong engines in bullet chess, maybe 80% is about right for serious engine play.

bob · Post by **bob** » Sun Jul 18, 2010 5:46 am

lkaufman wrote:
bob wrote:
lkaufman wrote:My "proof" that my human evaluation (and that of all GMs) is the right one is that even in engine databases, the openings score more or less similarly to the results in human play (in most cases). Thus you would never find 75% scores for White in any database of Sicilian games, which might be expected from the huge evals in Crafty and Stockfish. I also try randomized playouts with Rybka from major openings and usually get similar results to human databases, i.e. scores around 55% for White.

As for earlier starting positions, what do you think about randomized moves for the first N ply, filtering out all positions unbalanced by more than X?
Aha, now we get to the bottom of the "symantic" war here.

What you are doing, and incorrectly, is to assume that the scores match up proportionally to winning or losing. But if you think about the score as something different entirely, it becomes less muddy. A program thinks in terms of "best move" and has to assign some numeric value so that minimax works. But nothing says that +1.5 means "winning advantage". To understand, what if I multiplied all scores by 100, including material. Now you would see truly huge scores with no greater probability of winning than before. The scores help the program choose between moves. I'd like to have scores between -1.0 and +1.0, where -1.0 is absolutely lost and +1.0 is absolutely won. But that would not necessarily help the thing play one bit better. Would be a more useful number for humans to see, I agree. But that would be all.

Comparing scores between engines is very much like comparing depths, or branching factors, etc. And in the end, the only thing that really matters is who wins and who loses...

I think we are still miscommunicating. My point is that a clean pawn up translates to a certain winning percentage, maybe 75% or so (it varies a bit with time limit and engine). So if you report a score of 0.75, this implies a winning percentage that can be calculated from the Elo tables based on the winning percentage for plus one pawn. Multiplying the scores by a constant would change nothing. So if you report a score of +.75 for the position after 1e4 c5, something is very wrong if you also report a score of +1.00 for an extra pawn in the opening with no compensation, as I assume is roughly true.

To a human, +1.00 might mean a win xx% of the time. I don't believe it does to most computer programs however. There is probably some correlation between the score and being the best move, but if you look at most programs, a score of +2.0 can mean several things. 2 pawns up. A piece up but wrecked position. One side has a very weak king position but material is even. Etc. As a human I don't find myself thinking like that. I am acutely aware of material, and am willing to make concessions to obtain an attack, or create a target. But there is never any doubt that in my mind I am down a pawn and up some compensation. A computer doesn't really separate those, which, IMHO leads to the current "somewhat offbeat scores" that we are talking about.

bob · Post by **bob** » Sun Jul 18, 2010 5:54 am

michiguel wrote:
mcostalba wrote:
lkaufman wrote:My point is that a clean pawn up translates to a certain winning percentage, maybe 75% or so (it varies a bit with time limit and engine).
And why should be like that in your opinion ?

http://chessprogramming.wikispaces.com/ ... e,+and+ELO

Miguel

Knowing how evaluation is tuned according to if a change makes an engine wins more then losing I see absoultly no direct relation between evaluation score and winning percentage, perhaps there is but it is not immediate to figure it out nor it is trivial to demonstarte a certain relation does exsist.

Intutively, but just intuitively because I have no element to state a fixed link between evaluation score and winning percentage, I could argue that if this relation does exist it would be, in case, in endgame / late midgame evaluations, but more diffcult in middlegame or even in opening positions.

But anyhow my opinion is that evaluation score is just a tool to let the engine to choose a move among a given set, no more then that. It is designed for this goal and all the changes are aimed at improving this goal, not something else, for sure not to help humans to understand the positions, this, in case, could be just a nice side effect but not the reason why evaluation exists and is tuned.

That's interesting, but totally worthless in this context. What do you do when program A says +2.1, program B says +1.2, and program C say s+0.3, all for the same position? Trying to have a single "unified field theory" to translate score to winning percentage for all chess engines is as elusive as Einstein's goal was/is. You could certainly do one for each distinct program, but a unified formula "one size fits all"? Doubt it is possible, much less practical. Ever seen a game where computers are playing both sides, and both sides think they are winning?

michiguel · Post by **michiguel** » Sun Jul 18, 2010 6:29 am

bob wrote:
michiguel wrote:
mcostalba wrote:
lkaufman wrote:My point is that a clean pawn up translates to a certain winning percentage, maybe 75% or so (it varies a bit with time limit and engine).
And why should be like that in your opinion ?

http://chessprogramming.wikispaces.com/ ... e,+and+ELO

Miguel

Knowing how evaluation is tuned according to if a change makes an engine wins more then losing I see absoultly no direct relation between evaluation score and winning percentage, perhaps there is but it is not immediate to figure it out nor it is trivial to demonstarte a certain relation does exsist.

Intutively, but just intuitively because I have no element to state a fixed link between evaluation score and winning percentage, I could argue that if this relation does exist it would be, in case, in endgame / late midgame evaluations, but more diffcult in middlegame or even in opening positions.

But anyhow my opinion is that evaluation score is just a tool to let the engine to choose a move among a given set, no more then that. It is designed for this goal and all the changes are aimed at improving this goal, not something else, for sure not to help humans to understand the positions, this, in case, could be just a nice side effect but not the reason why evaluation exists and is tuned.
That's interesting, but totally worthless in this context. What do you do when program A says +2.1, program B says +1.2, and program C say s+0.3, all for the same position? Trying to have a single "unified field theory" to translate score to winning percentage for all chess engines is as elusive as Einstein's goal was/is. You could certainly do one for each distinct program, but a unified formula "one size fits all"? Doubt it is possible, much less practical. Ever seen a game where computers are playing both sides, and both sides think they are winning?

That is not the point of the link I provided. It does not deal with score, but it tries to relate material vs winning percentage. I thought that was the question MC asked to LK.

Miguel

Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question

Re: Crafty and Stockfish question