A bizarre evaluation.

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
lkaufman
Posts: 3724
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

A bizarre evaluation.

Post by lkaufman » Sun Mar 20, 2016 3:01 am

Consider the standard Gruenfeld Defense variation: 1.d4 Nf6 2.c4 g6 3.Nc3 d5 4.cxd5 Nxd5 5.e4 Nxc3 6.bxc3 Bg7. This is a major line, so the evaluation of theory (and results in practical play) show White with just his normal slight opening advantage, in the 0.15 to 0.20 range. Normal moves are 7.Nf3, 7.Bc4, and 7.Be3.
Now haave both Komodo and Stockfish (any recent versions) do a one ply search from here. Both engines play the unusual 7.Qb3, which is understandable as it "threatens" some checks and the f7 square near the king. But both engines give ridiculous scores of around two pawns or more advantage to White, implying that the position is totally winning for White, even though in fact White has probably thrown away his opening advantage already! How can the two strongest engines in the world misevaluate this position by two pawns or more? It is incredible to me.
Of course, I know that either Komodo or Stockfish could "fix" this position by drastically lowering king safety weights, but this would lose a ton of elo. Is it really true that we need absurd evaluations like this to achieve strongest play, and if so why should this be? I have my own ideas about this, but I'd rather hear what others have to say about it first. Perhaps there is some way to have plausible evals without losing much elo, but we sure haven't found it yet. Note that the problem applies to both engines, which are quite different, so it is not due to some program bug or anything like that.
Komodo rules!

User avatar
lucasart
Posts: 3040
Joined: Mon May 31, 2010 11:29 am
Full name: lucasart
Contact:

Re: A bizarre evaluation.

Post by lucasart » Sun Mar 20, 2016 3:17 am

There is no problem. So there is nothing to fix. Searching 1 ply gives notoriously stupid results.

To get any kind of a decent evaluation with SF at fixed depth, you need to disable PV node pruning (Step 7 & 13 in search) and you should use depth = 10 at least.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

kgburcham
Posts: 2016
Joined: Sun Feb 17, 2008 3:19 pm

Re: A bizarre evaluation.

Post by kgburcham » Sun Mar 20, 2016 3:40 am

I don't understand the concern. who plays one ply games?


[D] rnbqk2r/ppp1ppbp/6p1/8/3PP3/2P5/P4PPP/R1BQKBNR w KQkq -

Engine: Komodo 9.4 64-bit (8192 MB)
by Don Dailey, Larry Kaufman, Mark Lefler
Threads now set to 8

26.01 0:15 +0.34 7.Bc4 O-O 8.Be3 c5 9.Ne2 Qc7 10.Rc1 Nc6 11.O-O Na5 12.Bd3 Bd7 13.Qd2 b6 14.Bh6 f6 15.Ng3 Bxh6 16.Qxh6 e5 17.f4 exf4 18.Ne2 Qd6 19.Qxf4 Qxf4 20.Nxf4 (222.041.713) 13918
27.01 0:23 +0.36 7.Bc4 O-O 8.Be3 c5 9.Ne2 Qc7 10.Rc1 Nc6 11.O-O Na5 12.Bd3 b6 13.d5 c4 14.Bc2 Bg4 15.f3 Bd7 16.f4 Rad8 17.e5 Bg4 18.h3 Bxe2 19.Qxe2 Rxd5 20.Be4 (315.090.628) 13653
28.01 0:30 +0.29-- 7.Bc4 O-O (415.403.187) 13529
28.01 1:30 +0.36++ 7.Bc4 O-O 8.Be3 (1.253.308.335) 13882
28.01 1:34 +0.29-- 7.Bc4 O-O (1.305.877.654) 13850
28.01 1:40 +0.29 7.Bc4 O-O 8.Be3 e5 9.Nf3 Nc6 10.O-O exd4 11.cxd4 Bg4 12.e5 Qd7 13.Rc1 Rad8 14.Bb5 Be6 15.Bxc6 bxc6 16.Ng5 Bd5 17.f3 f6 18.exf6 Bxf6 19.Qc2 Rfe8 20.Ne4 (1.393.721.865) 13809
29.01 1:51 +0.26 7.Bc4 O-O 8.Be3 e5 9.Nf3 Nc6 10.d5 Na5 11.Bb3 b6 12.O-O Qd6 13.Rc1 f5 14.c4 Nb7 15.Ba4 f4 16.Bd2 a5 17.Bc6 Rb8 18.h3 Nc5 19.Qc2 Ba6 20.Bc3 (1.539.086.668) 13776
30.01 2:08 +0.30 7.Bc4 O-O 8.Be3 e5 9.Nf3 Nc6 10.d5 Na5 11.Bb3 b6 12.O-O Ba6 13.Re1 Bc4 14.Bxc4 Nxc4 15.Qb3 Nxe3 16.Rxe3 Qd6 17.h3 a6 18.c4 Rfe8 19.Rb1 f6 20.Ree1 (1.761.395.516) 13752
31.01 2:43 +0.32 7.Bc4 O-O 8.Be3 e5 9.Nf3 Nc6 10.d5 Na5 11.Bb3 b6 12.O-O Qd6 13.Re1 Nb7 14.Nd2 Nc5 15.Nc4 Qe7 16.d6 cxd6 17.Qxd6 Re8 18.Bxc5 bxc5 19.Rab1 Be6 20.Qxe7 (2.243.589.936) 13718
32.01 3:42 +0.33 7.Bc4 O-O 8.Be3 e5 9.Nf3 Nc6 10.d5 Na5 11.Bb3 b6 12.O-O Nb7 13.Re1 Bd7 14.Rc1 Qe7 15.h3 Nc5 16.Bc2 f6 17.Qe2 Rad8 18.Rb1 a6 19.Qd1 a5 20.Kh1 (3.062.826.194) 13744
33.01 5:09 +0.26-- 7.Bc4 O-O (4.259.957.431) 13779


Engine: Stockfish 7Beta1 64 BMI2 (8192 MB)
by T. Romstad, M. Costalba, J. Kiiski, G.

27/40 0:48 +0.24 7.Be3 c5 8.Nf3 Qa5 9.Qd2 O-O 10.Rc1 cxd4 11.cxd4 Qxd2+ 12.Nxd2 e6 13.Nf3 Bd7 14.Bc4 Nc6 15.O-O Rfc8 16.h3 a6 17.d5 Nb4 18.dxe6 Bxe6 19.Bxe6 fxe6 20.Rxc8+ (710.263.288) 14732
28/40 1:08 +0.31++ 7.Bc4 (1.013.745.541) 14788
28/40 1:09 +0.28 7.Bc4 O-O 8.Ne2 Nc6 9.Bd3 Na5 10.O-O b6 11.Rb1 Bb7 12.Be3 Rc8 13.Qd2 c5 14.d5 e6 15.c4 Ba6 16.Qc2 Re8 17.Rfe1 Qh4 18.Nf4 Be5 19.g3 Qe7 20.Bd2 (1.027.508.252) 14787
29/40 1:13 +0.28 7.Bc4 O-O 8.Ne2 Nc6 9.Bd3 Na5 10.O-O b6 11.Rb1 Bb7 12.Be3 Rc8 13.Qd2 c5 14.d5 e6 15.c4 Ba6 16.Qc2 Re8 17.Rfe1 Qh4 18.Nf4 Be5 19.g3 Qe7 20.Bd2 (1.085.807.430) 14794
30/40 1:26 +0.26 7.Bc4 O-O 8.Ne2 Nc6 9.Bd3 Na5 10.O-O b6 11.Rb1 Bb7 12.Be3 Rc8 13.Qd2 c5 14.d5 e6 15.c4 Ba6 16.Qc2 Re8 17.Rfe1 Qh4 18.Nf4 Be5 19.g3 Qe7 20.Bd2 (1.282.041.741) 14821
31/45 2:22 +0.20 7.Be3 c5 8.Nf3 Qa5 9.Qd2 O-O 10.Rc1 cxd4 11.cxd4 Qxd2+ 12.Nxd2 e6 13.e5 Nc6 14.Ne4 Rd8 15.Rc4 Bd7 16.h4 Na5 17.Rc1 Rdc8 18.h5 Rxc1+ 19.Bxc1 Rc8 20.Bd2 (2.129.395.512) 14895
32/45 2:58 +0.27++ 7.Be3 (2.664.517.141) 14925
32/46 3:56 +0.20-- 7.Be3 c5 (3.546.003.916) 14973
32/46 5:49 +0.19 7.Be3 c5 8.Nf3 Qa5 9.Qd2 O-O 10.Rc1 Rd8 11.Be2 b6 12.d5 Ba6 13.c4 Qxd2+ 14.Nxd2 e6 15.O-O Nd7 16.Rfe1 Re8 17.g3 Rad8 18.Kg2 Bb7 19.h4 f5 20.Bg5 (5.237.809.140) 15000
no chess program was born totally from one mind. all chess programs have many ideas from many minds.

lkaufman
Posts: 3724
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: A bizarre evaluation.

Post by lkaufman » Sun Mar 20, 2016 6:00 am

The point is that the position after 7.Qb3 is given an absurd evaluation by both engines. Never mind what search leads to this position, shouldn't the evaluation be a reasonable one? Any human who evaluated such a position as easiliy winning for White would be called a moron. Why should such poor evaluation work for engines? Yet it does.
Komodo rules!

User avatar
cdani
Posts: 2104
Joined: Sat Jan 18, 2014 9:24 am
Location: Andorra
Contact:

Re: A bizarre evaluation.

Post by cdani » Sun Mar 20, 2016 6:26 am

lkaufman wrote:The point is that the position after 7.Qb3 is given an absurd evaluation by both engines. Never mind what search leads to this position, shouldn't the evaluation be a reasonable one? Any human who evaluated such a position as easiliy winning for White would be called a moron. Why should such poor evaluation work for engines? Yet it does.
One idea that comes to my mind is that an engine is a "function" of search + static evaluation, and going to depth 1 basically you are viewing only static evaluation.

But I'm with you that this evaluation is absurd. My idea is something that I told already with other words somewhere, related to intrinsically engine weaknesses of current engines, that they have a bad evaluation function because is very limited in parameters/algorithms. So I have the idea that the coming years what will rule the improvements will be mostly the evaluation function, not the search, because the later resolves already most tactical stuff for the best engines.

User avatar
Rebel
Posts: 4706
Joined: Thu Aug 18, 2011 10:04 am

Re: A bizarre evaluation.

Post by Rebel » Sun Mar 20, 2016 7:16 am

kgburcham wrote:I don't understand the concern. who plays one ply games?
Many programmers :wink:

Just to check their eval and if it's reasonable.

The concern is the knowledge that similar positions with such a crazy high scores are evaluated as leafs as well and eventually will influence how root moves are evaluated.

Eval is the heart of a chess program. Search is the tool that is fed with evaluation scores and via mini-max decides what is the best move.

User avatar
lucasart
Posts: 3040
Joined: Mon May 31, 2010 11:29 am
Full name: lucasart
Contact:

Re: A bizarre evaluation.

Post by lucasart » Sun Mar 20, 2016 8:10 am

lkaufman wrote:The point is that the position after 7.Qb3 is given an absurd evaluation by both engines. Never mind what search leads to this position, shouldn't the evaluation be a reasonable one? Any human who evaluated such a position as easiliy winning for White would be called a moron. Why should such poor evaluation work for engines? Yet it does.
I don't know what you're going on about. Even with no search, and just looking at the static eval, SF reports only +0.8 for white (including 0.06 for tempo). And this is perfectly normal, and can be explained by better space, center control etc. Of course, deeper search will show that the white advantage is smaller (as black is given a chance by search to play developpement moves):

Code: Select all

position fen rnbqk2r/ppp1ppbp/6p1/8/3PP3/2P5/P4PPP/R1BQKBNR w KQkq - 0 1
d  

 +---+---+---+---+---+---+---+---+
 | r | n | b | q | k |   |   | r |
 +---+---+---+---+---+---+---+---+
 | p | p | p |   | p | p | b | p |
 +---+---+---+---+---+---+---+---+
 |   |   |   |   |   |   | p |   |
 +---+---+---+---+---+---+---+---+
 |   |   |   |   |   |   |   |   |
 +---+---+---+---+---+---+---+---+
 |   |   |   | P | P |   |   |   |
 +---+---+---+---+---+---+---+---+
 |   |   | P |   |   |   |   |   |
 +---+---+---+---+---+---+---+---+
 | P |   |   |   |   | P | P | P |
 +---+---+---+---+---+---+---+---+
 | R |   | B | Q | K | B | N | R |
 +---+---+---+---+---+---+---+---+

Fen: rnbqk2r/ppp1ppbp/6p1/8/3PP3/2P5/P4PPP/R1BQKBNR w KQkq - 0 1
Key: 59157A21AC3BA42B
Checkers: 
eval
      Eval term |    White    |    Black    |    Total    
                |   MG    EG  |   MG    EG  |   MG    EG  
----------------+-------------+-------------+-------------
       Material |   ---   --- |   ---   --- |  0.16 -0.12 
      Imbalance |   ---   --- |   ---   --- |  0.00  0.00 
          Pawns |   ---   --- |   ---   --- |  0.10  0.01 
        Knights |  0.06  0.00 |  0.06  0.00 |  0.00  0.00 
         Bishop | -0.12 -0.32 | -0.09 -0.33 | -0.03  0.01 
          Rooks | -0.27  0.00 | -0.19  0.00 | -0.09  0.00 
         Queens |  0.00  0.00 |  0.00  0.00 |  0.00  0.00 
       Mobility |  0.41  0.61 |  0.11  0.20 |  0.29  0.41 
    King safety |  0.89 -0.06 |  0.78 -0.06 |  0.11  0.00 
        Threats |  0.00  0.00 |  0.00  0.00 |  0.00  0.00 
   Passed pawns |  0.00  0.00 |  0.00  0.00 |  0.00  0.00 
          Space |  0.33  0.00 |  0.13  0.00 |  0.20  0.00 
----------------+-------------+-------------+-------------
          Total |   ---   --- |   ---   --- |  0.74  0.34 

Total Evaluation: 0.80 (white side)
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

User avatar
Rebel
Posts: 4706
Joined: Thu Aug 18, 2011 10:04 am

Re: A bizarre evaluation.

Post by Rebel » Sun Mar 20, 2016 8:18 am

Looks normal indeed.

User avatar
hgm
Posts: 23718
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: A bizarre evaluation.

Post by hgm » Sun Mar 20, 2016 8:52 am

But this is a 0-ply search for the position, while Larry was complaining about the 1-ply result. For that it is rather immaterial what the evaluation of the position itself is.

The problems was diagnosed as the evaluation of the position after Qb3. Apparently this is highly overrated for white. One suspects that this is because of King Safety, and in particular the attack on f7, which is after all what distinguishes this moves from others.

It is of course fundamentaly wrong to evaluate KS in the normal way on an uncastled King that still has castling rights. Just like it is wrong to evaluate material (or passer boonuses) on an opponent Pawn that is hanging. The latter case is solved by QS, however, which makes the material and associated bonuses go away by capturing the Pawn. Here the KS penalty (which seems even larger than a Pawn) would go away by castling. But you don't search the castling in QS.

Lesson: evaluation in non-quiet positions can give non-sensical results. And when the side to move can do a move that greatly improves the position for him, the position is by definition non-quiet.

Dann Corbit
Posts: 10112
Joined: Wed Mar 08, 2006 7:57 pm
Location: Redmond, WA USA
Contact:

Re: A bizarre evaluation.

Post by Dann Corbit » Sun Mar 20, 2016 8:55 am

Yes, it is easily solved by search.
But a human can evaluate it instantly.
Why can't the computer?

His point is that he would like evaluation to be smarter for positions like this one.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.

Post Reply