Eval Dilemma

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Eval Dilemma

Post by Edsel Apostol »

I guess some of you may have encountered this. It's somewhat annoying. I'm currently in the process of trying out some new things on my eval function. Let's say I have an old eval feature I'm going to denote as F1 and a new implementation of this eval feature as F2.

I have tested F1 against a set of opponents using a set of test positions in a blitz tournament.

I then replaced F1 by F2 but on some twist of fate I accidentally enabled F2 for the white side only and F1 for the black side. I tested it and it scored way higher compared to F1 in the same test condition. I said to myself, the new implementation works well, but then when I reviewed the code I found out that it was not implemented as I wanted to.

I then fixed the asymmetry bug and went on to implement the correct F2 feature. To my surprise it scored only between the F1 and the F1/F2 combination. Note that I have not tried F2 for white and F1 for black to see if it still performs well.

Now here's my dilemma, if you're in my place, would you keep the bug that performs well or implement the correct feature that doesn't perform as well?
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: Eval Dilemma

Post by Gian-Carlo Pascutto »

First thing to check in a case like this: are the test results really statistically significant?
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: Eval Dilemma

Post by Edsel Apostol »

Gian-Carlo Pascutto wrote:First thing to check in a case like this: are the test results really statistically significant?
I will post here if the re-run of the test is consistent. The test is just a few hundred games so it might not be super accurate.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Eval Dilemma

Post by Edmund »

I don't know about your specific example. Generally it might well be that an engine benefits from certain asymmetric evaluation functions. But what I encountered with Glass when I played around with them was that you have to be very careful with other things when you use them. For example you get search inconsistencies when you first analyze a position with white to move and then the next position black to move (after white's move)
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Eval Dilemma

Post by mcostalba »

Gian-Carlo Pascutto wrote:First thing to check in a case like this: are the test results really statistically significant?
IMHO I would second this.

Redo the tests with a higer number of games until you are sure that test result is reliable.

If you still have the same result then you have learned something new :-) I would try to understund "why" it is better in this way and perhaps, doing so, I find new and better ways to exploit this property.

In any case engine strenght is the only metrics to me.
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: Eval Dilemma

Post by Edsel Apostol »

Codeman wrote:I don't know about your specific example. Generally it might well be that an engine benefits from certain asymmetric evaluation functions. But what I encountered with Glass when I played around with them was that you have to be very careful with other things when you use them. For example you get search inconsistencies when you first analyze a position with white to move and then the next position black to move (after white's move)
I don't think that it would result into some search inconsistencies, but I might be wrong. The eval will return the same score for a position may be it the white sides to move or black.

What I mean with eval asymmetry is that when you flip the position the eval for that is different compared to the original position.

The feature I'm referring here is backwards pawn, F1 is the original implementation and F2 is the new one, and you set F1 for white and F2 for black. Suppose you have one backward pawn for both sides from the definition of F1, but in the definition of F2 it doesn't consider it as backwards pawn, so in this case only white will have the penalty while black will not be penalized.

One more thing I've noticed, the one with the bug scored better against Rybka 1.2f x64 compared to the bugfixed one.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Eval Dilemma

Post by hgm »

I think you still ignore the most important point: how much better did it do (percentage-wise), and over how many games?
Edsel Apostol
Posts: 803
Joined: Mon Jul 17, 2006 5:53 am
Full name: Edsel Apostol

Re: Eval Dilemma

Post by Edsel Apostol »

hgm wrote:I think you still ignore the most important point: how much better did it do (percentage-wise), and over how many games?
The result is 31.0417% with the bug, 26.25% without the bug.

The opponents are Rybka1.2f 64 bit, Naum 3 64 bit, Thinker 5.4ai 64 bit and HIARCS11MP. My engine here is only in 32 bit. The opening position is Noomen Test Suite 2008. Time control is blitz 1'+1". GUI is Arena, and the number of games is only 240 for each version.

I know its too few games, but I would expect that the bugfixed version should at least be stronger.

Note that the version with the bug performed well against Rybka and Hiarcs resulting to higher score.
CRoberson
Posts: 2055
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Eval Dilemma

Post by CRoberson »

There exist an obvious possibility here: BOTH F1 and F2 have bugs!

If F1 has bugs in how it handles black and F2 has bugs in how it
handles white, then you will get exactly the results you are getting now.

In using F1 for white and F2 for black, you have eliminated the
chance for F1 black bugs to reveal themselves and the same for
F2 white bugs.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Eval Dilemma

Post by mcostalba »

Edsel Apostol wrote:
hgm wrote: I know its too few games, but I would expect that the bugfixed version should at least be stronger.
If the bug is small, if the bug weights only few elo points of difference then with only 240 it is very easy to completly loose it in the background noise.

I have experience at 1+0 time control of patches that seem good until 500-600 games and also of a good amount 10-15 ELO to be completly dropped or even reversed after 1000 games.

In my very little experience in this world I have made up my mind that testing accounts for 90% of the final engine quality especially with strong ones.

Unfortunaltly testing a chess engine is an art in his own...you need A LOT of patience :-)