Eval Dilemma

bob · Post by **bob** » Fri Apr 03, 2009 8:07 pm

Edsel Apostol wrote:
Gian-Carlo Pascutto wrote:First thing to check in a case like this: are the test results really statistically significant?
I will post here if the re-run of the test is consistent. The test is just a few hundred games so it might not be super accurate.

It is going to have a huge error bar, which can really be misleading when the changes are not that significant...

bob · Post by **bob** » Fri Apr 03, 2009 8:09 pm

Edsel Apostol wrote:I guess some of you may have encountered this. It's somewhat annoying. I'm currently in the process of trying out some new things on my eval function. Let's say I have an old eval feature I'm going to denote as F1 and a new implementation of this eval feature as F2.

I have tested F1 against a set of opponents using a set of test positions in a blitz tournament.

I then replaced F1 by F2 but on some twist of fate I accidentally enabled F2 for the white side only and F1 for the black side. I tested it and it scored way higher compared to F1 in the same test condition. I said to myself, the new implementation works well, but then when I reviewed the code I found out that it was not implemented as I wanted to.

I then fixed the asymmetry bug and went on to implement the correct F2 feature. To my surprise it scored only between the F1 and the F1/F2 combination. Note that I have not tried F2 for white and F1 for black to see if it still performs well.

Now here's my dilemma, if you're in my place, would you keep the bug that performs well or implement the correct feature that doesn't perform as well?

I never accept bugs just because they are better. The idea is to understand what is going on, and _why_ the bug is making it play better (this is assuming it really is, which may well require a ton of games to verify) and go from there. Once you understand the "why" then you can probably come up with an implementation that is symmetric and still works well.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Fri Apr 03, 2009 8:47 pm

Edsel Apostol wrote: The opponents are Rybka1.2f 64 bit, Naum 3 64 bit, Thinker 5.4ai 64 bit and HIARCS11MP[...] the number of games is only 240 for each version.

I know its too few games, but I would expect that the bugfixed version should at least be stronger.

Note that the version with the bug performed well against Rybka and Hiarcs resulting to higher score.

240 games / 4 opponents = 60 games per program.

With 60 games per program it is completely meaningless to talk about "perform better against this or that".

Your problem is that you do not respect statistics.

diep · Post by **diep** » Sat Apr 04, 2009 3:29 am

Gian-Carlo Pascutto wrote:
Edsel Apostol wrote: The opponents are Rybka1.2f 64 bit, Naum 3 64 bit, Thinker 5.4ai 64 bit and HIARCS11MP[...] the number of games is only 240 for each version.

I know its too few games, but I would expect that the bugfixed version should at least be stronger.

Note that the version with the bug performed well against Rybka and Hiarcs resulting to higher score.
240 games / 4 opponents = 60 games per program.

With 60 games per program it is completely meaningless to talk about "perform better against this or that".

Your problem is that you do not respect statistics.

When i would've posted something like this a few years ago (before 2004), Frans Morsch would ship me an email or tell during a tournament: "please don't tell them that, right now they make no chance to get a strong engine, let alone become one of my competitors, competing with the current ones is already hard enough".

Vincent

diep · Post by **diep** » Sat Apr 04, 2009 3:35 am

Gian-Carlo Pascutto wrote:
Edsel Apostol wrote: The opponents are Rybka1.2f 64 bit, Naum 3 64 bit, Thinker 5.4ai 64 bit and HIARCS11MP[...] the number of games is only 240 for each version.

I know its too few games, but I would expect that the bugfixed version should at least be stronger.

Note that the version with the bug performed well against Rybka and Hiarcs resulting to higher score.
240 games / 4 opponents = 60 games per program.

With 60 games per program it is completely meaningless to talk about "perform better against this or that".

Your problem is that you do not respect statistics.

Yeah they'll soon figure out they need 20000 games for that, which is quite some cpu time. So in Diep i just keep adding patterns and if it loses a game, i just print the patterns verbose and have a look as a chessplayer what on earth could've caused it.

Human intuition is the biggest energy saver of 'em all.

Imagine you'd make an evaluation function with 20k patterns and would need 20k games for each pattern you added.

Additionally there is a huge diff between bullet/blitz versus slower time controls. A few patzer moves work real well in bullet/blitz and just make no chance ever impressing the deep searchers in slower time controls. Especially kingsafety seems to be in many cases a temporarily advantage, once you get that 14+ ply (if you'd add LMR type reductions that's as you will realize 21 ply).

Vincent

Kirill Kryukov · Post by **Kirill Kryukov** » Sat Apr 04, 2009 3:51 am

Edsel Apostol wrote:
hgm wrote:I think you still ignore the most important point: how much better did it do (percentage-wise), and over how many games?
The result is 31.0417% with the bug, 26.25% without the bug.

The opponents are Rybka1.2f 64 bit, Naum 3 64 bit, Thinker 5.4ai 64 bit and HIARCS11MP. My engine here is only in 32 bit. The opening position is Noomen Test Suite 2008. Time control is blitz 1'+1". GUI is Arena, and the number of games is only 240 for each version.

I know its too few games, but I would expect that the bugfixed version should at least be stronger.

Note that the version with the bug performed well against Rybka and Hiarcs resulting to higher score.

Since you used a stronger set of opponents, it's possible that your best version is just the most defensive. You need an even mix of stronger and weaker opponents to notice a real improvement.

(Also 240 games per version is too little as others said).

MattieShoes · Post by **MattieShoes** » Sat Apr 04, 2009 5:12 am

Can you or anybody point me to how the error bars are calculated? I was browsing wikipedia about hypothesis testing and there's a whole bunch of different ways to do it depending on the situation, and my statistics-fu is too weak to know which would be appropriate.

Links or even the name of the hypothesis testing method one would use to calculate them would be great...

krazyken · Post by **krazyken** » Sat Apr 04, 2009 7:19 am

If you are willing to assume that your results are normally distributed, the easiest method is a paired samples T-test. compare the results of A with the results of A' from identical test matches to see if they are different. Depending on the confidence level and the size of |A-A'| the number of games needed can vary greatly.

hgm · Post by **hgm** » Sat Apr 04, 2009 10:06 am

MattieShoes wrote:Can you or anybody point me to how the error bars are calculated?

I think the rule-of-thumb Error = 40%/sqrt(numberOfGames) is accurate enough in practice, for scores in the 65%-35% range. (This is for the 1-sigma or 84% confidence level; for 95% confidence, double it.) For very unbalanced scores, you would have to take into account the fact that the 40% goes down; the exact formula for this is

100%*sqrt(score*(1-score) - 0.25*drawFraction)

where the score is given as a fraction. The 40% is based on 35% draws, and a score of 0.5. In the case mentioned (score around 0.25, presumably 15% win and 20% draws, you would get 100%*sqrt(0.25*0.75-0.25*0.2) = 37%. So in 240 games you woud have 2.4% error bar (1 sigma).

When comparing the results from two independent gauntlets, the error bar in the diffrence is the Pythagorean sum of the individua error bars (i.e. sqrt(error1^2 + error2^2) ). For results that had equal numbers of games, this means multiplying the individual error bars by sqrt(2).

Edsel Apostol · Post by **Edsel Apostol** » Sat Apr 04, 2009 4:39 pm

CRoberson wrote:There exist an obvious possibility here: BOTH F1 and F2 have bugs!

If F1 has bugs in how it handles black and F2 has bugs in how it
handles white, then you will get exactly the results you are getting now.

In using F1 for white and F2 for black, you have eliminated the
chance for F1 black bugs to reveal themselves and the same for
F2 white bugs.

I think I've implemented both correctly.

I have tried to disable the feature and it scored 3 more points on the same test condition. It seems backwards pawn is not that beneficial to my engine.

A question to Bob, have you determined how much elo backwards pawn is worth in Crafty?

Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma

Re: Eval Dilemma