Since I do lack the resources to test them thoroughly I mostly rely on intuition. Since this one is so counter intuitive, I don't know what to decide. Well I guess I will just have to choose the right implementation even if it seems to be weaker in my limited tests.bob wrote:I never accept bugs just because they are better. The idea is to understand what is going on, and _why_ the bug is making it play better (this is assuming it really is, which may well require a ton of games to verify) and go from there. Once you understand the "why" then you can probably come up with an implementation that is symmetric and still works well.Edsel Apostol wrote:I guess some of you may have encountered this. It's somewhat annoying. I'm currently in the process of trying out some new things on my eval function. Let's say I have an old eval feature I'm going to denote as F1 and a new implementation of this eval feature as F2.
I have tested F1 against a set of opponents using a set of test positions in a blitz tournament.
I then replaced F1 by F2 but on some twist of fate I accidentally enabled F2 for the white side only and F1 for the black side. I tested it and it scored way higher compared to F1 in the same test condition. I said to myself, the new implementation works well, but then when I reviewed the code I found out that it was not implemented as I wanted to.
I then fixed the asymmetry bug and went on to implement the correct F2 feature. To my surprise it scored only between the F1 and the F1/F2 combination. Note that I have not tried F2 for white and F1 for black to see if it still performs well.
Now here's my dilemma, if you're in my place, would you keep the bug that performs well or implement the correct feature that doesn't perform as well?
Eval Dilemma
Moderators: hgm, Rebel, chrisw
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: Eval Dilemma
Edsel Apostol
https://github.com/ed-apostol/InvictusChess
https://github.com/ed-apostol/InvictusChess
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: Eval Dilemma
I know its too few games. I would have run tons of games if I do have the resources but I don't. I mostly rely on intuition to decide from the small data that I have which is better.Gian-Carlo Pascutto wrote:240 games / 4 opponents = 60 games per program.Edsel Apostol wrote: The opponents are Rybka1.2f 64 bit, Naum 3 64 bit, Thinker 5.4ai 64 bit and HIARCS11MP[...] the number of games is only 240 for each version.
I know its too few games, but I would expect that the bugfixed version should at least be stronger.
Note that the version with the bug performed well against Rybka and Hiarcs resulting to higher score.
With 60 games per program it is completely meaningless to talk about "perform better against this or that".
Your problem is that you do not respect statistics.
I actually have no PC of my own and those test results are run for me by my tester. At least I achieved the current strength of my engine even if its development is limited. Just wait until I have my own PC and I can test my ideas throughly and it will trash those commercial engines.
Edsel Apostol
https://github.com/ed-apostol/InvictusChess
https://github.com/ed-apostol/InvictusChess
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: Eval Dilemma
It seems like being wary that the newbies will catch up with the veterans. It's just a fact of life that old bulls are being replaced by young ones.diep wrote:When i would've posted something like this a few years ago (before 2004), Frans Morsch would ship me an email or tell during a tournament: "please don't tell them that, right now they make no chance to get a strong engine, let alone become one of my competitors, competing with the current ones is already hard enough".Gian-Carlo Pascutto wrote:240 games / 4 opponents = 60 games per program.Edsel Apostol wrote: The opponents are Rybka1.2f 64 bit, Naum 3 64 bit, Thinker 5.4ai 64 bit and HIARCS11MP[...] the number of games is only 240 for each version.
I know its too few games, but I would expect that the bugfixed version should at least be stronger.
Note that the version with the bug performed well against Rybka and Hiarcs resulting to higher score.
With 60 games per program it is completely meaningless to talk about "perform better against this or that".
Your problem is that you do not respect statistics.
Vincent
Edsel Apostol
https://github.com/ed-apostol/InvictusChess
https://github.com/ed-apostol/InvictusChess
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: Eval Dilemma
Thanks Kirill. I think I indeed need to add some opponents that are about as strong as my engine to see the real improvement.Kirill Kryukov wrote:Since you used a stronger set of opponents, it's possible that your best version is just the most defensive. You need an even mix of stronger and weaker opponents to notice a real improvement.Edsel Apostol wrote:The result is 31.0417% with the bug, 26.25% without the bug.hgm wrote:I think you still ignore the most important point: how much better did it do (percentage-wise), and over how many games?
The opponents are Rybka1.2f 64 bit, Naum 3 64 bit, Thinker 5.4ai 64 bit and HIARCS11MP. My engine here is only in 32 bit. The opening position is Noomen Test Suite 2008. Time control is blitz 1'+1". GUI is Arena, and the number of games is only 240 for each version.
I know its too few games, but I would expect that the bugfixed version should at least be stronger.
Note that the version with the bug performed well against Rybka and Hiarcs resulting to higher score.
(Also 240 games per version is too little as others said).
Edsel Apostol
https://github.com/ed-apostol/InvictusChess
https://github.com/ed-apostol/InvictusChess
-
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Eval Dilemma
To get a score difference of 30% (35% ==> 65%) that's so big, that if you add just 1 pattern to your evaluation with such a huge impact, that obviously we might hope that's not the 'average pattern' that you add.hgm wrote:I think the rule-of-thumb Error = 40%/sqrt(numberOfGames) is accurate enough in practice, for scores in the 65%-35% range. (This is for the 1-sigma or 84% confidence level; for 95% confidence, double it.) For very unbalanced scores, you would have to take into account the fact that the 40% goes down; the exact formula for this isMattieShoes wrote:Can you or anybody point me to how the error bars are calculated?
100%*sqrt(score*(1-score) - 0.25*drawFraction)
where the score is given as a fraction. The 40% is based on 35% draws, and a score of 0.5. In the case mentioned (score around 0.25, presumably 15% win and 20% draws, you would get 100%*sqrt(0.25*0.75-0.25*0.2) = 37%. So in 240 games you woud have 2.4% error bar (1 sigma).
When comparing the results from two independent gauntlets, the error bar in the diffrence is the Pythagorean sum of the individua error bars (i.e. sqrt(error1^2 + error2^2) ). For results that had equal numbers of games, this means multiplying the individual error bars by sqrt(2).
More likely it is that you see a score difference of 1 point at a 200 games when adding just 1 tiny pattern.
Vincent
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: Eval Dilemma
He was not talking about increases from 35% to 65%. The formula is valid if both score A and score B (from versions A and B) are within 35% to 65%. In other words, if score A is 48% and score B is 52%, you can apply the formula. If score A is 8% and score B is 12%, you cannot.diep wrote:To get a score difference of 30% (35% ==> 65%) that's so big, that if you add just 1 pattern to your evaluation with such a huge impact, that obviously we might hope that's not the 'average pattern' that you add.hgm wrote:I think the rule-of-thumb Error = 40%/sqrt(numberOfGames) is accurate enough in practice, for scores in the 65%-35% range. (This is for the 1-sigma or 84% confidence level; for 95% confidence, double it.) For very unbalanced scores, you would have to take into account the fact that the 40% goes down; the exact formula for this isMattieShoes wrote:Can you or anybody point me to how the error bars are calculated?
100%*sqrt(score*(1-score) - 0.25*drawFraction)
where the score is given as a fraction. The 40% is based on 35% draws, and a score of 0.5. In the case mentioned (score around 0.25, presumably 15% win and 20% draws, you would get 100%*sqrt(0.25*0.75-0.25*0.2) = 37%. So in 240 games you woud have 2.4% error bar (1 sigma).
When comparing the results from two independent gauntlets, the error bar in the diffrence is the Pythagorean sum of the individua error bars (i.e. sqrt(error1^2 + error2^2) ). For results that had equal numbers of games, this means multiplying the individual error bars by sqrt(2).
More likely it is that you see a score difference of 1 point at a 200 games when adding just 1 tiny pattern.
Vincent
Miguel
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: Eval Dilemma
You said that you knew it was too few games. But I do not think you knew the magnitude of games needed to come up with a conclusion. What Giancarlo was pointing out can be translated to: "Both versions do not look any weaker or stronger than the other". So, your test does not look counter intuitive.Edsel Apostol wrote:Since I do lack the resources to test them thoroughly I mostly rely on intuition. Since this one is so counter intuitive, I don't know what to decide. Well I guess I will just have to choose the right implementation even if it seems to be weaker in my limited tests.bob wrote:I never accept bugs just because they are better. The idea is to understand what is going on, and _why_ the bug is making it play better (this is assuming it really is, which may well require a ton of games to verify) and go from there. Once you understand the "why" then you can probably come up with an implementation that is symmetric and still works well.Edsel Apostol wrote:I guess some of you may have encountered this. It's somewhat annoying. I'm currently in the process of trying out some new things on my eval function. Let's say I have an old eval feature I'm going to denote as F1 and a new implementation of this eval feature as F2.
I have tested F1 against a set of opponents using a set of test positions in a blitz tournament.
I then replaced F1 by F2 but on some twist of fate I accidentally enabled F2 for the white side only and F1 for the black side. I tested it and it scored way higher compared to F1 in the same test condition. I said to myself, the new implementation works well, but then when I reviewed the code I found out that it was not implemented as I wanted to.
I then fixed the asymmetry bug and went on to implement the correct F2 feature. To my surprise it scored only between the F1 and the F1/F2 combination. Note that I have not tried F2 for white and F1 for black to see if it still performs well.
Now here's my dilemma, if you're in my place, would you keep the bug that performs well or implement the correct feature that doesn't perform as well?
To make a decision based only on the numbers of wins you had in your tests, is almost as basing it on flipping coins. The difference you got was ~10 wins on 240 games. You had a performance of ~33%. This is not the same (because you have draws) but just to have an idea, throw a dice 240 times and count how many times you get 1 or 2 (33% chances). Do it again and again. The number will oscillate around 80, but getting close to 70 or 90 is not that unlikely. This is pretty well established. The fact that you are using only 20 positions and 4 engines make differences even less significant (statistically speaking).
Miguel
-
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Eval Dilemma
Taking strelka code and carrying on might win from some old chaps, it sure doesn't make the entire new generation better as they 'lift' a bit too much there onto someone elses work.Edsel Apostol wrote:It seems like being wary that the newbies will catch up with the veterans. It's just a fact of life that old bulls are being replaced by young ones.diep wrote:When i would've posted something like this a few years ago (before 2004), Frans Morsch would ship me an email or tell during a tournament: "please don't tell them that, right now they make no chance to get a strong engine, let alone become one of my competitors, competing with the current ones is already hard enough".Gian-Carlo Pascutto wrote:240 games / 4 opponents = 60 games per program.Edsel Apostol wrote: The opponents are Rybka1.2f 64 bit, Naum 3 64 bit, Thinker 5.4ai 64 bit and HIARCS11MP[...] the number of games is only 240 for each version.
I know its too few games, but I would expect that the bugfixed version should at least be stronger.
Note that the version with the bug performed well against Rybka and Hiarcs resulting to higher score.
With 60 games per program it is completely meaningless to talk about "perform better against this or that".
Your problem is that you do not respect statistics.
Vincent
-
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Eval Dilemma
If A scores 48% and B scores 52%, that's basically blowing 2 games with maybe in total just 2 very bad moves, as that can give a 4 point swing in total.michiguel wrote:He was not talking about increases from 35% to 65%. The formula is valid if both score A and score B (from versions A and B) are within 35% to 65%. In other words, if score A is 48% and score B is 52%, you can apply the formula. If score A is 8% and score B is 12%, you cannot.diep wrote:To get a score difference of 30% (35% ==> 65%) that's so big, that if you add just 1 pattern to your evaluation with such a huge impact, that obviously we might hope that's not the 'average pattern' that you add.hgm wrote:I think the rule-of-thumb Error = 40%/sqrt(numberOfGames) is accurate enough in practice, for scores in the 65%-35% range. (This is for the 1-sigma or 84% confidence level; for 95% confidence, double it.) For very unbalanced scores, you would have to take into account the fact that the 40% goes down; the exact formula for this isMattieShoes wrote:Can you or anybody point me to how the error bars are calculated?
100%*sqrt(score*(1-score) - 0.25*drawFraction)
where the score is given as a fraction. The 40% is based on 35% draws, and a score of 0.5. In the case mentioned (score around 0.25, presumably 15% win and 20% draws, you would get 100%*sqrt(0.25*0.75-0.25*0.2) = 37%. So in 240 games you woud have 2.4% error bar (1 sigma).
When comparing the results from two independent gauntlets, the error bar in the diffrence is the Pythagorean sum of the individua error bars (i.e. sqrt(error1^2 + error2^2) ). For results that had equal numbers of games, this means multiplying the individual error bars by sqrt(2).
More likely it is that you see a score difference of 1 point at a 200 games when adding just 1 tiny pattern.
Vincent
Miguel
First of all odds these 2 bad moves were caused by the specific pattern is tiny. It could be some fluctuation or book learning or whatever effect.
So you really soon will conclude you need THOUSANDS of games for real good statistical significance. I'm going usually for 95%.
Vincent
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: Eval Dilemma
Going from 50-50 to 52-48 is an increase of ~15 Elo points. Yes, you need thousands of games to make sure it is real with a good level of confidence.diep wrote:If A scores 48% and B scores 52%, that's basically blowing 2 games with maybe in total just 2 very bad moves, as that can give a 4 point swing in total.michiguel wrote:He was not talking about increases from 35% to 65%. The formula is valid if both score A and score B (from versions A and B) are within 35% to 65%. In other words, if score A is 48% and score B is 52%, you can apply the formula. If score A is 8% and score B is 12%, you cannot.diep wrote:To get a score difference of 30% (35% ==> 65%) that's so big, that if you add just 1 pattern to your evaluation with such a huge impact, that obviously we might hope that's not the 'average pattern' that you add.hgm wrote:I think the rule-of-thumb Error = 40%/sqrt(numberOfGames) is accurate enough in practice, for scores in the 65%-35% range. (This is for the 1-sigma or 84% confidence level; for 95% confidence, double it.) For very unbalanced scores, you would have to take into account the fact that the 40% goes down; the exact formula for this isMattieShoes wrote:Can you or anybody point me to how the error bars are calculated?
100%*sqrt(score*(1-score) - 0.25*drawFraction)
where the score is given as a fraction. The 40% is based on 35% draws, and a score of 0.5. In the case mentioned (score around 0.25, presumably 15% win and 20% draws, you would get 100%*sqrt(0.25*0.75-0.25*0.2) = 37%. So in 240 games you woud have 2.4% error bar (1 sigma).
When comparing the results from two independent gauntlets, the error bar in the diffrence is the Pythagorean sum of the individua error bars (i.e. sqrt(error1^2 + error2^2) ). For results that had equal numbers of games, this means multiplying the individual error bars by sqrt(2).
More likely it is that you see a score difference of 1 point at a 200 games when adding just 1 tiny pattern.
Vincent
Miguel
First of all odds these 2 bad moves were caused by the specific pattern is tiny. It could be some fluctuation or book learning or whatever effect.
So you really soon will conclude you need THOUSANDS of games for real good statistical significance. I'm going usually for 95%.
Vincent
Miguel