Argh, my program gets worse test suite results after bugfix

Discussion of chess software programming and technical issues.

Moderator: Ras

Ratta

Argh, my program gets worse test suite results after bugfix

Post by Ratta »

Hi, did you experience anything like this too?
I use to run a mix of test suites (arasan+bt+bs+ecmgcp) to have an idea of the tactical strength of my engine, and after i fixed a couple of *bad* bugs in the evaluation ad search function i'm getting much worse results, something like 62/381 instead of 109/381.
Yeah, they where really bugs, and the program plays much better in real games (no more blunders), and i know that in chess it is better chosing a decent move 100% of times rather than playing the best move 99.9% of times and blundering 0.1% of them.
But well, hower, this behaviour makes me feel really frustrated, and i'm writing this post because of that.
I'm now looking forward to find a way to restore the bugs, with no blunder if possible :)
Regards!
User avatar
Ovyron
Posts: 4558
Joined: Tue Jul 03, 2007 4:30 am

Re: Argh, my program gets worse test suite results after bug

Post by Ovyron »

I think that back when Rybka was bad at test suites and members were ranting about that, Vas just released Rybka Winfinder, an engine that was weaker than the original but was good at these tests.

You may try something similar, having 2 engines, one for games and one for test suites.
Your beliefs create your reality, so be careful what you wish for.
Dann Corbit
Posts: 12777
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Argh, my program gets worse test suite results after bug

Post by Dann Corbit »

What exactly were the things that you fixed?
User avatar
Graham Banks
Posts: 44136
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: Argh, my program gets worse test suite results after bug

Post by Graham Banks »

Hi Maurizio,

is your engine named Rattate Chess or Ratta Te Chess?

Regards, Graham.
gbanksnz at gmail.com
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Argh, my program gets worse test suite results after bug

Post by bob »

Ratta wrote:Hi, did you experience anything like this too?
I use to run a mix of test suites (arasan+bt+bs+ecmgcp) to have an idea of the tactical strength of my engine, and after i fixed a couple of *bad* bugs in the evaluation ad search function i'm getting much worse results, something like 62/381 instead of 109/381.
Yeah, they where really bugs, and the program plays much better in real games (no more blunders), and i know that in chess it is better chosing a decent move 100% of times rather than playing the best move 99.9% of times and blundering 0.1% of them.
But well, hower, this behaviour makes me feel really frustrated, and i'm writing this post because of that.
I'm now looking forward to find a way to restore the bugs, with no blunder if possible :)
Regards!
It is so common it isn't funny... The bad thing is that introducing that bug actually improved your results, which would lead you to believe it works...
Tony

Re: Argh, my program gets worse test suite results after bug

Post by Tony »

Ratta wrote:Hi, did you experience anything like this too?
I use to run a mix of test suites (arasan+bt+bs+ecmgcp) to have an idea of the tactical strength of my engine, and after i fixed a couple of *bad* bugs in the evaluation ad search function i'm getting much worse results, something like 62/381 instead of 109/381.
Yeah, they where really bugs, and the program plays much better in real games (no more blunders), and i know that in chess it is better chosing a decent move 100% of times rather than playing the best move 99.9% of times and blundering 0.1% of them.
But well, hower, this behaviour makes me feel really frustrated, and i'm writing this post because of that.
I'm now looking forward to find a way to restore the bugs, with no blunder if possible :)
Regards!
You might have created dependencies. Maybe fe you adjusted thresholds for pruning based on the buggy situation.

Tony
Ratta

Re: Argh, my program gets worse test suite results after bug

Post by Ratta »

Hi, thanks for your replies.

@robert: eh, looks like that i'm not the only one that incurred in this :)

@graham: the correct spelling is RattateChess, or Rattatechess if you don't like the uppercase 'C'. Other spellings are also ok for me, but let's say that "RattateChess" is the official one :)

@dann, tony: The first bug was a threashold bug (and when i started this thread it was not fixed properly, now it should be and i get slightly better test score, but not as good as with the bug :) ). I store upper and lower bounds in the hashtable, and at quescence nodes if i get a lower bound that is >-WORST_MATE+MAX_PLY it means that the evaluation function had been previously called (and the lower bound has already been improved), so i don't call it again. The bug was because i have an aspiration window of [-WORST_MATE, +WORST_MATE], and i was instead testing if the lower bound is >-INFINITY. The buggy behaviour in negescout is that with the first call you are asking "Is there anything better than a being checkmated?", and you get "Yes, there is!", then at the second (PV) search you ask "How good is it?" and get "It is like being checkmated + 1 centipawn".
The second bug was related to an experimental code to evaluate the dynamism of the position, i add a malus for each of you pawns/pieces that is being defended/attacked at the same time, and a bigger malus if you have something two or more pieces that are being attacked by pawn, or something like that (with some little glitches depending if you are on more or not. I was expecting this last thing that is a bit like a "global SEE" to just smooth out the quiescence, while the first was designed to give a malus to things like the "passive rook" in the end game, or the bishop in c1 that cannot be developed because it must defend the pawn at b2, etc).
In my original idea i hoped that this change would "drive" the search towards more attacking/aggressive positions, and it was actually makeing some good difference in many tactical problems.
The bug was that the values with which a estimated the "stength" of an attack where completely messed up (the attack of a pawn is strong because the pawn can be sacrified, the attac of the queen is weeker, etc).

BTW, the source code of my engine is available at http://repo.or.cz with full revision history in a GIT repository (if you don't know what GIT is, it is a revision control system, like CVS or SVN but much more powerful), but the code in the repository in this moment is very week because i have an extremely simple evaluation function and no funky search features like futility of late move pruning, because i wanted to test ideas like the one above in a very simple and safe setting (with no success, as this thread is here to prove).
Regards!
Tony

Re: Argh, my program gets worse test suite results after bug

Post by Tony »

Ratta wrote:Hi, thanks for your replies.

@robert: eh, looks like that i'm not the only one that incurred in this :)

@graham: the correct spelling is RattateChess, or Rattatechess if you don't like the uppercase 'C'. Other spellings are also ok for me, but let's say that "RattateChess" is the official one :)

@dann, tony: The first bug was a threashold bug (and when i started this thread it was not fixed properly, now it should be and i get slightly better test score, but not as good as with the bug :) ). I store upper and lower bounds in the hashtable, and at quescence nodes if i get a lower bound that is >-WORST_MATE+MAX_PLY it means that the evaluation function had been previously called (and the lower bound has already been improved), so i don't call it again. The bug was because i have an aspiration window of [-WORST_MATE, +WORST_MATE], and i was instead testing if the lower bound is >-INFINITY. The buggy behaviour in negescout is that with the first call you are asking "Is there anything better than a being checkmated?", and you get "Yes, there is!", then at the second (PV) search you ask "How good is it?" and get "It is like being checkmated + 1 centipawn".
The second bug was related to an experimental code to evaluate the dynamism of the position, i add a malus for each of you pawns/pieces that is being defended/attacked at the same time, and a bigger malus if you have something two or more pieces that are being attacked by pawn, or something like that (with some little glitches depending if you are on more or not. I was expecting this last thing that is a bit like a "global SEE" to just smooth out the quiescence, while the first was designed to give a malus to things like the "passive rook" in the end game, or the bishop in c1 that cannot be developed because it must defend the pawn at b2, etc).
In my original idea i hoped that this change would "drive" the search towards more attacking/aggressive positions, and it was actually makeing some good difference in many tactical problems.
The bug was that the values with which a estimated the "stength" of an attack where completely messed up (the attack of a pawn is strong because the pawn can be sacrified, the attac of the queen is weeker, etc).

BTW, the source code of my engine is available at http://repo.or.cz with full revision history in a GIT repository (if you don't know what GIT is, it is a revision control system, like CVS or SVN but much more powerful), but the code in the repository in this moment is very week because i have an extremely simple evaluation function and no funky search features like futility of late move pruning, because i wanted to test ideas like the one above in a very simple and safe setting (with no success, as this thread is here to prove).
Regards!
Not sure how much time I'll spend on this but a couple of things. (Tips vary from essential to optimization. You do the categorizing yourself)

- Your hashtable writing code overwrites deep entries with shallow entries (when the key is the same). Pretty desastrous in endgames

- I seriously question doing a nullmove when only 1 piece is present (and you have 2 pawns)

- Don't adjust alpha and beta based on the hashtable scores. It seems theoreticly correct but in practice it isn't

- (Specially for the FIRST ply in quiescence) Don't take the evaluation as best score when you're in check. (Maybe it can't happen in your code )

- Do the (material)counting stuff incremental.

- Don't know if I understood this correctly, but giving more than half a pawn bonus for attacking a piece with a pawn, gives serious horizon effects.

- I would suggest splitting normal search and quiescence search in your code. They behave quite differently.

Hope I have given you some suggestions to work about.

Tony
Ratta

Re: Argh, my program gets worse test suite results after bug

Post by Ratta »

Tony wrote: Not sure how much time I'll spend on this but a couple of things. (Tips vary from essential to optimization. You do the categorizing yourself)
Wow, you are amazing. Thanks a lot!
Tony wrote: - Your hashtable writing code overwrites deep entries with shallow entries (when the key is the same). Pretty desastrous in endgames
Mh, no, this is not being done (up to a programming mistake). The only deeper entries that can be overwritten are those "old", ie resulting from a previous call to "find_best_move" (thus this could make "pondering" less effective, but it is required to avoid stucking the hashtable with useless positions).
Tony wrote: - I seriously question doing a nullmove when only 1 piece is present (and you have 2 pawns)
Yeah, my null-move checking function is still very rough.
Tony wrote: - Don't adjust alpha and beta based on the hashtable scores. It seems theoreticly correct but in practice it isn't
Mh, i would like to better understand this issue. Suppose i store the lower bound "correcty", ie when a previous seach at the same depth fails high (IIUC this should mean that the "true" value of the position, the value that can be calculated with a [-INF,+INF] window, is higher). If so when doing a PV search i should be able to adjust the window, because anyway i'm removing value ranges that cannot contain the "true value". Is there anything wrong in this? Or is there any other kind of "pratical" issue?
Tony wrote: - (Specially for the FIRST ply in quiescence) Don't take the evaluation as best score when you're in check. (Maybe it can't happen in your code )
This can't happen (up to programming mistake, as usual).
Tony wrote: - Do the (material)counting stuff incremental.
Yeah, let's say that at the moment i'm just trying to achieve the highest strength/speed ratio :)
Tony wrote: - Don't know if I understood this correctly, but giving more than half a pawn bonus for attacking a piece with a pawn, gives serious horizon effects.
The bonus is given only if the attacking pawn is on move, or there are two (or more) pawns that attack two (or more) different pieces.
Tony wrote: - I would suggest splitting normal search and quiescence search in your code. They behave quite differently.
Yeah, there is still a lot of cleanup waiting :)

Regards!
User avatar
Graham Banks
Posts: 44136
Joined: Sun Feb 26, 2006 10:52 am
Location: Auckland, NZ

Re: Argh, my program gets worse test suite results after bug

Post by Graham Banks »

Ratta wrote: @graham: the correct spelling is RattateChess, or Rattatechess if you don't like the uppercase 'C'. Other spellings are also ok for me, but let's say that "RattateChess" is the official one :)
Thanks Maurizio,

I wanted to make sure that I used the correct name when I begin testing it. :wink:

Regards, Graham.
gbanksnz at gmail.com