Another testing question

lkaufman · Post by **lkaufman** » Sun Sep 23, 2012 7:13 pm

Let's assume that we want to test our engine and candidate new versions of it against unrelated engines. There are two obvious ways to go about this.
Method 1: Run our engine against the gauntlet. Then run the candidate new engine against same. If candidate gets better rating (by any margin)it becomes our new best engine. Each new candidate has to surpass the rating of the previous best version to become our new best engine.
Method 2: Run best engine and candidate at the same time against gauntlet. Promote candidate only if it does better by some significant margin. Then when new candidate comes up, run it and the best version simultaneously against the gauntlet, with something that guarantees that we won't just be repeating the same games from the previous test for the best version (i.e. different openings, different time control, some randomizing element). Again only promote the candidate if it wins by a significant margin.
Which method of testing is superior, or is there some third way that is better still? Method 1 has the advantage of pretty much guaranteeing against regressions, you can always see your total gain since the release. The drawback is that once one version gets lucky, it is extremely hard for any subsequent version to surpass it unless it is quite a bit superior. Method 2 requires twice as many games to be played by the successful versions, and there is always a risk of regression, but each change will be accepted or rejected without the effect of "survivor bias" having helped the previous one.
Comments?

hgm · Post by **hgm** » Sun Sep 23, 2012 7:28 pm

I would not be quick discarding versions that are not clearly enormously better. The effects of improvements are often highly additive. So if I made a change in my version O which produced A, and it is not really conclusive whether it is an improvement or not, when I wanted to test a change B, I would actually make it both in O and A, to get new programs B and AB. I would then test B and AB in the gauntlet, each with half the number of games. This tests change B in the full number of games, but at the same time also adds more games to testing change A, as you play games with versions that have A, as well as versions that don't. This way you can both keep A and leave it out in tests for many subsequent new changes, increasing the accuracy in the value of A basically for free. After evaluating 10 other changes, you now will have 10 times as many games for the A evaluation, making it much easier to decide whether you should keep or discard it.

Daniel White · Post by **Daniel White** » Sun Sep 23, 2012 7:50 pm

I'm not sure I understand method 2 exactly so I can't comment on which is better. However if I want to test a change I often make several different versions and test each one with whatever method I'm use to test. So if I was adding a new eval term I would have one version which simply has the weight for that term at some sensible value, then perhaps two other versions, one with the weight = 0 and another with it being some larger value. The version with weight = 0 should choose the same moves as the base version but take longer to do so and is useful as a control*.

Also I'll add this even though I'm sure you're aware of it. If your new version turns out to be better you can simply keep the pgn of its games against the gauntlet. Then when you come to test a new version you only need to play the new engine against the same opponents and compare to the saved pgn.

I also like to do a round-robin of the opponent engines in the gauntlet before I do any testing. I then pass this pgn into the elo calculator along with the other two mentioned above. I have no idea if this affects the ratings but it seems to me that it cannot harm.

* If the elo of the 'sensible' version is negative, but not as negative as this 'control' version we know there might be some benefit to be had if only we could speed up the pattern recognition.

lkaufman · Post by **lkaufman** » Sun Sep 23, 2012 10:15 pm

Daniel White wrote:I'm not sure I understand method 2 exactly so I can't comment on which is better. However if I want to test a change I often make several different versions and test each one with whatever method I'm use to test. So if I was adding a new eval term I would have one version which simply has the weight for that term at some sensible value, then perhaps two other versions, one with the weight = 0 and another with it being some larger value. The version with weight = 0 should choose the same moves as the base version but take longer to do so and is useful as a control*.

Also I'll add this even though I'm sure you're aware of it. If your new version turns out to be better you can simply keep the pgn of its games against the gauntlet. Then when you come to test a new version you only need to play the new engine against the same opponents and compare to the saved pgn.

I also like to do a round-robin of the opponent engines in the gauntlet before I do any testing. I then pass this pgn into the elo calculator along with the other two mentioned above. I have no idea if this affects the ratings but it seems to me that it cannot harm.

* If the elo of the 'sensible' version is negative, but not as negative as this 'control' version we know there might be some benefit to be had if only we could speed up the pattern recognition.

If you keep the pgn as you propose, you are basically choosing method 1. The downside is that each later version is being compared to the winner of the previous match. Suppose two identical copies of your engine are tested in a gauntlet, with one getting an elo of 3020 and the other 3030, just due to sample error, but you don't know they are identical. The best estimate of the true strength is obviously 3025, but you keep the version that got 3030. Now you make a small change that improves the play by one elo. Let's assume it gets the expected result of 3026. This looks like a four elo backward step just because you are comparing to the lucky version. That is the argument for method 2, which requires that the "best" version be retested each time along with the new candidate. But of course this seems like a waste of resources.
So, which way is better? I don't know.

lkaufman · Post by **lkaufman** » Sun Sep 23, 2012 10:21 pm

hgm wrote:I would not be quick discarding versions that are not clearly enormously better. The effects of improvements are often highly additive. So if I made a change in my version O which produced A, and it is not really conclusive whether it is an improvement or not, when I wanted to test a change B, I would actually make it both in O and A, to get new programs B and AB. I would then test B and AB in the gauntlet, each with half the number of games. This tests change B in the full number of games, but at the same time also adds more games to testing change A, as you play games with versions that have A, as well as versions that don't. This way you can both keep A and leave it out in tests for many subsequent new changes, increasing the accuracy in the value of A basically for free. After evaluating 10 other changes, you now will have 10 times as many games for the A evaluation, making it much easier to decide whether you should keep or discard it.

That is a very interesting idea, one which we have occasionally tried in the past, but perhaps we should do so more often. The problem is that when most of your changes are within the margin of error, it quickly becomes unwieldy to make too many versions and have to keep track of all the permutations etc. It's a tradeoff of engine time vs. human time.
But this does not address the basic question, which is whether it is advisable to constantly compare to the best result ever obtained against the gauntlet, or should a new test be started each time, discarding the data that led to the promotion of a new candidate engine?

hgm · Post by **hgm** » Sun Sep 23, 2012 11:28 pm

What I once did is make an engine that took its eval parameters from the command line, and used them in the name it presents (which will show up in the PGN). Then I held a big round-robin between dozens of differently configured versions.

In particular I did this for 5x5 mini-Shogi, where I had no idea about the piece values, and turn-over of material is so fast that no imbalance survives for more than two or three moves. What I did is make versions that had permuted the 4 basic piece types in all possible orders, value-wise, and combined it with a high and a low Pawn value. That produced 48 engines. Each engine played only 94 games, but for each piece pair (A,B) half the games in the entire round robin had either A>B or A<B. So I compared the scores of those two sets of games (47*48 games in each set). That worked like a charm.

kbhearn · Post by **kbhearn** » Sun Sep 23, 2012 11:51 pm

If you can avoid your reference version changing much you could have its elo tested to a relatively low error bar. Assuming most of your changes are minor and relatively orthogonal to each other, you could then test reference vs reference + one change without rerunning reference through the gauntlet until you have enough successful changes that you feel it's time to roll the changes together and make a new reference version.

Another testing question

Another testing question

Re: Another testing question

Re: Another testing question

Re: Another testing question

Re: Another testing question

Re: Another testing question

Re: Another testing question