Variance reports for testing engine improvements

nczempin · Post by **nczempin** » Wed Sep 19, 2007 11:33 am

hgm wrote: The theoretical estimate that uMax would vary about one in 40 moves seems quite realistic. The large number of identical games is simply a consequence of the fact that so many games were shorter than 40 moves. As to the game results, many games of course are already decided before move 40, even if they drag on 60 more moves. Therefore games that diverge only after 30-40 moves produce highly correlated results, even though the position might not be biased in itself.

Just a quick note lest anyone be confused:
The low number of moves is of course an artifact of the situation that we are starting with middle games that would be reached after perhaps move 15..20 in a real game.

nczempin · Post by **nczempin** » Wed Sep 19, 2007 11:39 am

nczempin wrote: I am still open to suggestions as to what engines I should use. I will choose one engine to run against, and then choose the next one, so as long as the matches with the first engine has not finished yet, I am still free to choose the others.

My first opponent shall be Pooky 2.7.

Okay, so choosing Pooky as an opponent for this test was a distaster. It seems to have a bug that after about 6 moves in the UCI fen+moves command, it thinks it is white when it is actually black, and thus plays an illegal moves.

So the next one I chose was Yawce (again, based on the observed variance in the normal gauntlet). That one doesn't even support the "edit" command according to Arena, so you can't start from a position other than the starting one.

So now I am running against ALChess 1.5, an engine that is still in the top range of engines I would play against (actually it would normally have been declared as too strong by my previous methodology), but that the new version could reasonable be expected to gain against. So far the score is 14.5-4.5, which of course does not tell you anything, because the second half, with reversed colours, hasn't been played yet.

hgm · Post by **hgm** » Wed Sep 19, 2007 11:48 am

Yes, I think so. Winboard seems to ignore the move number in the FEN (if there is any), and starts counting at 1. I never played from FENs before; the Nunn positions I usually play from are given as a PGN file with the moves leading to those positions from the opening. In principle I like that better, as it fills the full game history, which might be important for rep-draws.

I could send you the PGN files if you like, but I don't think there is really much remarkable in the games itself. The result of the match confirms my earlier observation (by playing both engines in round robins with many other engines, at various time controls) that Eden and uMax 1.6 are extremely closely matched. This is why I wondered if they were using the same algorithm. It is all the more funny that they don't. I watched many of the games, and it seems that they indeed have different strong and weak points. uMax seems to be tactically better (perhaps due to a recapture extension), but it bungles many end-games, where Eden obviously is much more adept in pushing its passers. And Eden is much better at avoiding unwanted rep-draws (or, more aptly said, uMax is an idiot in this respect).

(Note that I had to repair a bug in the edit command of the uMax that was on my website to play this match, as to implement minor promotion cheaply in uMax I had to swap the internal codes for King and Knight compared to the other micro-Maxes. But I forgot to do it in the edit command, so that it was setting up positions with 2 Kings and one Knight!

I will upload the corrected version to my website soon.)

nczempin · Post by **nczempin** » Wed Sep 19, 2007 11:48 am

hgm wrote:
bob wrote:Your program vs one opponent, using the 40 Silver positions, playing black and white in each, for a total of 80 games. Then run the _same_ test again and report both results to see if you see the kind of non-deterministic results I see with the programs I am using.
OK, test results are in, so some real data.

I played the two 80-game matches from the Silver positions between micro-Max 1.6 and Eden 0.0.13_win32_jet, at 40/1'.

Of the 80 matches, 64 were identical, move for move, until the very end.

Of the other 16, it was Eden who deviated 10 times, on move 11, 38, 27, 10, 42, 42, 6, 28, 16 and 7. The other 6 deviations were due to uMax chosing a different move, on move 32, 46, 24, 15, 37 and 25.

Note that some of the games were short (shortest game: 10 moves), increasing the likelihood that they are identical, but that there was also one 125-move game, and two 112-move games that were completely identical.

It can be concluded that the number of identically repeated games is very high (around 80%), so that one would waste a factor of 5 in testing time by running multiple mini-matches between these engines, duplicating games that have already been played before. Clearly different testing methodology is needed. (E.g. more starting positions.)

Btw, Eden won both mini-matches with 41.5 - 38.5, although about half of the 16 games that diverged had also different result.

The theoretical estimate that uMax would vary about one in 40 moves seems quite realistic. The large number of identical games is simply a consequence of the fact that so many games were shorter than 40 moves. As to the game results, many games of course are already decided before move 40, even if they drag on 60 more moves. Therefore games that diverge only after 30-40 moves produce highly correlated results, even though the position might not be biased in itself.

I think that this experiment is showing (it's not finished yet) the one extreme end of our discussion, namely what the results will be when playing highly deterministic engines against each other.

We need to find an opponent for Eden from the other extreme end, one which varies its play considerably. I could nominate Atlanchess for this honor, because it is very resilient, always making it to my next test tournament, but always finishing near the bottom (you'd expect it to be dominated after some more improvement, but it hasn't happened yet).

However, I believe most of this variation is due to its big own opening book, and perhaps it will be decreased for this particular engine when using the 80-game test.

I also believe that for more deterministic engines like Eden and umax, the variance will actually increase when using the 80gametest. But we are working on finding out if there is any evidence for that.

Certainly the variance per game I find likely would go up (i. e. require more games before significant results could be observed), but perhaps even the overall variance will do that. One of the things we should compare (or, rather, that I would be interested in) would be to see if out of that e. g. 320 game budget, I could more effectively spend it on playing my way or playing bob's way. (Which, I'd like to emphasise, would not tell us anything about whether the same result would hold for Crafty. Especially considering the source-available restriction that he has; he wouldn't be able to make the choice between more engines vs. more games.)

nczempin · Post by **nczempin** » Wed Sep 19, 2007 11:50 am

hgm wrote:Yes, I think so. Winboard seems to ignore the move number in the FEN (if there is any), and starts counting at 1. I never played from FENs before; the Nunn positions I usually play from are given as a PGN file with the moves leading to those positions from the opening. In principle I like that better, as it fills the full game history, which might be important for rep-draws.

I could send you the PGN files if you like, but I don't think there is really much remarkable in the games itself. The result of the match confirms my earlier observation (by playing both engines in round robins with many other engines, at various time controls) that Eden and uMax 1.6 are extremely closely matched. This is why I wondered if they were using the same algorithm. It is all the more funny that they don't. I watched many of the games, and it seems that they indeed have different strong and weak points. uMax seems to be tactically better (perhaps due to a recapture extension), but it bungles many end-games, where Eden obviously is much more adept in pushing its passers. And Eden is much better at avoiding unwanted rep-draws (or, more aptly said, uMax is an idiot in this respect).

Please do send me the games, I like to collect any games that any version of Eden plays anywhere. I'll think about the 8-million problem once I get there

)

hgm · Post by **hgm** » Wed Sep 19, 2007 12:14 pm

Give the fact that we have engines with a very low variability, I would try to profit from that. So the last thing I would do is try to play against more variable opponents, or increase the variability of my own engine by introducing random components in the evaluation.

Just use more different deterministic opponents. That is better anyway, as you will test against more playing styles. If you run out of suitable opponents (and around Eden's level it is not so easy either, as many engines there are buggy or don't implement needed features, such as force or edit, or there time management goes haywire when you use these features...), just use more different starting positions.

It is easy to generate thousands of starting positions: Just let two engines with good and varied books play a long match at very fast time control, and then delete all moves beyond move 12 from the PGN. Then you can use that PGN for starting positions. The advantage is that you would play both versions you are comparing from the same starting position, with an 80% probability that they play the same game unless the difference you want to test 'kicks in'. That would strongly reduce the sampling variance when testing small changes.

Alessandro Scotti · Post by **Alessandro Scotti** » Wed Sep 19, 2007 12:18 pm

nczempin wrote:
Alessandro Scotti wrote:
nczempin wrote:Okay, I will try to follow these specifications as closely as possible. I will still run them at 2/6 however, and stop complaining about resources.
IMO you don't have to worry too much about test matches until the engine is reasonably free of bugs. Such an engine will be stronger than 2000 elo with just a minimal set of features. Until that point, I think it's better to spend time in unit test and maybe reviewing the code and algorithms.
After that, I'm still convinced that 1+1 is preferrable to 2+6 as it allows more games to be played.
Under the goal I have about releases of Eden, which I have stated here sufficiently often, this is not an option. Yes, you can question the goal, and I will stick with it.

I must have missed those posts but I think I can live without knowing your goal, let alone question it. Have fun with your project!

hgm · Post by **hgm** » Wed Sep 19, 2007 5:20 pm

nczempin wrote:
nczempin wrote: I am still open to suggestions as to what engines I should use. I will choose one engine to run against, and then choose the next one, so as long as the matches with the first engine has not finished yet, I am still free to choose the others.

My first opponent shall be Pooky 2.7.
Okay, so choosing Pooky as an opponent for this test was a distaster. It seems to have a bug that after about 6 moves in the UCI fen+moves command, it thinks it is white when it is actually black, and thus plays an illegal moves.

So the next one I chose was Yawce (again, based on the observed variance in the normal gauntlet). That one doesn't even support the "edit" command according to Arena, so you can't start from a position other than the starting one.

OK, I fixed the time-control problems in all uMax versions. The problem of the 2'+6" time control you are using is that most of the time comes from the increment. Nominally uMax counts with 40 moves remaining game duration, so the 2' only contribute 3'/move in the beginning, and even later at the end.

And the formula for time/move that uMax was using was basically:

Increment + TimeLeft/(NrMovesLeft+4).

For Increment = 0, this is OK and leaves 5 times the nominal move time for the last move before the time control, which is sufficient to accomodate worst-case fluctuations in the time needed to finish an iteration. In incremental time controls, NrMovesLeft is taken to be 40.

When The contribution from the second term gets very low, because TimeLeft runs out, this formula starts to use the nominal time equal to the increment, meaning that it pretty much plns to use all its time before the axe for the single move, without any safety margin. As the nominal time is supposed to be the average time a move takes, time forfeit follows very quickly. I fixed this by adding a line that replaces the time given by the formula by TimeLeft/5 whenever the formula would give less. This seems to work, I have observed no time losses at 1'+3" so far.

This would make uMax a very suitable test opponent for you: it does support the 'force' and 'edit' commands, allowing you to play from both FEN and PGN initial positions, it is about as deterministic as they come (perhaps even better than Eden), and is very stable. An advantage is that a whole family of uMax engines is available (five of them as compiled downloads from my website), varying in strength from the current level of Eden (uMax 1.6) to the bottom of the CCRL ratinglist (uMax 4.8). So you could upgrade as you go. So you don't have to immediately start with uMax 4.8, which might be a bit too strong, and thus tell you little about the progress that you are making.

nczempin · Post by **nczempin** » Wed Sep 19, 2007 5:36 pm

hgm wrote: This would make uMax a very suitable test opponent for you: it does support the 'force' and 'edit' commands, allowing you to play from both FEN and PGN initial positions, it is about as deterministic as they come (perhaps even better than Eden), and is very stable. An advantage is that a whole family of uMax engines is available (five of them as compiled downloads from my website), varying in strength from the current level of Eden (uMax 1.6) to the bottom of the CCRL ratinglist (uMax 4.8). So you could upgrade as you go. So you don't have to immediately start with uMax 4.8, which might be a bit too strong, and thus tell you little about the progress that you are making.

Well, so uMax 4.8 is interesting enough that I actually downloaded it. And using earlier versions violates my probably irrational policy that I only ever use the latest versions of engines in my tests.

Actually this policy comes from another goal I have with Eden: I want to see its progress in those tournaments it participates in, such as UCI engines ligue, ChessWar/OpenWar and WBEC. And that means I should play mainly against engines that also take part in these events, and are close in strength to Eden.

Since those tournaments usually use the latest version, that's what I use too.

But don't worry, my guess is that I'll be on your back sooner than you will like

hgm · Post by **hgm** » Wed Sep 19, 2007 6:03 pm

Well, you should not really see the different micro-Maxes as versions of the same engine. Some are more like parallel developments, optimized according to different criteria. I made uMax 1.6 after uMax 4.4, in the development line aimed at making the smallest engine (source-code wise), while the uMax 4 series aims at maximal Elo/character. It is very possible that there will be a uMax 1.7, and not a uMax 4.9, so that the far weaker uMax 1,7 would be the latest version...

So far uMax 4.8 leads Eden 17.5-0.5 in the Silver match at 1'+3".

Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements