more on engine testing

bob · Post by **bob** » Fri Aug 01, 2008 7:56 am

A while back I mentioned how difficult it is to draw conclusions about relatively modest changes in a chess program, requiring a ton of games to get usable comparisons. Here is a sample to show that in a way that is pretty easy to understand.

First, I played crafty vs 5 other opponents, including an older 21.7 version. The version I am testing here is not particularly good yet, representing some significant "removals" from the evaluation. So the results are not particularly interesting from that perspective. The 5 opponents were played on 40 starting positions, playing 4 rounds for each position, alternating colors. So a total of 800 games per match, and I am giving 4 consecutive match results, all the same opponents, all played at a time control of 5 + 5 (5 minutes on clock, 5 seconds increment added per move). I lost a game here and there due to data corruption on our big storage system, so some of the matches show 799 rather than 800 games because once in a while the PGN for the last game would be somehow corrupted (a different issue).

I ran these 800 game matches thru Remi's BayesElo. You can look at the four sets of results, but imagine that in each of those tests, crafty-22.2 was a slightly different version with a tweak or two added. Which of the four looks the best? And then realize that all programs are identical for the 4 matches. How would one reliably draw any conclusion from a match containing only 800 games since the error bar is significant, and the variability is even more significant. First the data:

Code: Select all

Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   121   42   41   160   68%   -18   17%
   2 Glaurung 1.1 SMP        61   42   41   160   60%   -18   13%
   3 Fruit 2.1               49   41   40   160   59%   -18   15%
   4 opponent-21.7           13   38   38   159   55%   -18   33%
   5 Crafty-22.2            -18   18   18   799   47%     4   19%
   6 Arasan 10.0           -226   42   45   160   23%   -18   18%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5    81   42   41   160   63%   -17   16%
   2 opponent-21.7           61   38   38   159   62%   -17   33%
   3 Glaurung 1.1 SMP        46   42   41   160   58%   -17   13%
   4 Fruit 2.1               35   40   40   160   57%   -17   19%
   5 Crafty-22.2            -17   18   18   799   47%     3   19%
   6 Arasan 10.0           -205   42   45   160   26%   -17   16%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   113   43   41   160   66%   -12   12%
   2 opponent-21.7           73   39   38   159   63%   -12   32%
   3 Fruit 2.1               21   41   40   160   54%   -12   15%
   4 Crafty-22.2            -12   18   18   799   48%     2   18%
   5 Glaurung 1.1 SMP       -35   41   41   160   47%   -12   11%
   6 Arasan 10.0           -161   41   43   160   30%   -12   18%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   131   45   42   160   70%   -33   10%
   2 Fruit 2.1               64   41   40   160   63%   -33   19%
   3 Glaurung 1.1 SMP        25   41   40   160   58%   -33   15%
   4 opponent-21.7           13   37   37   160   57%   -33   36%
   5 Crafty-22.2            -33   18   18   800   45%     7   19%
   6 Arasan 10.0           -199   42   44   160   29%   -33   15%

Notice first that _everybody_ in the test is getting significantly different results each match. The overall order (with the exception of Glaurung 2 which stays at the top) flips around significantly.

Now does anyone _really_ believe that 800 games are enough? Later I will show some _much_ bigger matches as well, showing the same kind of variability. Here are two quickies that represent 25,000 games per match for two matches, just for starters (same time control):

Code: Select all

Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   123    8    8  5120   66%     2   15%
   2 Fruit 2.1               38    8    7  5119   55%     2   19%
   3 opponent-21.7           28    7    7  5119   54%     2   34%
   4 Crafty-22.2              2    4    4 25597   50%     0   19%
   5 Glaurung 1.1 SMP         2    8    8  5120   50%     2   14%
   6 Arasan 10.0           -193    8    9  5119   26%     2   15%
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   118    8    8  5120   67%   -19   13%
   2 Fruit 2.1               42    8    8  5120   58%   -19   17%
   3 opponent-21.7           32    7    7  5115   58%   -19   36%
   4 Glaurung 1.1 SMP        20    8    8  5120   55%   -19   12%
   5 Crafty-22.2            -19    4    4 25595   47%     4   19%
   6 Arasan 10.0           -193    8    8  5120   28%   -19   16%

The question you want to answer from the above is this: crafty-22.2 in the first run was slightly modified for the second run. Was the change good or bad? How sure are you? Then I will add that crafty-22.2 for _both_ runs was identical. Now which one is better?

There is a 21 Elo difference between the two. The first result says 2 +/- 8, while the second says -19 +/- 4. The ranges don't even overlap. Which points out that this kind of statistic is good for the sample under observation, but not necessarily representative of the total population of potential games, without playing a _lot_ more games. Some would say that the second match says crafty is somewhere between -15 and -23. Which is OK. But then what does the first bigger match say?

"things that make you go hmmm......."

Michael Sherwin · Post by **Michael Sherwin** » Fri Aug 01, 2008 8:41 am

Even though you are correct (people just do not understand the severity of the random factor) it is not much of a consolation to know the truth of this matter. All the chess engine raters out there do a fair to good job rating the engines and their work taken as a whole and averaged is rather accurate. Most of us engine authors are restricted to minimal testing schemes , because of minimal hardware (often outdated) and we end up releasing a new version based on our gut feelings as much as on testing. How can knowing the truth, help us? If all the raters out there took heed of the truth then they might just quit. Where would we be then?

Dirt · Post by **Dirt** » Fri Aug 01, 2008 8:42 am

Unless you changed the default, BayesElo shows a 95% confidence range, or two sigma. So your 21 point change in Crafty (which was +/-4 in both runs) was a ten sigma change. Either BayesElo is calculating the ranges wrong or there was something different about the two runs. Or you have some freakish luck, which keeps happening over and over and ... play the lottery much?

Michael Sherwin · Post by **Michael Sherwin** » Fri Aug 01, 2008 9:07 am

Dirt wrote:Unless you changed the default, BayesElo shows a 95% confidence range, or two sigma. So your 21 point change in Crafty (which was +/-4 in both runs) was a ten sigma change. Either BayesElo is calculating the ranges wrong or there was something different about the two runs. Or you have some freakish luck, which keeps happening over and over and ... play the lottery much?

The random effect inside a computer is not like the randomness in nature or that of a roulette wheel. It is not really random at all. There are all kinds of cycles of 'randomness' going on inside the computer that cause 'clumping of data' far in excess of what statistical math was designed to measure. An unrigged roulette wheel will hit 32 reds in a row every so many hundred years. A computer random effect put in terms of roulette, will hit 320 reds in a row, every year.

hgm · Post by **hgm** » Fri Aug 01, 2008 9:13 am

I don't see any problem here. The results of the 800-game matches are all the same within the quoted statistical error. So of course one cannot conclude that one version is better than the other. I am not sure if the intervals quoted by BayesElo are 68% confidence or 95% confidence (i.e. 1 sigma or 2 sigma), but, when repeated often enough, of course you are bound to also find results deviating more than the quoted statistical uncerrtainty. In 200 tries, 10 results should ly outside 2 sigma. This is why it is called a 95% confidence interval, and not a 100%.

So 800 games are not enough to show one of the versions to better. In fact no number of games would be enough, as they are indeed equal. But if they were not equal, and one would score 90% on average against the the opponents where the other would score only 10%, a dozen games each would already be enough to point out the better one with >99.9% confidence. (40%*sqrt(2/12) ~ 16%, so that would be a 5-sigma deviation.) Surprise! small differences are more difficult to measure than large differences...

In the 25,000-games test you know at least that it is extremely unlikely that the version with Elo=+2 would not be better than that with -19. The uncertainty in the difference is sqrt(4*4+4*4) = 6, the difference is 21. I guess the 6 is even a 2-sigma interval, after all (25,000 games give a standard deviation in the score of 40%/sqrt(25,000) = 0.25%, which translate to roughtly 1.8 Elo point, while 4 was indicated). So you deviate by 6 sigma, which makes it 99.99...% certain that one of the versions is better.

But that was statistics. The question of course should be: "better at what?". And the answer is of course: "better at beating these particular 5 opponents starting from the Sliver positions".

As not all starting positions give the same score expectation value (i.e. score after asymptotically many games) between the same opponents, there is a statistical error associated with sampling the positions as well. Analyze your data by starting position, and you will have a pretty good idea how large it is (40 positions are enough to give a good impression of the standard deviation of the entire population). Say it is 5% (optimistic, IMO...). Then, averaged over 40 positions, a 1-sigma uncertainty of 0.8% in score expectancy will remain. This would show up as 11 Elo points in the Elo bounds, when converted to units comparable to the BayesElo-given error bounds.

So the likelihood that the 'better' Crafty version is also better at scoring pints against opponents picked from the entire population of Chess-playing entities, starting from arbitrary positions, is not nearly as large as you might think from the bounds given by BayesElo. As these bounds only apply to the statistical error in the game results, and not to the errors due to position selection and opponent selection. (Note that the latter also contributes a sampling error, as the score of one particular program against two different opponents will in general not be the same. And a sample of 5 is very, very small...)

So in short: no matter if you have 10,000, 25,000, 1,000,000 or 100,000,000 games, as long as you use only 5 opponents and 40 starting positions, the results will be pretty useless for determining which version would be better in general.

krazyken · Post by **krazyken** » Fri Aug 01, 2008 9:47 am

Nice statistics! It's the kind of thing my stats prof would give the class. Take this data and build a case for X. Now take the same data and build a case against X.

Your 2 25k matches are nice. Of course there is a possibility that the results could happen that way with no changes.

It is more likely that something did change. If software was controlled, learning off, same openings and opponents played in the same order, etc. SMP algorithms are frequently non-deterministic aren't they? It was 2 SMP engines that varied the most.

Then, of course, one should question whether the environmental variables were the same. Possibilities include CPU temp, background processes, available memory and things of that sort.

hgm · Post by **hgm** » Fri Aug 01, 2008 11:32 am

Well, he says they were changed between the runs, didn't he?

If these are the same, I would not trust it: then there would be very strong evidence that you are not doing what you think you are doing, (whatever the reason may be), as the probability of a 6-sigma fluke is virtually nil. Unless, of course, this was not a typical result, but the most extreme fluctuation selected from a million tries. Then it could become quite likely, again. It is very easy to fool yourself with statistics, by selecting samples in stead of sampling randomly. It will always be possible to quote examples of incredibly unlikely events, after enough tries.

mathmoi · Post by **mathmoi** » Fri Aug 01, 2008 3:43 pm

hgm wrote:Well, he says they were changed between the runs, didn't he?

No, I don't think he did. All the opponents for all the test are the sames. The point of the experiment is that if you run a long match two times with the sames opponents you can get pretty differents results.

Here is the relevent quote:

bob wrote:The question you want to answer from the above is this: crafty-22.2 in the first run was slightly modified for the second run. Was the change good or bad? How sure are you? Then I will add that crafty-22.2 for _both_ runs was identical. Now which one is better?

hgm · Post by **hgm** » Fri Aug 01, 2008 4:00 pm

Well, if the Crafty version was the same, and the other engines are as well, it just shows one of two things:

1) the test-method is completely broken.
2) the test result was selected as the most a-typical case from a huge data-base of results.

Case 1 could occur, for example, because one of the machines you had been using forrunning the test had a defective memory bit, which reduced playing strength of the engine using that memory, and that Crafty was suffering from this for a large number of games in a row (because it stayed loaded between games) while in another test one of the opponents was similarly suffering. If the distributionof 25,000-game results does have a standard deviation large enough that results like the reported one are likely, it proves that the results of individual game are not independent. No matter how unlikely any mechanism you could imagine to cause this is (supposing you made an effort to exclude the likely causes), it is stiil astronomically more likely that this is the case than that the reported result would be due to chance.

Case 2 I already discussed. If you try something a million times, you will almost certainly (well, with 1-1/e probability, actually) get at least one 'one-in-a-million' fluke. This is actually why they call it a one-in-a-million fluke. So it would be silly to start bragging about how lucky / unlucky you were.

So the moral lesson is this: believe the result distributions you see. If they do not match the normal distribution that is predicted for independent games, you know your test setup and/or method is flawed, and the width of the distribution might give you a pointer as to what cused the problem. (The effectve number of independent games will tell you ho big the correlated clusters of results were, and that might for example correspond to the number of games played on each machine, or against each opponent.)

bob · Post by **bob** » Fri Aug 01, 2008 4:00 pm

Michael Sherwin wrote:Even though you are correct (people just do not understand the severity of the random factor) it is not much of a consolation to know the truth of this matter. All the chess engine raters out there do a fair to good job rating the engines and their work taken as a whole and averaged is rather accurate. Most of us engine authors are restricted to minimal testing schemes , because of minimal hardware (often outdated) and we end up releasing a new version based on our gut feelings as much as on testing. How can knowing the truth, help us? If all the raters out there took heed of the truth then they might just quit. Where would we be then?

I'm not worried about the "raters". I'm worried about the "programmers". Those that make changes and then test to see if the changes are good or bad. Doing so with just 100 games or whatever is hopeless... Even the 160 game tests I posted are useless unless one finds a _major_ enhancement that overruns the wide error margin.

more on engine testing

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing