more on engine testing

hgm · Post by **hgm** » Fri Aug 01, 2008 9:13 am

I don't see any problem here. The results of the 800-game matches are all the same within the quoted statistical error. So of course one cannot conclude that one version is better than the other. I am not sure if the intervals quoted by BayesElo are 68% confidence or 95% confidence (i.e. 1 sigma or 2 sigma), but, when repeated often enough, of course you are bound to also find results deviating more than the quoted statistical uncerrtainty. In 200 tries, 10 results should ly outside 2 sigma. This is why it is called a 95% confidence interval, and not a 100%.

So 800 games are not enough to show one of the versions to better. In fact no number of games would be enough, as they are indeed equal. But if they were not equal, and one would score 90% on average against the the opponents where the other would score only 10%, a dozen games each would already be enough to point out the better one with >99.9% confidence. (40%*sqrt(2/12) ~ 16%, so that would be a 5-sigma deviation.) Surprise! small differences are more difficult to measure than large differences...

In the 25,000-games test you know at least that it is extremely unlikely that the version with Elo=+2 would not be better than that with -19. The uncertainty in the difference is sqrt(4*4+4*4) = 6, the difference is 21. I guess the 6 is even a 2-sigma interval, after all (25,000 games give a standard deviation in the score of 40%/sqrt(25,000) = 0.25%, which translate to roughtly 1.8 Elo point, while 4 was indicated). So you deviate by 6 sigma, which makes it 99.99...% certain that one of the versions is better.

But that was statistics. The question of course should be: "better at what?". And the answer is of course: "better at beating these particular 5 opponents starting from the Sliver positions".

As not all starting positions give the same score expectation value (i.e. score after asymptotically many games) between the same opponents, there is a statistical error associated with sampling the positions as well. Analyze your data by starting position, and you will have a pretty good idea how large it is (40 positions are enough to give a good impression of the standard deviation of the entire population). Say it is 5% (optimistic, IMO...). Then, averaged over 40 positions, a 1-sigma uncertainty of 0.8% in score expectancy will remain. This would show up as 11 Elo points in the Elo bounds, when converted to units comparable to the BayesElo-given error bounds.

So the likelihood that the 'better' Crafty version is also better at scoring pints against opponents picked from the entire population of Chess-playing entities, starting from arbitrary positions, is not nearly as large as you might think from the bounds given by BayesElo. As these bounds only apply to the statistical error in the game results, and not to the errors due to position selection and opponent selection. (Note that the latter also contributes a sampling error, as the score of one particular program against two different opponents will in general not be the same. And a sample of 5 is very, very small...)

So in short: no matter if you have 10,000, 25,000, 1,000,000 or 100,000,000 games, as long as you use only 5 opponents and 40 starting positions, the results will be pretty useless for determining which version would be better in general.

krazyken · Post by **krazyken** » Fri Aug 01, 2008 9:47 am

Nice statistics! It's the kind of thing my stats prof would give the class. Take this data and build a case for X. Now take the same data and build a case against X.

Your 2 25k matches are nice. Of course there is a possibility that the results could happen that way with no changes.

It is more likely that something did change. If software was controlled, learning off, same openings and opponents played in the same order, etc. SMP algorithms are frequently non-deterministic aren't they? It was 2 SMP engines that varied the most.

Then, of course, one should question whether the environmental variables were the same. Possibilities include CPU temp, background processes, available memory and things of that sort.

hgm · Post by **hgm** » Fri Aug 01, 2008 11:32 am

Well, he says they were changed between the runs, didn't he?

If these are the same, I would not trust it: then there would be very strong evidence that you are not doing what you think you are doing, (whatever the reason may be), as the probability of a 6-sigma fluke is virtually nil. Unless, of course, this was not a typical result, but the most extreme fluctuation selected from a million tries. Then it could become quite likely, again. It is very easy to fool yourself with statistics, by selecting samples in stead of sampling randomly. It will always be possible to quote examples of incredibly unlikely events, after enough tries.

mathmoi · Post by **mathmoi** » Fri Aug 01, 2008 3:43 pm

hgm wrote:Well, he says they were changed between the runs, didn't he?

No, I don't think he did. All the opponents for all the test are the sames. The point of the experiment is that if you run a long match two times with the sames opponents you can get pretty differents results.

Here is the relevent quote:

bob wrote:The question you want to answer from the above is this: crafty-22.2 in the first run was slightly modified for the second run. Was the change good or bad? How sure are you? Then I will add that crafty-22.2 for _both_ runs was identical. Now which one is better?

hgm · Post by **hgm** » Fri Aug 01, 2008 4:00 pm

Well, if the Crafty version was the same, and the other engines are as well, it just shows one of two things:

1) the test-method is completely broken.
2) the test result was selected as the most a-typical case from a huge data-base of results.

Case 1 could occur, for example, because one of the machines you had been using forrunning the test had a defective memory bit, which reduced playing strength of the engine using that memory, and that Crafty was suffering from this for a large number of games in a row (because it stayed loaded between games) while in another test one of the opponents was similarly suffering. If the distributionof 25,000-game results does have a standard deviation large enough that results like the reported one are likely, it proves that the results of individual game are not independent. No matter how unlikely any mechanism you could imagine to cause this is (supposing you made an effort to exclude the likely causes), it is stiil astronomically more likely that this is the case than that the reported result would be due to chance.

Case 2 I already discussed. If you try something a million times, you will almost certainly (well, with 1-1/e probability, actually) get at least one 'one-in-a-million' fluke. This is actually why they call it a one-in-a-million fluke. So it would be silly to start bragging about how lucky / unlucky you were.

So the moral lesson is this: believe the result distributions you see. If they do not match the normal distribution that is predicted for independent games, you know your test setup and/or method is flawed, and the width of the distribution might give you a pointer as to what cused the problem. (The effectve number of independent games will tell you ho big the correlated clusters of results were, and that might for example correspond to the number of games played on each machine, or against each opponent.)

bob · Post by **bob** » Fri Aug 01, 2008 4:00 pm

Michael Sherwin wrote:Even though you are correct (people just do not understand the severity of the random factor) it is not much of a consolation to know the truth of this matter. All the chess engine raters out there do a fair to good job rating the engines and their work taken as a whole and averaged is rather accurate. Most of us engine authors are restricted to minimal testing schemes , because of minimal hardware (often outdated) and we end up releasing a new version based on our gut feelings as much as on testing. How can knowing the truth, help us? If all the raters out there took heed of the truth then they might just quit. Where would we be then?

I'm not worried about the "raters". I'm worried about the "programmers". Those that make changes and then test to see if the changes are good or bad. Doing so with just 100 games or whatever is hopeless... Even the 160 game tests I posted are useless unless one finds a _major_ enhancement that overruns the wide error margin.

bob · Post by **bob** » Fri Aug 01, 2008 4:03 pm

Dirt wrote:Unless you changed the default, BayesElo shows a 95% confidence range, or two sigma. So your 21 point change in Crafty (which was +/-4 in both runs) was a ten sigma change. Either BayesElo is calculating the ranges wrong or there was something different about the two runs. Or you have some freakish luck, which keeps happening over and over and ... play the lottery much?

This is stock BayesElo. The only thing different about the two runs is that one started at time T, the next started at time T+N where N is how long it took the first run to occur. I've been describing this problem for quite a while. Note that the four runs I posted were four consecutive runs. I had queued up some 5000+ game matches, but our A/C in the cluster room had a problem during the night and the cluster was powered down to avoid overheating.

bob · Post by **bob** » Fri Aug 01, 2008 4:18 pm

hgm wrote:I don't see any problem here. The results of the 800-game matches are all the same within the quoted statistical error. So of course one cannot conclude that one version is better than the other. I am not sure if the intervals quoted by BayesElo are 68% confidence or 95% confidence (i.e. 1 sigma or 2 sigma), but, when repeated often enough, of course you are bound to also find results deviating more than the quoted statistical uncerrtainty. In 200 tries, 10 results should ly outside 2 sigma. This is why it is called a 95% confidence interval, and not a 100%.

So 800 games are not enough to show one of the versions to better. In fact no number of games would be enough, as they are indeed equal. But if they were not equal, and one would score 90% on average against the the opponents where the other would score only 10%, a dozen games each would already be enough to point out the better one with >99.9% confidence. (40%*sqrt(2/12) ~ 16%, so that would be a 5-sigma deviation.) Surprise! small differences are more difficult to measure than large differences...

In the 25,000-games test you know at least that it is extremely unlikely that the version with Elo=+2 would not be better than that with -19. The uncertainty in the difference is sqrt(4*4+4*4) = 6, the difference is 21. I guess the 6 is even a 2-sigma interval, after all (25,000 games give a standard deviation in the score of 40%/sqrt(25,000) = 0.25%, which translate to roughtly 1.8 Elo point, while 4 was indicated). So you deviate by 6 sigma, which makes it 99.99...% certain that one of the versions is better.

But that was statistics. The question of course should be: "better at what?". And the answer is of course: "better at beating these particular 5 opponents starting from the Sliver positions".

As not all starting positions give the same score expectation value (i.e. score after asymptotically many games) between the same opponents, there is a statistical error associated with sampling the positions as well. Analyze your data by starting position, and you will have a pretty good idea how large it is (40 positions are enough to give a good impression of the standard deviation of the entire population). Say it is 5% (optimistic, IMO...). Then, averaged over 40 positions, a 1-sigma uncertainty of 0.8% in score expectancy will remain. This would show up as 11 Elo points in the Elo bounds, when converted to units comparable to the BayesElo-given error bounds.

So the likelihood that the 'better' Crafty version is also better at scoring pints against opponents picked from the entire population of Chess-playing entities, starting from arbitrary positions, is not nearly as large as you might think from the bounds given by BayesElo. As these bounds only apply to the statistical error in the game results, and not to the errors due to position selection and opponent selection. (Note that the latter also contributes a sampling error, as the score of one particular program against two different opponents will in general not be the same. And a sample of 5 is very, very small...)

So in short: no matter if you have 10,000, 25,000, 1,000,000 or 100,000,000 games, as long as you use only 5 opponents and 40 starting positions, the results will be pretty useless for determining which version would be better in general.

Here's the point, that continually gets overlooked. If 800 games gives two rating ranges that do _not_ overlap, for two identical programs, then it is impossible to use 800 game matches to determine whether A' is better than A, when that change will, by definition, be small. The Rybka testing using 80K games to "measure Elo changes as small as 1" (Larry K's words, not mine) is, as I said when it was posted, incorrect. I don't believe you can actually measure Elo changes of +/- 1 with any degree of confidence at all, without playing hundreds of thousands of games at a minimum.

The BayesElo confidence interval is 95%. Which, as I have said previously, doesn't mean a lot when programs exhibit so much natural non-determinism.

Once more, the runs were consecutive runs. I did not cherry-pick the wildest 4 our of hundreds of matches. I just ran 4 matches and cut/pasted the results. These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.

BTW the stuff about the 40 positions, or the 5 opponents is way beside the point. No matter what 40 positions I choose, and it would seem to me that the smaller number the more stable the results, I ought to be able to produce a stable answer about those 40 positions, whether or not that carries over to other positions depends on how "general" those positions are (and the silver positions are pretty generic/representative of opening positions). My point is this: If I can't play a long match against _one_ opponent, using 5 starting positions, and get similar results for each run (in terms of Elo) then adding more programs and more positions only increases the noise, it certainly doesn't reduce it.

Uri Blass · Post by **Uri Blass** » Fri Aug 01, 2008 4:28 pm

bob wrote:
hgm wrote:I don't see any problem here. The results of the 800-game matches are all the same within the quoted statistical error. So of course one cannot conclude that one version is better than the other. I am not sure if the intervals quoted by BayesElo are 68% confidence or 95% confidence (i.e. 1 sigma or 2 sigma), but, when repeated often enough, of course you are bound to also find results deviating more than the quoted statistical uncerrtainty. In 200 tries, 10 results should ly outside 2 sigma. This is why it is called a 95% confidence interval, and not a 100%.

So 800 games are not enough to show one of the versions to better. In fact no number of games would be enough, as they are indeed equal. But if they were not equal, and one would score 90% on average against the the opponents where the other would score only 10%, a dozen games each would already be enough to point out the better one with >99.9% confidence. (40%*sqrt(2/12) ~ 16%, so that would be a 5-sigma deviation.) Surprise! small differences are more difficult to measure than large differences...

In the 25,000-games test you know at least that it is extremely unlikely that the version with Elo=+2 would not be better than that with -19. The uncertainty in the difference is sqrt(4*4+4*4) = 6, the difference is 21. I guess the 6 is even a 2-sigma interval, after all (25,000 games give a standard deviation in the score of 40%/sqrt(25,000) = 0.25%, which translate to roughtly 1.8 Elo point, while 4 was indicated). So you deviate by 6 sigma, which makes it 99.99...% certain that one of the versions is better.

But that was statistics. The question of course should be: "better at what?". And the answer is of course: "better at beating these particular 5 opponents starting from the Sliver positions".

As not all starting positions give the same score expectation value (i.e. score after asymptotically many games) between the same opponents, there is a statistical error associated with sampling the positions as well. Analyze your data by starting position, and you will have a pretty good idea how large it is (40 positions are enough to give a good impression of the standard deviation of the entire population). Say it is 5% (optimistic, IMO...). Then, averaged over 40 positions, a 1-sigma uncertainty of 0.8% in score expectancy will remain. This would show up as 11 Elo points in the Elo bounds, when converted to units comparable to the BayesElo-given error bounds.

So the likelihood that the 'better' Crafty version is also better at scoring pints against opponents picked from the entire population of Chess-playing entities, starting from arbitrary positions, is not nearly as large as you might think from the bounds given by BayesElo. As these bounds only apply to the statistical error in the game results, and not to the errors due to position selection and opponent selection. (Note that the latter also contributes a sampling error, as the score of one particular program against two different opponents will in general not be the same. And a sample of 5 is very, very small...)

So in short: no matter if you have 10,000, 25,000, 1,000,000 or 100,000,000 games, as long as you use only 5 opponents and 40 starting positions, the results will be pretty useless for determining which version would be better in general.
Here's the point, that continually gets overlooked. If 800 games gives two rating ranges that do _not_ overlap, for two identical programs, then it is impossible to use 800 game matches to determine whether A' is better than A, when that change will, by definition, be small. The Rybka testing using 80K games to "measure Elo changes as small as 1" (Larry K's words, not mine) is, as I said when it was posted, incorrect. I don't believe you can actually measure Elo changes of +/- 1 with any degree of confidence at all, without playing hundreds of thousands of games at a minimum.

The BayesElo confidence interval is 95%. Which, as I have said previously, doesn't mean a lot when programs exhibit so much natural non-determinism.

Once more, the runs were consecutive runs. I did not cherry-pick the wildest 4 our of hundreds of matches. I just ran 4 matches and cut/pasted the results. These results are _typical_. And most "testers" are using a similar number of opponents, playing _far_ fewer games, and then making go/no-go decisions about changes based on the results. And much of that is pure random noise.

If the results are typical the only conclusion is that the results are not independent.

If you write the number of nodes per second in every move then I suggest to look for the smallest numbers and the biggest number.

You may discover that out of 1600 games the 100 smallest numbers all happened at games 200-500.

Another possibility is that Crafty does not leave the memory of the computer after every game and in that case some bug like an uninitialized variable can cause random change in the behaviour of Crafty during the match so games are dependent.

Uri

bob · Post by **bob** » Fri Aug 01, 2008 4:30 pm

krazyken wrote:Nice statistics! It's the kind of thing my stats prof would give the class. Take this data and build a case for X. Now take the same data and build a case against X.

Your 2 25k matches are nice. Of course there is a possibility that the results could happen that way with no changes.

It is more likely that something did change. If software was controlled, learning off, same openings and opponents played in the same order, etc. SMP algorithms are frequently non-deterministic aren't they? It was 2 SMP engines that varied the most.

Except that all programs use one cpu here. This makes the testing go faster and also lets me test against programs that don't have SMP, without exaggerating the difference due to a speed advantage. All my normal testing is just one cpu for Crafty vs one cpu for the opponent.

I am going to try, before long, to write a paper describing the SMP decisions I made in Crafty, and then run some matches on our bigger cluster where Crafty uses 1, then 2, then 4 and then 8 processors, to the opponent's one, to get an idea of how Elo is affected by SMP speedup. But not in these tests.

Then, of course, one should question whether the environmental variables were the same. Possibilities include CPU temp, background processes, available memory and things of that sort.

Our cluster is consistent in that regard. No processes on a node except for mine, as once I am given a node it is mine and the queue scheduler won't run anything else on that node at all. There is always the issue of network traffic, although the nodes themselves are behind a "front end node" and do not see "outside world" network traffic, but they do see "internal traffic" although those runs were with me using the entire cluster so the only network traffic was writing result files to a humongous file storage sub-system that sits on the infiniband network. But this is as constant as it can be, from that perspective. Memory also is constant at 12 gigs per node although I use nowhere near 1/3 of that for these fairly fast games. There is just a tremendous amount of variability in measuring time on these systems, which lets the searches from the same position search different numbers of nodes on successive runs. And I have discovered that just varying by 100 nodes in each search is way more than enough to produce completely different games, even though the idea seems impossible.

more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing

Re: more on engine testing