bob wrote:A while back I mentioned how difficult it is to draw conclusions about relatively modest changes in a chess program, requiring a ton of games to get usable comparisons. Here is a sample to show that in a way that is pretty easy to understand.
First, I played crafty vs 5 other opponents, including an older 21.7 version. The version I am testing here is not particularly good yet, representing some significant "removals" from the evaluation. So the results are not particularly interesting from that perspective. The 5 opponents were played on 40 starting positions, playing 4 rounds for each position, alternating colors. So a total of 800 games per match, and I am giving 4 consecutive match results, all the same opponents, all played at a time control of 5 + 5 (5 minutes on clock, 5 seconds increment added per move). I lost a game here and there due to data corruption on our big storage system, so some of the matches show 799 rather than 800 games because once in a while the PGN for the last game would be somehow corrupted (a different issue).
I ran these 800 game matches thru Remi's BayesElo. You can look at the four sets of results, but imagine that in each of those tests, crafty-22.2 was a slightly different version with a tweak or two added. Which of the four looks the best? And then realize that all programs are identical for the 4 matches. How would one reliably draw any conclusion from a match containing only 800 games since the error bar is significant, and the variability is even more significant. First the data:
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 121 42 41 160 68% -18 17%
2 Glaurung 1.1 SMP 61 42 41 160 60% -18 13%
3 Fruit 2.1 49 41 40 160 59% -18 15%
4 opponent-21.7 13 38 38 159 55% -18 33%
5 Crafty-22.2 -18 18 18 799 47% 4 19%
6 Arasan 10.0 -226 42 45 160 23% -18 18%
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 81 42 41 160 63% -17 16%
2 opponent-21.7 61 38 38 159 62% -17 33%
3 Glaurung 1.1 SMP 46 42 41 160 58% -17 13%
4 Fruit 2.1 35 40 40 160 57% -17 19%
5 Crafty-22.2 -17 18 18 799 47% 3 19%
6 Arasan 10.0 -205 42 45 160 26% -17 16%
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 113 43 41 160 66% -12 12%
2 opponent-21.7 73 39 38 159 63% -12 32%
3 Fruit 2.1 21 41 40 160 54% -12 15%
4 Crafty-22.2 -12 18 18 799 48% 2 18%
5 Glaurung 1.1 SMP -35 41 41 160 47% -12 11%
6 Arasan 10.0 -161 41 43 160 30% -12 18%
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 131 45 42 160 70% -33 10%
2 Fruit 2.1 64 41 40 160 63% -33 19%
3 Glaurung 1.1 SMP 25 41 40 160 58% -33 15%
4 opponent-21.7 13 37 37 160 57% -33 36%
5 Crafty-22.2 -33 18 18 800 45% 7 19%
6 Arasan 10.0 -199 42 44 160 29% -33 15%
Notice first that _everybody_ in the test is getting significantly different results each match. The overall order (with the exception of Glaurung 2 which stays at the top) flips around significantly.
Now does anyone _really_ believe that 800 games are enough? Later I will show some _much_ bigger matches as well, showing the same kind of variability. Here are two quickies that represent 25,000 games per match for two matches, just for starters (same time control):
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 123 8 8 5120 66% 2 15%
2 Fruit 2.1 38 8 7 5119 55% 2 19%
3 opponent-21.7 28 7 7 5119 54% 2 34%
4 Crafty-22.2 2 4 4 25597 50% 0 19%
5 Glaurung 1.1 SMP 2 8 8 5120 50% 2 14%
6 Arasan 10.0 -193 8 9 5119 26% 2 15%
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 118 8 8 5120 67% -19 13%
2 Fruit 2.1 42 8 8 5120 58% -19 17%
3 opponent-21.7 32 7 7 5115 58% -19 36%
4 Glaurung 1.1 SMP 20 8 8 5120 55% -19 12%
5 Crafty-22.2 -19 4 4 25595 47% 4 19%
6 Arasan 10.0 -193 8 8 5120 28% -19 16%
The question you want to answer from the above is this: crafty-22.2 in the first run was slightly modified for the second run. Was the change good or bad? How sure are you? Then I will add that crafty-22.2 for _both_ runs was identical. Now which one is better?

There is a 21 Elo difference between the two. The first result says 2 +/- 8, while the second says -19 +/- 4. The ranges don't even overlap. Which points out that this kind of statistic is good for the sample under observation, but not necessarily representative of the total population of potential games, without playing a _lot_ more games. Some would say that the second match says crafty is somewhere between -15 and -23. Which is OK. But then what does the first bigger match say?
"things that make you go hmmm......."