First, I played crafty vs 5 other opponents, including an older 21.7 version. The version I am testing here is not particularly good yet, representing some significant "removals" from the evaluation. So the results are not particularly interesting from that perspective. The 5 opponents were played on 40 starting positions, playing 4 rounds for each position, alternating colors. So a total of 800 games per match, and I am giving 4 consecutive match results, all the same opponents, all played at a time control of 5 + 5 (5 minutes on clock, 5 seconds increment added per move). I lost a game here and there due to data corruption on our big storage system, so some of the matches show 799 rather than 800 games because once in a while the PGN for the last game would be somehow corrupted (a different issue).
I ran these 800 game matches thru Remi's BayesElo. You can look at the four sets of results, but imagine that in each of those tests, crafty-22.2 was a slightly different version with a tweak or two added. Which of the four looks the best? And then realize that all programs are identical for the 4 matches. How would one reliably draw any conclusion from a match containing only 800 games since the error bar is significant, and the variability is even more significant. First the data:
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 121 42 41 160 68% -18 17%
2 Glaurung 1.1 SMP 61 42 41 160 60% -18 13%
3 Fruit 2.1 49 41 40 160 59% -18 15%
4 opponent-21.7 13 38 38 159 55% -18 33%
5 Crafty-22.2 -18 18 18 799 47% 4 19%
6 Arasan 10.0 -226 42 45 160 23% -18 18%
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 81 42 41 160 63% -17 16%
2 opponent-21.7 61 38 38 159 62% -17 33%
3 Glaurung 1.1 SMP 46 42 41 160 58% -17 13%
4 Fruit 2.1 35 40 40 160 57% -17 19%
5 Crafty-22.2 -17 18 18 799 47% 3 19%
6 Arasan 10.0 -205 42 45 160 26% -17 16%
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 113 43 41 160 66% -12 12%
2 opponent-21.7 73 39 38 159 63% -12 32%
3 Fruit 2.1 21 41 40 160 54% -12 15%
4 Crafty-22.2 -12 18 18 799 48% 2 18%
5 Glaurung 1.1 SMP -35 41 41 160 47% -12 11%
6 Arasan 10.0 -161 41 43 160 30% -12 18%
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 131 45 42 160 70% -33 10%
2 Fruit 2.1 64 41 40 160 63% -33 19%
3 Glaurung 1.1 SMP 25 41 40 160 58% -33 15%
4 opponent-21.7 13 37 37 160 57% -33 36%
5 Crafty-22.2 -33 18 18 800 45% 7 19%
6 Arasan 10.0 -199 42 44 160 29% -33 15%
Now does anyone _really_ believe that 800 games are enough? Later I will show some _much_ bigger matches as well, showing the same kind of variability. Here are two quickies that represent 25,000 games per match for two matches, just for starters (same time control):
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 123 8 8 5120 66% 2 15%
2 Fruit 2.1 38 8 7 5119 55% 2 19%
3 opponent-21.7 28 7 7 5119 54% 2 34%
4 Crafty-22.2 2 4 4 25597 50% 0 19%
5 Glaurung 1.1 SMP 2 8 8 5120 50% 2 14%
6 Arasan 10.0 -193 8 9 5119 26% 2 15%
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 118 8 8 5120 67% -19 13%
2 Fruit 2.1 42 8 8 5120 58% -19 17%
3 opponent-21.7 32 7 7 5115 58% -19 36%
4 Glaurung 1.1 SMP 20 8 8 5120 55% -19 12%
5 Crafty-22.2 -19 4 4 25595 47% 4 19%
6 Arasan 10.0 -193 8 8 5120 28% -19 16%


"things that make you go hmmm......."