Code: Select all
Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 108 7 7 7782 67% -21 20%
2 Fruit 2.1 62 7 6 7782 61% -21 23%
3 opponent-21.7 25 6 6 7780 57% -21 33%
4 Glaurung 1.1 SMP 10 6 6 7782 54% -21 20%
5 Crafty-22.2 -21 4 4 38908 46% 4 23%
6 Arasan 10.0 -185 7 7 7782 29% -21 19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 110 6 7 7782 67% -19 21%
2 Fruit 2.1 63 6 7 7782 61% -19 23%
3 opponent-21.7 26 6 6 7782 57% -19 33%
4 Glaurung 1.1 SMP 7 6 7 7782 54% -19 20%
5 Crafty-22.2 -19 4 3 38910 47% 4 23%
6 Arasan 10.0 -187 6 7 7782 28% -19 19%
Those are the first two runs, and they seem to be more consistent than the last two big runs with just 40 positions. Run 3 in in progress and will finish tonight, by noon tomorrow the entire 160,000 games should be done. If, as Karl has suggested, these runs stay within expected variable limits, then we can start a discussion on reducing this computational load to something more palatable.
I'm just hoping that we see stable Elo numbers. But then the worry may well be that most sensible changes do not affect a program's Elo enough for this test to measure, which would be a completely different problem to deal with.
Note that this is not a full round-robin, although I could run one after the current test finishes if anyone wants to see how that would collapse the overall rating differences into a smaller range.
Remember, my goal is to compare A to A'. I don't care about absolute Elo values, or exactly how much better or worse A is than A', I only want to see if A' (which represents a slightly modified version of Crafty, AKA program A) is better or worse. Don't give a hoot about how much better or worse.