bob wrote:And, in fact, I am gong to take the PGN from the last 3 runs and produce three sets of data. I am going to rename crafty-22.2 in one of the sets, and combine it with another set and run Bayesel, as now both 22.2 versions will have exactly the same number of games. I will repeat so that I get three results out of this and see how closely _they_ match, which is the comparison I really care about. They ought to turn out _very_ close, since they will be comparing the same two programs, exactly.
As I pointed out at the same time you were posting this, you can get six test results. They won't be independent, but I see no reason not to look at all of them. And, as Rémi said, you should look at the LOS tables.
There is one reason I can't get 6. The first PGN set is gone. Due to a human error (on my part), I created 4 sub-directories to hold the separate runs, but somehow only used three directory names when saving the output. So far I have two complete sets, with a third "in progress" as the cluster load drops off. So I can use A+B, A+C and B+C to get three results, but no more...
Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 108 7 7 7782 67% -21 20%
2 Fruit 2.1 62 7 6 7782 61% -21 23%
3 opponent-21.7 25 6 6 7780 57% -21 33%
4 Glaurung 1.1 SMP 10 6 6 7782 54% -21 20%
5 Crafty-22.2 -21 4 4 38908 46% 4 23%
6 Arasan 10.0 -185 7 7 7782 29% -21 19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 110 6 7 7782 67% -19 21%
2 Fruit 2.1 63 6 7 7782 61% -19 23%
3 opponent-21.7 26 6 6 7782 57% -19 33%
4 Glaurung 1.1 SMP 7 6 7 7782 54% -19 20%
5 Crafty-22.2 -19 4 3 38910 47% 4 23%
6 Arasan 10.0 -187 6 7 7782 28% -19 19%
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 109 7 6 7782 67% -16 20%
2 Fruit 2.1 63 6 7 7782 61% -16 24%
3 opponent-21.7 23 6 6 7781 56% -16 32%
4 Glaurung 1.1 SMP 3 6 7 7782 53% -16 21%
5 Crafty-22.2 -16 4 3 38909 47% 3 23%
6 Arasan 10.0 -182 7 7 7782 28% -16 19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...
More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...
It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
Since this last run is going to take a bit longer than usual due to cluster load, here is the results so far, for comparison:
I get a crash in the final link (re-optimization) phase with gcc 4.x. gcc 3.4 or so worked with PGO but I don't support that any longer because of its deficiencies with C++ library support. Will try icc again soon. As you may have guessed, Linux is not my primary development platform.
Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 108 7 7 7782 67% -21 20%
2 Fruit 2.1 62 7 6 7782 61% -21 23%
3 opponent-21.7 25 6 6 7780 57% -21 33%
4 Glaurung 1.1 SMP 10 6 6 7782 54% -21 20%
5 Crafty-22.2 -21 4 4 38908 46% 4 23%
6 Arasan 10.0 -185 7 7 7782 29% -21 19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 110 6 7 7782 67% -19 21%
2 Fruit 2.1 63 6 7 7782 61% -19 23%
3 opponent-21.7 26 6 6 7782 57% -19 33%
4 Glaurung 1.1 SMP 7 6 7 7782 54% -19 20%
5 Crafty-22.2 -19 4 3 38910 47% 4 23%
6 Arasan 10.0 -187 6 7 7782 28% -19 19%
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 109 7 6 7782 67% -16 20%
2 Fruit 2.1 63 6 7 7782 61% -16 24%
3 opponent-21.7 23 6 6 7781 56% -16 32%
4 Glaurung 1.1 SMP 3 6 7 7782 53% -16 21%
5 Crafty-22.2 -16 4 3 38909 47% 3 23%
6 Arasan 10.0 -182 7 7 7782 28% -16 19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...
More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...
It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
This seems to indicate that Crafty 22.2 is about as strong as Glaurung {within 30 Elo or so}. If so, then that is a very impressive result and means also that the current Crafty is about as strong as the current crop of commercial programs (other than one notable exception).
The comparison is only with old version of Glaurung(Glaurung1.1)
Crafty is clearly weaker than Glaurung.
Crafty is clearly weaker than commercial programs like Zappa,Naum,Hiarcs,Shredder or Fritz and the difference is more than 200 elo.
Uri
Hmmm... do you not see Glaurung 2 in that list anywhere?
I see but it is not the version that Dann Corbit meant because the difference from Crafty is clearly bigger than 30 elo and he wrote:
"This seems to indicate that Crafty 22.2 is about as strong as Glaurung {within 30 Elo or so}. "
Glaurung 2-epsilon/5 is also not latest glaurung(Glaurung2.1)
Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 108 7 7 7782 67% -21 20%
2 Fruit 2.1 62 7 6 7782 61% -21 23%
3 opponent-21.7 25 6 6 7780 57% -21 33%
4 Glaurung 1.1 SMP 10 6 6 7782 54% -21 20%
5 Crafty-22.2 -21 4 4 38908 46% 4 23%
6 Arasan 10.0 -185 7 7 7782 29% -21 19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 110 6 7 7782 67% -19 21%
2 Fruit 2.1 63 6 7 7782 61% -19 23%
3 opponent-21.7 26 6 6 7782 57% -19 33%
4 Glaurung 1.1 SMP 7 6 7 7782 54% -19 20%
5 Crafty-22.2 -19 4 3 38910 47% 4 23%
6 Arasan 10.0 -187 6 7 7782 28% -19 19%
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 109 7 6 7782 67% -16 20%
2 Fruit 2.1 63 6 7 7782 61% -16 24%
3 opponent-21.7 23 6 6 7781 56% -16 32%
4 Glaurung 1.1 SMP 3 6 7 7782 53% -16 21%
5 Crafty-22.2 -16 4 3 38909 47% 3 23%
6 Arasan 10.0 -182 7 7 7782 28% -16 19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...
More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...
It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
This seems to indicate that Crafty 22.2 is about as strong as Glaurung {within 30 Elo or so}. If so, then that is a very impressive result and means also that the current Crafty is about as strong as the current crop of commercial programs (other than one notable exception).
The comparison is only with old version of Glaurung(Glaurung1.1)
Crafty is clearly weaker than Glaurung.
Crafty is clearly weaker than commercial programs like Zappa,Naum,Hiarcs,Shredder or Fritz and the difference is more than 200 elo.
Uri
Hmmm... do you not see Glaurung 2 in that list anywhere?
I see but it is not the version that Dann Corbit meant because the difference from Crafty is clearly bigger than 30 elo and he wrote:
"This seems to indicate that Crafty 22.2 is about as strong as Glaurung {within 30 Elo or so}. "
Glaurung 2-epsilon/5 is also not latest glaurung(Glaurung2.1)
Uri
I will soon add that to the list of opponents, keeping both that I am now using, and see how it stacks up... But I don't want to change anything just yet so that all the data is apples-to-apples.
OK, so say you have one run that gives a score of 20 +/-5 with 95% confidence.
And then you have a second run that gives a score of 10 +/-5 with 95% confidence.
Would the maths approach allow us to conclude that the score is exactly 15 with 95% confidence? Or is that not how it works?
No, that is not how it works. How you process the data would depend on if the confidence intervals were externally supplied, or derived from the spread in your data itself.
When externally supplied, the variances add, meaning the SD gets sqrt(2) larger in the sum of the ratings, and sqrt(2) smaller in the avrage. So you would get 15+/-3. And you should become very suspicious, as the two runs each where so distant from that, that it should have happened only once in 20 cases (as it is just on the edge of the 95% confidence), so the chance that that happens in both of them is a 1 in 400 fluke. Sometimes that happens, but it is a good reason to redo the measurement.
If you derived the confidence intervals from the spread of the dat itself, your data set would now have a spread of 5 between the two runs, and a spread of 2.5 within each run (half the 95% confidence interval), and you would have to add that quadratically. That would give you an SD of 5.5, so the new 95% confidence interval would be 11. (i.e. your overall result would be 15 +/- 11.)