testing, 2 complete runs now...

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 3[code] complete runs now...

Post by bob »

Dirt wrote:
bob wrote:And, in fact, I am gong to take the PGN from the last 3 runs and produce three sets of data. I am going to rename crafty-22.2 in one of the sets, and combine it with another set and run Bayesel, as now both 22.2 versions will have exactly the same number of games. I will repeat so that I get three results out of this and see how closely _they_ match, which is the comparison I really care about. They ought to turn out _very_ close, since they will be comparing the same two programs, exactly.
As I pointed out at the same time you were posting this, you can get six test results. They won't be independent, but I see no reason not to look at all of them. And, as Rémi said, you should look at the LOS tables.

There is one reason I can't get 6. The first PGN set is gone. Due to a human error (on my part), I created 4 sub-directories to hold the separate runs, but somehow only used three directory names when saving the output. So far I have two complete sets, with a third "in progress" as the cluster load drops off. So I can use A+B, A+C and B+C to get three results, but no more...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 3[code] complete runs now...

Post by bob »

Just a late-night ill-conceived post. ;) I willl post some better C vs C numbers once I get the third test done...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 3[code] complete runs now...

Post by bob »

bob wrote:

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
Since this last run is going to take a bit longer than usual due to cluster load, here is the results so far, for comparison:

Code: Select all

33583 game(s) loaded, 0 game(s) with unknown result ignored.
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   113    7    7  6712   68%   -21   21% 
   2 Fruit 2.1               70    7    7  6711   63%   -21   23% 
   3 opponent-21.7           19    6    7  6716   56%   -21   33% 
   4 Glaurung 1.1 SMP        11    7    7  6719   54%   -21   20% 
   5 Crafty-22.2            -21    4    4 33583   46%     4   23% 
   6 Arasan 10.0           -192    7    7  6725   28%   -21   18% 
jdart
Posts: 4375
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: testing, 3[code] complete runs now...

Post by jdart »

I get a crash in the final link (re-optimization) phase with gcc 4.x. gcc 3.4 or so worked with PGO but I don't support that any longer because of its deficiencies with C++ library support. Will try icc again soon. As you may have guessed, Linux is not my primary development platform.
Uri Blass
Posts: 10460
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: testing, 3[code] complete runs now...

Post by Uri Blass »

bob wrote:
Uri Blass wrote:
Dann Corbit wrote:
bob wrote:

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
This seems to indicate that Crafty 22.2 is about as strong as Glaurung {within 30 Elo or so}. If so, then that is a very impressive result and means also that the current Crafty is about as strong as the current crop of commercial programs (other than one notable exception).
;-)
The comparison is only with old version of Glaurung(Glaurung1.1)

Crafty is clearly weaker than Glaurung.

Crafty is clearly weaker than commercial programs like Zappa,Naum,Hiarcs,Shredder or Fritz and the difference is more than 200 elo.

Uri
Hmmm... do you not see Glaurung 2 in that list anywhere? :)
I see but it is not the version that Dann Corbit meant because the difference from Crafty is clearly bigger than 30 elo and he wrote:
"This seems to indicate that Crafty 22.2 is about as strong as Glaurung {within 30 Elo or so}. "

Glaurung 2-epsilon/5 is also not latest glaurung(Glaurung2.1)

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 3[code] complete runs now...

Post by bob »

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
Dann Corbit wrote:
bob wrote:

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
This seems to indicate that Crafty 22.2 is about as strong as Glaurung {within 30 Elo or so}. If so, then that is a very impressive result and means also that the current Crafty is about as strong as the current crop of commercial programs (other than one notable exception).
;-)
The comparison is only with old version of Glaurung(Glaurung1.1)

Crafty is clearly weaker than Glaurung.

Crafty is clearly weaker than commercial programs like Zappa,Naum,Hiarcs,Shredder or Fritz and the difference is more than 200 elo.

Uri
Hmmm... do you not see Glaurung 2 in that list anywhere? :)
I see but it is not the version that Dann Corbit meant because the difference from Crafty is clearly bigger than 30 elo and he wrote:
"This seems to indicate that Crafty 22.2 is about as strong as Glaurung {within 30 Elo or so}. "

Glaurung 2-epsilon/5 is also not latest glaurung(Glaurung2.1)

Uri
I will soon add that to the list of opponents, keeping both that I am now using, and see how it stacks up... But I don't want to change anything just yet so that all the data is apples-to-apples.
MartinBryant

Re: testing, 3[code] complete runs now... A stats question..

Post by MartinBryant »

OK, so say you have one run that gives a score of 20 +/-5 with 95% confidence.
And then you have a second run that gives a score of 10 +/-5 with 95% confidence.
Would the maths approach allow us to conclude that the score is exactly 15 with 95% confidence? Or is that not how it works?
User avatar
hgm
Posts: 28123
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: testing, 3[code] complete runs now... A stats question..

Post by hgm »

No, that is not how it works. How you process the data would depend on if the confidence intervals were externally supplied, or derived from the spread in your data itself.

When externally supplied, the variances add, meaning the SD gets sqrt(2) larger in the sum of the ratings, and sqrt(2) smaller in the avrage. So you would get 15+/-3. And you should become very suspicious, as the two runs each where so distant from that, that it should have happened only once in 20 cases (as it is just on the edge of the 95% confidence), so the chance that that happens in both of them is a 1 in 400 fluke. Sometimes that happens, but it is a good reason to redo the measurement.

If you derived the confidence intervals from the spread of the dat itself, your data set would now have a spread of 5 between the two runs, and a spread of 2.5 within each run (half the 95% confidence interval), and you would have to add that quadratically. That would give you an SD of 5.5, so the new 95% confidence interval would be 11. (i.e. your overall result would be 15 +/- 11.)