testing, 2 complete runs now...

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 3[code] complete runs now...

Post by bob »

Tony wrote:
bob wrote:
bob wrote:

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
stopped typing too soon. My main issue with the above is that each run was attempting to determine whether 21.7 is better than 22.2 (which we knew already since 22.2 has parts of the evaluation completely missing). But the numbers, giving -46, -45 and -39. the 7 elo difference (not to mention the error bar of course) for three versions that are exactly the same would make it very difficult to detect a small improvement it would seem...

comments??

More games at shorter time control? can certainly do that...
That could be interesting, more games, shorter timecontrol. Question is, would that need more starting positions ?

Can you do a 12 hour run with 1 second games ? Maybe 5M games gives us some information :)

Tony
Based on the current approach, each pair of games needs a new position. I can certainly create enough starting positions with the current approach, and using the enormous.pgn I can produce far more, and I could even go to the non-alternating color approach of one game per position, where on even numbered positions Crafty gets white, on odd numbered it gets black.

4th run is going to take longer than expected. For past two days cluster has been idle, but when I checked this morning, there are several hundred things queued up in addition to mine, so I won't have the 4th run complete for a while, yet. I think the first experiment I'd like to run is exactly the same as the current one, except very short time controls, to see if any results change due to speed issues...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 3[code] complete runs now...

Post by bob »

krazyken wrote:
bob wrote:
bob wrote:

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
stopped typing too soon. My main issue with the above is that each run was attempting to determine whether 21.7 is better than 22.2 (which we knew already since 22.2 has parts of the evaluation completely missing). But the numbers, giving -46, -45 and -39. the 7 elo difference (not to mention the error bar of course) for three versions that are exactly the same would make it very difficult to detect a small improvement it would seem...

comments??

More games at shorter time control? can certainly do that...
Well I think the setup in this test is wrong to tell any difference between 21.7 and 22.2. The only games 21.7 has are the games against 22.2? I think what you want to do to tell the difference between the two is 21.7 vs world compared with 22.2 vs world. The way this is set up you may be able to tell if 22.2 in the first run is different from 22.2 in the second run.
Good point and I had forgotten that. The way I want to run this test eventually is 22.x against all opponents including the old 21.7, then 22.y against all opponents. Then combine all the PGN into one wad and feed it to BayesElo. So yes, the current approach is not a very good test for 21.7 vs 22.2...

That's what happens when you are posting at 2am. :)
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: testing, 3[code] complete runs now...

Post by Dirt »

Sven Schüle wrote:Yes. Please don't forget the important advice of Rémi that's still pending: instead of comparing relative ratings between completely different test runs, you should first combine all test runs into one PGN file, therefore editing the Crafty-22.2 names appropriately as already suggested, and only then pass this to BayesElo.
To see concrete examples of how this test would work with real data, after renaming Crafty-22.2 in the PGN files the files can be concatenated pairwise to give us six combined PGNs that would look like what we would get when testing different versions.

On each of the combined files run BayesElo and get the ratings and the LOS tables. Normally we would be looking for a minimum[*] LOS between the versions to verify an improvement. If that minimum was less the largest from the combined files then there would be a false positive, so the Elo difference shown for those would probably be just smaller than what you would accept when testing different versions.

[*]Or since changes are designed to be improvements, perhaps they should be assumed to be good but too small to measure unless shown to be bad by some minimal likelihood of superiority of the old version.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 3[code] complete runs now...

Post by bob »

hgm wrote:
bob wrote:Stopped typing too soon. My main issue with the above is that each run was attempting to determine whether 21.7 is better than 22.2 (which we knew already since 22.2 has parts of the evaluation completely missing). But the numbers, giving -46, -45 and -39. the 7 elo difference (not to mention the error bar of course) for three versions that are exactly the same would make it very difficult to detect a small improvement it would seem...
As you could have known in advance. The SD of a run of N games (with ~20% draws) is 45%/sqrt(N), so a run of 39,000 (as for Crafty 22.2) will have a SD of 0.22%, corresponding to ~1.6 Elo point (7 Elo per %). The result of the 7,800 game runs as for Crafty 21.7 wil have a SD of 0.5% or ~3.5 Elo. So their difference will have SD = 3.9 Elo (adding the variances).

So repeating the experiment will produce results that will typically deviate 3.9 Elo from the truth, and thus sqrt(2)*3.9 = 5.5 Elo from each other. The three pairs of result you have now, (-46, -45, -39) deviate by 1, 6 and 7 from each other. That is on the average sqrt((1+36+49)/3) = sqrt(29) = 5.35 Elo. As expected / predicted.

I you had expected to be able to measure difference smaller than this, your enterprise was doomed from the start.
comments??
For the purpose of measuring the difference between the two Crafties, the runs you do now are flawed: 21.7 plays fewer games, and worse, it does play only a single opponent. This causies a large systematic error even if you woud reduce the variability to zero by playing infinitely many games in the same pairing scheme.

You could get (a little) less variability in the rating difference of the Crafties with fewer games, if you let them play equal number of games. In the runs above you spend games on reducing the fluctuations in one Crafty, while the errors in the difference are already dominated by the other. This is basically wasted time.
More games at shorter time control? can certainly do that...
Useful knowledge is how scores typically vary with opponent and with position. Determinig this requires a moderate number of position+opponent combinations to be played in such a way that each position gets at least 10,000 games against the same set of opponents. (And ech opponnt plays at least 10,000 games from the same set of positions.)
This has been pointed out more than once, and was a mistake I made late last night. I explained how I plan on testing in another post. And, in fact, I am gong to take the PGN from the last 3 runs and produce three sets of data. I am going to rename crafty-22.2 in one of the sets, and combine it with another set and run Bayesel, as now both 22.2 versions will have exactly the same number of games. I will repeat so that I get three results out of this and see how closely _they_ match, which is the comparison I really care about. They ought to turn out _very_ close, since they will be comparing the same two programs, exactly.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 3[code] complete runs now...

Post by bob »

jdart wrote:I would add, Arasan is doing rather poorly here, but one contributing factor may be tests are on Linux, and it is much slower running on Linux, partly because I can't use PGO for compiles there (and maybe for other reasons). I think it's getting about 50% more NPS on Windows with VC++/PGO compiles. Investigating this further is on my todo list.
What kinds of problems? I use icc to compile all the programs on the cluster, and all are PGO'ed with no problems. Only thing is, I PGO crafty with a set of positions, but not everyone uses the same kind of test format so for the others, I compile with PGO enabled, play a couple of quick games, then re-compile using that info... I've not noticed any issues at all, although I do not use gcc for this as it has always been beyond flakey with PGO for me.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 3[code] complete runs now...

Post by bob »

Uri Blass wrote:
Dann Corbit wrote:
bob wrote:

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
This seems to indicate that Crafty 22.2 is about as strong as Glaurung {within 30 Elo or so}. If so, then that is a very impressive result and means also that the current Crafty is about as strong as the current crop of commercial programs (other than one notable exception).
;-)
The comparison is only with old version of Glaurung(Glaurung1.1)

Crafty is clearly weaker than Glaurung.

Crafty is clearly weaker than commercial programs like Zappa,Naum,Hiarcs,Shredder or Fritz and the difference is more than 200 elo.

Uri
Hmmm... do you not see Glaurung 2 in that list anywhere? :)
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: testing, 3[code] complete runs now...

Post by Dirt »

bob wrote:And, in fact, I am gong to take the PGN from the last 3 runs and produce three sets of data. I am going to rename crafty-22.2 in one of the sets, and combine it with another set and run Bayesel, as now both 22.2 versions will have exactly the same number of games. I will repeat so that I get three results out of this and see how closely _they_ match, which is the comparison I really care about. They ought to turn out _very_ close, since they will be comparing the same two programs, exactly.
As I pointed out at the same time you were posting this, you can get six test results. They won't be independent, but I see no reason not to look at all of them. And, as Rémi said, you should look at the LOS tables.
jdart
Posts: 4368
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: testing, 3[code] complete runs now...

Post by jdart »

Doesn't work with GCC, as you mentioned.

I haven't tried icc for a while - I should do it again. The Makefile doesn't support icc + PGO but I can fix that.
User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Re: testing, 2 complete runs now...

Post by mhull »

hgm wrote:
Fritzlein wrote: That's not quite correct. If there are two test runs that both diverge from a truth that is presumed to be constant, then making both test runs measure closer to the truth will necessarily also make the test runs measure closer to each other.
Well, I think that it very easy to prove that this statement is wrong, with a sample counter example:

Suppose that I want to measure the average age of the population, and my sampling procedure is to ask Bob his age 10 times and take the average. If I now do many 'runs' like this, I will always get answers that are very close (especially if it is not Bob's birthday).

Now I change my sampling procedure, by going out in the street, and ask the first 10 persons I encounter their age and take the average of that. The result of such a 'run' will be far closer to the truth. But when I repeat such a 'run' several times, the resulting run averages will fluctuate. So they are farther from each other than the various Bob-only 'averages' were from each other, depite the fact that they were all much closer to the truth...
This seems like a false analogy, since multiple Bob-only sample runs wouldn't be far apart from each other to begin with. The context of the discussion was going from multiple sample runs that were far apart, to multiple sample runs that are closer together.
Matthew Hull
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 3[code] complete runs now...

Post by bob »

jdart wrote:Doesn't work with GCC, as you mentioned.

I haven't tried icc for a while - I should do it again. The Makefile doesn't support icc + PGO but I can fix that.
Don't know what kind of problems you run into, but for me, gcc usually just crashes during the profile run, or else complains that the profile output is corrupted and crashes during the re-compile... I gave up on it although I did find a version or two that worked when I was running on AMD developer lab machines...

Latest ICCs have all worked perfectly for me, both 32 bit and 64 bit versions.