testing, 2 complete runs now...

bob · Post by **bob** » Wed Aug 13, 2008 9:09 am

bob wrote:

Tue Aug 12 00&#58;49&#58;44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11&#58;36&#58;10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00&#58;53&#58;43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%

Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.

stopped typing too soon. My main issue with the above is that each run was attempting to determine whether 21.7 is better than 22.2 (which we knew already since 22.2 has parts of the evaluation completely missing). But the numbers, giving -46, -45 and -39. the 7 elo difference (not to mention the error bar of course) for three versions that are exactly the same would make it very difficult to detect a small improvement it would seem...

comments??

More games at shorter time control? can certainly do that...

Dann Corbit · Post by **Dann Corbit** » Wed Aug 13, 2008 9:24 am

bob wrote:

Code: Select all

Tue Aug 12 00&#58;49&#58;44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11&#58;36&#58;10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00&#58;53&#58;43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%

Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.

This seems to indicate that Crafty 22.2 is about as strong as Glaurung {within 30 Elo or so}. If so, then that is a very impressive result and means also that the current Crafty is about as strong as the current crop of commercial programs (other than one notable exception).

hgm · Post by **hgm** » Wed Aug 13, 2008 9:51 am

Fritzlein wrote: That's not quite correct. If there are two test runs that both diverge from a truth that is presumed to be constant, then making both test runs measure closer to the truth will necessarily also make the test runs measure closer to each other.

Well, I think that it very easy to prove that this statement is wrong, with a sample counter example:

Suppose that I want to measure the average age of the population, and my sampling procedure is to ask Bob his age 10 times and take the average. If I now do many 'runs' like this, I will always get answers that are very close (especially if it is not Bob's birthday).

Now I change my sampling procedure, by going out in the street, and ask the first 10 persons I encounter their age and take the average of that. The result of such a 'run' will be far closer to the truth. But when I repeat such a 'run' several times, the resulting run averages will fluctuate. So they are farther from each other than the various Bob-only 'averages' were from each other, depite the fact that they were all much closer to the truth...

Tony · Post by **Tony** » Wed Aug 13, 2008 10:00 am

bob wrote:
bob wrote:
Code: Select all
Tue Aug 12 00&#58;49&#58;44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11&#58;36&#58;10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00&#58;53&#58;43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
stopped typing too soon. My main issue with the above is that each run was attempting to determine whether 21.7 is better than 22.2 (which we knew already since 22.2 has parts of the evaluation completely missing). But the numbers, giving -46, -45 and -39. the 7 elo difference (not to mention the error bar of course) for three versions that are exactly the same would make it very difficult to detect a small improvement it would seem...

comments??

More games at shorter time control? can certainly do that...

That could be interesting, more games, shorter timecontrol. Question is, would that need more starting positions ?

Can you do a 12 hour run with 1 second games ? Maybe 5M games gives us some information

Tony

krazyken · Post by **krazyken** » Wed Aug 13, 2008 10:08 am

bob wrote:
bob wrote:
Code: Select all
Tue Aug 12 00&#58;49&#58;44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11&#58;36&#58;10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00&#58;53&#58;43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
stopped typing too soon. My main issue with the above is that each run was attempting to determine whether 21.7 is better than 22.2 (which we knew already since 22.2 has parts of the evaluation completely missing). But the numbers, giving -46, -45 and -39. the 7 elo difference (not to mention the error bar of course) for three versions that are exactly the same would make it very difficult to detect a small improvement it would seem...

comments??

More games at shorter time control? can certainly do that...

Well I think the setup in this test is wrong to tell any difference between 21.7 and 22.2. The only games 21.7 has are the games against 22.2? I think what you want to do to tell the difference between the two is 21.7 vs world compared with 22.2 vs world. The way this is set up you may be able to tell if 22.2 in the first run is different from 22.2 in the second run.

hgm · Post by **hgm** » Wed Aug 13, 2008 10:31 am

bob wrote:Stopped typing too soon. My main issue with the above is that each run was attempting to determine whether 21.7 is better than 22.2 (which we knew already since 22.2 has parts of the evaluation completely missing). But the numbers, giving -46, -45 and -39. the 7 elo difference (not to mention the error bar of course) for three versions that are exactly the same would make it very difficult to detect a small improvement it would seem...

As you could have known in advance. The SD of a run of N games (with ~20% draws) is 45%/sqrt(N), so a run of 39,000 (as for Crafty 22.2) will have a SD of 0.22%, corresponding to ~1.6 Elo point (7 Elo per %). The result of the 7,800 game runs as for Crafty 21.7 wil have a SD of 0.5% or ~3.5 Elo. So their difference will have SD = 3.9 Elo (adding the variances).

So repeating the experiment will produce results that will typically deviate 3.9 Elo from the truth, and thus sqrt(2)*3.9 = 5.5 Elo from each other. The three pairs of result you have now, (-46, -45, -39) deviate by 1, 6 and 7 from each other. That is on the average sqrt((1+36+49)/3) = sqrt(29) = 5.35 Elo. As expected / predicted.

I you had expected to be able to measure difference smaller than this, your enterprise was doomed from the start.

comments??

For the purpose of measuring the difference between the two Crafties, the runs you do now are flawed: 21.7 plays fewer games, and worse, it does play only a single opponent. This causies a large systematic error even if you woud reduce the variability to zero by playing infinitely many games in the same pairing scheme.

You could get (a little) less variability in the rating difference of the Crafties with fewer games, if you let them play equal number of games. In the runs above you spend games on reducing the fluctuations in one Crafty, while the errors in the difference are already dominated by the other. This is basically wasted time.

More games at shorter time control? can certainly do that...

Useful knowledge is how scores typically vary with opponent and with position. Determinig this requires a moderate number of position+opponent combinations to be played in such a way that each position gets at least 10,000 games against the same set of opponents. (And ech opponnt plays at least 10,000 games from the same set of positions.)

Sven · Post by **Sven** » Wed Aug 13, 2008 2:44 pm

bob wrote:
bob wrote:It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
stopped typing too soon. My main issue with the above is that each run was attempting to determine whether 21.7 is better than 22.2 (which we knew already since 22.2 has parts of the evaluation completely missing). But the numbers, giving -46, -45 and -39. the 7 elo difference (not to mention the error bar of course) for three versions that are exactly the same would make it very difficult to detect a small improvement it would seem...

comments??

Yes. Please don't forget the important advice of Rémi that's still pending: instead of comparing relative ratings between completely different test runs, you should first combine all test runs into one PGN file, therefore editing the Crafty-22.2 names appropriately as already suggested, and only then pass this to BayesElo. Rémi has explained why, there was something about the rating offset that is calculated each time to get relative ratings. I do not consider this to be "just some improvement" but as a "must", since the author of the tool claims that otherwise the ratings are not comparable, and thus any conclusions from such a comparison remain highly questionable.

Btw I expect even smaller error bars after doing this, but let's see.

Regarding possible test setups, I can imagine that you might also be successful with more opponents (say, 10) but much less games (say, 200 test positions x both colors = 400 games per match, although this is just a guess and should be validated thoroughly). This might already give you error bars that are as good as the current ones.

Currently I don't insist on the round robin of "world" as an initial preparation step since your current results seem to be quite good already. But I still like the idea, and I'm pretty sure everyone will be surprised about the results, provided you apply the additional "covariance" command of BayesElo in this case after having added the games of all candidate versions to the single PGN file.

Also if I would do this test I would keep the time control based setup. With fixed node counts I think that I would measure different abilities of the engine that I'm not really interested in. I would like to compare the strength of versions of my engine when playing "real games". For me, performing better when always given a fixed node count for each search does not necessarily mean to also perform better in "real games" with time control.

Of course "real games" would imply to play with opening books instead of test positions but that's another issue that has already been checked and discussed, I guess, and I assume that currently there is a wide acceptance of not using opening books for such tests.

Sven

Uri Blass · Post by **Uri Blass** » Wed Aug 13, 2008 3:23 pm

Dann Corbit wrote:
bob wrote:
Code: Select all
Tue Aug 12 00&#58;49&#58;44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11&#58;36&#58;10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00&#58;53&#58;43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
This seems to indicate that Crafty 22.2 is about as strong as Glaurung {within 30 Elo or so}. If so, then that is a very impressive result and means also that the current Crafty is about as strong as the current crop of commercial programs (other than one notable exception).

The comparison is only with old version of Glaurung(Glaurung1.1)

Crafty is clearly weaker than Glaurung.

Crafty is clearly weaker than commercial programs like Zappa,Naum,Hiarcs,Shredder or Fritz and the difference is more than 200 elo.

Uri

mathmoi · Post by **mathmoi** » Wed Aug 13, 2008 4:26 pm

bob wrote:
bob wrote:
Code: Select all
Tue Aug 12 00&#58;49&#58;44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11&#58;36&#58;10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00&#58;53&#58;43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
stopped typing too soon. My main issue with the above is that each run was attempting to determine whether 21.7 is better than 22.2 (which we knew already since 22.2 has parts of the evaluation completely missing). But the numbers, giving -46, -45 and -39. the 7 elo difference (not to mention the error bar of course) for three versions that are exactly the same would make it very difficult to detect a small improvement it would seem...

comments??

More games at shorter time control? can certainly do that...

Hi Bob,

If you want to compare 21.7 (opponent-21.7) against 22.2 (Crafty-22.2) it doesn't seems logic to me to use a 22.2 vs World run to do this.

I think what you should do (and I think it's what Rémi was proposing) is run 22.2 vs World and 21.7 vs World then combine both PGN and feed them to BayesElo.

Wouldn't this produce a more accurate rating of 21.7 and then provide a better comparisons of 21.7 and 22.2?

Also note that in real world experiment, the 22.2 vs World run could be use later to compare 22.2 with 22.3 (or whatever the next version number is). Eventually you would have a PGN file composed of multiple versions of crafty vs the world.

jdart · Post by **jdart** » Wed Aug 13, 2008 5:05 pm

I would add, Arasan is doing rather poorly here, but one contributing factor may be tests are on Linux, and it is much slower running on Linux, partly because I can't use PGO for compiles there (and maybe for other reasons). I think it's getting about 50% more NPS on Windows with VC++/PGO compiles. Investigating this further is on my todo list.

testing, 2 complete runs now...

Re: testing, 3[code] complete runs now...

Re: testing, 3[code] complete runs now...

Re: testing, 2 complete runs now...

Re: testing, 3[code] complete runs now...

Re: testing, 3[code] complete runs now...

Re: testing, 3[code] complete runs now...

Re: testing, 3[code] complete runs now...

Re: testing, 3[code] complete runs now...

Re: testing, 3[code] complete runs now...

Re: testing, 3[code] complete runs now...