testing, 2 complete runs now...

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

testing, 2 complete runs now...

Post by bob »

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%

Those are the first two runs, and they seem to be more consistent than the last two big runs with just 40 positions. Run 3 in in progress and will finish tonight, by noon tomorrow the entire 160,000 games should be done. If, as Karl has suggested, these runs stay within expected variable limits, then we can start a discussion on reducing this computational load to something more palatable.

I'm just hoping that we see stable Elo numbers. But then the worry may well be that most sensible changes do not affect a program's Elo enough for this test to measure, which would be a completely different problem to deal with.

Note that this is not a full round-robin, although I could run one after the current test finishes if anyone wants to see how that would collapse the overall rating differences into a smaller range.

Remember, my goal is to compare A to A'. I don't care about absolute Elo values, or exactly how much better or worse A is than A', I only want to see if A' (which represents a slightly modified version of Crafty, AKA program A) is better or worse. Don't give a hoot about how much better or worse.
User avatar
xsadar
Posts: 147
Joined: Wed Jun 06, 2007 10:01 am
Location: United States
Full name: Mike Leany

Re: testing, 2 complete runs now...

Post by xsadar »

Looking good so far. Even the opponents are all within 3 Elo between the two runs -- well within the expected margins.
Fritzlein

Re: testing, 2 complete runs now...

Post by Fritzlein »

bob wrote:Those are the first two runs, and they seem to be more consistent than the last two big runs with just 40 positions.
Indeed, a difference of 2 Elo, when we are given that 4 Elo is two standard deviations, looks reasonable. But the results are even less varied than they appear at first glance. If I correctly understood what Remi Coulom said, there is an additional source of uncertainty from running BayesElo twice rather than once on a combined dataset. This additional uncertainty is on the same order as the given standard deviation, I believe he said, so a difference of 2 Elo between the two runs is actually like half a standard deviation.

Of course, to put all the results in one file and run BayesElo once, you have to have different names for Crafty in each run, e.g. "Crafty-22.2-Test1" and "Crafty-22.2-Test2", while keeping all the opponent names the same. That stabilizes the scale (since Elo has no absolute anchor) and makes the confidence interval mean what it looks like it means.

Anyway, it's way cool to see these results and to not have been proven wrong yet. From my perspective, we had better cancel the last two runs, because that will just be two more chances for me to be wrong. [/kidding]
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 2 complete runs now...

Post by bob »

Fritzlein wrote:
bob wrote:Those are the first two runs, and they seem to be more consistent than the last two big runs with just 40 positions.
Indeed, a difference of 2 Elo, when we are given that 4 Elo is two standard deviations, looks reasonable. But the results are even less varied than they appear at first glance. If I correctly understood what Remi Coulom said, there is an additional source of uncertainty from running BayesElo twice rather than once on a combined dataset. This additional uncertainty is on the same order as the given standard deviation, I believe he said, so a difference of 2 Elo between the two runs is actually like half a standard deviation.

Of course, to put all the results in one file and run BayesElo once, you have to have different names for Crafty in each run, e.g. "Crafty-22.2-Test1" and "Crafty-22.2-Test2", while keeping all the opponent names the same. That stabilizes the scale (since Elo has no absolute anchor) and makes the confidence interval mean what it looks like it means.

Anyway, it's way cool to see these results and to not have been proven wrong yet. From my perspective, we had better cancel the last two runs, because that will just be two more chances for me to be wrong. [/kidding]
When all is done, I can easily edit three of the sets and change the names to see what happens... But also, remember that my goal is to accurately determine the difference between (in this case) crafty-21.7 and crafty-22.2, and they are included which should help...

But so far, things do look more reasonable, but as the past has shown, any two runs can produce lots of variability... so anything can yet still happen. :)
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: testing, 2 complete runs now...

Post by hgm »

Fritzlein wrote:Anyway, it's way cool to see these results and to not have been proven wrong yet. From my perspective, we had better cancel the last two runs, because that will just be two more chances for me to be wrong. [/kidding]
Let me get this clear: what exactly was not proven wrong yet?

I understand that there is a difference between the runs we are seeing now, and the two 25,000-game runs that started all this, namely that these new runs are using many more different starting positions. Your original statement was that more different positions would be needed to get the results of runs closer to the true strength of the engine (defined as the average over all positions and all opponents). But that it would not do anything for getting the result of two runs closer to each other. Even when starting from a single position, the results of the runs would be close, or typically even closer, than when you would take many different positions. (Provided all the runs use the same single initial position, of course.) It is justthat in the latter case they would both be close to the wrong value.

So how does the current result makes you 'right'? I would say, if anything, it proves you wrong: you claimed that using 40 positions would be too few for the result to connverge to the true value. But is seems that all the runs now converge to the same value as the second run, despite the fact that the latter only used only 40 positions.

What is needed to show if you are right (and how many positions are needed to reach a certain accuracy), one would have to know how much the results taken on one position differs from results taken on another. If results were always the same withing the sampling error 0.5/sqrtt(N), no matter what position you took, obviously there would be no need to take more than one position at all. (Obviously you would have to pair the same position played with black and white in order to eliminate any bias in the position itself, to be able to do this.)

I doubt f the current runs play each position often enough to make any statements on that, though.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 2 complete runs now...

Post by bob »

hgm wrote:
Fritzlein wrote:Anyway, it's way cool to see these results and to not have been proven wrong yet. From my perspective, we had better cancel the last two runs, because that will just be two more chances for me to be wrong. [/kidding]
Let me get this clear: what exactly was not proven wrong yet?

I understand that there is a difference between the runs we are seeing now, and the two 25,000-game runs that started all this, namely that these new runs are using many more different starting positions. Your original statement was that more different positions would be needed to get the results of runs closer to the true strength of the engine (defined as the average over all positions and all opponents). But that it would not do anything for getting the result of two runs closer to each other. Even when starting from a single position, the results of the runs would be close, or typically even closer, than when you would take many different positions. (Provided all the runs use the same single initial position, of course.) It is justthat in the latter case they would both be close to the wrong value.

So how does the current result makes you 'right'? I would say, if anything, it proves you wrong: you claimed that using 40 positions would be too few for the result to connverge to the true value. But is seems that all the runs now converge to the same value as the second run, despite the fact that the latter only used only 40 positions.

What is needed to show if you are right (and how many positions are needed to reach a certain accuracy), one would have to know how much the results taken on one position differs from results taken on another. If results were always the same withing the sampling error 0.5/sqrtt(N), no matter what position you took, obviously there would be no need to take more than one position at all. (Obviously you would have to pair the same position played with black and white in order to eliminate any bias in the position itself, to be able to do this.)

I doubt f the current runs play each position often enough to make any statements on that, though.
He simply said that there was correlation between many of the games, the correlation caused by the fact that the same two opponents are playing the same two positions repeatedly, and that many positions will be highly correlated and some will not. So he proposed playing about the same number of games, but not repeating the positions. He even suggested not playing black/white either and I am going to test that later. All the positions are WTM, so I can just run half with crafty as white, half as black, and see what happens, cutting the total games by 1/2 to start with.

By the time this finishes tomorrow morning, each of the almost 4,000 positions will have been played 8 times, 4 with white, 4 with black. Not a lot of data to draw conclusions about individual positions. And certainly unimportant for what I am trying to measure, as one position is never going to be enough.

But in any case, he predicted much less of the variability seen in the other runs, if we used more positions and less repeats to produce the same number of games. That is what we are currently testing. So far two tests look reasonable. But that has happened in the past also.

I will also try to run some 800-game matches using these positions (or a sub-set) to see how the stability of those compares to the relative instability of the old 40 position matches...
Fritzlein

Re: testing, 2 complete runs now...

Post by Fritzlein »

hgm wrote:Let me get this clear: what exactly was not proven wrong yet?
What has not yet been proven wrong is my intuition that the results of repeated playouts of the same position with the same two opponents playing the same colors at the same time control are highly correlated. You, contradicting me, have espoused the theory that clock jitter makes those repeated playouts independent, except in the case where the starting position was very unbalanced to begin with. You haven't been proven wrong yet either. Since you and I have opposing intuitions, and neither of us has been proven wrong yet, it follows that neither of us has been proven right. But I intentionally did not say that I had been proven right; you put those words in my mouth.
hgm wrote:Your original statement was that more different positions would be needed to get the results of runs closer to the true strength of the engine (defined as the average over all positions and all opponents). But that it would not do anything for getting the result of two runs closer to each other.
That's not quite correct. If there are two test runs that both diverge from a truth that is presumed to be constant, then making both test runs measure closer to the truth will necessarily also make the test runs measure closer to each other. Yes, I understand that there are other ways to get the get the two runs to be close to each other without making them both close to the truth. Yes, I understand that if the results now happen to be all close together it doesn't prove that they are all close to the truth.

If I am correct about the correlation between repeated playouts, then it explains how Bob's original test runs could each be farther from the truth than the bounds given by BayesElo. I have not explained why the first test run would fail to be close to the second run. Yes, I understand that correlated playouts should make the two runs closer each other (at an equal distance from the truth) unless something changed between the two runs. Therefore I believe that something did change between the two runs that Bob used to kick off this discussion. As I have said before, I don't know what that change might have been. However, I don't believe it was a change that systematically helped Crafty or hurt Crafty.

Now if we remove the correlation from repeated playouts, and we still get out of bounds results, I won't know what to think. Maybe there is another source of correlation I didn't think of. (You have made more than one suggestion along these lines.) Maybe something is changing that helps/hurts Crafty more than other engines. I don't know; I will definitely be at a loss. So that's why I'm hoping all four of Bob's present runs are near each other. If they are, my theory is still alive and kicking.

Admittedly, even if we don't get out of bounds results now, it doesn't distinguish between the two cases (A) that whatever changed between runs the first time didn't change this time, and (B) that whatever changes between runs no longer messes up our results, because it is a random change, and we are now using lots of independent positions.

I expect you will still think that I am wrong, but with all those caveats in place, might you agree that I haven't yet been proven wrong?
User avatar
xsadar
Posts: 147
Joined: Wed Jun 06, 2007 10:01 am
Location: United States
Full name: Mike Leany

Re: testing, 2 complete runs now...

Post by xsadar »

hgm wrote:
Fritzlein wrote:Anyway, it's way cool to see these results and to not have been proven wrong yet. From my perspective, we had better cancel the last two runs, because that will just be two more chances for me to be wrong. [/kidding]
Let me get this clear: what exactly was not proven wrong yet?

I understand that there is a difference between the runs we are seeing now, and the two 25,000-game runs that started all this, namely that these new runs are using many more different starting positions. Your original statement was that more different positions would be needed to get the results of runs closer to the true strength of the engine (defined as the average over all positions and all opponents). But that it would not do anything for getting the result of two runs closer to each other. Even when starting from a single position, the results of the runs would be close, or typically even closer, than when you would take many different positions. (Provided all the runs use the same single initial position, of course.) It is justthat in the latter case they would both be close to the wrong value.

So how does the current result makes you 'right'? I would say, if anything, it proves you wrong: you claimed that using 40 positions would be too few for the result to connverge to the true value. But is seems that all the runs now converge to the same value as the second run, despite the fact that the latter only used only 40 positions.
Am I missing something here? You're looking at the value for crafty 22.2 in the second run with 40 positions, but shouldn't we be looking at the values for the opponents too? Unless I'm confused about how to read the data, they're mostly just outside the expected ranges(with Fruit's values being way off), which suggests to me (but I'm no statistician) that the apparent accuracy of the crafty 22.2 value was at least as likely due to luck as due to the accuracy of the test, and could have just as easily been just outside the range like the others.
What is needed to show if you are right (and how many positions are needed to reach a certain accuracy), one would have to know how much the results taken on one position differs from results taken on another. If results were always the same withing the sampling error 0.5/sqrtt(N), no matter what position you took, obviously there would be no need to take more than one position at all. (Obviously you would have to pair the same position played with black and white in order to eliminate any bias in the position itself, to be able to do this.)

I doubt f the current runs play each position often enough to make any statements on that, though.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: testing, 3[code] complete runs now...

Post by bob »

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
Uri Blass
Posts: 10905
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: testing, 3[code] complete runs now...

Post by Uri Blass »

bob wrote:

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20%
   2 Fruit 2.1               62    7    6  7782   61%   -21   23%
   3 opponent-21.7           25    6    6  7780   57%   -21   33%
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20%
   5 Crafty-22.2            -21    4    4 38908   46%     4   23%
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19%
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21%
   2 Fruit 2.1               63    6    7  7782   61%   -19   23%
   3 opponent-21.7           26    6    6  7782   57%   -19   33%
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20%
   5 Crafty-22.2            -19    4    3 38910   47%     4   23%
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19%
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20%
   2 Fruit 2.1               63    6    7  7782   61%   -16   24%
   3 opponent-21.7           23    6    6  7781   56%   -16   32%
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21%
   5 Crafty-22.2            -16    4    3 38909   47%     3   23%
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19%
Third run is complete and given above. I also have a 15-minute "snapshot" program running that will grab the elo data every 15 minutes so that I can see if the numbers stabilize before the test finishes, in any sort of usable way...

More tomorrow when test 4 is done. But this certainly does look better, with one minor issue...

It appears to be a difficult task to measure small changes. Crafty's rating varies from -16 to -21, which is more noise than what I am actually hoping to measure... might be a hopeless idea to try to measure very small changes in strength.
If the real rating of Crafty is -18.5 then it means that no rating is out of the bounds.

Nothing seem to be strange in the results so far and one result that is slightly out of bounds can be expected after enough tries based on statistics.

Uri