New testing thread

Discussion of chess software programming and technical issues.

Moderator: Ras

Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: 4 sets of data

Post by Sven »

Uri Blass wrote:
hgm wrote:
bob wrote:Got anything useful to offer? That's not what this was about. We were testing to see if round-robin offered more reliable ratings than just C vs world.
Why would you want to test such a thing, which trivially follows from first principles? Would you also play a game of Chess twice, and the use BayesElo to 'test' if 1+1 equals 2???

I suggest not to insult Bob.
This is a good idea, of course ...
I noticed that other people commented nonsense [...]
... but I think it was a bad idea, Uri, to write this.

I assume that your intent was not to insult those (nameless?) "other people", although they might be tempted to think it was. (I hope you do remember the names of those "other people".)
Note that I did not study the exact way that the rating program works but it is obvious that smaller error for Crafty after games between non Crafty versions suggests some bug in the rating program because I see no way how this information can be productive to get better estimate to the rating of Crafty.
1. If you see no way then I suggest that you try to understand what other people are writing, and why they do it. To suspect some software bug instead appears counter-productive to me.

2. What is obvious for you may be non-obvious for others, and they may have good reasons for it.

3. I don't know whether Bob used the "covariance" command of BayesElo that Rémi described and that might give a better calculation since it would involve the opponent's error margins to determine error margins of one player, as far as I understood. I expect he didn't, and in this case I would be interested to see the results of applying "covariance" to the same sets of games.

4. My point was that for me it was (and still is) "obvious" not only
a) that the opponent's ratings become more reliable by a round robin compared to everyone playing only one opponent, but also
b) that the error margins for the testing candidate (Crafty) decrease significantly when playing against opponents whose error margins decrease, too.

4.a) is a fact that is already confirmed by Bob's new data and also by Rémi.

4.b) is still open for me, I wait for 3. first.

Again, for me 4.b) denotes a very basic and important principle of the Elo system, if not of most chess rating systems. When I play rated games, results against opponents with a stable rating are more meaningful than those against unexperienced opponents with a non-established rating due to few games (edit: or due to few opponents). So if I get a rating based on many games against unstable opponents my rating is worth less than if I had played against more stable opponents. This is most natural for me. Do you agree here, Uri?

Transferring this to our computer chess situation is easy since here we have the special case of "rating a group of unrated players" by evaluating all relevant games at once. But the reliability issue remains the same: more games and more opponents = more rating stability, more opponents with stable ratings = even more own rating stability.

It is, of course, another point whether rating software really works this way. At the moment I think that you can achieve this goal with the "covariance" command of BayesElo. When I originally made my proposal to do that round robin test (in fact it was Richard Allbert, btw) I was not aware that there are different possible levels of rating quality that can be achieved by different BayesElo commands, so I thought that we would see the better error margins for Crafty immediately.

I now propose to return to interesting and important parts of the discussion, and to stop the personal parts.

I also propose to be very careful when using words like "nonsense" in the future.

Sven
Uri Blass
Posts: 10892
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: 4 sets of data

Post by Uri Blass »

Sven Schüle wrote:
Uri Blass wrote:
hgm wrote:
bob wrote:Got anything useful to offer? That's not what this was about. We were testing to see if round-robin offered more reliable ratings than just C vs world.
Why would you want to test such a thing, which trivially follows from first principles? Would you also play a game of Chess twice, and the use BayesElo to 'test' if 1+1 equals 2???

I suggest not to insult Bob.
This is a good idea, of course ...
I noticed that other people commented nonsense [...]
... but I think it was a bad idea, Uri, to write this.

I assume that your intent was not to insult those (nameless?) "other people", although they might be tempted to think it was. (I hope you do remember the names of those "other people".)
Note that I did not study the exact way that the rating program works but it is obvious that smaller error for Crafty after games between non Crafty versions suggests some bug in the rating program because I see no way how this information can be productive to get better estimate to the rating of Crafty.
1. If you see no way then I suggest that you try to understand what other people are writing, and why they do it. To suspect some software bug instead appears counter-productive to me.

2. What is obvious for you may be non-obvious for others, and they may have good reasons for it.

3. I don't know whether Bob used the "covariance" command of BayesElo that Rémi described and that might give a better calculation since it would involve the opponent's error margins to determine error margins of one player, as far as I understood. I expect he didn't, and in this case I would be interested to see the results of applying "covariance" to the same sets of games.

4. My point was that for me it was (and still is) "obvious" not only
a) that the opponent's ratings become more reliable by a round robin compared to everyone playing only one opponent, but also
b) that the error margins for the testing candidate (Crafty) decrease significantly when playing against opponents whose error margins decrease, too.

4.a) is a fact that is already confirmed by Bob's new data and also by Rémi.

4.b) is still open for me, I wait for 3. first.

Again, for me 4.b) denotes a very basic and important principle of the Elo system, if not of most chess rating systems. When I play rated games, results against opponents with a stable rating are more meaningful than those against unexperienced opponents with a non-established rating due to few games (edit: or due to few opponents). So if I get a rating based on many games against unstable opponents my rating is worth less than if I had played against more stable opponents. This is most natural for me. Do you agree here, Uri?

Transferring this to our computer chess situation is easy since here we have the special case of "rating a group of unrated players" by evaluating all relevant games at once. But the reliability issue remains the same: more games and more opponents = more rating stability, more opponents with stable ratings = even more own rating stability.

It is, of course, another point whether rating software really works this way. At the moment I think that you can achieve this goal with the "covariance" command of BayesElo. When I originally made my proposal to do that round robin test (in fact it was Richard Allbert, btw) I was not aware that there are different possible levels of rating quality that can be achieved by different BayesElo commands, so I thought that we would see the better error margins for Crafty immediately.

I now propose to return to interesting and important parts of the discussion, and to stop the personal parts.

I also propose to be very careful when using words like "nonsense" in the future.

Sven
I understand what you say and I can accept that saying nonsense is not a good idea.

I still do not think that games between the oppoents of Crafty help to get better numbers and more games of Crafty against the opponents are clearly more productive.

I will respond to points 4)a) and 4)b)

1)I agree with 4)a)
2)I do not agree at least with the significantly part of 4)b) because Crafty played the same number of games against every opponent and the average rating of the opponents does not change by games between themselves so the rating of Crafty did not change or almost did not change(looking at the data I see that the rating of Crafty did not change by more than one point).

If you know in the beginning that the rating of Crafty is not going to be changed then it is clear that the error cannot be changed.

Maybe the rating can be changed slightly because the rating is not based on average of opponents and 2000 players may score 1/2 point against players with rating of 3000 and 2000 when the same player can score close to 0 points against 2 players with rating of 2500 but this is an extreme case and practically when the difference is not very big you can predict results based on the average of the opponents.

I could not think of an example when you know the rating of Crafty more accurately based on games between non crafty.

If you can show me a case when Crafty's rating relative to the opponents become significantly higher or significantly smaller after adding non crafty games it may convince me that I was wrong in the assumption

Suppose that the average rating of all opponents is fixed to 2500
Suppose that A score 40% against B in 10 games
and 60% against C in 10 games.

The question is if result of B against C can change A's rating to something different than 2500
It seems clear to me that the answer is negative because scoring 50% against 2 opponents with average of 2500 does not change the performance.

Maybe things are different for scores that are significantly different than 50% but this was not the case for Crafty.

Uri
User avatar
hgm
Posts: 28388
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Correlated data discussion

Post by hgm »

bob wrote:
hgm wrote:Seems to me the point made above is not relevant. Playing at 10,000,000 nodes is not the same experiment as playing at 10,010,000 nodes.

However, it _does_ represent what happens when you use a fixed time, since the nodes vary because of time jitter. That was his point.
Indeed. So his point was not relevant.
hgm wrote: Each match is a sample of a different process, each process having its own standard deviation. Which is actually zero for totally deterministic node-count-based TC games. So the results actually ly an infinite number of SDs apart, making it 100% certain that Crafty has a better performance against these opponents, with these starting positions, at the slightly different node count. Great! But of course meaningless, as the sample of opponents and positions was too small for this result to have any correlation with the performance on all positions against all possible opponents.

But none of it has any bearing on the reported 6-sigma deviation between the 2x25,000 games: there the conditions were supposed to be identical.
All conditions except for time measurements which are _never_ identical.
They are identical in the statistical sense, that in both runs you had the same probability on a certain amount of jitter. If you flip a coin 10,000 times, and then you flip the same coin another 10,000 times, the fact that they land diffently in the sequent series does not make the conditions of the experiment different. Only using another coin, which is differently loaded would.
Bullshit. That two different experiments give a different answer can never be used to explain why repeating the same experiment gives a different answer.
So let me get this right. In your statistical definition, if I run three tests with three different node counts, the test is of no use. But if I run three tests with different time measurements, which also leads to different node counts, that is useful.
[/quote]'Use' has not been defined in this context. But repeating the same game 10,000 times at the same node count is of no use, except for those who want to wear down their computer and burn electricity. Repeating an experiment at the same time control can be useful, if the time jitter is large enough to cause enough variability that the games are not all the same. If most of the games are identical, or identical for on the average 30 moves before they deviate (as they were in the case of micro-Max 1.6 vs Eden 0.0.11 discussed on this forum some time ago), it becomes a lot less useful, but if you are prepared to go on long enough and eliminate the duplicate games, you might still get the result you are after (in a very wasteful way). Playing at a large number of different node counts amounts to the same thing.
And the two tests have nothing in common whatsoever??? The number of different node counts is not that great (I could run a hundred 3 second searches to see how many different node counts I get if you want. And then factor that in to N moves in the game... So his suggested experiment is just a small subset of what might happen. but it _is_ a subset.
Yes, so? Looking twice at the same subset does not give you any more information than looking once, and will not tell you how much the result might change if you would look at another randomly chosen subset. The 0.5/sqrt(N) upper bound on standard deviation only applies to a sample of N independent games. By repeating the same test 80 times, 79 incarnations of each game become fully dependent on the first one, and you cannot use the total number of games to calculate the standard deviation anymore.
User avatar
hgm
Posts: 28388
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: 4 sets of data

Post by hgm »

bob wrote:Are you following the discussion? Someone posed the question: "If you run a round robin, how will that compare to the results of just C vs World, since it will not (according to Remi) affect the rating of Crafty itself?" And I simply ran a test to answer that question. A later question was "would this stabilize the numbers better and reduce the variation?" So I ran the test four times to see. Answer is there is still a ton of variation.
OK, I see. If someone suggest you to waste your time, you are happy to oblige. If someone suggests you do serious data collection and make the result available so they can analyze what went wrong, of course you do not. My mistake for missing something so obvious...

As to the "ton of variation": define "ton". If I look at the same data, I would say the results are pretty stable, well within their statistical bounds (as BayesElo so usefully quotes). Note that not only the rating of Crafty remains unaffected by the games between the others, but also the error-bar on this rating. And that actual results ly well within the error bars, distributed as they should (in so far you can deduce a distribution from 4 data points).

So you got what you asked for. Why complain?

You can of course try similar runs with more different positions, more different opponents, take all those measures to create more variation between the games of a single run, and guess what? The variation in the score of the runs will not even go down an ounce by increasing the variation in games...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Correlated data discussion

Post by bob »

Karl sent me a further explanation late last night and asked me to post it here. He has registered, but the registration request has not been handled yet (I have no idea why it takes 4-5 days to handle this either, but that is a topic for a discussion in the moderator's forum which I will start in a few minutes).

In any case, here is the contents of his email, posted at his request:

================quote on=================================
Robert Hyatt wrote:I received an email from a mathematician overnight that had some interesting insight into our discussions. Since they
could be taken as a bit insulting to some posters in the discussion, I have invited him to come and participate directly.
I was unguarded in what I wrote to you by e-mail, Dr. Hyatt, but here in the public forum I hope to be a model of civility. Shame on me
if I raise the emotional temperature of the discussion instead of discussing calmly and rationally.

It seems that some of what you posted from my letter was not sufficiently clear, based on the immediate responses it drew:
Uri Blass wrote:Correlation between 2 varaibales that get 1 with probability of 1 is not defined
and
H. G. Muller wrote:Note that statistically speaking, such games are not dependent or correlated at all. A sampling that returns always
exactly the same value obeys all the laws for independent sampling, with respect to the standard deviation and the central limit
theorem.
To clarify, consider the situation where the same starting positions are used repeatedly with fixed opponents and fixed node counts,
guaranteeing the outcomes are the same in every run. The result of the first trial run are indeed correlated to the results of every
subsequent trial run. For simplicity let me assume a position set of 40, only one opponent, only one color, and no draws. The results
of the first trial run might be

X = 1110010001000011011111000010011111111101.

Then the result of the second trial run will be the same

Y = 1110010001000011011111000010011111111101.

The sample coefficient of correlation is well-defined, and may be calculated (using the formula on Wikipedia) as

(N * sum(x_i * y_i) - sum(x_i)*sum(y_i)) / (sqrt(N * sum(x_i^2) - sum(x_i)^2) * sqrt(N * sum(y_i^2) - sum(y_i)^2))
= (40*23 - 23*23) / (sqrt(40*23 - 23*23) * sqrt(40*23 - 23*23))
= (40*23 - 23*23) / (40*23 - 23*23)
= 1

Thus the coefficient of correlation between the first run and the second is unity, corresponding to the intuitive understanding that the
two trials are perfectly correlated.
These repeated trials do not, in fact, obey all the laws of independent sampling. In particular, there is no guarantee that the sample
mean will converge to the true mean of the random variable X as the sample size goes to infinity. I took these numbers from a random
number generator which was supposed to be 50% zeros and 50% ones. Intuitively 23 of 40 is not an alarming deviation from the true mean,
but if we repeat the trial a thousand times, 23000 of 40000 is statistically highly improbable. For the mathematically inclined, we can
make precise the calculation that repeated trials provide "no new information".

For our random variable X taking values 1 and 0 with assumed mean 0.5, each trial has variance (1 - 0.5)^2 = (0 - 0.5)^2 = 0.25. The
variance of the sum of the first forty trials, since they are independent with no covariance, is simply the sum of the variances, i.e.
40*0.25=10. The variance of the mean is 10/(40^2) = 1/160. The standard deviation of the mean is sqrt(1/160) ~ 0.079

Now we add in random variable Y. The covariance of X and Y is E(XY) - E(X)E(Y) = 0.5 - 0.5*0.5 = 0.25. When adding two random
variables, the variance of the sum is the sum of the variances plus twice the sum of the covariance. Thus the variance of the sum of our
eighty scores will be 40 * 0.25 + 2 * 40 * 0.25 + 40 * 0.25 = 40. The variance of the mean will be 40/(80^2) = 1/160. The standard
deviation of the mean is sqrt(1/160) ~ 0.079.

If we have M perfectly correlated trials, there will be covariance between each pair of trials. Since there are M(M-1)/2 pairs of
trials, there will be this many covariance terms in the formula for the variance of the sum. Thus the varaince of the sum will be M * 40
* 0.25 + 2 * 40 * 0.25 * M(M-1)/2 = 10M + 10(M^2 - M) = 10M^2. The variance of the mean will be 10(M^2)/((40M)^2) = (1/160). The
standard deviation of the mean is sqrt(1/160) ~ 0.079.

No matter how large the test size, we should expect results of 50% plus or minus 7.9%. The central limit theorem implies (among other
things) that the sample mean goes to zero, which isn't happening here. I apologize if the detail seems pedantic; it is in service of
clarity which my original letter obviously did not provide.

The second point I would like to make regards the "randomness" of our testing. Randomness has been presented as both the friend and the
foe of accurate measurement. In fact, both intuitions are correct in different contexts.

If we want to measure how well engine A plays relative to how well engine A' plays, then we want to give them exactly the same test.
Random changes between the first test and the second one will only add noise to our signal. In particular, if A is playing against
opponents limited by one node count and A' is playing opponents limited by a different node count, then they are not taking the same
test. By the same token, if both A and A' play opponents at the same time control, but small clock fluctuations change the node counts
from the opponent A played to the opponent A' played, we can expect that to add noise to our signal. The fact that the two engines are
playing slightly different opposition makes difference between A and A' slightly harder to detect.


If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between
measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games
that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games
that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use
exactly the same node count for the same game when A' plays in our comparison game.

I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the
bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving
less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison
between engines).

I've rambled quite a bit already, but I want to make two final points before wrapping up. First, the only way I see to achieve "good
randomness" is by not re-using positions at all, as several people have suggested. Even playing the same position against five different
opponents twice each will introduce correlations between the ten games of each position, and keep the standard deviation in measured
performance higher than it would be for perfectly independent results.

Second, although I am not intimately familiar with BayesElo, I am quite confident that all the error bounds it gives are calculated on
the assumption that the input games are independent. If the inputs to BayesElo are dependent, it will necessarily fool BayesElo into
giving confidence intervals that are too narrow.

Thank you for allowing me to participate in this discussion.
--Karl Juhnke
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 4 sets of data

Post by bob »

hgm wrote:
bob wrote:Are you following the discussion? Someone posed the question: "If you run a round robin, how will that compare to the results of just C vs World, since it will not (according to Remi) affect the rating of Crafty itself?" And I simply ran a test to answer that question. A later question was "would this stabilize the numbers better and reduce the variation?" So I ran the test four times to see. Answer is there is still a ton of variation.
OK, I see. If someone suggest you to waste your time, you are happy to oblige. If someone suggests you do serious data collection and make the result available so they can analyze what went wrong, of course you do not. My mistake for missing something so obvious...

As to the "ton of variation": define "ton". If I look at the same data, I would say the results are pretty stable, well within their statistical bounds (as BayesElo so usefully quotes). Note that not only the rating of Crafty remains unaffected by the games between the others, but also the error-bar on this rating. And that actual results ly well within the error bars, distributed as they should (in so far you can deduce a distribution from 4 data points).

So you got what you asked for. Why complain?

You can of course try similar runs with more different positions, more different opponents, take all those measures to create more variation between the games of a single run, and guess what? The variation in the score of the runs will not even go down an ounce by increasing the variation in games...
At times, discussing things with you is like discussing the _same_ technical ideas with a 6-year-old. "Ton" == "Too much to be useful". That has been what this discussion has been about from post #1. Never changed. Never will. Do you think you can follow that simple idea for a while? I want to be able to run the fastest possible test and measure minor improvements in a program. That's all I have been trying to discover how to do since my very first post. And while this data might satisfy your "well within statistical bounds" but I can't use it for the purpose I defined.

So is it _possible_ that we stay on _that_ train of thought for a while, and stop going off into the twilight zone most of the time. I want to measure small improvements. That data won't do so.
User avatar
xsadar
Posts: 147
Joined: Wed Jun 06, 2007 10:01 am
Location: United States
Full name: Mike Leany

Re: Correlated data discussion

Post by xsadar »

hgm wrote:
bob wrote:
hgm wrote:Seems to me the point made above is not relevant. Playing at 10,000,000 nodes is not the same experiment as playing at 10,010,000 nodes.

However, it _does_ represent what happens when you use a fixed time, since the nodes vary because of time jitter. That was his point.
Indeed. So his point was not relevant.
hgm wrote: Each match is a sample of a different process, each process having its own standard deviation. Which is actually zero for totally deterministic node-count-based TC games. So the results actually ly an infinite number of SDs apart, making it 100% certain that Crafty has a better performance against these opponents, with these starting positions, at the slightly different node count. Great! But of course meaningless, as the sample of opponents and positions was too small for this result to have any correlation with the performance on all positions against all possible opponents.

But none of it has any bearing on the reported 6-sigma deviation between the 2x25,000 games: there the conditions were supposed to be identical.
All conditions except for time measurements which are _never_ identical.
They are identical in the statistical sense, that in both runs you had the same probability on a certain amount of jitter. If you flip a coin 10,000 times, and then you flip the same coin another 10,000 times, the fact that they land diffently in the sequent series does not make the conditions of the experiment different. Only using another coin, which is differently loaded would.
Bullshit. That two different experiments give a different answer can never be used to explain why repeating the same experiment gives a different answer.
So let me get this right. In your statistical definition, if I run three tests with three different node counts, the test is of no use. But if I run three tests with different time measurements, which also leads to different node counts, that is useful.
'Use' has not been defined in this context. But repeating the same game 10,000 times at the same node count is of no use, except for those who want to wear down their computer and burn electricity. Repeating an experiment at the same time control can be useful, if the time jitter is large enough to cause enough variability that the games are not all the same. If most of the games are identical, or identical for on the average 30 moves before they deviate (as they were in the case of micro-Max 1.6 vs Eden 0.0.11 discussed on this forum some time ago), it becomes a lot less useful, but if you are prepared to go on long enough and eliminate the duplicate games, you might still get the result you are after (in a very wasteful way). Playing at a large number of different node counts amounts to the same thing.
And the two tests have nothing in common whatsoever??? The number of different node counts is not that great (I could run a hundred 3 second searches to see how many different node counts I get if you want. And then factor that in to N moves in the game... So his suggested experiment is just a small subset of what might happen. but it _is_ a subset.
Yes, so? Looking twice at the same subset does not give you any more information than looking once, and will not tell you how much the result might change if you would look at another randomly chosen subset. The 0.5/sqrt(N) upper bound on standard deviation only applies to a sample of N independent games. By repeating the same test 80 times, 79 incarnations of each game become fully dependent on the first one, and you cannot use the total number of games to calculate the standard deviation anymore.
If I'm understanding you both correctly, I think you and Bob agree more than you realize. It sounds to me like you both agree with and are in fact trying to convince each other of the following two points:

1) If you run a test with many identical games, as in the node count example, the results aren't useful.
2) Likewise, if you run a test where many of the games are related somehow (but not identical), for whatever reason the results still may not be as useful as you'd like.

Also, if I'm understanding you both correctly (and if I'm not mixing up who has said what) this is where you disagree:

You are saying that the tests were set up wrong perhaps causing the results of one game to influence the results of subsequent games, or causing sets of games where one engine has an advantage or disadvantage over others.

Bob says that the nature of the testing method itself (which others are using too) is flawed and might be resulting in related games.


Am I understanding you both correctly? These threads are really hard to follow.
User avatar
xsadar
Posts: 147
Joined: Wed Jun 06, 2007 10:01 am
Location: United States
Full name: Mike Leany

Re: Correlated data discussion

Post by xsadar »

snipped
The second point I would like to make regards the "randomness" of our testing. Randomness has been presented as both the friend and the
foe of accurate measurement. In fact, both intuitions are correct in different contexts.

If we want to measure how well engine A plays relative to how well engine A' plays, then we want to give them exactly the same test.
Random changes between the first test and the second one will only add noise to our signal. In particular, if A is playing against
opponents limited by one node count and A' is playing opponents limited by a different node count, then they are not taking the same
test. By the same token, if both A and A' play opponents at the same time control, but small clock fluctuations change the node counts
from the opponent A played to the opponent A' played, we can expect that to add noise to our signal. The fact that the two engines are
playing slightly different opposition makes difference between A and A' slightly harder to detect.


If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between
measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games
that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games
that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use
exactly the same node count for the same game when A' plays in our comparison game.

I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the
bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving
less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison
between engines).

I've rambled quite a bit already, but I want to make two final points before wrapping up. First, the only way I see to achieve "good
randomness" is by not re-using positions at all, as several people have suggested. Even playing the same position against five different
opponents twice each will introduce correlations between the ten games of each position, and keep the standard deviation in measured
performance higher than it would be for perfectly independent results.
This point makes sense, but I don't like it. That means we would need 10 times as many positions as I originally thought for an ideal test. Also, if we don't play the positions as both white and black, it seems (to me at least) to make it even more important that the positions be about equal for white and black. I hope your kind enough, Bob, to make the positions you finally settle on (however many that may be) available to the rest of us.

At first I thought it wouldn't work so well if the different opponents all played from different positions, or if the positions weren't played from each color's perspective, but if you're only testing to see if engine A' is better or worse than engine A, I don't think it matters as long as A is tested identically to A'.
Of course if you played A against A' to see which is better, then they would need to each play white and black for each position that they play against each other, but then count each position as a data point rather than each game.
Second, although I am not intimately familiar with BayesElo, I am quite confident that all the error bounds it gives are calculated on
the assumption that the input games are independent. If the inputs to BayesElo are dependent, it will necessarily fool BayesElo into
giving confidence intervals that are too narrow.

Thank you for allowing me to participate in this discussion.
--Karl Juhnke
Uri Blass
Posts: 10892
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Correlated data discussion

Post by Uri Blass »

I am going to read more and respond later but I understand the source of the misunderstanding.

I thought about correlation between result of game from position 1 and result of game from position 2 when karl looked at different variables namely results of the same position in different matches.

I agree that correlation of results from the same position may reduce the variance but if this is the only correlation and this correlation coefficient is positive then it can only reduce the total variance and I looked for explanation of big variance in the results.

Edit:From fast looking at karl's post he explains why repeating the same experiment is a bad idea to find small changes but it does not explain the results.

There are basically 2 questions:
1)Is it good to test in the way hyatt test
2)Do you expect to get the same rating in different tests.

In the case that you always get 23-17 between equal programs you may get wrong rating but your rating is going to be always the same.

In the case of Crafty the rating was not always the same and the only logical reason is simply that Bob hyatt did not repeat the same experiment.

Even if the cluster became at average 0.1% slower after many games then it means that the experiment is different.

Uri
Last edited by Uri Blass on Sat Aug 09, 2008 7:11 pm, edited 2 times in total.
User avatar
hgm
Posts: 28388
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: 4 sets of data

Post by hgm »

bob wrote:At times, discussing things with you is like discussing the _same_ technical ideas with a 6-year-old.
I am not surprised. You seem to learn as much from any discussion as from talking to a brick wall...
"Ton" == "Too much to be useful". That has been what this discussion has been about from post #1. Never changed. Never will. Do you think you can follow that simple idea for a while? I want to be able to run the fastest possible test and measure minor improvements in a program.
Well, if you were able to compute a square root, you would have known that 800 games, gives you a standard error of 1.7% ~ 12 Elo, so if your idea of a small change is <20 Elo, that is too much to be useful. I would have thought a 6-yesar old culd understand the difference between smaller than 20 and larger than 20.
That's all I have been trying to discover how to do since my very first post. And while this data might satisfy your "well within statistical bounds" but I can't use it for the purpose I defined.
Yes, that was obvious from the beginning. So why do you persistently try? Doesn't seem very useful to me...
So is it _possible_ that we stay on _that_ train of thought for a while, and stop going off into the twilight zone most of the time. I want to measure small improvements. That data won't do so.
Play enough independent games, then. If you want to reliably see difference of 7 Elo, 800 will never be enough, as computing a square will tell you. And no number of games will ever be enough if you use only 40 positions or 5 opponents. The sum of the statistical noise caused by opponent sampling, position sampling and game sampling will have to be well below 1% (scorewise). As I told you ages ago.