Karl sent me a further explanation late last night and asked me to post it here. He has registered, but the registration request has not been handled yet (I have no idea why it takes 4-5 days to handle this either, but that is a topic for a discussion in the moderator's forum which I will start in a few minutes).
In any case, here is the contents of his email, posted at his request:
================quote on=================================
Robert Hyatt wrote:I received an email from a mathematician overnight that had some interesting insight into our discussions. Since they
could be taken as a bit insulting to some posters in the discussion, I have invited him to come and participate directly.
I was unguarded in what I wrote to you by e-mail, Dr. Hyatt, but here in the public forum I hope to be a model of civility. Shame on me
if I raise the emotional temperature of the discussion instead of discussing calmly and rationally.
It seems that some of what you posted from my letter was not sufficiently clear, based on the immediate responses it drew:
Uri Blass wrote:Correlation between 2 varaibales that get 1 with probability of 1 is not defined
and
H. G. Muller wrote:Note that statistically speaking, such games are not dependent or correlated at all. A sampling that returns always
exactly the same value obeys all the laws for independent sampling, with respect to the standard deviation and the central limit
theorem.
To clarify, consider the situation where the same starting positions are used repeatedly with fixed opponents and fixed node counts,
guaranteeing the outcomes are the same in every run. The result of the first trial run are indeed correlated to the results of every
subsequent trial run. For simplicity let me assume a position set of 40, only one opponent, only one color, and no draws. The results
of the first trial run might be
X = 1110010001000011011111000010011111111101.
Then the result of the second trial run will be the same
Y = 1110010001000011011111000010011111111101.
The sample coefficient of correlation is well-defined, and may be calculated (using the formula on Wikipedia) as
(N * sum(x_i * y_i) - sum(x_i)*sum(y_i)) / (sqrt(N * sum(x_i^2) - sum(x_i)^2) * sqrt(N * sum(y_i^2) - sum(y_i)^2))
= (40*23 - 23*23) / (sqrt(40*23 - 23*23) * sqrt(40*23 - 23*23))
= (40*23 - 23*23) / (40*23 - 23*23)
= 1
Thus the coefficient of correlation between the first run and the second is unity, corresponding to the intuitive understanding that the
two trials are perfectly correlated.
These repeated trials do not, in fact, obey all the laws of independent sampling. In particular, there is no guarantee that the sample
mean will converge to the true mean of the random variable X as the sample size goes to infinity. I took these numbers from a random
number generator which was supposed to be 50% zeros and 50% ones. Intuitively 23 of 40 is not an alarming deviation from the true mean,
but if we repeat the trial a thousand times, 23000 of 40000 is statistically highly improbable. For the mathematically inclined, we can
make precise the calculation that repeated trials provide "no new information".
For our random variable X taking values 1 and 0 with assumed mean 0.5, each trial has variance (1 - 0.5)^2 = (0 - 0.5)^2 = 0.25. The
variance of the sum of the first forty trials, since they are independent with no covariance, is simply the sum of the variances, i.e.
40*0.25=10. The variance of the mean is 10/(40^2) = 1/160. The standard deviation of the mean is sqrt(1/160) ~ 0.079
Now we add in random variable Y. The covariance of X and Y is E(XY) - E(X)E(Y) = 0.5 - 0.5*0.5 = 0.25. When adding two random
variables, the variance of the sum is the sum of the variances plus twice the sum of the covariance. Thus the variance of the sum of our
eighty scores will be 40 * 0.25 + 2 * 40 * 0.25 + 40 * 0.25 = 40. The variance of the mean will be 40/(80^2) = 1/160. The standard
deviation of the mean is sqrt(1/160) ~ 0.079.
If we have M perfectly correlated trials, there will be covariance between each pair of trials. Since there are M(M-1)/2 pairs of
trials, there will be this many covariance terms in the formula for the variance of the sum. Thus the varaince of the sum will be M * 40
* 0.25 + 2 * 40 * 0.25 * M(M-1)/2 = 10M + 10(M^2 - M) = 10M^2. The variance of the mean will be 10(M^2)/((40M)^2) = (1/160). The
standard deviation of the mean is sqrt(1/160) ~ 0.079.
No matter how large the test size, we should expect results of 50% plus or minus 7.9%. The central limit theorem implies (among other
things) that the sample mean goes to zero, which isn't happening here. I apologize if the detail seems pedantic; it is in service of
clarity which my original letter obviously did not provide.
The second point I would like to make regards the "randomness" of our testing. Randomness has been presented as both the friend and the
foe of accurate measurement. In fact, both intuitions are correct in different contexts.
If we want to measure how well engine A plays relative to how well engine A' plays, then we want to give them exactly the same test.
Random changes between the first test and the second one will only add noise to our signal. In particular, if A is playing against
opponents limited by one node count and A' is playing opponents limited by a different node count, then they are not taking the same
test. By the same token, if both A and A' play opponents at the same time control, but small clock fluctuations change the node counts
from the opponent A played to the opponent A' played, we can expect that to add noise to our signal. The fact that the two engines are
playing slightly different opposition makes difference between A and A' slightly harder to detect.
If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between
measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games
that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games
that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use
exactly the same node count for the same game when A' plays in our comparison game.
I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the
bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving
less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison
between engines).
I've rambled quite a bit already, but I want to make two final points before wrapping up. First, the only way I see to achieve "good
randomness" is by not re-using positions at all, as several people have suggested. Even playing the same position against five different
opponents twice each will introduce correlations between the ten games of each position, and keep the standard deviation in measured
performance higher than it would be for perfectly independent results.
Second, although I am not intimately familiar with BayesElo, I am quite confident that all the error bounds it gives are calculated on
the assumption that the input games are independent. If the inputs to BayesElo are dependent, it will necessarily fool BayesElo into
giving confidence intervals that are too narrow.
Thank you for allowing me to participate in this discussion.
--Karl Juhnke