New testing thread

hgm · Post by **hgm** » Sat Aug 09, 2008 9:18 pm

bob wrote:Believe I got that 55 years ago or so. The point being addressed was how does something with a SD of 12, (or less) still produce two runs where the results are well outside that range.
Get that?

Not only that: even answered it about 30 pages of posts ago.

I just reduced the number of positions so that I could run the various tests being suggested, and not make the discussion drag out over months.

Reducing the number of games is not compatible with insisting on being able to measure the same tinny Elo differences you wanted to measure with 25,000 games. No use sulking about it. If you want to create a model problem that you can test quickly, you will have to scale the requirement for accuracy accordingly.

I fully understand that 25,000 games produces a smaller SD than 800.

Apparently not, as you continue complaining that the 800 games show a larger deviation than would make it a 'useful' replacement for your 25,000-game run.

But, unlike yourself, I also fully understand that running 800 games takes a lot less time,

Where did you get that silly idea from?

and if you recognize that the SD goes up and the size goes down, the discussion can _still_ reach some sort of conclusion that could be verified on the bigger run if necessary.

Because _everybody_ is using some sort of a vs b testing to either measure rating differences to see who is better, or whether a change was good. No idea how you do your testing and draw your conclusions, and I don't really care. But I am addressing what _most_ are doing. And things are progressing in spite of your many non-contributions, thank you.

You're welcome.

OK, we are playing with number of positions. I am up to almost 4,000 and am testing this. How many opponents? 4,000 needed there too? If so, I can actually play 16,000,000 games. But can anyone else? Didn't think so, so we need something that is both (a) useful and (b) doable. So far you are providing _neither_ while at least others are making suggestions that can be tested.

Yes, life is difficult isn't it, for those relying on blind guesses in the face of infinities. The scientific approach would of course be to calculate how many you need. You have tried 40 different positions, and did a good deal more than 40 games on each of those. So you are in a position to calculate the standard deviation of the result over variation of the number of games. Calculate by which factor that fall short of the accuracy you want, calculate the square, and, magical trick, you have the number of needed positions.

Oh, sorry, too difficult. Another useless non-contribution. And you of course no longer have the results of the 25,000-game matches specified by position...

BTW your "most frequent suggestion, by far" has not been to use more positions or more opponents. 99% of your posts are "stampee feet, testing flawed, stampee feet, cluster is broken, stampee feet, there are dependencies between the games, stampee feet, stampee feet." None of which is useful. I had already pointed out that the _only_ dependencies present were tied to same opponents and same positions.

In your dreams, yes.

But as you used the same positions and opponents in both runs, these dependencies (if any) should decrease the variability of the runs (and hence the typical difference between two runs), and thus cannot explain your result, which is too large a difference, not too small. That these effects drive up the difference between what your runs produce and what you really wanted to know is irrelevant for explaining the 6-sigma deviation, as you remarked yourself in the very beginning of the previous thread.

But not where one game influences another in any possible way. But we don't seem to be able to get away from that.

I guess this is because of the unfortunate coincidence that I continue to overlook your explanation of how you excluded slow time-dependence of the involved engines as an artifact spiling your experiment. Can you give me the link back to the post where you did that?

Meanwhile, in spite of all the noise, there is a small and steadily helpful signal buried in here that others are contributing. And I am willing to test 'em all without dismissing _anything_ outright. Unlike yourself.

Well, so at least you are following my earlier advice, then:
Muddle on!

Dirt · Post by **Dirt** » Sat Aug 09, 2008 9:52 pm

bob wrote:Karl sent me a further explanation late last night and asked me to post it here. He has registered, but the registration request has not been handled yet (I have no idea why it takes 4-5 days to handle this either, but that is a topic for a discussion in the moderator's forum which I will start in a few minutes).

In any case, here is the contents of his email, posted at his request:

================quote on=================================
Robert Hyatt wrote:I received an email from a mathematician overnight that had some interesting insight into our discussions. Since they could be taken as a bit insulting to some posters in the discussion, I have invited him to come and participate directly.
I was unguarded in what I wrote to you by e-mail, Dr. Hyatt, but here in the public forum I hope to be a model of civility. Shame on me if I raise the emotional temperature of the discussion instead of discussing calmly and rationally.

It seems that some of what you posted from my letter was not sufficiently clear, based on the immediate responses it drew:

Uri Blass wrote:Correlation between 2 varaibales that get 1 with probability of 1 is not defined
and
H. G. Muller wrote:Note that statistically speaking, such games are not dependent or correlated at all. A sampling that returns always exactly the same value obeys all the laws for independent sampling, with respect to the standard deviation and the central limit
theorem.
To clarify, consider the situation where the same starting positions are used repeatedly with fixed opponents and fixed node counts, guaranteeing the outcomes are the same in every run. The result of the first trial run are indeed correlated to the results of every subsequent trial run. For simplicity let me assume a position set of 40, only one opponent, only one color, and no draws. The results of the first trial run might be

X = 110010001000011011111000010011111111101.

Then the result of the second trial run will be the same

Y = 110010001000011011111000010011111111101.

The sample coefficient of correlation is well-defined, and may be calculated (using the formula on Wikipedia) as

(N * sum(x_i * y_i) - sum(x_i)*sum(y_i)) / (sqrt(N * sum(x_i^2) - sum(x_i)^2) * sqrt(N * sum(y_i^2) - sum(y_i)^2))
= (40*23 - 23*23) / (sqrt(40*23 - 23*23) * sqrt(40*23 - 23*23))
= (40*23 - 23*23) / (40*23 - 23*23)
= 1

Thus the coefficient of correlation between the first run and the second is unity, corresponding to the intuitive understanding that the two trials are perfectly correlated. These repeated trials do not, in fact, obey all the laws of independent sampling. In particular, there is no guarantee that the sample mean will converge to the true mean of the random variable X as the sample size goes to infinity. I took these numbers from a random number generator which was supposed to be 50% zeros and 50% ones. Intuitively 23 of 40 is not an alarming deviation from the true mean, but if we repeat the trial a thousand times, 23000 of 40000 is statistically highly improbable. For the mathematically inclined, we can make precise the calculation that repeated trials provide "no new information".

For our random variable X taking values 1 and 0 with assumed mean 0.5, each trial has variance (1 - 0.5)^2 = (0 - 0.5)^2 = 0.25. The variance of the sum of the first forty trials, since they are independent with no covariance, is simply the sum of the variances, i.e. 40*0.25=10. The variance of the mean is 10/(40^2) = 1/160. The standard deviation of the mean is sqrt(1/160) ~ 0.079

Now we add in random variable Y. The covariance of X and Y is E(XY) - E(X)E(Y) = 0.5 - 0.5*0.5 = 0.25. When adding two random variables, the variance of the sum is the sum of the variances plus twice the sum of the covariance. Thus the variance of the sum of our eighty scores will be 40 * 0.25 + 2 * 40 * 0.25 + 40 * 0.25 = 40. The variance of the mean will be 40/(80^2) = 1/160. The standard deviation of the mean is sqrt(1/160) ~ 0.079.

If we have M perfectly correlated trials, there will be covariance between each pair of trials. Since there are M(M-1)/2 pairs of trials, there will be this many covariance terms in the formula for the variance of the sum. Thus the varaince of the sum will be M * 40
* 0.25 + 2 * 40 * 0.25 * M(M-1)/2 = 10M + 10(M^2 - M) = 10M^2. The variance of the mean will be 10(M^2)/((40M)^2) = (1/160). The standard deviation of the mean is sqrt(1/160) ~ 0.079.

No matter how large the test size, we should expect results of 50% plus or minus 7.9%. The central limit theorem implies (among other things) that the sample mean goes to zero, which isn't happening here. I apologize if the detail seems pedantic; it is in service of clarity which my original letter obviously did not provide.

The second point I would like to make regards the "randomness" of our testing. Randomness has been presented as both the friend and the foe of accurate measurement. In fact, both intuitions are correct in different contexts.

If we want to measure how well engine A plays relative to how well engine A' plays, then we want to give them exactly the same test. Random changes between the first test and the second one will only add noise to our signal. In particular, if A is playing against opponents limited by one node count and A' is playing opponents limited by a different node count, then they are not taking the same test. By the same token, if both A and A' play opponents at the same time control, but small clock fluctuations change the node counts from the opponent A played to the opponent A' played, we can expect that to add noise to our signal. The fact that the two engines are playing slightly different opposition makes difference between A and A' slightly harder to detect.

If we want to measure how well engine A plays in the absolute, however, then "randomness" (or more precisely independence between measurements) is a good thing. We want to do everything possible to kill off correlations between games that A plays and other games that A plays. This can include having the opponent's node count set to a random number, so that there is less correlation between games that reuse the same opponent. That said, if we randomize the opposing node count we should save the node count we used, so we can use exactly the same node count for the same game when A' plays in our comparison game.

I think, therefore, that test suite results will be most significant if the time control is taken out of the picture completely. If the bots are limited by node count rather than time control, we can control the randomness so that we get the "good randomness" (achieving less correlation among the games of one engine) and can simultaneously eliminate the "bad randomness" (removing noise from the comparison between engines).

I've rambled quite a bit already, but I want to make two final points before wrapping up. First, the only way I see to achieve "good randomness" is by not re-using positions at all, as several people have suggested. Even playing the same position against five different opponents twice each will introduce correlations between the ten games of each position, and keep the standard deviation in measured performance higher than it would be for perfectly independent results.

Second, although I am not intimately familiar with BayesElo, I am quite confident that all the error bounds it gives are calculated on the assumption that the input games are independent. If the inputs to BayesElo are dependent, it will necessarily fool BayesElo into giving confidence intervals that are too narrow.

Thank you for allowing me to participate in this discussion.
--Karl Juhnke

No insults? How boring! Plus when you concentrate on how to improve the testing rather than identify why it went wrong you don't leave much to argue about.

Fritzlein · Post by **Fritzlein** » Sat Aug 09, 2008 9:54 pm

Zach Wegner wrote:So the results may look random, but they are heavily dependent, as your mathematician friend says. The difference is that using a random value, you can draw statistical conclusions from the results, such as Elo, error margins, etc.

You have gotten my point. I would like to expand on it for precision and clarity.

First, the exact matter of concern in using time jitter to introduce randomness is not how much variation there is among the node counts, but rather how much correlation there is among repeated test runs using the same position. It could be that node counts vary more than the move choices, and the move choices vary more than the game results. Our desire is not so much to control the distribution of node counts, but to eliminate correlation between repeated tests of the same engine.

Second, we do not need to rely on our intuitions that time jitter produces results that are very random as opposed to heavily dependent. We don't have to trust our gut feeling or eyeball the data; correlation can be measured in practice.

Continuing the calculations from my earlier contribution that Dr. Hyatt kindly posted for me, suppose the results of a trial run on 40 positions are

X = 1110010001000011011111000010011111111101.

Then suppose a second trial is run on the same 40 positions with nothing changing except clock jitter, and the results are

Y = 0011011001000110011101000010011001111100.

Now are those results dependent or independent? Studies have shown that humans do a poor job of saying a set of results "looks random" or not, so we had better rely on formulas rather than how it looks. The sample coefficient of correlation is

(N * sum(x_i * y_i) - sum(x_i)*sum(y_i)) / (sqrt(N * sum(x_i^2) - sum(x_i)^2) * sqrt(N * sum(y_i^2) - sum(y_i)^2))
= (40*16 - 23*19) / (sqrt(40*23 - 23*23) * sqrt(40*19 - 19*19))
= 203 / sqrt(680)*sqrt(840)
~ 0.2686

So there appears to be some correlation. Of course, this is a small sample, so perhaps it happened by chance. To be more sure, we would need a larger sample.

Even assuming that the correlation in this sample is representative, one still might have a feeling that the clock jitter produced _some_ variation, so if we produce enough repeated trials, we can get an arbitrarily accurate answer. The feeling might be that we would maybe have to run 100,000 games instead of just 25,000 games, but eventually we would zoom in on the true win percentage in spite of a little bit of correlation sneaking in. Unfortunately, this intuition that we can make up for insufficient randomness by virtue of more measurements is very badly wrong, as I will try to show by formula.

Let's assume that our true winning percentage is 0.5, and the true correlation coefficient between repeated trials is as represented above. Each game has variance (1 - 0.5)^2 = (0 - 0.5)^2 = 0.25. The covariance between games using the same position is E(XY) - E(x)E(Y) = 16/40 - 23/40 * 19/40 = 203/1600 ~ 0.129, which we will take to be representative of the covariance of all further runs as well.

The sum of the first forty trials, since they are independent with no covariance, is simply the sum of the variances, i.e. 40*0.25=10. The variance of the mean is 10/(40^2) = 1/160. The standard deviation of the mean is sqrt(1/160) ~ 0.079

If we have M correlated trials, there will be covariance between each pair of trials. Since there are M(M-1)/2 pairs of trials, there will be this many covariance terms in the formula for the variance of the sum. Thus the varaince of the sum will be M * 40 * 0.25 + 2 * 40 * 203/1600 * M(M-1)/2 = 10M + (M^2 - M)*203/40 = 5.075*M^2 + 4.925M. The variance of the mean will be (5.075*M^2 + 4.925M)/((40M)^2).

As our number of repeat runs M goes to infinity, the M^2 term dominates, leaving the limit of the variance of the mean as 5.075/40^2. The limit of the standard deviation of the mean is sqrt(5.075/40^2) ~ 0.056

In other words, although the coefficient of correlation between the two trials was only 0.2686, it limits the amount of information we can ever get in a million billion trials to plus or minus 5.6% from the true mean. This is barely better than the plus or minus 7.9% from the true mean we got from our first forty playouts. If would could get a mere 100 independent playouts, we would have measured plus or minus 5%, which is better than we will ever get from re-using the first forty that are correlated.

In conclusion, I think it is very relevant to run correlation tests if anything is re-used. In the old tests, we can confirm intuition by measuring the correlation from one 400-game run to the next. I expect it to be high, because I expect the randomness introduced by time jitter to be low.

In the latest tests, Dr. Hyatt is using 3000+ positions ten times each; we can check if the ten playout results are correlated. I suspect that they are at least somewhat, because if Crafty knows how to win a position against Fruit, it probably knows how to win that same position against Glaurung. I don't expect perfect correlation, of course, but potentially enough correlation that we are wasting time by running 27000 recycled positions when we could be running an extra, say, 5000 unique positions to get the same amount of information.

hgm · Post by **hgm** » Sat Aug 09, 2008 9:55 pm

Fritzlein wrote:I was talking about the standard deviation of the mean of test results in comparison to the true mean winning percentage. This seems to me the relevant number: we want to know what the expected measurement error of our test is, and we want to drive this measurement error as near to zero as possible.

Yes, I noticed that. But the fuss is all about the fact that two of Bobs samples differed more than 6-sigma from each other. About how much they differ from this true quantity, which you and I agree is so relevant, no one has any clue of. Except of course, that if two samples differ by a certain amount, at least one of them must differ from the true winning percentage by least half that. But it could be arbitrary much more.

I was not talking about the standard deviation between a test run and an identical or nearly-identical test run.

Clear. But Bob was. So that makes your remarks not very relevant to his problem. Which is what I pointed out when Bob first quoted them.

We can make that variation exactly zero if we want to. Big deal. Who wants to get the wrong answer again and again with high precision?

Well, seems like Bob wants that very much. You might say it is his principle drive in life. Otherwise he would not be using only 40 positions and 5 opponents some 8 months after I pointed out this problem to him. This was one of the first useless non-contributions to this discussion, you see...

(Well maybe it is important to know somehow whether or not _every_ variable has been controlled, but that's a practical question, not a math question.)

I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision. I'll let computer scientists duke it, and to me it is frankly less interesting because the mathematical mystery is gone. What I have explained (to my own satisfaction at least) is why each of the two 25600-game test runs can be expected to have a large error relative to Crafty's true strength, and thus I only need any old random change between the two big runs to complete the less interesting (to me) explanation as well.

Well, that is something we definitely agree on.

Tell me now if we also agree on this:

The magnitude of the correlations you point out, can be meausred empirically by taking the marginal scores of the big run, projected onto the different positions, different opponents, different playing times, and calculating the score variance over those marginal distributions. (Povided that the number of games played from each position (or against each opponent) is so large that the sampling noise in the result for that position (opponent) is small compared to the typical difference in score between the different positions.) That will tell you how large the result for a single position will typically deviate from the result of the average over all positions. And thus how many positions you minimally need to bring this source of sampling noise below the required accuracy bound.

Dirt · Post by **Dirt** » Sat Aug 09, 2008 10:26 pm

bob wrote:To continue the discussion above, the idea of using a different position for each opponent seems reasonable. not playing black and white from each, however, becomes more problematic because then we have to be sure that the positions are all relatively balanced, which is not exactly an easy tasks, when we start talking about 10K+ (or more) positions.

I don't think there is any need to be sure the positions are balanced. That is, you get more information from balanced positions whether you used them for both black and white or not. What you lose by testing only one way is accuracy about how well Crafty plays against the opponents. Theoretically you don't care about that, only about how the different versions of Crafty compare.

In practice I don't think it makes much difference, and the convenience of having half the number of positions probably outweighs any loss of testing efficiency.

Dirt · Post by **Dirt** » Sat Aug 09, 2008 10:34 pm

Dirt wrote:That is, you get more information from balanced positions...

Yes, I know this isn't strictly correct.

Fritzlein · Post by **Fritzlein** » Sat Aug 09, 2008 10:43 pm

hgm wrote:Tell me now if we also agree on this:

The magnitude of the correlations you point out, can be meausred empirically by taking the marginal scores of the big run, projected onto the different positions, different opponents, different playing times, and calculating the score variance over those marginal distributions. (Povided that the number of games played from each position (or against each opponent) is so large that the sampling noise in the result for that position (opponent) is small compared to the typical difference in score between the different positions.) That will tell you how large the result for a single position will typically deviate from the result of the average over all positions. And thus how many positions you minimally need to bring this source of sampling noise below the required accuracy bound.

I do agree that one can and should measure various types of correlation in playout results. I'm not quite sure, however, what you mean about the result for a single position deviating from the average of all positions, and how that relates to a calculation of the minimal number of tests needed.

My conception of the measurement problem is as follows. If Crafty is put into the wild to play, it will have a certain percentage chance to win, let's call it X. That percentage will obviously depend on the opponent, depend on the position, depend on the time control, etc. These dependencies are not things we are trying to eliminate; on the contrary we want to simulate them because they inherently exist in the variable we are trying to measure.

If Crafty changes, it will have a new winning percentage X'. We want a test that reliably detects whether X or X' is greater, even when the difference between the two is as low as one or two percent.

If we could take N independent measurements of X, the greatest the variation could possibly be is the case where X is half wins, half losses, and no draws, for a variance of 0.25 per trial. The standard deviation of the of the mean of N trials is sqrt(0.25/N), so to measure X plus or minus one percent with 95% confidence (two standard deviations) we can set 2 * sqrt(0.25/N) = 0.01, which implies N = 10,000. That's the biggest number of independent tests we will need; lesser variation can only mean fewer tests will discriminate X within 1%.

We could then perform an independent measurement of X'. However, knowing both X and X' within 1% only determines (X - X') within 1.414% percent. We can get more accuracy for free by running an identical test on X' that we ran on X. Here we want correlations: how X does on a position should be as related as possible to how X' does on a position. If the test suites are identical, then 10,000 tests of each will determine the difference to essentially 1% rather than to 1.414%.

If we double the allowed error bound, we only need a quarter of the tests, i.e. 2,500. If we can only do 400 tests (1/25 the original) our error bound is five times as great, i.e. plus or minus 5%.

So what is left to calculate? Are you suggesting that we can get away with fewer tests by showing that typical situations in which Crafty will find itself are correlated with each other, and therefore don't all need to be tested?

The correlations I want to calculate are the ones that could say our 10,000 tests are not in fact independent, and thus don't give us the error bound we want. I'm focused on finding correlations we don't want so as to eliminate them.

bob · Post by **bob** » Sat Aug 09, 2008 11:23 pm

Dirt wrote:
bob wrote:To continue the discussion above, the idea of using a different position for each opponent seems reasonable. not playing black and white from each, however, becomes more problematic because then we have to be sure that the positions are all relatively balanced, which is not exactly an easy tasks, when we start talking about 10K+ (or more) positions.
I don't think there is any need to be sure the positions are balanced. That is, you get more information from balanced positions whether you used them for both black and white or not. What you lose by testing only one way is accuracy about how well Crafty plays against the opponents. Theoretically you don't care about that, only about how the different versions of Crafty compare.

In practice I don't think it makes much difference, and the convenience of having half the number of positions probably outweighs any loss of testing efficiency.

Take an extreme point... all positions are such that white is winning. The test would not show a thing since either A or A' will win every game with white or lose every game if it plays black. Could that happen? Hard to say but with so many gambit lines available, and so many lines that are _very_ deeply analyzed (Sicilian for example) it would be within the realm of reasonable proabability to choose at least a large number of unbalanced positions. If you chose 1/2 of those, then 1/2 of the testing effort won't show anything. The other issue is that if you don't get a good representative sample of the openings you typically play, and you do get a lot of games from openings you avoid because you don't handle them very well, then the results get biased.

hgm · Post by **hgm** » Sat Aug 09, 2008 11:30 pm

Fritzlein wrote:So what is left to calculate?

This is in relation to Bob's complaint that 10,000 positions might be too much, and 10,000 opponents will certainly be too much.

What I meant is this: suppose you have 10 positions, 10 opponents and play each opponent from each position 100 times at slightly different number of nodes. So you play 10,000 games. Each position is played 1000 times. Suppose from one position the engine under test scores 60%, from 2 others 55%, from 4 others 50%, from 2 other 45%, and from the last one 40%. So the total test score is 50%.

The individual positions deviate from this average score by 10% (2x) or 5% (4x). This makes the standard deviation in the score of the individual positions sqrt(30)% ~ 5.5%. Very little of this is due to samplling noise in the percentages themselves, as the variance in 1000 games should be at most 2500/N (square percents) = 2.5 << 30. So the observed variance in the individual position scores is almost exclusively due to variation of the position. And the standard deviation the score suffers by selecting a single position, and using it for all games, is thus ~5%.

Now if you average over M independently chosen positions, the SD of the average will be 5%/sqrt(N). So if you want to acheive an accuracy of 1%, (at 95% confidence, so 2-sigma), and you want not more than one quarter of the corresponding variance to be contributed by the position sampling, you need a SD of 0.25% (1-sigma). Thus N has to be at least 400.

There is not much benifit in doing the experiment with more than 400 positions: With 10,000 games the SD of the average score of the entire sample will be 50%/sqrt(10,000) = 0.5%, even with 10,000 different positions. Adding an extra 0.25% SD to that because of position-sampling noise averaged over 400 positions will only drive up the overall SD to 0.55%.

bob · Post by **bob** » Sat Aug 09, 2008 11:33 pm

there is simply something going _badly_ wrong with communication here.

(1) if we play with X nodes per search we get one result. If we play with X+N nodes per search, we get another result.

(2) if we play with time limit for search, each run gets a different result.

I ran several tests using (2) and got different results. I then showed that the _only_ thing possible to produce these changes are a result of the different nodes caused by using time.

Why is that so hard to explain? Every search of exactly 1 second is not exactly one second. It is 1 second +/- N where N is not precisely known because of hardware differences. Before you can fix something, you have to understand what is happening. I wanted to prove, beyond any doubt, that this variability was purely a result of timing fluctuation, which I quite clearly did. Otherwise I didn't care about the varying node count searches at all with respect to testing. It was just a method of validating a hypothesis.

New testing thread

Re: 4 sets of data

More readable

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Roughly

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion

Re: Correlated data discussion