YATT.... (Yet Another Testing Thread)

bob · Post by **bob** » Sat Aug 16, 2008 5:02 pm

Tony wrote:
bob wrote:
Tony wrote:
bob wrote:
hgm wrote:
bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.
This is still absolute bullshit. Karl stated that the results would be farther from the truth when you used fewer positions. But they would have been closer to each other, as they used the same small set of positions. Karl's temark that being closer to the truth necessary implies that they were closer to each other was even plain wrong, as my counter-example shows.
OK. Here you go. First a direct quote from karl:

============================================================
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.

When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather
affected each block of 64 playouts in coordinated fashion.
============================================================

Now, based on that, either (a) "bullshit" is simply the first idea you get whenever you read a post here or (b) you wouldn't recognize bullshit if you stepped in it.

He said _exactly_ what I said he said. Notice the "enough to explain away..." This quote followed the first one I posted from him last week when we started this discussion.

And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents.
This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.
Again, don't buy it at all. If a position is so unbalanced, the two outcomes will be perfectly correlated and cancel out. A single game per position gives twice as many games, hopefully twice as many that are not too unbalanced.

Is this true ?

With equal strength (50% winchance)

1 unbalanced position, played twice : => 1 - 1

1 unbalanced, 1 balanced => 1.5 - 0.5

perfect world result 1 - 1

With unequal strength (100% winchance for 1):

1 unbalanced position, played twice : => 1 - 1

1 unbalanced, 1 balanced 2 possibilities
stronger gets winning position => 2 - 0
weaker gets winning position => 1 - 1

perfect world result 2-0

Tony
The issue was "independent or non-correlated results." In a 2-game match on an unbalanced position, the two results are correlated. Think about the extreme point. 100 positions, all unbalanced. So you get 100 wins and 100 losses whatever you change. Now take 200 positions, 100 unbalanced, 100 pretty even. Changes you make are not going to affect the unbalanced results, but will affect the other 100 games. Which set will give the most useful information???
Ah, yes.

Playing 2 matches per position might bring us closer to the "real" elo, but we're only interested in the relative results.

New proposel: How about varying the starting positions ?

Put enormous.pgn in a game database, and play with the 10000 positions that scored closest to 50% (as black and white). Add these games, take the positions closest to 50 % etc.

That way, we improve the chance for "random" positions. ( ie we filter out unbalanced positions.

We could even do this on a per opponent base, to make sure that certain kind of positions a certain opponent handles bad, don't get over valued.

Tony

That is probably a bit better approach that I am using at the moment. However, you probably want to factor in both win/lose ratio as well as percent of total games where this was played so that you don't get off into oddball opening systems which might have a number of draws because of weak opponents or whatever. Another issue is the type of correlation you would see between positions that are different, but only barely so. For example, one where white has played a3 and one where he has not. In thinking about your idea, I think there are three components that have to factor in.

1. some sort of "chess hamming-distance" which is a measure of how different the positions are, favoring positions with more significant differences.

2. some sort of popularity measure so that you test on mainstream rather than oddball openings.

3. result of the game, so that you pick openings that are pretty balanced rather than one that leads to a quick win or loss every time.

How to do that is a completely different question I obviously have the w/l/d data for each game in PGN form, by counting I could discover how many times it was played compared to other openings, and a basic hamming-distance approach would work although the degree of "difference is not so easy to think about. Some positions with slight differences in piece placement might be extremely different in terms of how they are played. e4/e5 or d4/d5 are two minor changes but major in terms of the ensuing game.

Needs some thought and discussion. And there is also the issue of should you include all openings, or just the ones you are going to actually play in games?

bob · Post by **bob** » Sat Aug 16, 2008 5:44 pm

To further clarify, here is the complete original email from Karl. I tried to post it in two parts so that each could be discussed separately. But if you read it _carefully_ you (or at least everyone else) will find his explanation of why the randomness I was seeing was _not_ necessarily a "6-sigma event". Here goes. Note that he gives a formula midway down that is not quite correct. He corrected it but after positing the original and hand-editing to match his update, I deleted the email and don't have the fix. But if you ignore the formula and read the text, all will be clear.

=============================================================
Subject: Designing a statistically significant test suite.

Hello, Dr. Hyatt. I was referred to your recent talkchess.com thread from a
different forum that I read, and I found the discussion fascinating due to
my interest in mathematics, rating systems, and computer game engines.
Perhaps I can shed some light on that discussion.

The central point of miscommunication seems to have been confusion between
the everyday meaning of dependent (causally connected) and the mathematical
meaning of dependent (correlated). I am astonished that self-styled
mathematical experts at talkchess.com who were criticizing you didn't make
this distinction. The differnce in the two meanings is stark if one
considers two engines playing each other twice from a given position with
fixed node counts, because the results of the two playouts will surely be
the same. Neither playout affects the other causally, so they are not
dependent at all in the everyday sense, but the winner is always the same,
which is to say the outputs are perfectly correlated, and therefore as
mathematically dependent as it gets.

Let's consider a series of hypothetical trial runs. I assume that you are
as capable as anyone in the industry of preventing any causal dependence
between various games in the trials, so causal dependence will not factor in
my calculations at all. I believe you when you say that you have solved
that problem.

Trial A: Crafty plays forty positions against each of five opponents with
colors each way for a total of 400 games. The engines are each limited to a
node count of 10,000,000. Crafty wins 198 games.

Trial B: Same as Trial A, except the node count limit is changed to
10,010,000. Crafty wins 190 games.

Now we compare these two results to see if anything extraordinary has
happened. In 400 games, the standard deviation is 10, and the difference in
results was only 8, so we are well within expected bounds. There's nothing
to get excited about, and we move on to the next experiment.

Trial C: Same as Trial A, except that each position-opponent-color
combination is played out 64 times. Yes, this is a silly experiment,
because we know that repeated playouts with a fixed node count give
identical results, but bear with me. Crafty wins (as expected) exactly
12672 games.

Trial D: Same as Trial B, except that each position-opponent-color
combination is played out 64 times. Crafty wins 12160, as we knew it would.

Now we compare the latter two trials. In 25,600 games the standard
deviation is 80, and our difference in result was 512, so we are more than
six sigmas out. Holy cow! Run out and buy lottery tickets!

In this deterministic case it is easy to see what happened. The prefect
correlation of the sixty-four repeats of each combination meant that we were
gaining no new information by expanding the trial. The calculation of
standard deviation, however, assumes no correlation whatsoever, i.e. perfect
mathematical independence. Since the statistical assumption was not met,
the statistical result is absurd.

Let's continue our trials:

Trial E: Same as Trial C, but instead of limiting by nodes, we limit by
time.

Trial F: Same as Trial E, including the same time control, except that the
room temperature is two degrees cooler, and because of that or some other
factor, the engines are able to search at 1.001 times the speed they were
searching before.

In these last two trials, the 64 repetitions of each position-opponent-color
combination will not necessarily be identical. Miniscule time variations
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.

When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather
affected each block of 64 playouts in coordinated fashion.

Now let me pass from trying to give a plausible explanation of your posted
results to trying to solve the practical problem of detecting whether a code
change makes an engine stronger or weaker. I am entirely persuaded of your
opening thesis, namely that our testing appears to find significance where
there is none. We think we see a trend or pattern when it is only random
fluctuation. We need to re-examine our methodology and assumptions so that
we don't jump to conclusions too quickly.

The bugbear is correlation. We are wasting our time if we run sets of
trials that _tend_ to have the same result, even if they don't always have
the same result. Yes, we want code A and code A' to run against exactly the
same test suite, but we don't want code A to run against the same test
position more than once.

The bedrock of the test suite is a good selection of positions. If the
positions are representative of actual game situations, then they will give
us information about how the engine will perform in the wild. They can't be
too heavy on any particular strategic theme that would bias the test results
and induce us to over-fit the engine to do well on that one strategy.

Assuming you have a good way to choose test positions, I think it is a
mistake to re-use them in any way, because that creates correlations. If A'
as white can outplay A as white from a certain position, then probably A' as
black can outplay A as black from the same position. The same strategic
understanding will apply. Re-running the same test with different colors is
not giving us independent information, it is giving us information
correlated to what we already know. Similarly it is a mistake to re-use a
position against different opponents. If A' can play the position better
than A against Fruit, then A' can probably play the position better than A
against Glaurung. The correlation won't be perfect, but neither will the
tests be independent.

In other words, I am saying that if you want to run 25,600 playouts, then
you should have a set of 25,600 unique starting positions that are
representative of the positions you want Crafty to do well on. If you want
to remove color bias, good, have Crafty play white in the even-numbered
positions and black in the odd-numbered positions, but don't re-use
positions. If you want to avoid tuning for a specific opponent, good, have
Crafty play against Fruit in positions numbered 1 mod 5, against Glaurung in
positions numbered 2 mod 5, etc., but don't re-use positions. Come to think
of it, re-using opponents creates a different source of correlation that
also minimizes the usefulness of your results. One hundred opponents will
be better than five, and ideally you wouldn't re-use anything at all. If
nothing else, vary the opponents by making them marginally stronger or
weaker via the time control to kill some of the correlation.

If it seems like too much trouble to build up a suite of so many test
positions, consider that correlation puts an absolute bound on what you can
learn from a single position no matter how many trials you run on it. Even
if the results of repeated playouts are only 50% correlated, what you learn
from thousands of repetitions (let's go ahead and say infinitely many
repetitions) on a single position is equivalent to what you learn from a
mere four independent positons. Asymptotically you learn nothing more from
more repetitions if there is _any_ correlation, not matter how small. The
existence of correlation bounds the significance of your results regardless
of how much hardware you throw at it. Your primary focus should therefore
be eliminating as many types of correlations as you can think of.

I hope what I am saying sounds reasonable and useful, but if not, I'm happy
to hear why not. Thanks for listening.
=========================================================

So in reading that, it appears to me that he _does_ explain exactly why the original results (two big runs) were not corrupted by anything other than using too few positions, something you never pointed out in the context being discussed.

Now, do some more lol's and such...

hgm · Post by **hgm** » Sat Aug 16, 2008 6:57 pm

bob wrote:So you still refuse to read _all words_. "not explained... with high precision. You apparently can't parse that

Well, if you think that "with high presition" refers to "explain", it is obviously you who can't parse the sentence... "High precision" refers to "getting the same result twice". And indeed, the results of the two runs were not the same with the high precision (namely 1-sigma, where sigma was very small due to the large number of games), but significantly different (6-sigma apart). This was the whole issue ("why a 6-sigma deviation, rather than 1-2 sigma"), and the only issue. And Karl says he cannot explain it. Get it now?

and grasp the basic idea "I have explained how this might happen, and now it is not an issue because the 'mystery' is gone?

Is this gobbledegook supposed to mean anything? What is "this" supposed to refer to? It seems you are putting words in Karl's mouth he never uttered... But everyone taking the trubble to read Karl's post will see that he very unambiguously state that he explains your results might ly further than necessary from the truth, and that he does neither can explain, nor cares that your systematically flawed runs ly also far from each other. So if you want to demonstrate your intellectual level by continuing to deny it, well, go ahead...

No wonder these conversations are hopeless, but thankfully we have people like Karl that are not in this for the sake of being combative, they simply want to help.

In general my postings are far mor helpful than Karl's. It is not my fault that you are beyond help...

Yes, I suppose I can get confused when _you_ write things.

Indeed. And of course when Karl write's things. And Uri. And Nicolai. And ...... The list goes on and on! But it is all our fault of course. Especially since we have no problem whatsoever to understand each other...

Because you say A, then someone comes along and says B, and once it becomes apparent that B is pretty accurate, you come back and say "I said A, but I obviously meant B as well, you just were too stupid to realize that..."

Well, you certainly got that last thing right. And even worse, you cannot even grasp it when it is than explained to you in excruciating detail, like above.

If you want credit for pointing out that there was correlation that was caused by _the positions_, and _the opponents_ then feel free to take it.

Well, that is what I wrote 11 month ago. But, as I also wrote 11 month ago, and uncountable times now, that is not the kind of correlation that you observed, the kind of ccorrelation that drives up the difference between the runs. So on the 6-sigma difference nothing is explained, neither by me nor by Karl, and the fact remains that you posted crap results without having any explanation for it.

But if you read back over all your posts on the topic, you have been claiming all along that this correlation was introduced in some other way

Well, it was certainly not introduced in this way, that's for sure.

and was intrinsic to the cluster testing being done. You have said that dozens of times.

I said no such thing. I always maintained they must be due to incompetent data taking. I never used the word 'intrinsic' as an explanation for the variability you observed. On the contrary, I always maintained that this variability is an artifact, and that when conducted properly, should disappear from your tests.

Suddenly we see that it is _not_ related to the cluster at all,

We see no such thing! Neither suddenly nor gradually. (But of course that will not stop you from imagining it...) The only thing we see is that the artifact that corrupted your first data run did not manifest itself. You have offered no proof whatsoever that the hypervariability correlates with using few different positions as opposed to many. Nor will you ever, if you continue to only do runs with many positions. The only way to show this correlation is to also repeat runs with few positions, and wait until you have a statistically significant number of occurrences of the artefact to determine if it occurs more often when you use few games as many games.

Your statement that you have repeatedly posted hypervariable data is not very credible, as almost every time I have seen you post such allegedly hypervariable data here, a closer look reveals that there is actually notjhing hypervariable in the data whatsoever, and that it was in perfect agreement with normal statitical exectation. (Oh yes, and of course not sufficient for the purpose you wanted to use it. But that is no criterion for hypervariability.) This creates a strong impression that you are not able to see the difference between normal data and exceptional flukes. You keep on showing us flies, claiming they are elephants. The only other case where you have shown us a single example of data outside normal statistical expectation was one year ago.

which I have also said dozens of times. So if you want the credit, feel free to take it. But so far as I am concerned, the explanation came from Karl, along with a suggestion on how to directly address the explanation and eliminate the problem.

Well, as you apparently do not understand a syllable of what either of us wrote, it is not very relevant who you give the credit to, don't you think? Better embarras him than me by giving him credit for your nonsensical distortion of what he said...

hgm · Post by **hgm** » Sat Aug 16, 2008 7:09 pm

bob wrote:But if you read it _carefully_ you (or at least everyone else) will find his explanation of why the randomness I was seeing was _not_ necessarily a "6-sigma event".

And note that the post quoted by me, where he states he did not explain why the runs differed so much from each other, occurred after this, in reaction to my criticism on some points of this message, that IMO were unclear, and needed clarification.Well, Karl certainly did give this clarification in unambiguous terms:

Karl wrote:I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision.

So of course Bob thinks that Karl's message explains why the runs failed to get the same wrong answer.

bob · Post by **bob** » Sat Aug 16, 2008 7:33 pm

hgm wrote:
bob wrote:But if you read it _carefully_ you (or at least everyone else) will find his explanation of why the randomness I was seeing was _not_ necessarily a "6-sigma event".
And note that the post quoted by me, where he states he did not explain why the runs differed so much from each other, occurred after this, in reaction to my criticism on some points of this message, that IMO were unclear, and needed clarification.Well, Karl certainly did give this clarification in unambiguous terms:

Karl wrote:I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision.
So of course Bob thinks that Karl's message explains why the runs failed to get the same wrong answer.

You just can't parse and understand what appears to me to be plain English.

"I have not explained... with high preision". Does not mean "I have not explained anything at all". He explained how it _could_ have happened. He was (at that time) not so certain that this _was_ the cause. The runs made recently have increased the probability that his explanation was correct.

You can have the last word. You really don't want to see this lead to a resolution, just more arguments. There's plenty left to do without endless and circular arguments with someone that simply either can not, or will not read what is being written. That is not _my_ problem. But if you re-read, particularly the paragraph that ends with "lottery" I think the _normal_ person will "get it".

bob · Post by **bob** » Sat Aug 16, 2008 7:43 pm

OK, I will leave it up to Karl to explain his sentence quoted here:

=======================================================
I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision. I'll let computer scientists duke it, and to me it is frankly less interesting because the mathematical mystery is gone.
=======================================================

I believe you are saying "I have not explained precisely why the two original runs failed to get the same wrong answer twice". Where you did give a good "guess" that needed testing.

So, in summary, do you believe that your original hypothesis that playing the same positions multiple times was introducing correlation, and hence blowing the "6 sigma SD event" out of the water because the SD was invalid due to the repeated games?

Somehow it is impossible to make progress with HGM. Even though I thought I had understood what you were saying from the beginning. I thought it better to let you explain your own words rather than continuing an endless debate that is producing no useful product.

hgm · Post by **hgm** » Sun Aug 17, 2008 12:38 am

Well, fair enough. No that I dug up my original post, am also curious what Karl's command to that would be, and how he judges the similarity of it to what he has said (as opposed to what you think he has said...).

In fact I still have an issue with something Karl said before, which I had no time to address yet. So while we await his arrival, let me elaborate on it, so we can solve it at the same time.

In an earlier post Karl said that he and I had a different intuition about the effect of repeatedly using the same position, and that neither of us was proven wrong yet. I beg to differ: what I said had nothing to do with intuition, only with rock-solid math:

SUpposy you sample a population that is subdivided in classes (labeled by i). Each cass has a distribution of the result j, P(j|i), with an associated mean m_i = SUM(j) j P(j|i) ans a variance v_i = SUM(j) (j - m_i)^2 P(j|i). Now the "shifting rule" for variance says:

SUM(j) (j-M)^2 P(j|i) = v_i + (M-a_i)^2

If I now sample not only within class i, but do a "super-sampling" of the total population, by first picking a class with probability P(i), and then sampling that class as usual. The expectation value of such a sample
will be

M = SUM(i) a_i P(i) = SUM(i,j) j P(i) P(j|i) = SUM(i,j) j P(i,j),

i.e. the grand average of the entire population. Using this M is the shifting rule for each i, multiplying those rules by P(i), and adding for all i, we get

SUM(i) { P(i) SUM(j) (j-M)^2 P(j|i) } = SUM(i) { P(i) v_i } + SUM(i) { P(i) (a_i-M)^2 }.

Making use of P(i) P(j|i) = P(i,j), the left-hand side reduces to

SUM(i,j) (j - M)^2 P(i,j)

This is by definition the total variance of the population, as M was the population average. So we see that this population average equals the sum of the average of the class variances (weighted with their probability to be selected), plus the variance of the distribution of the class averages (second term RHS).

Now if the classes are defined by the starting position of the game, we see that the variance of a sampling process of match results cn be written as the sum of the variance of the expected match results over the set of positions, plus the average of the variances of the matches using a single starting position. If all these matches contain approximately equal numbers of games (i.e. each position is played equally often), and ar approximately balanced, all the latter variances should be nearly equal. So their average is simply a constant, and in any case independent from how many positions, and which positions exactly you average over.

So the variance in the super-sampling process is the variance of a match from a single, typical position, plus a variance caused by the sampling of the position. No intuition is involved in this.

hgm · Post by **hgm** » Mon Aug 18, 2008 7:43 am

Well, it seems we won't see Karl here again.

Fritzlein wrote:Unfortunately, the ongoing discussion of who is an idiot, although it has given me grounds to formulate my own opinions, does not draw me back to contribute either those opinions or something more substantive. I wanted to jump into the fray because I thought I had something to say, but I'm not going to be able to stand this forum. I don't know what the solution is for the collective, but the solution for me personally is to stick to friendlier discussions.

So I guess the conclusions must remain as they are now:

1) Sampling over a set limited to a small number of games produces results that are far from the truth

2) Sampling from a set of a large number of positions produces results closer to the truth

3) Both sampling methods produce the same variability in the sample results

4) The variance of the result w.r.t. the truth due to position sampling simply adds to the average variance of the sampling using a single position

5) Points 1-4, and more, were pointed out by me 11 month ago in the quoted post.

6) Karl re-iterated 1-3

7) When explicitly asked by me, he again confirmed (3)

8) Despite all this, Bob still is in denial about (3)

bob · Post by **bob** » Mon Aug 18, 2008 7:05 pm

hgm wrote:Well, it seems we won't see Karl here again.

Fritzlein wrote:Unfortunately, the ongoing discussion of who is an idiot, although it has given me grounds to formulate my own opinions, does not draw me back to contribute either those opinions or something more substantive. I wanted to jump into the fray because I thought I had something to say, but I'm not going to be able to stand this forum. I don't know what the solution is for the collective, but the solution for me personally is to stick to friendlier discussions.
So I guess the conclusions must remain as they are now:

1) Sampling over a set limited to a small number of games produces results that are far from the truth

2) Sampling from a set of a large number of positions produces results closer to the truth

3) Both sampling methods produce the same variability in the sample results

4) The variance of the result w.r.t. the truth due to position sampling simply adds to the average variance of the sampling using a single position

5) Points 1-4, and more, were pointed out by me 11 month ago in the quoted post.

6) Karl re-iterated 1-3

7) When explicitly asked by me, he again confirmed (3)

8) Despite all this, Bob still is in denial about (3)

You are correct about "in denial" because that is not what we have been discussing. My _smallest_ number of test games is larger than anyone else's _largest_ number of test games. But even that is not the issue at hand.

We have been discussing the difference between a small number of positions played many times each, vs a large number of positions played one time each. I was doing the former, I am now doing the latter. But I have _never_ played "a small number of games". So perhaps once we get on the same page, using the same vocabulary, we might agree on something.

What you explained above has not been the issue over the past month.

BTW I wonder where the "hostility" he mentioned came from?

hgm · Post by **hgm** » Mon Aug 18, 2008 7:43 pm

Oops! You got me there!

What I intended to write was 'positions', not 'games'. That a small number of games gives results that are typically far from the truth is of course a no-brainer.

The point I intended to summarize in (1) was that even with an infinite number of games, the results would be far from the truth if these games were only played from a small number of positions.

The point you deny is (3), btw, not (1).

The point we have been discussing the past month seems a moving target. At the very beginning of this discussion I already brought up the fact that the results were far from the truth, due to the small number of games and opponents. But at the time I accepted your dismissal of that, when you said we were not discussing the difference of your results with the truth, but from each other.

But then Karl showed up, and he was only interested in the difference with the truth, and did not want to offer anything on the original problem. So then playing more positions suddenly became the hype of the day...

But, elaborating on (3), the fact remains that:

3a) Playing from more positions only could help to get the results closer to the truth, but does nothing for their variability.

3b) I said this

3c) Karl said this

3d) I said it again

3e) You keep denying it.

YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Karl, input please...

Re: Karl, input please...

Re: Karl, input please...

Re: Karl, input please...

Re: Karl, input please...