YATT.... (Yet Another Testing Thread)

jwes · Post by **jwes** » Fri Aug 15, 2008 8:49 pm

hgm wrote:What Karl actually said, referring to his long story where you have taken this snippet from was:

Karl wrote:I was talking about the standard deviation of the mean of test results in comparison to the true mean winning percentage. This seems to me the relevant number: we want to know what the expected measurement error of our test is, and we want to drive this measurement error as near to zero as possible.

I was not talking about the standard deviation between a test run and an identical or nearly-identical test run. We can make that variation exactly zero if we want to. Big deal. Who wants to get the wrong answer again and again with high precision? (Well maybe it is important to know somehow whether or not _every_ variable has been controlled, but that's a practical question, not a math question.)

I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision.

bob wrote:Karl made a suggestion, Uri made a suggestion on where to get the positions, and I tried it. Had a _reasonable_ suggestion been made 8 months ago (where that comes from is unknown as these testing threads have been going on for almost 2 years now) I'd certainly have tried it. Just as I did this time.

The problem is that statistics suggested very strongly that there was a problem in your testing process, but gave little information on what that problem might be. Increasing the number of positions is an obvious (in hindsight) way to improve your testing, but I have no idea how it could have fixed the particular problem we were seeing. I (and apparently other people) could not come with good suggestions for what was causing this problem, so we remained silent (or made bad suggestions ;). HGM suggested you were manipulating the data, which seemed very unlikely. But since the data showed somthing extremely unlikely was occuring, it could not be rejected out of hand.

bob · Post by **bob** » Fri Aug 15, 2008 8:51 pm

I now have 6 complete runs, with the PGN from the last 2 saved. Once the other two are finished (and perhaps even late tonight after the 3rd one finishes, I will try the combination test we were discussing.

Code: Select all

Tue Aug 12 00:49:44 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   108    7    7  7782   67%   -21   20% 
   2 Fruit 2.1               62    7    6  7782   61%   -21   23% 
   3 opponent-21.7           25    6    6  7780   57%   -21   33% 
   4 Glaurung 1.1 SMP        10    6    6  7782   54%   -21   20% 
   5 Crafty-22.2            -21    4    4 38908   46%     4   23% 
   6 Arasan 10.0           -185    7    7  7782   29%   -21   19% 
Tue Aug 12 11:36:10 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   110    6    7  7782   67%   -19   21% 
   2 Fruit 2.1               63    6    7  7782   61%   -19   23% 
   3 opponent-21.7           26    6    6  7782   57%   -19   33% 
   4 Glaurung 1.1 SMP         7    6    7  7782   54%   -19   20% 
   5 Crafty-22.2            -19    4    3 38910   47%     4   23% 
   6 Arasan 10.0           -187    6    7  7782   28%   -19   19% 
Wed Aug 13 00:53:43 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   109    7    6  7782   67%   -16   20% 
   2 Fruit 2.1               63    6    7  7782   61%   -16   24% 
   3 opponent-21.7           23    6    6  7781   56%   -16   32% 
   4 Glaurung 1.1 SMP         3    6    7  7782   53%   -16   21% 
   5 Crafty-22.2            -16    4    3 38909   47%     3   23% 
   6 Arasan 10.0           -182    7    7  7782   28%   -16   19% 
Wed Aug 13 14:19:47 CDT 2008
time control = 1+1
crafty-22.2R4
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   111    7    7  7782   68%   -20   21% 
   2 Fruit 2.1               71    6    7  7782   62%   -20   23% 
   3 opponent-21.7           17    6    6  7780   56%   -20   34% 
   4 Glaurung 1.1 SMP        11    6    7  7782   54%   -20   20% 
   5 Crafty-22.2            -20    3    4 38908   47%     4   23% 
   6 Arasan 10.0           -191    7    7  7782   28%   -20   18% 
Fri Aug 15 00:22:40 CDT 2008
time control = 1+1
crafty-22.2R4a
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   105    7    7  7782   67%   -19   21% 
   2 Fruit 2.1               62    7    6  7782   61%   -19   24% 
   3 opponent-21.7           23    6    6  7782   56%   -19   33% 
   4 Glaurung 1.1 SMP        11    6    7  7782   54%   -19   20% 
   5 Crafty-22.2            -19    4    3 38910   47%     4   23% 
   6 Arasan 10.0           -181    7    7  7782   29%   -19   19% 
Fri Aug 15 13:34:12 CDT 2008
time control = 1+1
crafty-22.2R4b
Rank Name                   Elo    +    - games score oppo. draws
   1 Glaurung 2-epsilon/5   111    7    7  7782   68%   -19   21% 
   2 Fruit 2.1               64    7    6  7782   62%   -19   24% 
   3 opponent-21.7           23    6    6  7782   56%   -19   33% 
   4 Glaurung 1.1 SMP         9    7    6  7782   54%   -19   20% 
   5 Crafty-22.2            -19    4    4 38910   47%     4   24% 
   6 Arasan 10.0           -188    7    7  7782   28%   -19   20% 
olympus%

hgm · Post by **hgm** » Fri Aug 15, 2008 9:14 pm

bob wrote:So exactly where is any argument suggesting _anything_ related to what Karl pointed out? You are talking about errors with accepting/rejecting improvements because of the poor choice of positions. Which has exactly _zero_ to do with the current discussion.

try again...

This says identically the same as what Karl said. That using too few positions (and thus re-using positions, or you would never get enough games) causes the result of your runs, although close to each other, will ly far from the truth. Look, if you cannot understand what Karl wrote, that is your problem. Do not expect me to solve it for you...

bob · Post by **bob** » Fri Aug 15, 2008 9:38 pm

hgm wrote:
bob wrote:So exactly where is any argument suggesting _anything_ related to what Karl pointed out? You are talking about errors with accepting/rejecting improvements because of the poor choice of positions. Which has exactly _zero_ to do with the current discussion.

try again...
This says identically the same as what Karl said. That using too few positions (and thus re-using positions, or you would never get enough games) causes the result of your runs, although close to each other, will ly far from the truth. Look, if you cannot understand what Karl wrote, that is your problem. Do not expect me to solve it for you...

No it doesn't. Karl said using too few positions and repeating the games to get significant accuracy adds an unwanted correlation to the games. You said using too few games was simply training a program to do well in those positions but it might have nothing to do with how it does in other positions.

If you can't understand the difference, then that is something you are going to have to come to grips with. Just because you said "80 (where 80 comes from is beyond me) positions is not enough" you want to now claim that you were on this bandwagon from the beginning. Yet your post is quite clear in why you are saying "80 is not enough" because you explained that.

You have been harping on the idea that there is correlation between the games. Karl explained where it most likely comes from. Show me any reference to "correlation" in the post you cited. I looked at what I snipped above and could not find the word anywhere. So is the fault you are talking about now has been correlation. The fault you were talking about then was not. So which is it? Or is it neither? Who can tell. But if you have read Karl's posts, he first explained why those two runs were not really '6 sigma puzzles", and then gave a concrete suggestion on how to test his hypothesis. Which so far, seems to be holding up. All I have gotten from you are multiple pairs of worn-out shoes...

bob · Post by **bob** » Fri Aug 15, 2008 9:54 pm

jwes wrote:
hgm wrote:What Karl actually said, referring to his long story where you have taken this snippet from was:

Karl wrote:I was talking about the standard deviation of the mean of test results in comparison to the true mean winning percentage. This seems to me the relevant number: we want to know what the expected measurement error of our test is, and we want to drive this measurement error as near to zero as possible.

I was not talking about the standard deviation between a test run and an identical or nearly-identical test run. We can make that variation exactly zero if we want to. Big deal. Who wants to get the wrong answer again and again with high precision? (Well maybe it is important to know somehow whether or not _every_ variable has been controlled, but that's a practical question, not a math question.)

I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision.

bob wrote:Karl made a suggestion, Uri made a suggestion on where to get the positions, and I tried it. Had a _reasonable_ suggestion been made 8 months ago (where that comes from is unknown as these testing threads have been going on for almost 2 years now) I'd certainly have tried it. Just as I did this time.
The problem is that statistics suggested very strongly that there was a problem in your testing process, but gave little information on what that problem might be. Increasing the number of positions is an obvious (in hindsight) way to improve your testing, but I have no idea how it could have fixed the particular problem we were seeing. I (and apparently other people) could not come with good suggestions for what was causing this problem, so we remained silent (or made bad suggestions . HGM suggested you were manipulating the data, which seemed very unlikely. But since the data showed somthing extremely unlikely was occuring, it could not be rejected out of hand.

No argument, except for the definition of "problem". we've been through months of "your cluster is broken, or your referee is broken, or now even to the point that I should have verified that each opponent played the same strength no matter what time of day or what day of month, etc. When the "problem" was not with the testing methodology or opponents, but was in the use of repeated games on the same positions, introducing correlation that was not so obvious (at least to me, and nobody else mentioned it until Karl came along). Karl pointed this out quite clearly and described how it could easily explain the rather odd results that had come up over and over in the past. The current results are still a bit unsatisfactory, as the Elo jumping around by 5-6 makes it difficult to evaluate changes that are most likely going to be smaller than that. So there is still a ways to go before the final page of this is written, because of my original goal of finding a way to measure small changes. It might well be an intractable problem even for almost 1,000 processors that I have here. If so, it will be an intractable problem for _everybody_. That will be an interesting state of affairs, if we actually get there.

Dirt · Post by **Dirt** » Fri Aug 15, 2008 10:09 pm

Fritzlein wrote:
Dirt wrote:
bob wrote:So I have a random bug, that happens infrequently, that somehow biases the results where the games take a day or two to play?
That's the most likely situation, although I wouldn't personally call it a bug, as I don't think there is anything wrong with the hardware or software.
I think we might agree, but I want to be careful with the words to make sure of that. My guess is that this "random bug" (actually not a bug but an unknown change in the conditions of the testing system) doesn't in fact "bias the results" in the sense of systematically helping or hurting any bot. Rather it provides a random jolt to the results that is equally likely to hurt or help each bot in each position. People had pointed out before I joined the discussion that such a random jolt would not produce statistically out of bounds results, but it could indeed produce out of bounds results in the following case: replays within test run 1 are internally correlated, replays within test run 2 are internally correlated, but test run 1 and test run 2 (because of the "random bug") are not correlated. Now that Martin has provided hard evidence of internal correlation based on replayed positions, this scenario becomes more likely.

Although I am hard pressed to think of what might make the two runs different at all, it seems to be an easier stretch for the imagination than thinking what might make the two runs biased in different ways.

I think that is more or less what I was trying to say here, but my lack of eloquence makes that hard to see.

Fritzlein wrote:
Dirt wrote:Almost anything you could do to randomize the playing conditions should have helped. Random time controls, random evaluation offsets, random cache sizes, even files or processes left over from the previous use of the node.
Again, I think we might agree, but I want to be careful about the words. The purpose of randomizing playing conditions is to kill correlations between repeated measurements. Things like using different starting positions and different opponents will reduce correlation, so they are good changes to make.

But some kind of randomizing introduces correlations, and thus is bad. For example, randomizing the positions so much that we use unbalanced positions is bad, because that makes the result of Crafty vs. Glaurung more correlated the the result of Crafty vs. Fruit from the same position. (Unless we never re-use the position under any circumstances.)

Also some kind of randomizing just introduces noise, so it is also bad. For example, randomizing the time control so that Crafty gets 1 to 10 seconds to think and independently Fruit gets 1 to 10 seconds to think will certainly kill off correlations between Crafty vs. Fruit results on replays of the same position, but it will also conceal exactly what we are trying to measure, because it makes the winner less correlated to the true playing strength of each engine. Indeed, for statistically beautifully behaved results, we could randomize so much that each game was essentially a fair coin flip, which would take care of all statistical anomalies, but would also prevent us from measuring anything at all.

I'm sure you were not suggesting such outrageous testing procedures, but I just wanted to add caveats to the idea that introducing any kind of randomness is going to improve test design.

By "help" at that point I meant "almost anything" would reduce the "statistical anomalies". Small perturbations should have increased the accuracy of the test as well, but not as much as the increased starting positions.

bob · Post by **bob** » Sat Aug 16, 2008 7:00 am

yak...

last two runs will take about 24 hours each rather than the normal 12. Computer room temp was acting a bit unstable, so we made a decision to power down 1/2 the nodes to reduce the A/C load until it can be looked at next week. So the last two runs are going, the 3rd is almost done but will be slower now, and the last one will take another 24 hours. I'll probably have something to report by Sunday if nothing else goes wrong...

hgm · Post by **hgm** » Sat Aug 16, 2008 8:40 am

bob wrote:You have been harping on the idea that there is correlation between the games. Karl explained where it most likely comes from.

bob wrote:Karl pointed this out quite clearly and described how it could easily explain the rather odd results that had come up over and over in the past.

Again, wrong, wrong, wrong! Why don't you read the following sentence a couple of hundred times, until it sinks in? (Well, perhaps make that 25,000 times.... Whatever it takes!

)

Fritzlein wrote:I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision. I'll let computer scientists duke it, and to me it is frankly less interesting because the mathematical mystery is gone.

I have always been claiming that there is more correlation for games within the same run, than there is in games between the runs. For only that could drive up variability of the runs. Karl does nothing explain that. As should be obvious for anyone that knows statistics, and as he very clearly writes (see highlited sentence in my quote above).

That it was perhaps not obvious to you, because I did not explicitly make the distinction witin runs / between runs, then I apologize. But addig that qualification has only become necessary since Karl altered the definition of correlation. At the time I was writing it, correlation implied 'correlation of the result of games from the same position within the runs', because if games would correlate equally between runs and within runs, in my definition there would be zero correlation between all games.

By looking at correlations in different sets of games (not relevant for explaining run variability, but relevant for explaining run error), one gets of course a different value for the correlations. No mystery there.

It appears that you are not only incapable of understanding what Karl wrote, (not even after I highlight it for you...), but also what I wrote (no surprise there, I guess...). You keep focussing on the point that in my ancient post where I explain what the consequence of using results that systematically deviate from the truth is. If one uses an erroneous signal as feedback to 'improve' the engine, a feedback signal that contains large noise due to the small number of positions (frozen in, because you use that same small set consistently), the effect you get (namely that the engine gets in fact weaker, but does better in your flawed test) is known as training.

Hint: you should focus on the part that explains why the measure you use for feedback (i.e. which changes to keep, and which to discard) contained an error. Perhaps you will understand it then. (Who says I am not an incurable optimist?

)

Tony · Post by **Tony** » Sat Aug 16, 2008 10:52 am

bob wrote:
Tony wrote:
bob wrote:
hgm wrote:
bob wrote:It now appears that there was a correlation issue, but not one anyone seemed to grasp until Karl came along.
This is still absolute bullshit. Karl stated that the results would be farther from the truth when you used fewer positions. But they would have been closer to each other, as they used the same small set of positions. Karl's temark that being closer to the truth necessary implies that they were closer to each other was even plain wrong, as my counter-example shows.
OK. Here you go. First a direct quote from karl:

============================================================
can lead to different moves and even different game outcomes. However, we
are doing _almost_ the same thing in each repetition, so although the
results of the 64 repetitions are not perfectly correlated, they are highly
correlated, and far from mathematically independent.

When we do the calculation of the standard deviation, we will not be
understating it by a full factor of 8 as we did in the case of Trials C & D,
but we will still be understating it by almost that much, enough to explain
away the supposed mathematical impossibility. Note that I am specifically
not assuming that whatever changed between Trials E & F gave a systematic
disadvantage to Crafty. I am allowing that the change had a random effect
that sometimes helped and sometimes hurt. My assumption is merely that the
random effect didn't apply to each playout independently, but rather
affected each block of 64 playouts in coordinated fashion.
============================================================

Now, based on that, either (a) "bullshit" is simply the first idea you get whenever you read a post here or (b) you wouldn't recognize bullshit if you stepped in it.

He said _exactly_ what I said he said. Notice the "enough to explain away..." This quote followed the first one I posted from him last week when we started this discussion.

And based on the results so far, his idea of eliminating the black/white pairs may also be a good one, since a pair of games, same players, same position, is going to produce a significant correlation between the positions that are not absolutely equal, or which are not equal with respect to the two opponents.
This is also wrong. Unbalanced positions are bad no matter if you pair them or not. It becomes more difficult to express a small improvement in a game that you are almost certainly going to lose anyway. The improvement then usually only means you can delay the inevitable somewhat longer.
Again, don't buy it at all. If a position is so unbalanced, the two outcomes will be perfectly correlated and cancel out. A single game per position gives twice as many games, hopefully twice as many that are not too unbalanced.

Is this true ?

With equal strength (50% winchance)

1 unbalanced position, played twice : => 1 - 1

1 unbalanced, 1 balanced => 1.5 - 0.5

perfect world result 1 - 1

With unequal strength (100% winchance for 1):

1 unbalanced position, played twice : => 1 - 1

1 unbalanced, 1 balanced 2 possibilities
stronger gets winning position => 2 - 0
weaker gets winning position => 1 - 1

perfect world result 2-0

Tony
The issue was "independent or non-correlated results." In a 2-game match on an unbalanced position, the two results are correlated. Think about the extreme point. 100 positions, all unbalanced. So you get 100 wins and 100 losses whatever you change. Now take 200 positions, 100 unbalanced, 100 pretty even. Changes you make are not going to affect the unbalanced results, but will affect the other 100 games. Which set will give the most useful information???

Ah, yes.

Playing 2 matches per position might bring us closer to the "real" elo, but we're only interested in the relative results.

New proposel: How about varying the starting positions ?

Put enormous.pgn in a game database, and play with the 10000 positions that scored closest to 50% (as black and white). Add these games, take the positions closest to 50 % etc.

That way, we improve the chance for "random" positions. ( ie we filter out unbalanced positions.

We could even do this on a per opponent base, to make sure that certain kind of positions a certain opponent handles bad, don't get over valued.

Tony

bob · Post by **bob** » Sat Aug 16, 2008 4:53 pm

hgm wrote:
bob wrote:You have been harping on the idea that there is correlation between the games. Karl explained where it most likely comes from.

bob wrote:Karl pointed this out quite clearly and described how it could easily explain the rather odd results that had come up over and over in the past.
Again, wrong, wrong, wrong! Why don't you read the following sentence a couple of hundred times, until it sinks in? (Well, perhaps make that 25,000 times.... Whatever it takes! )

Fritzlein wrote:I have not explained why the original two 25600-game test runs failed to get the same wrong answer twice with high precision. I'll let computer scientists duke it, and to me it is frankly less interesting because the mathematical mystery is gone.

So you still refuse to read _all words_. "not explained... with high precision. You apparently can't parse that and grasp the basic idea "I have explained how this might happen, and now it is not an issue because the 'mystery' is gone?

No wonder these conversations are hopeless, but thankfully we have people like Karl that are not in this for the sake of being combative, they simply want to help.

I have always been claiming that there is more correlation for games within the same run, than there is in games between the runs. For only that could drive up variability of the runs. Karl does nothing explain that. As should be obvious for anyone that knows statistics, and as he very clearly writes (see highlited sentence in my quote above).

That it was perhaps not obvious to you, because I did not explicitly make the distinction witin runs / between runs, then I apologize. But addig that qualification has only become necessary since Karl altered the definition of correlation. At the time I was writing it, correlation implied 'correlation of the result of games from the same position within the runs', because if games would correlate equally between runs and within runs, in my definition there would be zero correlation between all games.

By looking at correlations in different sets of games (not relevant for explaining run variability, but relevant for explaining run error), one gets of course a different value for the correlations. No mystery there.

It appears that you are not only incapable of understanding what Karl wrote, (not even after I highlight it for you...), but also what I wrote (no surprise there, I guess...). You keep focussing on the point that in my ancient post where I explain what the consequence of using results that systematically deviate from the truth is. If one uses an erroneous signal as feedback to 'improve' the engine, a feedback signal that contains large noise due to the small number of positions (frozen in, because you use that same small set consistently), the effect you get (namely that the engine gets in fact weaker, but does better in your flawed test) is known as training.

Yes, I suppose I can get confused when _you_ write things. Because you say A, then someone comes along and says B, and once it becomes apparent that B is pretty accurate, you come back and say "I said A, but I obviously meant B as well, you just were too stupid to realize that..."

If you want credit for pointing out that there was correlation that was caused by _the positions_, and _the opponents_ then feel free to take it. But if you read back over all your posts on the topic, you have been claiming all along that this correlation was introduced in some other way and was intrinsic to the cluster testing being done. You have said that dozens of times. Suddenly we see that it is _not_ related to the cluster at all, which I have also said dozens of times. So if you want the credit, feel free to take it. But so far as I am concerned, the explanation came from Karl, along with a suggestion on how to directly address the explanation and eliminate the problem.

Hint: you should focus on the part that explains why the measure you use for feedback (i.e. which changes to keep, and which to discard) contained an error. Perhaps you will understand it then. (Who says I am not an incurable optimist? )

you are in incurable "something". that is for sure..

YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: more data... six runs done

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: more data... six runs done

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)

Re: YATT.... (Yet Another Testing Thread)