bob wrote:I'm hardly what one would call "mathematically shy" but I have discovered over the years, that one has to learn to avoid the "to the man who has a hammer, everything looks like a nail." It is tough to apply statistical analysis to something that has underlying properties that are not well understood, if those properties might violate certain key assumptions (independence of samples for just one).
But the mathematical properties of the system you are studying are totally understood. They should be independent repeats of the same experiment (position and opponent combination), that have fixed (in time) probabilities for ending on win, draw or loss. That is all that is necessary to do 100% accurate statistical analysis. And if the results in practice do not conform to that analysis in any credible way, it means that the assumption are apparently not valid. And their were only three. That the scores of an individual game where 0, 1/2 or 1 (and it stretches even my imagination that you would bungle that), that the probabilities were fixed in time, and that the games were independent.
That games starting from a different position might have a different WDL probability distribution, as Karl remarked, does not have the slightest impact on the analysis for the variability.
[qoute]There have been so many new characteristics discovered during these tests, and there are most likely more things yet to be discovered.[/quote]
Well, so far I have not seen anything being discovered...
But your story has _one_ fatal flaw. What happens if you bump into someone over 8' tall every few days? Do you still hide your head in the sand and say "can't be"? [/quote]
Of course not. I would say: "obviously the assumption that male adults are not 6' +/- 8" tall". Like I say to your 'frequent' observations: "obviously the assumption of independent games or time-constant WDL probability is not fulfilled". It is you that hide your head in the sand, insistin: "Can't be! My cluster is independent!" <bang head> "Can't be!" <bang head>. (Painful! Although computers are made out of silicon they are not quite as soft as sand!

)
Because this is the kind of results I have gotten on _multiple_ (and multiple does not mean more than 1.. that is, if I used normal Elo numbers +/- error, the min for A is larger than the max for B. You seem to want to imply that I only post such numbers here when I encounter them.
Well that woulld be a natural thing to do. Why post data that is absolutely unremarkable? But sorry, "more than 1" won't cut it. That is no science. To make such a claim without actually knowing excatly how much each deviation occurs is simply not credible. Because, like you remarked, no result is absolutely impossible. Just astronomically improbable. You brought this skepticism on our part upon yourself by very often (multiple times, where in this case multiple actually does not mean <= 1) posting data here complaining about the hypervariability, while in fact the variability was lower than expected.
The only reason I posted the last set was that it was the _first_ trial that produced BayesElo numbers for each run. It was the _first_ time I had posted results from Elo computations as opposed to just raw win/lose/draw results.
Be that as it may, that does not mean this is a non-sssue. Would you also have posted it here if this first time the BayesElo ratings of Crafty would have been exactly the same? If you switch on an internal filter to only post something that is remarkable, it would still be "cherry picking" if it happened the first time. But let us not quarrel over that, as it is counter-productive.: Only you can know if you would have posted a non-remarkable result.
I posted several results when we started. I increased the number of games and posted more. And have not posted anything since because nothing changed. We were still getting significant variability,
But the problem is that you (based on all these posts) seems to call totally normal, unavoidable sampling noise "significant variability". And if that is your criterion, it does not mean a thing that you were still getting "significant" variability. You have cried 8' giant too often when a perfectly normal 6' individualleft your premises. So now people doubt your eyesight...
So now your remarks about variability would only impress us, if we would know your standards, which can only be acheived by telling us exactly how many percent of your data showed 3-sigma deviation, how much 4-sigma, etc. But you fail to do it, with a stubbornness that creates a very strong impression you don't know this yourself...
I had proven that it was a direct result of time jitter, or at least that removing timing jitter eliminated the problem completely. And there was nothing new to say. we were still having the same problems within our "group" where the same big run said "A is better" and a repeat said "B is better". And nothing new was happening. I then ran four "reasonable" runs and two _big_ runs, and ran the results thru BayesElo and thought that the results were again "interesting" and posted here, primarily to see what Remi thought and if he thought I might be overlooking something or doing something wrong.
But I run into these 8' people way too regularly to accept "your ruler is wrong, or you are cherry-picking, or whatever". I wrote when I started this that there is something unusual going on. It appears to be related to the small number of positions and the inherent correlation when using the same positions and same opponents over and over.
Well, here we obviously continue to disagree. On both counts. It was not related to that, and nothing 'apears' so far.
OK, that way of testing was wrong. It wasn't obviously wrong, as nobody suggested what Karl suggested previously.
Well, this becomes a bit of a sticky siubject now. Because we can no longer know if you mean that no one suggested what
you think Karl suggested (which is of course true, as what you acribe to Karl is simply untrue and non-sensical), or if you mean that no one said what Karl actually said (which would be false, as I said the same thing 11 months ago).
Theron mentioned that he has _also_ been using that method of testing and then discovered (whether it was Karls post or his own evaluation was not clear however) that this approach led to unreliable results.
Of course it led to unreliable results. But that is not the same as hyper-variability. As I wrote 11 month ago, the results would even be unreliable if you had zero variability (by playing a billion games).
So, to recap, these odd results have been _far_ more common than once a decade,
depends on your standards of 'odd', which apparently differ from the usual...
... I, apparently like CT, had thought that "OK, N positions is not enough, but since the results are so random, playing more games with the same positions will address the randomness. But it doesn't And until Karl's discussion, the problem was not obvious to me.
And the only thing that is obvious to us, is that what you consider obvious is obviously wrong...
I have been trying to confirm that his "fix" actually works. waste of time, you say. But _only_ if it confirms his hypothesis was correct. There was no guarantee of that initially, the cause could have been something else, from the simple (random PGN errors) to the complex (programs vary their strength based on day or time). I still want to test with one game per position rather than 2, to see if the "noise" mentioned by some will be a factor or not. I still want to test with fewer positions to see if the "correlation effect" returns in some unexpected way.
I would applaud the latter. I think this is very important, because if it does not return, and the 38,000-game runs with 80 positions (40, black and white) would show simiar variability as the 'many-positions' runs you did recently, the problem might recur just as easily in any scheme.
But note that it does not _always_ work that way. My 40 position testing is a case in point. The math did _not_ predict outcomes anywhere near correctly.
Well, math is absolute truth. If you get a result that deviates from mathematical prediction, it can only mean that the assumptions on which the mathematical prediction was based cannot have been satisfied. And the only assumptions going into the deviation where that the games played from the result for games from the same position were drawn independently from the same probability distribution in both runs.
The positions were not even chosen by me. _many_ aure using those positions to test their programs. Hopefully no longer. But then, they are going to be stuck beause using my positions requires a ton of computing power. So what is left? If they add search extensions, small test sets might detect the improvement since it will be significant. Null-move? less significant. Reductions? even less significant. Eval changes? very difficult to measure. yet everyone wants to and needs to do so. Unless someone had run these tests, and posted the results for large numbers, this would _still_ be an unknown issue and most would be making decisions that are almost random. Unless someone verifies that the new positons are usable, the same issue could be waiting in hiding again. Far too often, theory and reality are different, due to an unknown feature that is present.
Well, in general this way of testing is doomed to failure, even if you can eliminate the hypervariability. The normal statistical fluctuation even on uncorrupted data is simply to large to be useful unless you play billions of games.
This is why I designed the tree-game comparison method., which eliminates the sampling noise.
And I believe that I had concluded from day 1 that the 40 positions were providing unusual data. But if you do not understand why, you can continue to pick positions from now on and still end up with the same problem.
Well, If I would need to do accurate measurements I would use tree-games, and I would not have this problem, as the initial position makes only a negligible fraction of the test positions there.
Karl gave a simple explanation about why the SD is wrong here.
So you seem to think. Wrongly so.
It should shrink as number of games increases, assuming the games are independent. But if the 40 positions produce the same results each time, the statistics will say the SD is smaller if you play more games, when in fact, it does not change at all.
It does not change because it is already zero. Would you have expected it to become negative with more games, or what?
I think it has been an interesting bit of research. I've learned, and posted some heretofore _unknown_ knowledge. Examples:
1. I thought I needed a wide book, and then "position learning" to avoid playing the same losing game over and over. not true. It is almost impossible to play the same game twice, even if everything is set up the same way for each attempt.
Not generally valid, as we discussed before. Most people have exactly the opposite problem...
2. I thought programs were _highly_ deterministic in how they play, except when you throw in the issue of parallel search. Wrong.
Not wrong for most programs. To be sure of sufficiently random behavior, it is better to not rely on any uncntrollable jitter, randomize explicitly. As the Rybka team does.
3. If someone had told me that allowing a program to search _one_ more node per position would change the outcome of a game, I would have laughed. It is true. In fact, one poster searched N and N+2 nodes in 40 positions and got (IIRC 11 different games out of 80 or 100). Nobody knew that prior to my starting this discusson. He played the same two programs from the same position for 100 games and got just one duplicate game. Unexpected? Yes, until I started this testing and saw just how pronounced this effect was.
4. I (and many others) thought that hash collisions are bad. And that they would wreck a program. Cozzie and I tested this and published the results. And even with an error rate of one bad hash collision for every 1000 nodes, it hs no measurable effect on the search quality.
5. Beal even found that random evaluation was enough to let a full-width search play reasonably. Nobody had thought that prior to his experiment.
I have said previously, the computer chess programs of today have some highly unusual (and unexpected, and possibly even undiscovered) properties that manifest themselves in ways nobody knows about nor understands. I had also said previously that standard statistical analysis (which includes Elo) might not apply as equally to computer chess programs beause of unknown things that are happening but which have not been observed specifically. Some of the above fall right in to that category.
It has been interesting, and it isn't over.