hgm wrote:Perhaps it is better then to cast the message into an anecdotal form, so that the mathematically shy have a better chance of following the discussion.
I'm hardly what one would call "mathematically shy" but I have discovered over the years, that one has to learn to avoid the "to the man who has a hammer, everything looks like a nail." It is tough to apply statistical analysis to something that has underlying properties that are not well understood, if those properties might violate certain key assumptions (independence of samples for just one). There have been so many new characteristics discovered during these tests, and there are most likely more things yet to be discovered.
The average healthy male adults is 6 feet tall, give or take 8 inches. Yet, if I take a walk through a crowded shopping mall, occasionally I encounter people that measure 6'9". Once I even encountered a giant of 6'10", and I thought: "Wow, this must be the tallest guy in the county. But since I shop here every day, I was bound to bump into him some time!".
But when I meet someone that is 30 feet tall, I would exclude the idea that he actually consisted of a single part. Much more likely it would be someone riding a Giraffe. Despite the fact that Giraffes are not very common in shopping malls. If I would think: "Well, if that tall guy I met the other day can be 6'10", his older brother can just as well be 30' tall, and this must be him!"... Well, I guess even the mathematically challenged would not have much difficulty finishing that sentence.
So indeed it can be both: Some guys with above-average length are merely very tall, while other lengths simply do not occur even amongst a population of 6 billion. And even thos most mathematically illiterate can usually see the difference between an excess of 2 inches and 24 feet.
But your story has _one_ fatal flaw. What happens if you bump into someone over 8' tall every few days? Do you still hide your head in the sand and say "can't be"? Because this is the kind of results I have gotten on _multiple_ (and multiple does not mean more than 1.. that is, if I used normal Elo numbers +/- error, the min for A is larger than the max for B. You seem to want to imply that I only post such numbers here when I encounter them. The only reason I posted the last set was that it was the _first_ trial that produced BayesElo numbers for each run. It was the _first_ time I had posted results from Elo computations as opposed to just raw win/lose/draw results.
I posted several results when we started. I increased the number of games and posted more. And have not posted anything since because nothing changed. We were still getting significant variability, I had proven that it was a direct result of time jitter, or at least that removing timing jitter eliminated the problem completely. And there was nothing new to say. we were still having the same problems within our "group" where the same big run said "A is better" and a repeat said "B is better". And nothing new was happening. I then ran four "reasonable" runs and two _big_ runs, and ran the results thru BayesElo and thought that the results were again "interesting" and posted here, primarily to see what Remi thought and if he thought I might be overlooking something or doing something wrong.
But I run into these 8' people way too regularly to accept "your ruler is wrong, or you are cherry-picking, or whatever". I wrote when I started this that there is something unusual going on. It appears to be related to the small number of positions and the inherent correlation when using the same positions and same opponents over and over. OK, that way of testing was wrong. It wasn't obviously wrong, as nobody suggested what Karl suggested previously. Theron mentioned that he has _also_ been using that method of testing and then discovered (whether it was Karls post or his own evaluation was not clear however) that this approach led to unreliable results.
So, to recap, these odd results have been _far_ more common than once a decade, I, apparently like CT, had thought that "OK, N positions is not enough, but since the results are so random, playing more games with the same positions will address the randomness. But it doesn't And until Karl's discussion, the problem was not obvious to me.
I have been trying to confirm that his "fix" actually works. waste of time, you say. But _only_ if it confirms his hypothesis was correct. There was no guarantee of that initially, the cause could have been something else, from the simple (random PGN errors) to the complex (programs vary their strength based on day or time). I still want to test with one game per position rather than 2, to see if the "noise" mentioned by some will be a factor or not. I still want to test with fewer positions to see if the "correlation effect" returns in some unexpected way.
So the mystery is all hidden in the numbers. Non-overlapping 95% confidence intervals occur once every 178 pairs. (And remember the birthday paradox: if you have 10 people, you already have 45 pairs!) But to have them disjoint so far that another 95% onfidence interval could be fit in between... That is much rarer. Just as 10" above-average tall is rare, but appreciably more common like 24' above-average tall. And the beauty of math is that it enables you to calculate exactly how much a deviation of whichever magnitude will occur. Sometimes that is once every 178 tries. And sometimes that is once in a million.
But note that it does not _always_ work that way. My 40 position testing is a case in point. The math did _not_ predict outcomes anywhere near correctly. The positions were not even chosen by me. _many_ aure using those positions to test their programs. Hopefully no longer. But then, they are going to be stuck beause using my positions requires a ton of computing power. So what is left? If they add search extensions, small test sets might detect the improvement since it will be significant. Null-move? less significant. Reductions? even less significant. Eval changes? very difficult to measure. yet everyone wants to and needs to do so. Unless someone had run these tests, and posted the results for large numbers, this would _still_ be an unknown issue and most would be making decisions that are almost random. Unless someone verifies that the new positons are usable, the same issue could be waiting in hiding again. Far too often, theory and reality are different, due to an unknown feature that is present.
This is why serious scientists do quantitative statistical analysis on their data, recoding the frequency of every deviation, to see if the data can be trusted or not, and to what extent.
And I believe that I had concluded from day 1 that the 40 positions were providing unusual data. But if you do not understand why, you can continue to pick positions from now on and still end up with the same problem. Karl gave a simple explanation about why the SD is wrong here. It should shrink as number of games increases, assuming the games are independent. But if the 40 positions produce the same results each time, the statistics will say the SD is smaller if you play more games, when in fact, it does not change at all. I think it has been an interesting bit of research. I've learned, and posted some heretofore _unknown_ knowledge. Examples:
1. I thought I needed a wide book, and then "position learning" to avoid playing the same losing game over and over. not true. It is almost impossible to play the same game twice, even if everything is set up the same way for each attempt.
2. I thought programs were _highly_ deterministic in how they play, except when you throw in the issue of parallel search. Wrong.
3. If someone had told me that allowing a program to search _one_ more node per position would change the outcome of a game, I would have laughed. It is true. In fact, one poster searched N and N+2 nodes in 40 positions and got (IIRC 11 different games out of 80 or 100). Nobody knew that prior to my starting this discusson. He played the same two programs from the same position for 100 games and got just one duplicate game. Unexpected? Yes, until I started this testing and saw just how pronounced this effect was.
4. I (and many others) thought that hash collisions are bad. And that they would wreck a program. Cozzie and I tested this and published the results. And even with an error rate of one bad hash collision for every 1000 nodes, it hs no measurable effect on the search quality.
5. Beal even found that random evaluation was enough to let a full-width search play reasonably. Nobody had thought that prior to his experiment.
I have said previously, the computer chess programs of today have some highly unusual (and unexpected, and possibly even undiscovered) properties that manifest themselves in ways nobody knows about nor understands. I had also said previously that standard statistical analysis (which includes Elo) might not apply as equally to computer chess programs beause of unknown things that are happening but which have not been observed specifically. Some of the above fall right in to that category.
It has been interesting, and it isn't over.