hgm wrote:The problem is that you are so disinterested in doing statistical analysis on your data, that you don't seem to be able to distinguish a one-in-a-million event from a one-in-a-hundred event.
What "once in a million event have you seen?" Let's back up just a minute. I reported a while back that when playing 80 game matches, I discovered that it was impossible to use them to detect whether a change to my program produced an improvement or not. In fact, I posted that it was impossible to use these 80 game matches to predict whether A was better than B or vice-versa, within a pool of players that contained my program plus several well-known program that are all quite strong.
You seem to be worrying about the fact that four consecutive matches have a wild variance when the next 10 do not. I don't care about that at the moment. I care about the wild variance _period_ because when I run a test, I can get any one of those random match samples as the result, and that doesn't help me in any way.
Statistically I don't care about how much these samples do or do not vary. I will analyze this at some point. But for now, I want a test that says "better or worse". And the variability I am seeing in 80 game matches makes that fail. I then went to 4-match sets and discovered that was not enough. I explained this in the past during these discussions as well. I took a pretty simple change that I knew made a difference (I changed the null-move reduction depth, I eliminated an important search extension, etc) and running 1-match samples randomly told me "better" or worse" for any of those cases. 320 game matches did exactly the same thing. Even though I was not talking about very small changes, I still was unable to determine whether the change was good or bad.
That was what I reported on initially, that the results were so random it was surprising. "how" random was (and still is) not so important, because I just wanted to know "OK, the programs all have a strong non-deterministic aspect to their play, so how many games should I play to be able to accurately measure this effect.
That is a hard question to answer. The more significant the change, in theory I would need fewer games. But I made some significant changes and found that the number of games needed to determine the effect with high accuracy was much larger than I had ever heard anyone mention in the past.
I first decided to try to play enough games to drive this non-deterministic result influence down to zero, if possible. It wasn't possible, but I drove it down to a very low level by playing 64 matches against an opponent. And I discovered that I could determine whether a change had any significant effect on performance.
Somehow we get neck deep in "is that atypical?" and such, which at the time I was not interested in analyzing. I could _see_ the randomness, and I realized that I could not use that kind of random observations to make any sort of meaningful decision.
So, to assist us in _our_ problem of trying to evaluate changes that I make, or Tracy or Mike make, I started playing longer and longer tests, I tried longer time controls to see if that would help (it didn't). And I arrived at what we do today which is reasonably accurate.
I do plan, when I have time, to let my stat friend loose to determine how many games we need to be playing to get reasonable results, and see if that number is smaller than what I am using (which I doubt based on simple experimental tests we have run, and we have run a _bunch_ of them already as I have mentioned.
"once in a million" is not something I see often since I have only run a paltry 100,000 80 game matches so far. And nothing I have given here represents any such rare event, taken in context. In a 32 match test, whether the first 4 are wildly random, or whether the wildly random matches are distributed uniformly makes no difference _to me_. The fact that they exist at all is enough to influence how I test. You want to latch onto a tiny sample and say "that is way too rare to possibly happen." You are wrong. It _did_ happen as given. And I continue to get significant randomness. I don't care whether I should not get 4 wild results in a row. The fact that I get 4 wild results at all is enough to show that I can't reliably depend on one single match to predict anything. And the more such wild results I get, the more forceful that becomes.
So, in summary, can we stop trying to take individual runs apart, and focus on the big picture. I will post a long statistical analysis when we get the time to do this at some point in the future. Right now, all I care about is trying to determine if a change is good or not. Nothing more, nothing less.
If you believe the samples I posted are a one in a million event, then I just flipped heads 20 times in a row. It happens. Again, I could not tell you the circumstances where that original 4-game sample was produced. I did explain exactly how that last 10-12 game sample was produced. And I have posted others with less (but still highly significant) variance, as those were what I had at the time. There's nothing to be gained by trying to make this data up. I'm just showing a "work in progress" result here and there as the experimental testing progresses.
[/quote]
A deviation of 2.5*sigma is one-in-a-hundred, so if you post 64 mini-matches it is not at all significant if it occurs once. A deviation of 4.5*sigma is a one-in-a-million event. You seem to think "Oh well, it is not even twice larger than 2.5*sigma, so if 2.5*sigma occurs frequently, 4.5*sigma cannot be considered unusual". Well, that is as wrong and as naive as thinking that a billion is only 50% more than a million, because it has 9 zeros in stead of 6. Logic at the level of "I have seen two birds, so anything flies".
I don't believe any such thing. But I do believe this:
Whether it is a 4.5*sigma or a 1.0*sigma event, if sigma is big enough (and it is here) it makes the results unusable. I've not said anything else since starting this discussion.
Where did I say it was a "goof"? I reported _exactly_ what it was. 4 consecutive results obtained in a much larger test. I claimed it ws the first _four_ test results in that series of matches. I didn't claim anything else. The question about "how random" came up, I took what I had. Again if I flipped heads 20 times in a row to get that sample, so be it. I flipped heads 20 times in a row. whether you take that first group of 4, or the first group of 4 from the last data I posted, you get the same result. There is enough variability that it takes a _lot_ of games to smooth it out. All the other arguments, tangents, etc don't change that one iota.No one contest your right to remain in ignorant bliss of statistics, and concentrate on things that interest you more. But than don't mingle in discussions about statistics, as your uninformed and irrelevant comments only serve to confuse people.
Funny. It was a topic _I_ started.So who is "mingling"???
Wow. Talk about inaccurate statements. What about the set of programs I have used? On my cluster, I use fruit, glaurung 1/2, arasan 9/10, gnuchess 4 and 5. And a couple of others I won't mention since the authors asked me to play games and send them back. I use shredder and junior manually and they do the same.The matter of randomness in move choice was already discussed ad nauseam in another thread, and not really of interest to anyone, as it is fully understood, and most of us do not seem to suffer from this as much as you do, if at all.
So somehow "most of us" represents a couple of samples, and you complain that I "jump to conclusions"???
complete iteration searches is a primitive idea that almost everyone stopped using in the late 70's and early 80's. And the minute that goes away, randomness pops up, and there is nothing that can be done about it if you use any sort of time limit to control the search, because time intrinsically varies on computers.
The fact that you cannot tell apart a one-in-15,000 fluke from a 1-in-4 run-off-the-mill data set, really says it all: this discussion is completely over your head. Note that I did not say that your gang-of-four was a "fake" (if you insist on calling a hypothetical case such), but that I only _hoped_ it was a fake, and not a cheat or a goof. OK, so you argue that it was a goof, and that you are not to blame for it because you are not intellectually equiped to know what you are doing ("statistics doesn't interest me"). Well, if that suits you better, fine. But if you want to masquerade for a scientist, it is'n't really good advertisement...