hgm wrote:bob wrote:What about the case where the results are nearly perfectly random. So now the mean is well-defined but the standard deviation is at its max?
This is the case we have been talking about from the very beginning.
Totally random results, cannot have a variance larger that sqrt(80). That _is_ the max. And only if there can't be any draws. Here there are many draws, so the variance _must_ be smaller. But sqrt(80) is the max. Unless the games are correlated.
The word games you try to play now are just plain silly. Something that occurse only once every 15,000 cases cannot be a-typical, because you have seen more than 15,000 cases, an what you have seen once or twice cannot be a-typical? Ridiculous!
No, not ridiculous at all. Just some sort of word game. I see _significant_ randomness in every group of matches I play. Some far worse than others. And on rare occasion, some that are very close to each other. I already explained, more than once, that I have never been interested in trying to statistically analyze this data for what I am working on, which is (was) an attempt to eliminate or explain every bit of the randomness I was seeing.
I was not interested in how much variance I was seeing, I wanted to understand _why_ I was / am seeing it. I wanted to make sure there were no bugs. No unexpected randomness in either crafty or the opponents I am using. For example, an occasional loop that hangs until the search times out and a move is made after almost no searching, an easy way to lose a game (and yes, I have seen that bug in the past).
So all I cared about here, was why is this happening and is there a viable way to stop it from happening so that a small sample size will produce a reliable result. The basic answer is "no" to the small sample size question. At least for the programs I am using, since we all time the search in a similar non-whole-iteration way. I reported here that the sample sizes being quoted to either (a) compare X to Y or (b) to determine if X is better than X' are sufficient, because I ran over 100K 80 game samples and easily discovered that 80 games is not enough to prove anything with any reasonable standard error.
At some point, I do plan on going the stat route to discover what is the "optimal" sample size. Right now, I can run enough tests to smooth out the randomess easily enough. But I am also interested in how far I can back off of my current sample size and still produce acceptable error. But I am not there yet. I will get there at some point. All I know right now is that to discover where A' is better than A requires _way_ more than 320 games. And that 20K games provides a stable result. Since 20K is not intractable at present, I am using that and getting accurate results. As I have time, I have a stat friend that wants to help see if we can back down from that number a bit without compromising accuracy. At present, cutting it by a factor of 2 introduces more randomness. I can already run 20K games twice and get slightly different results each time. The fewer games I play, the greater the variance, and it _quickly_ loses the ability to pick up on small changes in skill.
That's where I am. Along the way you ask questions and I try to answer them. Is one of those wild variations "atypical"? Depends on your definition of atypical. I have seen them enough that they don't stand out to me at all. So I don't call them atypical. If you use some definition of (say) once in 10000 trials is atypical, or once in 100 trials is atypical, that's fine. I don't have a precise number I use in that comparison. I just think of "rare" and those results are not rare to me.
And totally irrelevant, as we are not discussing if the description 'a-typical' is a proper one, but the fact that you post a 1-in-15,000 fluke like it is the most common thing in the world, and then throw a fit if you get caught at it... So even if this falls under your definition of 'typical', the 15,000 still stands, and that is all that matters.
That is pretty strange. The last test I posted had 3 or 4 matches of significant variance followed by 7-8 of pretty stable results. And you said "that is nothing special." Yet if I posted just those first 4, you say "that is a one in 15,000 event and it is fake." Since I have played over 8,000,000 games, in 80 game matches, that is 100,000 of those matches. I could _easily_ pick one that was rare. But I didn't pick it because it _was_ rare. I just picked it because happened to be the data I had at hand. I can't even post the rest of the run I started to post about this morning. Our IBRIX filesystem died, and the run terminated and cleaned up after itself leaving no data to look at. I am running the test again. If you want, I will post the first 4 matches from it as well, so that we can again get into the catcalls about "that can't be."
As I have said before, "it is what it is." And nothing more. I am trying to supply information. Sometimes I wonder if it is worth the trouble. There is always someone waiting around to poke holes in it rather than take it for what it is.
So keep poking. I'm not sure I'm interested in continuing the dance, however. I want to get back to making my program better, playing what you consider to be an excessive number of games, to get what I consider to be very high accuracy...
that's good enough for "team Crafty"...