bob wrote:No, I'm not confusing anything.  Your "variance" is based on a huge number of games. 
My variance is based not on a number of games, but on the probability distribution.
And I agree, with N=infinity that variance (and hence standard deviation) is very small.  But remember, _I_ an the one arguing for large N.  You (and others) are arguing for small N.  
I have not been arguing for any number of games. I just calculated the confidence intervals for the number of games you or Nicolai had been using.
So don't quote sigma^2 (or sigma) for large N to justify a small number of games.  
I still consider that nonsense. Variance is a property of the probability distribution from which you draw the games. It has _nothing_ to do with the number of games. Your actual results will vary per draw, and of course 32% will have a deviation from the mean larger than sigma, and 5% larger than 2 sigma. That doesn't mean the variance of the match results is any different from what is calculated. In fact it confirms it. 
Given the variance of the distribution one can calculate the probability for a certain deviation, and this is what I did.
That is exactly my point.  To reduce the variance to something that is reliable, we need a large number of games because the results of the 80-game mini-matches I play are so widely distributed.  
I did not think they were widely distributed at all. The observed variance of the mini-match results was lower than one would expect for the score, and the game results were actually highly correlated between matches. This shows that either the initial positions are not very well selected for equality, or the non-determinism you champion is not nearly as big as you claim.
I am not bothered by the necessity of playing 20K games to make a decision.  Since I can play 256 at a time, that is the same as playing less than 100 games one at a time.  With a fast time control, that doesn't take that long.
Well, good for you. So availability of excessive computer power causes atrophy of the brain. Nothing new about that. But I only have one dual-core machine, so I still have to _think_ how to do things efficiently. And actually the thinking is what is all the fun. 
 
 
The alternative is to use small N, and make poor decisions because of the inherent randomness (large sigma/sigma^2).
The alternative for me is to eliminate all unnecessary variability, so that I can base decisions with as much confidence as you, on 256 times fewer games.
BTW you could make sigma much smaller by just taking each 2-game position as a match.  the results have to lie between -2 to +2.  But I am not sure how you can use that to accept/reject a change to the program since the before/after results will be identical to several decimal places.
So what? Like anyone cares about sigma... What matters is the probability that you will exceed a certain preset score threshold where you are confident that the engine is better. For two matches and +1/-1 scoring the SD per game is ~0.8, so per 2-game match it is about 1.2. For 2-sigma confidence you would have to score 2.5 wins out of 2 games. Well, good luck with it. How many hours would that take on your cluster to succeed once? 
 
 
Given two equal humans, who play 100 games, what would you expect to be the outcome? 45-45-10?  If you repeat the test, what would you expect?  I would _not_ expect the two equal computers to produce that same level of consistency.  That is what I am trying to explain.  Comp vs Comp is nowhere close to human vs human..
I don't really follow Human Chess, and from my relative ignorance I would more expect spomething like 25-25-50 (+,-,=). I don't see what you are driving at at all. What do you mean by 'level of consistency'? I don't see anything consistent in 45-45-10 or 25-25-50.
Far an infinite number of trials, I agree.  But I am the one arguing for large N.  So how can you argue based on large N values, when you want to use small N results??   The standard deviation can be much larger than the above, for small sample sizes that you propose...
Again, what you say seems based on the idea that the variance of a stochastic variable is depedent on the number of draws. But it is not.
For individual outcomes ('mini-matches') anything is _possible_. But not with equal _probability_. If the first mini-match you play has 80 losses, than it is possible that the next one, with the same opponents and the same initial positions will have 80 wins. But you will NOT see it happen in the lifetime of the universe, even if you would use a million of your clusters. At least, if the games within the mini-match are independent. (e.g. if the first game that crashes the engine would cause forfeit on time of all subsequent games as the engine could not recover enough to recognize the next 'new' command, this would of course be different.)
The variance of the score distribution of an 80-game match (in the +1/0/-1 scoring system) can never be larger than sqrt(80). That is a hard fact.
The variance of this probability distribution will tell you how large the probability is that a certain deviation from the mean will occur in a mini-match, or in a given number of minimatches.