hgm wrote:bob wrote:the "standard error" might be 2 points. The variance in such a match is _far_ higher. Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty. So talking about standard error is really meaningless here, as is the +/- elostat output. It just is not applicable based on a _huge_ number of small matches...
Wait a minute! Are we talking here about 'variance' in the usual statistical meaning, as the square of the standard deviation?
Then it is safe to say that what you claim is just nonsense. The addition law for variances of combined events (and the resulting central limit theorem) can be mathematically _proven_ under very weak conditions. The only thing that is required is that the variances be finite (which in Chess they necessarily are, as all individual scores are finite), and that the events are independent.
So what you claim could only be true if the results of games within one match were dependent on each other, in a way that winning one, would give you a larger probability of winning the next. As we typically restart our engines that doesn't seem possible without violating causality...
If you play a large number of 26-game matches, the results will be distributed with a SD of at most 0.5*sqrt(26) = 2.5, and you can only reach that if the engines cannot play draws, and score near 50% on average. With normal draw percentages it will be 2 points (i.e. the variance will be 4 square points).
No way the variance is ever going to be any higher.
I was originally talking about variance in the engineering sense. But I have been using the usual stastical term recently. In that when we started testing on our cluster, I could run 26 game matches, and get 2-24, 24-2 and 13-13.
So the three samples are -22, +22 and 0. mean of the squares is 22^2 + 22^2 + 0^2 = 968 / 3 = 323. Divide by 2 to compute the variance (161). Is that the variance you are talking about? square the difference of each pair of observations, then compute the mean of that and divide by 2??? I claim that number is _very_ high for computer vs computer chess games. I claim it goes down as the number of games goes up. I claim that until the number of games is far larger than one might originally suspect, the variance is extremely high making the results highly random and unusable. I can't deal with variance that large and conclude anything useful.
So I agree with your "if you play a large number". But I have been arguing for a large number of games to reduce variance. You have been arguing for a small number of games, which absolutely increases the variance. We just don't seem to agree on how much this increase actually is...
So maybe I am now confused, since I am arguing for large N small sigma squared. So back to the beginning. Do you advocate large N or small N? I will again remind you, I am using 40 positions so that I get a representative cross-section of games, tactical, positional, middlegame attacks, endgame finesses, etc. That requires at least 80 games since you must alternate colors to cancel unbalanced positions. What N do you advocate? I am finding it necessary to run about 5K games per opponent to get N up and sigma^2 down enough to make evaluation accurate. I also choose to use multiple opponents as I don't want to tune to beat X only to lose to Y and Z because X has problems with certain types of positions that skew those results. I am not sure my 20K total games (5K x 4 opponents) is either enough or overkill. But I am absolutely certain that 500 is not enough because of the variance I have measured.
Also, with my cluster, I can test parallel searches, and I would really like (and plan on doing so) to test against parallel search engines to compare the parallel search effectiveness of the parallel engines. But that introduces even more non-determinism and will drive N even larger. And it would be nice to test a book in the same way, again further increasing N. Not to mention pondering. So far I have tried to whittle away every source of non-determinism except for timing for the moves. Interfering with that produces stability but the match results generally do not come close to the match results produced with more games using time limits.