Gerd Isenberg wrote:
Why do you expect those games not independent from each other?
Code: Select all
1: =--===++--=-=-++=-=-++++---+=+-+--=+-=-+-+-=-++-+++--=+=++=+---+-+++-==+---++=-- (-2)
2: =---+---+-=+=-+++---++------+=-+===+---=--=---+--+=-=-=-++-+--+--+---=-=+-+++++- (-18)
3: +---=-=-+++-+=+===--+++----=-+-=-+++-+----=-=+==+-=+--+--+=+--+=+=+-+++-+-=+--=- (-6)
4: =-=-==--+---+-=+=----+=---+===---=-=---=--====+------=---+-+--+--=+--++=+--+--=- (-31)
I see this is remains unanswered. You must realize that it is not easy to explain one semester worth of statistical knowledge for math students into a single post...
One other crucial point is that the data above was given together with the information that this data was part of a set of 32 that had an average score of -2.
The variance of a quantity is the average of the square of the deviation of the average. To not complicate it too much, assume that the grand average of all results is zero (equal amounts of + and - occurs). Then the deviation from average is just the result itself, and to get the variance, you have to square those deviations, and then take the average of those squares.
Now the mini-match results are the sum of 80 game results. Squaring a sum of N terms gives you squares of each of the N terms, but also double product of pairs of terms (N*(N-1)/2 of them). The point now is that with independent games, the double products always average to zero, because the ++ and -- combinations (where the product is positive) cancel the equally probable +- and -+ combinations (where the product is negative). Remember we assumed + and - were equally probable. If they are not, the result would still hold, but the math to show it would be much more cumbersome.
So in the average of the squared deviations, only the squares of the terms contribute (as they are always positive), and that makes the variance of the sum (= mini-match result) equal to the sum of the variances of the individual games. As the variance of a game is limited to 1 (because the result can be at most +1 or -1, both of which have a square of 1), that means the variance in the result of 80 independent games can at most be 80 (and thus the SD sqrt(80) = ~9). Because draws are reasonably abundant, the variance in practice will be lower (as 0 squared equals 0, and thus lowers the average), more like 7 or 8.
Now one of the deviations shown is many times larger than that (-29). For sums of independent events the probability distribution is nearly universal (the so called "normal" distribution), when you scale it to the SD. This deviation is about 4 times SD, and the probability for that can be looked up in a table for the normal distribution to be only 1 in 15,000 (that you make it in either direction, as +29 would have disturbed us as much as -29).
If in practice such extreme diviations occur much more frequently than this, it either means the distribution of the mini-match results is not normal, or that it is normal, but with a larger variance. Both of these can only happen if the individual games somehow conspire to cause extreme deviations, i.e. a large number should decide to all produce a +, or all produce a -, but rarely produce results where some are + and others are -.
So this is what is troubling us. One of those 4 mini-match resulta shows a deviation so big that should almost never occur in a sum of 80 independently chosen 1 and -1, even if you would choose these individual 1 and -1 through a coin flip (= totally random). That the results of a chess game between the same engine from the same starting position might not be totally random can only make things worse: the variance of an individual 'sample position' would go down even further by such an effect, making the observed deviation even less likely. So the behavior of chess engines is not relevant at all for addressing this problem.