An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

bob wrote:I responded because you suggested that the data was made up. The data was _not_ "atypical". I just posted some more data that started off pretty wildly then settled down. Whether it will "wilden back up" or not is unknown. Seeing drastic differences is not that uncommon. Doesn't happen every time. Doesn't happen once per millenium. Sort of the nature of random observations...
Well, this is where we still differ, then.

If this data is not atypical, your measurements must be corrupted by effects that you are not aware of, and perhaps cannot even imagine. But they must be there, if the data show them to be there.

If the games are independent (as you claim, and as would be expected) the mini-match results are distributed according to a normal distribution with a SD bounded by sqrt(80). That puts a hard upper limit to the _probability_ for observing a deviation of a certain size (in particular, once every 15,000 mini-matches for a deviation of 4*sigma).

If you see it _significantly_ more often, you will have significant evidence that your measurement is invalid. That you can't imagine what caused the screw up can at best tell something about you, but never anything about the data.

If it does _not_ occur significantly more often than once every 15,000 mini-matches, then it _is_ very atypical.

Take your pick...
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Gerd Isenberg wrote: Why do you expect those games not independent from each other?

Code: Select all

1: =--===++--=-=-++=-=-++++---+=+-+--=+-=-+-+-=-++-+++--=+=++=+---+-+++-==+---++=-- (-2)
2: =---+---+-=+=-+++---++------+=-+===+---=--=---+--+=-=-=-++-+--+--+---=-=+-+++++- (-18)
3: +---=-=-+++-+=+===--+++----=-+-=-+++-+----=-=+==+-=+--+--+=+--+=+=+-+++-+-=+--=- (-6)
4: =-=-==--+---+-=+=----+=---+===---=-=---=--====+------=---+-+--+--=+--++=+--+--=- (-31)
I see this is remains unanswered. You must realize that it is not easy to explain one semester worth of statistical knowledge for math students into a single post... :?

One other crucial point is that the data above was given together with the information that this data was part of a set of 32 that had an average score of -2.

The variance of a quantity is the average of the square of the deviation of the average. To not complicate it too much, assume that the grand average of all results is zero (equal amounts of + and - occurs). Then the deviation from average is just the result itself, and to get the variance, you have to square those deviations, and then take the average of those squares.

Now the mini-match results are the sum of 80 game results. Squaring a sum of N terms gives you squares of each of the N terms, but also double product of pairs of terms (N*(N-1)/2 of them). The point now is that with independent games, the double products always average to zero, because the ++ and -- combinations (where the product is positive) cancel the equally probable +- and -+ combinations (where the product is negative). Remember we assumed + and - were equally probable. If they are not, the result would still hold, but the math to show it would be much more cumbersome.

So in the average of the squared deviations, only the squares of the terms contribute (as they are always positive), and that makes the variance of the sum (= mini-match result) equal to the sum of the variances of the individual games. As the variance of a game is limited to 1 (because the result can be at most +1 or -1, both of which have a square of 1), that means the variance in the result of 80 independent games can at most be 80 (and thus the SD sqrt(80) = ~9). Because draws are reasonably abundant, the variance in practice will be lower (as 0 squared equals 0, and thus lowers the average), more like 7 or 8.

Now one of the deviations shown is many times larger than that (-29). For sums of independent events the probability distribution is nearly universal (the so called "normal" distribution), when you scale it to the SD. This deviation is about 4 times SD, and the probability for that can be looked up in a table for the normal distribution to be only 1 in 15,000 (that you make it in either direction, as +29 would have disturbed us as much as -29).

If in practice such extreme diviations occur much more frequently than this, it either means the distribution of the mini-match results is not normal, or that it is normal, but with a larger variance. Both of these can only happen if the individual games somehow conspire to cause extreme deviations, i.e. a large number should decide to all produce a +, or all produce a -, but rarely produce results where some are + and others are -.

So this is what is troubling us. One of those 4 mini-match resulta shows a deviation so big that should almost never occur in a sum of 80 independently chosen 1 and -1, even if you would choose these individual 1 and -1 through a coin flip (= totally random). That the results of a chess game between the same engine from the same starting position might not be totally random can only make things worse: the variance of an individual 'sample position' would go down even further by such an effect, making the observed deviation even less likely. So the behavior of chess engines is not relevant at all for addressing this problem.
Tony

Re: An objective test process for the rest of us?

Post by Tony »

hgm wrote:
bob wrote:I responded because you suggested that the data was made up. The data was _not_ "atypical". I just posted some more data that started off pretty wildly then settled down. Whether it will "wilden back up" or not is unknown. Seeing drastic differences is not that uncommon. Doesn't happen every time. Doesn't happen once per millenium. Sort of the nature of random observations...
Well, this is where we still differ, then.

If this data is not atypical, your measurements must be corrupted by effects that you are not aware of, and perhaps cannot even imagine. But they must be there, if the data show them to be there.

If the games are independent (as you claim, and as would be expected) the mini-match results are distributed according to a normal distribution with a SD bounded by sqrt(80). That puts a hard upper limit to the _probability_ for observing a deviation of a certain size (in particular, once every 15,000 mini-matches for a deviation of 4*sigma).

If you see it _significantly_ more often, you will have significant evidence that your measurement is invalid. That you can't imagine what caused the screw up can at best tell something about you, but never anything about the data.

If it does _not_ occur significantly more often than once every 15,000 mini-matches, then it _is_ very atypical.

Take your pick...
Hi HG,

it's not atypical. It just means there is no normal distribution.

Only if it should be normal distribution, it's atypical. But are you sure it should be ?

Tony
Uri Blass
Posts: 10905
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: An objective test process for the rest of us?

Post by Uri Blass »

Tony wrote:
hgm wrote:
bob wrote:I responded because you suggested that the data was made up. The data was _not_ "atypical". I just posted some more data that started off pretty wildly then settled down. Whether it will "wilden back up" or not is unknown. Seeing drastic differences is not that uncommon. Doesn't happen every time. Doesn't happen once per millenium. Sort of the nature of random observations...
Well, this is where we still differ, then.

If this data is not atypical, your measurements must be corrupted by effects that you are not aware of, and perhaps cannot even imagine. But they must be there, if the data show them to be there.

If the games are independent (as you claim, and as would be expected) the mini-match results are distributed according to a normal distribution with a SD bounded by sqrt(80). That puts a hard upper limit to the _probability_ for observing a deviation of a certain size (in particular, once every 15,000 mini-matches for a deviation of 4*sigma).

If you see it _significantly_ more often, you will have significant evidence that your measurement is invalid. That you can't imagine what caused the screw up can at best tell something about you, but never anything about the data.

If it does _not_ occur significantly more often than once every 15,000 mini-matches, then it _is_ very atypical.

Take your pick...
Hi HG,

it's not atypical. It just means there is no normal distribution.

Only if it should be normal distribution, it's atypical. But are you sure it should be ?

Tony
The distribution of match result should be close to normal based on
the central limit theorem so the error in the probability is not high.

Uri
Uri Blass
Posts: 10905
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: An objective test process for the rest of us?

Post by Uri Blass »

hgm wrote:
bob wrote:I responded because you suggested that the data was made up. The data was _not_ "atypical". I just posted some more data that started off pretty wildly then settled down. Whether it will "wilden back up" or not is unknown. Seeing drastic differences is not that uncommon. Doesn't happen every time. Doesn't happen once per millenium. Sort of the nature of random observations...
Well, this is where we still differ, then.

If this data is not atypical, your measurements must be corrupted by effects that you are not aware of, and perhaps cannot even imagine. But they must be there, if the data show them to be there.

If the games are independent (as you claim, and as would be expected) the mini-match results are distributed according to a normal distribution with a SD bounded by sqrt(80). That puts a hard upper limit to the _probability_ for observing a deviation of a certain size (in particular, once every 15,000 mini-matches for a deviation of 4*sigma).

If you see it _significantly_ more often, you will have significant evidence that your measurement is invalid. That you can't imagine what caused the screw up can at best tell something about you, but never anything about the data.

If it does _not_ occur significantly more often than once every 15,000 mini-matches, then it _is_ very atypical.

Take your pick...
My guess is that Bob simply did not compute statistics in order to decide
if the data is typical.

There are two possibilities:
1)He saw often data with smaller veriance and thought that the data is with the same type.

2)He see often data with similiar variance and in this case it is clear that the result are not independent.

Uri
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Tony wrote:Hi HG,

it's not atypical. It just means there is no normal distribution.

Only if it should be normal distribution, it's atypical. But are you sure it should be ?
Well, that is the entire point of this discussion. The distribution can only be other than normal if there is correlation between the games. For independent games the distribution should be indistinguishable from normal, for 80 independent results.
Tony

Re: An objective test process for the rest of us?

Post by Tony »

hgm wrote:
Tony wrote:Hi HG,

it's not atypical. It just means there is no normal distribution.

Only if it should be normal distribution, it's atypical. But are you sure it should be ?
Well, that is the entire point of this discussion. The distribution can only be other than normal if there is correlation between the games. For independent games the distribution should be indistinguishable from normal, for 80 independent results.
But for a lot of engines there is a correlation. Opening book statistics that get adjusted or even worse, learning files.

They can stop an engine playing a crappy opening line and lower loosing percentage or start beating a slower learning engine.

(Maybe this has been discussed, but I really can't/don't remember every post)

Tony
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Bob plays without book and learning, and takes every conceivable measure to prevent games influencing each other.

But indeed, that is what I say above, "take your pick":

either the games are dependent, _or_ the data is atypical.

You can't get rid of one of those without getting the other.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
bob wrote:I responded because you suggested that the data was made up. The data was _not_ "atypical". I just posted some more data that started off pretty wildly then settled down. Whether it will "wilden back up" or not is unknown. Seeing drastic differences is not that uncommon. Doesn't happen every time. Doesn't happen once per millenium. Sort of the nature of random observations...
Well, this is where we still differ, then.

If this data is not atypical, your measurements must be corrupted by effects that you are not aware of, and perhaps cannot even imagine. But they must be there, if the data show them to be there.

If the games are independent (as you claim, and as would be expected) the mini-match results are distributed according to a normal distribution with a SD bounded by sqrt(80). That puts a hard upper limit to the _probability_ for observing a deviation of a certain size (in particular, once every 15,000 mini-matches for a deviation of 4*sigma).

If you see it _significantly_ more often, you will have significant evidence that your measurement is invalid. That you can't imagine what caused the screw up can at best tell something about you, but never anything about the data.

If it does _not_ occur significantly more often than once every 15,000 mini-matches, then it _is_ very atypical.

Take your pick...
OK, let me define "atypical" here.

Atypical: adj. A word that means "not typical", or, "unusual".

In my usage, atypical means "unusual". I have played too many of these matches to call any result I have posted "unusual". Does that mean these results happen every 2 tests? No. In a set of 80 tests? Sometimes.

There is enough randomness in the results, and the positions used are "interesting" enough, that randomness can hurt in some, more than it helps in others, or vice versa. I suspect a much larger set of positions might well decrease these fluctuations. But that is going in the wrong direction for fast turnaround. I did look at one particular position when we started trying to figure out this behavior, and there was a place where several moves were ranked by Crafty as very close to each other. But most of them actually lost farther into the game. Whenever the opponent varied in that part of the game, it was generally varying into fairly easy losses for itself (or wins for Crafty). A "lucky" (or "unlucky") type position depending on which side you are playing.

So 40 starting positions is a _very_ small sample of starting positions. Which already gives us small sample size, large variance. And certain types of positions simply have more variance than others because there is a precise path to win, and any variation can turn it into a loss or draw, but at the time control used, that can't be determined at the point the move is chosen.

So none of these results surprise me. I have played so many of these 80 game matches and seen the varying results, that nothing is surprising at all, and after seeing so many varying results, none strike me as "atypical", probably because after playing 100,000 such matches, I have probably seen nearly every result that is possible...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

Uri Blass wrote:
Tony wrote:
hgm wrote:
bob wrote:I responded because you suggested that the data was made up. The data was _not_ "atypical". I just posted some more data that started off pretty wildly then settled down. Whether it will "wilden back up" or not is unknown. Seeing drastic differences is not that uncommon. Doesn't happen every time. Doesn't happen once per millenium. Sort of the nature of random observations...
Well, this is where we still differ, then.

If this data is not atypical, your measurements must be corrupted by effects that you are not aware of, and perhaps cannot even imagine. But they must be there, if the data show them to be there.

If the games are independent (as you claim, and as would be expected) the mini-match results are distributed according to a normal distribution with a SD bounded by sqrt(80). That puts a hard upper limit to the _probability_ for observing a deviation of a certain size (in particular, once every 15,000 mini-matches for a deviation of 4*sigma).

If you see it _significantly_ more often, you will have significant evidence that your measurement is invalid. That you can't imagine what caused the screw up can at best tell something about you, but never anything about the data.

If it does _not_ occur significantly more often than once every 15,000 mini-matches, then it _is_ very atypical.

Take your pick...
Hi HG,

it's not atypical. It just means there is no normal distribution.

Only if it should be normal distribution, it's atypical. But are you sure it should be ?

Tony
The distribution of match result should be close to normal based on
the central limit theorem so the error in the probability is not high.

Uri
What about the case where the results are nearly perfectly random. So now the mean is well-defined but the standard deviation is at its max?

we are taking very small samples (80 games) out of a potential game space that has to be up in the 10^40 or beyond in total games. If the total set of games has such a random characteristic, what do you expect to find in the very small samples you obtain? You take tiny atmospheric samples in just a few places on the planet, and you can see heavy polution, pure air, air full of dust. Only when you take enough samples to get a representative average, can you draw any conclusion about air quality.

But these sample sizes are _tiny_ compared to the total population of games that could be sampled. And getting apparently biased results from tiny samples is not exactly unusual...