An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

bob wrote:
hgm wrote:
bob wrote:Personally, I find that particular comment absolutely beyond stupid. Asinine, ignorant, incompetent all come to mind to describe it, in fact. Feel free to make up what you want. In fact, based on that comment, feel free to continue this discussion with yourself. I certainly have better things to do that deal with that level of ignorance...
Oh well, everyone is entitled to his personal opinion... Usual personal opinions are very revealing.

This one, for instance, reveals a lot about your about your scientific abilities, and statistical skills. You give an example in a scientific discussion that represents a one-in-miliion event, (4 sigma and a second rather large deviation), so it would certainly be very important to know if this is a hypothertical case, selected actual data or randomly chosen actual data.
Do you ever read? I believe I answered that question _exactly_. I don't recall why I picked that 32-match set, but the four 80-game matches I gave were the first 4 played.
That was not the question. The question was if you selected that 32-match set. _Now_ you say you don't know why you selected that. I can't recall having seen that before, but I am sure you will be kind enough to point it out to me. And in any case, it is not an answer to the question, just an announcement that no answer will be forthcoming.
As I said. Now as to whether the complete set of 32 matches (I run dozens of such 32-match tests) was one that started off particularly bad or not, I don't remember. But I didn't just choose the 4 worst cases. If you think about it you could tell that from the data. How could the average be close to zero with those big negative results thrown in? Unless there were some positive scores as well?

Just stop and think for a moment before making ridiculous statements...
If I think about it, it seems that it is possible to compensate a -31 result by 31 +1 results in 31 other matches. The average doesn't tell you a thing about the variance in the missing data that pulled up the average. There can just as easily be 10 +27 results and 9 other -31 results to get the average on -2, as 16 zeros. I am sure that if _you_ would think for a moment, you would come to the same conclusion.

So the 'ridiculous' question still stands: how often do you observe once-in-a-million events in your data? If it is more than once in a million, I would say you have a big problem. If you would stop to think about it, that is...
You didn't ask if I used a hypothetical example. And the question was not necessary since I had already stated where the data came from. You asked "are you making this stuff up?" which is a _far_ different question. And, in fact, is more accusation than question. I am not interested in wasting time in that kind of conversation. Maybe one day you will produce your own data... Rather than telling me how mine should look.
hgm wrote: "When"? Is this a hypothetical case? How do you get these traces? Do you just make them up, or do you select the worst you ever encountered amongst the millions of such mini-matches that you claim to have played?
Well, seems to me that it clearly says "hypothetical case" here. Or do you never read? A hypothetical case is one for which you make up the data, and from the context it seems clear that I refer to that.
hgm wrote: Normally, since the data is so different in character from the 32 traces you posted before, I would assume this is highly selected data, and that you could not repeat such an extreme deviation or anything near it in the next 100 mini-matches, despite the fact that your reaction sets me thinking. That would make it a very _misleading_ example.
That I can't address, other than to say specifically that those were the first 4 80 game matches played in a set of either 32 or 64. I don't keep the data because it would be overwhelming to keep up with. I have hundreds of summaries that look like this:

Code: Select all


=============== glaurung results ->
64 distinct runs (5120 games) found
  win/ draw/ lose (score)
 1:  30/ 25/ 25 (  5)
 2:  28/ 23/ 29 ( -1)   29/ 24/ 27 (  2)
 3:  30/ 13/ 37 ( -7)
 4:  29/ 19/ 32 ( -3)   29/ 16/ 34 ( -5)   29/ 20/ 30 ( -1)
 5:  37/ 17/ 26 ( 11)
 6:  26/ 20/ 34 ( -8)   31/ 18/ 30 (  1)
 7:  32/ 18/ 30 (  2)
 8:  30/ 20/ 30 (  0)   31/ 19/ 30 (  1)   31/ 18/ 30 (  1)   30/ 19/ 30 (  0)
 9:  27/ 17/ 36 ( -9)
10:  28/ 20/ 32 ( -4)   27/ 18/ 34 ( -6)
11:  41/ 15/ 24 ( 17)
12:  33/ 21/ 26 (  7)   37/ 18/ 25 ( 12)   32/ 18/ 29 (  2)
13:  34/ 14/ 32 (  2)
14:  25/ 22/ 33 ( -8)   29/ 18/ 32 ( -3)
15:  39/ 10/ 31 (  8)
16:  32/ 19/ 29 (  3)   35/ 14/ 30 (  5)   32/ 16/ 31 (  1)   32/ 17/ 30 (  2)   31/ 18/ 30 (  0)
17:  22/ 20/ 38 (-16)
18:  34/ 20/ 26 (  8)   28/ 20/ 32 ( -4)
19:  34/ 20/ 26 (  8)
20:  34/ 10/ 36 ( -2)   34/ 15/ 31 (  3)   31/ 17/ 31 (  0)
21:  24/ 23/ 33 ( -9)
22:  37/ 18/ 25 ( 12)   30/ 20/ 29 (  1)
23:  32/ 17/ 31 (  1)
24:  29/ 24/ 27 (  2)   30/ 20/ 29 (  1)   30/ 20/ 29 (  1)   30/ 19/ 30 (  0)
25:  26/ 14/ 40 (-14)
26:  35/ 16/ 29 (  6)   30/ 15/ 34 ( -4)
27:  37/ 16/ 27 ( 10)
28:  32/ 19/ 29 (  3)   34/ 17/ 28 (  6)   32/ 16/ 31 (  1)
29:  40/ 18/ 22 ( 18)
30:  30/ 20/ 30 (  0)   35/ 19/ 26 (  9)
31:  30/ 18/ 32 ( -2)
32:  31/ 21/ 28 (  3)   30/ 19/ 30 (  0)   32/ 19/ 28 (  4)   32/ 17/ 29 (  3)   31/ 18/ 29 (  1)   31/ 18/ 30 (  1)
33:  33/ 23/ 24 (  9)
34:  38/ 14/ 28 ( 10)   35/ 18/ 26 (  9)
35:  32/ 18/ 30 (  2)
36:  25/ 18/ 37 (-12)   28/ 18/ 33 ( -5)   32/ 18/ 29 (  2)
37:  25/ 23/ 32 ( -7)
38:  32/ 18/ 30 (  2)   28/ 20/ 31 ( -2)
39:  36/ 15/ 29 (  7)
40:  36/ 14/ 30 (  6)   36/ 14/ 29 (  6)   32/ 17/ 30 (  2)   32/ 17/ 30 (  2)
41:  35/ 16/ 29 (  6)
42:  34/ 10/ 36 ( -2)   34/ 13/ 32 (  2)
43:  23/ 28/ 29 ( -6)
44:  33/ 15/ 32 (  1)   28/ 21/ 30 ( -2)   31/ 17/ 31 (  0)
45:  30/ 19/ 31 ( -1)
46:  35/ 11/ 34 (  1)   32/ 15/ 32 (  0)
47:  28/ 21/ 31 ( -3)
48:  30/ 21/ 29 (  1)   29/ 21/ 30 ( -1)   30/ 18/ 31 (  0)   31/ 17/ 31 (  0)   31/ 17/ 30 (  0)
49:  29/ 17/ 34 ( -5)
50:  30/ 21/ 29 (  1)   29/ 19/ 31 ( -2)
51:  36/ 15/ 29 (  7)
52:  36/ 14/ 30 (  6)   36/ 14/ 29 (  6)   32/ 16/ 30 (  2)
53:  28/ 22/ 30 ( -2)
54:  30/ 23/ 27 (  3)   29/ 22/ 28 (  0)
55:  27/ 16/ 37 (-10)
56:  29/ 21/ 30 ( -1)   28/ 18/ 33 ( -5)   28/ 20/ 31 ( -2)   30/ 18/ 30 (  0)
57:  26/ 21/ 33 ( -7)
58:  29/ 20/ 31 ( -2)   27/ 20/ 32 ( -4)
59:  32/ 17/ 31 (  1)
60:  27/ 17/ 36 ( -9)   29/ 17/ 33 ( -4)   28/ 18/ 32 ( -4)
61:  29/ 18/ 33 ( -4)
62:  26/ 19/ 35 ( -9)   27/ 18/ 34 ( -6)
63:  29/ 19/ 32 ( -3)
64:  27/ 14/ 39 (-12)   28/ 16/ 35 ( -7)   27/ 17/ 34 ( -7)   28/ 18/ 33 ( -5)   29/ 18/ 32 ( -2)   30/ 18/ 31 ( -1)
what you are looking at are the results of 64 matches, 80 games per match. The first column is the individual match results data. The second column is the average of the two preceeding results. The third is the average of the two preceeding averages, etc... Somehow the final column is missing, probably due to an error I did when I ran the summary and did not tell it to average all 64 matches, just 32.

That is a real run, comparing Crafty to Glaurung, nothing edited out, nothing added in. Do you see the same kind of variability I have been talking about all along? Results from +18 to -17? I have others that are worse. This was the _first_ in my file, so since you seem to think I "pick and choose" I just chose the first one that I saved. And no I don't save them all as there is just too much data.
And you know what? This unselected set has exactly the statistics that one would expect. The variance of the mini-match results is 51.7, corresponding to an SD of 7.2. Almost exactly equal to the theoretical prediction for equal opponents of 0.8*sqrt(80). The largest deviation is a +18 (the average is almost exactly zero), or 2.5 sigma. That is a one-in-a-hundred event. Not very exceptional, for 64 draws.
That is actually pretty funny. Did I accuse _you_ of making things up? Did I accuse _you_ of exaggerating? And _I_ can't engage in polite discussion? :)

You are _way_ out there, let me tell you. _WAY_ out there. With that kind of attitude, I doubt you can learn anything from anybody...
Well, as I see it, I merely asked a relevant scientific _question_ about data you presented, which, based on the new data you present above, was indeed very untypical. That you percieve a critical and very much to the point question as an accusation is, well, let's say remarkable. But that is not my problem.
Uri Blass
Posts: 10823
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: An objective test process for the rest of us?

Post by Uri Blass »

hgm wrote:
bob wrote:
hgm wrote:
bob wrote:Personally, I find that particular comment absolutely beyond stupid. Asinine, ignorant, incompetent all come to mind to describe it, in fact. Feel free to make up what you want. In fact, based on that comment, feel free to continue this discussion with yourself. I certainly have better things to do that deal with that level of ignorance...
Oh well, everyone is entitled to his personal opinion... Usual personal opinions are very revealing.

This one, for instance, reveals a lot about your about your scientific abilities, and statistical skills. You give an example in a scientific discussion that represents a one-in-miliion event, (4 sigma and a second rather large deviation), so it would certainly be very important to know if this is a hypothertical case, selected actual data or randomly chosen actual data.
Do you ever read? I believe I answered that question _exactly_. I don't recall why I picked that 32-match set, but the four 80-game matches I gave were the first 4 played.
That was not the question. The question was if you selected that 32-match set. _Now_ you say you don't know why you selected that. I can't recall having seen that before, but I am sure you will be kind enough to point it out to me. And in any case, it is not an answer to the question, just an announcement that no answer will be forthcoming.
As I said. Now as to whether the complete set of 32 matches (I run dozens of such 32-match tests) was one that started off particularly bad or not, I don't remember. But I didn't just choose the 4 worst cases. If you think about it you could tell that from the data. How could the average be close to zero with those big negative results thrown in? Unless there were some positive scores as well?

Just stop and think for a moment before making ridiculous statements...
If I think about it, it seems that it is possible to compensate a -31 result by 31 +1 results in 31 other matches. The average doesn't tell you a thing about the variance in the missing data that pulled up the average. There can just as easily be 10 +27 results and 9 other -31 results to get the average on -2, as 16 zeros. I am sure that if _you_ would think for a moment, you would come to the same conclusion.

So the 'ridiculous' question still stands: how often do you observe once-in-a-million events in your data? If it is more than once in a million, I would say you have a big problem. If you would stop to think about it, that is...
You didn't ask if I used a hypothetical example. And the question was not necessary since I had already stated where the data came from. You asked "are you making this stuff up?" which is a _far_ different question. And, in fact, is more accusation than question. I am not interested in wasting time in that kind of conversation. Maybe one day you will produce your own data... Rather than telling me how mine should look.
hgm wrote: "When"? Is this a hypothetical case? How do you get these traces? Do you just make them up, or do you select the worst you ever encountered amongst the millions of such mini-matches that you claim to have played?
Well, seems to me that it clearly says "hypothetical case" here. Or do you never read? A hypothetical case is one for which you make up the data, and from the context it seems clear that I refer to that.
hgm wrote: Normally, since the data is so different in character from the 32 traces you posted before, I would assume this is highly selected data, and that you could not repeat such an extreme deviation or anything near it in the next 100 mini-matches, despite the fact that your reaction sets me thinking. That would make it a very _misleading_ example.
That I can't address, other than to say specifically that those were the first 4 80 game matches played in a set of either 32 or 64. I don't keep the data because it would be overwhelming to keep up with. I have hundreds of summaries that look like this:

Code: Select all


=============== glaurung results ->
64 distinct runs (5120 games) found
  win/ draw/ lose (score)
 1:  30/ 25/ 25 (  5)
 2:  28/ 23/ 29 ( -1)   29/ 24/ 27 (  2)
 3:  30/ 13/ 37 ( -7)
 4:  29/ 19/ 32 ( -3)   29/ 16/ 34 ( -5)   29/ 20/ 30 ( -1)
 5:  37/ 17/ 26 ( 11)
 6:  26/ 20/ 34 ( -8)   31/ 18/ 30 (  1)
 7:  32/ 18/ 30 (  2)
 8:  30/ 20/ 30 (  0)   31/ 19/ 30 (  1)   31/ 18/ 30 (  1)   30/ 19/ 30 (  0)
 9:  27/ 17/ 36 ( -9)
10:  28/ 20/ 32 ( -4)   27/ 18/ 34 ( -6)
11:  41/ 15/ 24 ( 17)
12:  33/ 21/ 26 (  7)   37/ 18/ 25 ( 12)   32/ 18/ 29 (  2)
13:  34/ 14/ 32 (  2)
14:  25/ 22/ 33 ( -8)   29/ 18/ 32 ( -3)
15:  39/ 10/ 31 (  8)
16:  32/ 19/ 29 (  3)   35/ 14/ 30 (  5)   32/ 16/ 31 (  1)   32/ 17/ 30 (  2)   31/ 18/ 30 (  0)
17:  22/ 20/ 38 (-16)
18:  34/ 20/ 26 (  8)   28/ 20/ 32 ( -4)
19:  34/ 20/ 26 (  8)
20:  34/ 10/ 36 ( -2)   34/ 15/ 31 (  3)   31/ 17/ 31 (  0)
21:  24/ 23/ 33 ( -9)
22:  37/ 18/ 25 ( 12)   30/ 20/ 29 (  1)
23:  32/ 17/ 31 (  1)
24:  29/ 24/ 27 (  2)   30/ 20/ 29 (  1)   30/ 20/ 29 (  1)   30/ 19/ 30 (  0)
25:  26/ 14/ 40 (-14)
26:  35/ 16/ 29 (  6)   30/ 15/ 34 ( -4)
27:  37/ 16/ 27 ( 10)
28:  32/ 19/ 29 (  3)   34/ 17/ 28 (  6)   32/ 16/ 31 (  1)
29:  40/ 18/ 22 ( 18)
30:  30/ 20/ 30 (  0)   35/ 19/ 26 (  9)
31:  30/ 18/ 32 ( -2)
32:  31/ 21/ 28 (  3)   30/ 19/ 30 (  0)   32/ 19/ 28 (  4)   32/ 17/ 29 (  3)   31/ 18/ 29 (  1)   31/ 18/ 30 (  1)
33:  33/ 23/ 24 (  9)
34:  38/ 14/ 28 ( 10)   35/ 18/ 26 (  9)
35:  32/ 18/ 30 (  2)
36:  25/ 18/ 37 (-12)   28/ 18/ 33 ( -5)   32/ 18/ 29 (  2)
37:  25/ 23/ 32 ( -7)
38:  32/ 18/ 30 (  2)   28/ 20/ 31 ( -2)
39:  36/ 15/ 29 (  7)
40:  36/ 14/ 30 (  6)   36/ 14/ 29 (  6)   32/ 17/ 30 (  2)   32/ 17/ 30 (  2)
41:  35/ 16/ 29 (  6)
42:  34/ 10/ 36 ( -2)   34/ 13/ 32 (  2)
43:  23/ 28/ 29 ( -6)
44:  33/ 15/ 32 (  1)   28/ 21/ 30 ( -2)   31/ 17/ 31 (  0)
45:  30/ 19/ 31 ( -1)
46:  35/ 11/ 34 (  1)   32/ 15/ 32 (  0)
47:  28/ 21/ 31 ( -3)
48:  30/ 21/ 29 (  1)   29/ 21/ 30 ( -1)   30/ 18/ 31 (  0)   31/ 17/ 31 (  0)   31/ 17/ 30 (  0)
49:  29/ 17/ 34 ( -5)
50:  30/ 21/ 29 (  1)   29/ 19/ 31 ( -2)
51:  36/ 15/ 29 (  7)
52:  36/ 14/ 30 (  6)   36/ 14/ 29 (  6)   32/ 16/ 30 (  2)
53:  28/ 22/ 30 ( -2)
54:  30/ 23/ 27 (  3)   29/ 22/ 28 (  0)
55:  27/ 16/ 37 (-10)
56:  29/ 21/ 30 ( -1)   28/ 18/ 33 ( -5)   28/ 20/ 31 ( -2)   30/ 18/ 30 (  0)
57:  26/ 21/ 33 ( -7)
58:  29/ 20/ 31 ( -2)   27/ 20/ 32 ( -4)
59:  32/ 17/ 31 (  1)
60:  27/ 17/ 36 ( -9)   29/ 17/ 33 ( -4)   28/ 18/ 32 ( -4)
61:  29/ 18/ 33 ( -4)
62:  26/ 19/ 35 ( -9)   27/ 18/ 34 ( -6)
63:  29/ 19/ 32 ( -3)
64:  27/ 14/ 39 (-12)   28/ 16/ 35 ( -7)   27/ 17/ 34 ( -7)   28/ 18/ 33 ( -5)   29/ 18/ 32 ( -2)   30/ 18/ 31 ( -1)
what you are looking at are the results of 64 matches, 80 games per match. The first column is the individual match results data. The second column is the average of the two preceeding results. The third is the average of the two preceeding averages, etc... Somehow the final column is missing, probably due to an error I did when I ran the summary and did not tell it to average all 64 matches, just 32.

That is a real run, comparing Crafty to Glaurung, nothing edited out, nothing added in. Do you see the same kind of variability I have been talking about all along? Results from +18 to -17? I have others that are worse. This was the _first_ in my file, so since you seem to think I "pick and choose" I just chose the first one that I saved. And no I don't save them all as there is just too much data.
And you know what? This unselected set has exactly the statistics that one would expect. The variance of the mini-match results is 51.7, corresponding to an SD of 7.2. Almost exactly equal to the theoretical prediction for equal opponents of 0.8*sqrt(80). The largest deviation is a +18 (the average is almost exactly zero), or 2.5 sigma. That is a one-in-a-hundred event. Not very exceptional, for 64 draws.
That is actually pretty funny. Did I accuse _you_ of making things up? Did I accuse _you_ of exaggerating? And _I_ can't engage in polite discussion? :)

You are _way_ out there, let me tell you. _WAY_ out there. With that kind of attitude, I doubt you can learn anything from anybody...
Well, as I see it, I merely asked a relevant scientific _question_ about data you presented, which, based on the new data you present above, was indeed very untypical. That you percieve a critical and very much to the point question as an accusation is, well, let's say remarkable. But that is not my problem.
I can say the following comments:

1)The data is possible but still surprising because I expect to have some correlation between result of the same game against the same opponent so I expect the variance to be at least slightly smaller than the variance of independent events.

At least the variance here is not bigger than the case that you use random position in every game but I expect the variance to be smaller relative to the case that you use random position in every game.

2)I know from experience that changing the number of nodes can cause signifcant changes in the result.

Inspite of it common sense tell me that the change should be smaller then choosing random positions and the only question is how much smaller and even the results of Crafty here do not suggest that the variance is smaller relative to the case that every position has fixed probability for loss draw win.

I think that analysis of the variance of result of game number 1 ,game number 2,....game number 80 may be interesting(I expect it to have average that is smaller than 0.8 assuming loss is 0 draw is 1 and win is 2)


If you usually find that the variance of the match result is near
0.8*sqrt(80)
when the average variance of a single game is clearly smaller than 0.8(because there are positions when you almost always lose or almost always win) then it suggests that the results of a single match are not independent.

Uri
User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

bob wrote:as I have mentioned repeatedly, you can see what causes the variability by running this test:

...
You still don't get it, do you?

You continue dwelling on mechanisms that would make games different. While in fact everyone asks you to answer for your claim that the result of games tends to be the _same_. Games that started from a different position, and still somehow the result of one game determines the outcome of the other.

For that is what having a variance larger than the theoretical maximum for independent games means, that the games must be correlated. So stop babbling about what causes variability, and explain us how one game in your 80-game mini-matches can affect the outcome of other games in that same 80-game mini-match. And not just affect it, but affect it in such a way that the result is steered in a very particular direction, namely to be the same as that of the earlier game.

I guess you will be hard pressed to come up with such an explanation.

And that is of course because the claim is not valid in the first place: so far the longer data runs you showed do _not_ back up the claim at all, and the snippet of 4 mini-matches you showed is either an extreme fluke, not typical for most of your data at all, or selected. (And with selected I do _not_ mean that you collected them from different runs, for which there would be no need at all, as there is only one truly extreme deviation amongst those 4. Selected means that you noticed a very extreme fluctuation much larger than you _ever_ saw before, and decided to keep that 80-game run as an example.) Whatever the reason, fact is that that 4-game run is quite different in character from all other data that you showed here.
User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Uri Blass wrote:I think that analysis of the variance of result of game number 1 ,game number 2,....game number 80 may be interesting(I expect it to have average that is smaller than 0.8 assuming loss is 0 draw is 1 and win is 2)
I agree. Perhaps all games are close to 50% scoring, you really have to have an extremely biased game to make the variance significantly lower. Note furthermore that the draw percentage here is rather low. For 31/18/31 (+/=/-) the variance is 62/80 * 1^2 = 0.76, which makes the SD = 0.875*sqrt(80). So there is room for some bias in individual games, and of course an estimate for the variance based on only 64 samples is in itself not very precise.

So there is really nothing in this data that can be described as statistically remarkable.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
bob wrote:
hgm wrote:
bob wrote:Personally, I find that particular comment absolutely beyond stupid. Asinine, ignorant, incompetent all come to mind to describe it, in fact. Feel free to make up what you want. In fact, based on that comment, feel free to continue this discussion with yourself. I certainly have better things to do that deal with that level of ignorance...
Oh well, everyone is entitled to his personal opinion... Usual personal opinions are very revealing.

This one, for instance, reveals a lot about your about your scientific abilities, and statistical skills. You give an example in a scientific discussion that represents a one-in-miliion event, (4 sigma and a second rather large deviation), so it would certainly be very important to know if this is a hypothertical case, selected actual data or randomly chosen actual data.
Do you ever read? I believe I answered that question _exactly_. I don't recall why I picked that 32-match set, but the four 80-game matches I gave were the first 4 played.
That was not the question. The question was if you selected that 32-match set. _Now_ you say you don't know why you selected that. I can't recall having seen that before, but I am sure you will be kind enough to point it out to me. And in any case, it is not an answer to the question, just an announcement that no answer will be forthcoming.
It is an unanswerable question. Did you see the "summary" I posted where I gave win/lose/draw scores for each 80 game match, then the averages of pairs, then quads, etc? That is what I save. I don't save the individual games. 8 million PGN files is simply unmanagable.

The results I posted were from the test I had in progress at that instant in time. While one of the 20K game tests are in progress, all the PGN files are present. I can therefore run a simple program to scan them, pick one specific opponent, one specific 80 game match, and then produce one of those 80 character result strings. I can only do that when I have the PGN handy. The only PGN I ever have handy is the test being run at that moment, or if none is being run, possibly the pgn from the last test that was run.

So there is no "selection criteria" whatsoever, it is a matter of "what is available". Does that answer your question. Some runs don't have wild variance. Some do. Some are somewhere in between. I just get what I get.

As I said. Now as to whether the complete set of 32 matches (I run dozens of such 32-match tests) was one that started off particularly bad or not, I don't remember. But I didn't just choose the 4 worst cases. If you think about it you could tell that from the data. How could the average be close to zero with those big negative results thrown in? Unless there were some positive scores as well?

Just stop and think for a moment before making ridiculous statements...
If I think about it, it seems that it is possible to compensate a -31 result by 31 +1 results in 31 other matches. The average doesn't tell you a thing about the variance in the missing data that pulled up the average. There can just as easily be 10 +27 results and 9 other -31 results to get the average on -2, as 16 zeros. I am sure that if _you_ would think for a moment, you would come to the same conclusion.
Again, think first, write second. I gave you _four_ of the 32 matches. Two were well off from the expected value. So 31 +1's could not possibly happen. Others could. I just picked the first four because it is easy to do this:

./results wld match1 | head -4

and the paste that into the post. By the time the question came up, that data was gone. I then posted the entire result set, but for the last match (at that time) that had completed. Fairly random sampling methodology if I do say so, unless you can somehow figure out when to ask a question so that I pick a less favorable or more favorable sample.


So the 'ridiculous' question still stands: how often do you observe once-in-a-million events in your data? If it is more than once in a million, I would say you have a big problem. If you would stop to think about it, that is...
Nothing you have said to date convinces me that is a "one in a million sample". Suppose the results are absolutely random, uniformly distributed between 80 (all wins) and 0 (all losses). Is it now _that_ rare to get four results like 80, 60, 35, 45? Obviously the mean will be 40. But the numbers are uniformly distributed which means any small number of samples could be anywhere. That is sort of what I am seeing. Look at the last results I posted, a complete 64 match set of results. Are you surprised in the variance between the individual 80 game matches? Would you randomly pick one of those and use it to say go/no-go for a recent change???

You didn't ask if I used a hypothetical example. And the question was not necessary since I had already stated where the data came from. You asked "are you making this stuff up?" which is a _far_ different question. And, in fact, is more accusation than question. I am not interested in wasting time in that kind of conversation. Maybe one day you will produce your own data... Rather than telling me how mine should look.
hgm wrote: "When"? Is this a hypothetical case? How do you get these traces? Do you just make them up, or do you select the worst you ever encountered amongst the millions of such mini-matches that you claim to have played?
Well, seems to me that it clearly says "hypothetical case" here. Or do you never read? A hypothetical case is one for which you make up the data, and from the context it seems clear that I refer to that.
Sorry, but no wiggle room there. "do you make this stuff up" is an
accusation delivered in the form of a question.
hgm wrote: Normally, since the data is so different in character from the 32 traces you posted before, I would assume this is highly selected data, and that you could not repeat such an extreme deviation or anything near it in the next 100 mini-matches, despite the fact that your reaction sets me thinking. That would make it a very _misleading_ example.
That I can't address, other than to say specifically that those were the first 4 80 game matches played in a set of either 32 or 64. I don't keep the data because it would be overwhelming to keep up with. I have hundreds of summaries that look like this:

Code: Select all


=============== glaurung results ->
64 distinct runs (5120 games) found
  win/ draw/ lose (score)
 1:  30/ 25/ 25 (  5)
 2:  28/ 23/ 29 ( -1)   29/ 24/ 27 (  2)
 3:  30/ 13/ 37 ( -7)
 4:  29/ 19/ 32 ( -3)   29/ 16/ 34 ( -5)   29/ 20/ 30 ( -1)
 5:  37/ 17/ 26 ( 11)
 6:  26/ 20/ 34 ( -8)   31/ 18/ 30 (  1)
 7:  32/ 18/ 30 (  2)
 8:  30/ 20/ 30 (  0)   31/ 19/ 30 (  1)   31/ 18/ 30 (  1)   30/ 19/ 30 (  0)
 9:  27/ 17/ 36 ( -9)
10:  28/ 20/ 32 ( -4)   27/ 18/ 34 ( -6)
11:  41/ 15/ 24 ( 17)
12:  33/ 21/ 26 (  7)   37/ 18/ 25 ( 12)   32/ 18/ 29 (  2)
13:  34/ 14/ 32 (  2)
14:  25/ 22/ 33 ( -8)   29/ 18/ 32 ( -3)
15:  39/ 10/ 31 (  8)
16:  32/ 19/ 29 (  3)   35/ 14/ 30 (  5)   32/ 16/ 31 (  1)   32/ 17/ 30 (  2)   31/ 18/ 30 (  0)
17:  22/ 20/ 38 (-16)
18:  34/ 20/ 26 (  8)   28/ 20/ 32 ( -4)
19:  34/ 20/ 26 (  8)
20:  34/ 10/ 36 ( -2)   34/ 15/ 31 (  3)   31/ 17/ 31 (  0)
21:  24/ 23/ 33 ( -9)
22:  37/ 18/ 25 ( 12)   30/ 20/ 29 (  1)
23:  32/ 17/ 31 (  1)
24:  29/ 24/ 27 (  2)   30/ 20/ 29 (  1)   30/ 20/ 29 (  1)   30/ 19/ 30 (  0)
25:  26/ 14/ 40 (-14)
26:  35/ 16/ 29 (  6)   30/ 15/ 34 ( -4)
27:  37/ 16/ 27 ( 10)
28:  32/ 19/ 29 (  3)   34/ 17/ 28 (  6)   32/ 16/ 31 (  1)
29:  40/ 18/ 22 ( 18)
30:  30/ 20/ 30 (  0)   35/ 19/ 26 (  9)
31:  30/ 18/ 32 ( -2)
32:  31/ 21/ 28 (  3)   30/ 19/ 30 (  0)   32/ 19/ 28 (  4)   32/ 17/ 29 (  3)   31/ 18/ 29 (  1)   31/ 18/ 30 (  1)
33:  33/ 23/ 24 (  9)
34:  38/ 14/ 28 ( 10)   35/ 18/ 26 (  9)
35:  32/ 18/ 30 (  2)
36:  25/ 18/ 37 (-12)   28/ 18/ 33 ( -5)   32/ 18/ 29 (  2)
37:  25/ 23/ 32 ( -7)
38:  32/ 18/ 30 (  2)   28/ 20/ 31 ( -2)
39:  36/ 15/ 29 (  7)
40:  36/ 14/ 30 (  6)   36/ 14/ 29 (  6)   32/ 17/ 30 (  2)   32/ 17/ 30 (  2)
41:  35/ 16/ 29 (  6)
42:  34/ 10/ 36 ( -2)   34/ 13/ 32 (  2)
43:  23/ 28/ 29 ( -6)
44:  33/ 15/ 32 (  1)   28/ 21/ 30 ( -2)   31/ 17/ 31 (  0)
45:  30/ 19/ 31 ( -1)
46:  35/ 11/ 34 (  1)   32/ 15/ 32 (  0)
47:  28/ 21/ 31 ( -3)
48:  30/ 21/ 29 (  1)   29/ 21/ 30 ( -1)   30/ 18/ 31 (  0)   31/ 17/ 31 (  0)   31/ 17/ 30 (  0)
49:  29/ 17/ 34 ( -5)
50:  30/ 21/ 29 (  1)   29/ 19/ 31 ( -2)
51:  36/ 15/ 29 (  7)
52:  36/ 14/ 30 (  6)   36/ 14/ 29 (  6)   32/ 16/ 30 (  2)
53:  28/ 22/ 30 ( -2)
54:  30/ 23/ 27 (  3)   29/ 22/ 28 (  0)
55:  27/ 16/ 37 (-10)
56:  29/ 21/ 30 ( -1)   28/ 18/ 33 ( -5)   28/ 20/ 31 ( -2)   30/ 18/ 30 (  0)
57:  26/ 21/ 33 ( -7)
58:  29/ 20/ 31 ( -2)   27/ 20/ 32 ( -4)
59:  32/ 17/ 31 (  1)
60:  27/ 17/ 36 ( -9)   29/ 17/ 33 ( -4)   28/ 18/ 32 ( -4)
61:  29/ 18/ 33 ( -4)
62:  26/ 19/ 35 ( -9)   27/ 18/ 34 ( -6)
63:  29/ 19/ 32 ( -3)
64:  27/ 14/ 39 (-12)   28/ 16/ 35 ( -7)   27/ 17/ 34 ( -7)   28/ 18/ 33 ( -5)   29/ 18/ 32 ( -2)   30/ 18/ 31 ( -1)
what you are looking at are the results of 64 matches, 80 games per match. The first column is the individual match results data. The second column is the average of the two preceeding results. The third is the average of the two preceeding averages, etc... Somehow the final column is missing, probably due to an error I did when I ran the summary and did not tell it to average all 64 matches, just 32.

That is a real run, comparing Crafty to Glaurung, nothing edited out, nothing added in. Do you see the same kind of variability I have been talking about all along? Results from +18 to -17? I have others that are worse. This was the _first_ in my file, so since you seem to think I "pick and choose" I just chose the first one that I saved. And no I don't save them all as there is just too much data.
And you know what? This unselected set has exactly the statistics that one would expect. The variance of the mini-match results is 51.7, corresponding to an SD of 7.2. Almost exactly equal to the theoretical prediction for equal opponents of 0.8*sqrt(80). The largest deviation is a +18 (the average is almost exactly zero), or 2.5 sigma. That is a one-in-a-hundred event. Not very exceptional, for 64 draws.
That is actually pretty funny. Did I accuse _you_ of making things up? Did I accuse _you_ of exaggerating? And _I_ can't engage in polite discussion? :)

You are _way_ out there, let me tell you. _WAY_ out there. With that kind of attitude, I doubt you can learn anything from anybody...
Well, as I see it, I merely asked a relevant scientific _question_ about data you presented, which, based on the new data you present above, was indeed very untypical. That you percieve a critical and very much to the point question as an accusation is, well, let's say remarkable. But that is not my problem.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
Uri Blass wrote:I think that analysis of the variance of result of game number 1 ,game number 2,....game number 80 may be interesting(I expect it to have average that is smaller than 0.8 assuming loss is 0 draw is 1 and win is 2)
I agree. Perhaps all games are close to 50% scoring, you really have to have an extremely biased game to make the variance significantly lower. Note furthermore that the draw percentage here is rather low. For 31/18/31 (+/=/-) the variance is 62/80 * 1^2 = 0.76, which makes the SD = 0.875*sqrt(80). So there is room for some bias in individual games, and of course an estimate for the variance based on only 64 samples is in itself not very precise.

So there is really nothing in this data that can be described as statistically remarkable.
I never said there was. What _is_ in the data is the clear indication that using _one_ of those 80 game matches to make a go/no-go decision about a change is not going to be accurate in any way. Even using 2 or 4 of them is not good enough.

That is what I have said all along. All the discussion about distribution, variance, etc notwithstanding...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
bob wrote:as I have mentioned repeatedly, you can see what causes the variability by running this test:

...
You still don't get it, do you?

You continue dwelling on mechanisms that would make games different. While in fact everyone asks you to answer for your claim that the result of games tends to be the _same_. Games that started from a different position, and still somehow the result of one game determines the outcome of the other.
What is that based on? Reading tea leaves? I have been running these kinds of tests for _many_ years. Many is >= 30. So I have some clue about how to run game tests where things don't get biased by the testing itself. Each game is played separately. Games are pleyed on different systems. I monitor the load carefully while a game is in progress to make sure that nothing unusual happens (it rarely does, but it is not a zero-probability event).

There is absolutely no way that game X influences the result of game Y. It is physically impossible. So we can get off that bandwagon before it leaves the station.

My original belief was that N games played from the same starting position with the same time control would produce N duplicates. With perhaps an _occasional_ outlyer due to a timing issue. That is not the case. A single starting position can produce an equal number of wins and losses (and I am talking about when both sides play the same color repeatedly) which was a _big_ surprise to me. I then set about trying to determine where this came from and I found the tiny node variance produces big result difference issue.

I don't see anything that suggests that games influence each other. There is no learning. There is no shared hashing. Programs are restarted cleanly after each round. Starting with the _same_ operating system memory image. I have confirmed with thousands of games that the NPS for Crafty barely varies and that the same pages are allocated each time due to the uniform starting point.

There is just nothing that could possibly carry over from one game to the next, much less from one position to the next. Two games are played on the same computer. Different positions are scattered over different cluster nodes that do _not_ share memory or anything other than the home directory file system.

This is really not worth discussing. We could just as easily discuss how cosmic rays might produce this influence.

For that is what having a variance larger than the theoretical maximum for independent games means, that the games must be correlated. So stop babbling about what causes variability, and explain us how one game in your 80-game mini-matches can affect the outcome of other games in that same 80-game mini-match. And not just affect it, but affect it in such a way that the result is steered in a very particular direction, namely to be the same as that of the earlier game.
I'm not babbling, you are. There is absolutely no way this can happen in this experimental setup. I have a program that farms out two-game matches to each node I am using. Each node is initialized to the same starting point each time, the programs are started, and a game is played. This is repeated for the second game with reversed colors. With 128 nodes, I play 128 2-game matches at a time, until I complete the entire set of 64 matches X 40 positions X 4 opponents, something over 20K games.

I guess you will be hard pressed to come up with such an explanation.


Yes, since it doesn't happen, I certainly will have a hard time explaining it. Perhaps you are overlooking another solution I have already mentioned?? The data is more random than you give it credit for being???.
And that is of course because the claim is not valid in the first place: so far the longer data runs you showed do _not_ back up the claim at all, and the snippet of 4 mini-matches you showed is either an extreme fluke, not typical for most of your data at all, or selected. (And with selected I do _not_ mean that you collected them from different runs, for which there would be no need at all, as there is only one truly extreme deviation amongst those 4. Selected means that you noticed a very extreme fluctuation much larger than you _ever_ saw before, and decided to keep that 80-game run as an example.) Whatever the reason, fact is that that 4-game run is quite different in character from all other data that you showed here.
Again, look at the 64 matches given and quoted here. If you pick one at random, which is what you do when you just run 80 games and stop, what is the probability you pick one that is misleading?

Suppose I ran two tests, unknowingly using the same program because I forgot to copy the source over. And then take the stretch a bit further and assume that the two tests happen to produce the exact same 64 matches, although they would probably be in a different order. With the complete set, I conclude "no change" since the two scores after over 20K games each are identical. But if I just take the first 80 game match from each sample, I would get a different conclusion. 80 games just doesn't eliminate the inherent randomness in the computer vs computer games. It really is just that simple.

All the other speculating, hypothesizing, guessing, etc... are just not productive here. The data is what it is. And what it is is far more random than one would expect IMHO. Otherwise nobody would be running 80 game matches of any kind and drawing any conclusion from them...
User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

bob wrote:What is that based on? Reading tea leaves?
No, on mathematical calculation / proof. Something that, by now, I am beginning to fear is an utterly alien concept to you.
I have been running these kinds of tests for _many_ years. Many is >= 30. So I have some clue about how to run game tests where things don't get biased by the testing itself. Each game is played separately. Games are pleyed on different systems. I monitor the load carefully while a game is in progress to make sure that nothing unusual happens (it rarely does, but it is not a zero-probability event).

There is absolutely no way that game X influences the result of game Y. It is physically impossible. So we can get off that bandwagon before it leaves the station.
So explain us then how the results of games X and Y become correlated...

For if they are not correlated, the variance of the min-match results would be limited to sqrt(80) ~ 9, and even ~7 if you have draws, so results >=30 are 4-sigma fluctuations, and should not occur more often than once every 15,000 mini-matches, or so. Not "many times".

The only way you can have a probability distribution of the mini-match results other than a normal one with vSD = ~0.8*sqrt(80) = ~7, is to have correlation between the 80 games. That is a hard mathematical fact, as certain as that 1+1=2.
So you either should drop your claim that extreme deviations occur much more frequency than such a normal distribution would predict, or you would have to admit that there is correlation between games. (Or you could of course start claiming that all mathematics that has been done since Euclid is just nonsense...)
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: An objective test process for the rest of us?

Post by Gerd Isenberg »

hgm wrote:
bob wrote:What is that based on? Reading tea leaves?
No, on mathematical calculation / proof. Something that, by now, I am beginning to fear is an utterly alien concept to you.
I have been running these kinds of tests for _many_ years. Many is >= 30. So I have some clue about how to run game tests where things don't get biased by the testing itself. Each game is played separately. Games are pleyed on different systems. I monitor the load carefully while a game is in progress to make sure that nothing unusual happens (it rarely does, but it is not a zero-probability event).

There is absolutely no way that game X influences the result of game Y. It is physically impossible. So we can get off that bandwagon before it leaves the station.
So explain us then how the results of games X and Y become correlated...

For if they are not correlated, the variance of the min-match results would be limited to sqrt(80) ~ 9, and even ~7 if you have draws, so results >=30 are 4-sigma fluctuations, and should not occur more often than once every 15,000 mini-matches, or so. Not "many times".

The only way you can have a probability distribution of the mini-match results other than a normal one with vSD = ~0.8*sqrt(80) = ~7, is to have correlation between the 80 games. That is a hard mathematical fact, as certain as that 1+1=2.
So you either should drop your claim that extreme deviations occur much more frequency than such a normal distribution would predict, or you would have to admit that there is correlation between games. (Or you could of course start claiming that all mathematics that has been done since Euclid is just nonsense...)
You sound like a theoretical weisenheimer to me ;-)
I have no clue from statistics, but somehow Bob's result sounds more plausible to me and your deterministic ones suspect.

Assuming programs terminate their search preliminary before they finish an iteration - based on calling time every N nodes, I would have expected such a random result from a set of balanced position matches between quite equal strong opponents. Keeping the hash between searches amplifies very minor changes. IMHO simple application of chaos theory.

There is a multithreaded OS environment. Context-switch granularity, shifting phase between polling time after N nodes to some clock counters, processor-thread affinity, page- and cache issues, huge (chaotical) processor heuristics like tlb and btb, other running processes/threads (even if sleeping most of the time) etc.. The number of instructions per time may vary a lot per process/thread. They may vary by N or up to some +-1E5 nodes per search (+-0.1 sec) - leaving different hash-footprints.

In quite positions with a lot of equivalent good moves, even very minor move sorting changes by slightly different hash-footprints may conspirative result in other moves at the root up and then, specially in selective programs not pure alfa-beta. One time is enough to get a different game with another outcome, no matter from either side. "Complexity" of evaluation, thus number of terms and "noise" may amplify that none determinism as well - as well as subtle "constructive" bugs in eval due to initialization issues or to agressive compiler optimization ;-)

Weaker programs with unbalanced knowledge and mutual exclusive holes may somehow more fragile to play those positions more deterministic.

Running multiple programs on multiple cores seems another amplifier. If one thread takes advantage of some lucky ressource mappings, other processes are likely more unlucky.

Better suited to play more significant matches with less games would be to use two single core computers with a single thread, DOS like (real-time) OS. And/or both programs terminate on their target-time and NPS estimated number of nodes. Persistant NPS[matkey], "learned" before in different runs over all kind of material - constellations.
User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Well, computers are designed to be deterministic machines, and if you ignore time, there wouldn't be any variability at all, ever. And in many simple engines don't even read the clock, and those that do hardly act on it. That is just as much emperical fact as Bob's variability for his engine. In the games between uMax and Eden that Nicolai was running in the other thread, all white games (4 so far) are the same, and all black games (3 so far) are the same, move for move up to the very end.

But that is not really an issue, as it is all trivially understood from how the engines are programmed to function, and the magnitude of the timing noise. I did the calculation in an earlier post, and it matched the observations on uMax and Eden quite well. And, considering the different time management of Crafty, it is understandable that this engine is nearly two orders of magnitude more sensitive to timing fluctuations. So all of that is "old hat".