An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

bob wrote:What about the case where the results are nearly perfectly random. So now the mean is well-defined but the standard deviation is at its max?
This is the case we have been talking about from the very beginning.

Totally random results, cannot have a variance larger that sqrt(80). That _is_ the max. And only if there can't be any draws. Here there are many draws, so the variance _must_ be smaller. But sqrt(80) is the max. Unless the games are correlated.

The word games you try to play now are just plain silly. Something that occurse only once every 15,000 cases cannot be a-typical, because you have seen more than 15,000 cases, an what you have seen once or twice cannot be a-typical? Ridiculous!

And totally irrelevant, as we are not discussing if the description 'a-typical' is a proper one, but the fact that you post a 1-in-15,000 fluke like it is the most common thing in the world, and then throw a fit if you get caught at it... So even if this falls under your definition of 'typical', the 15,000 still stands, and that is all that matters.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
bob wrote:What about the case where the results are nearly perfectly random. So now the mean is well-defined but the standard deviation is at its max?
This is the case we have been talking about from the very beginning.

Totally random results, cannot have a variance larger that sqrt(80). That _is_ the max. And only if there can't be any draws. Here there are many draws, so the variance _must_ be smaller. But sqrt(80) is the max. Unless the games are correlated.

The word games you try to play now are just plain silly. Something that occurse only once every 15,000 cases cannot be a-typical, because you have seen more than 15,000 cases, an what you have seen once or twice cannot be a-typical? Ridiculous!
No, not ridiculous at all. Just some sort of word game. I see _significant_ randomness in every group of matches I play. Some far worse than others. And on rare occasion, some that are very close to each other. I already explained, more than once, that I have never been interested in trying to statistically analyze this data for what I am working on, which is (was) an attempt to eliminate or explain every bit of the randomness I was seeing.

I was not interested in how much variance I was seeing, I wanted to understand _why_ I was / am seeing it. I wanted to make sure there were no bugs. No unexpected randomness in either crafty or the opponents I am using. For example, an occasional loop that hangs until the search times out and a move is made after almost no searching, an easy way to lose a game (and yes, I have seen that bug in the past).

So all I cared about here, was why is this happening and is there a viable way to stop it from happening so that a small sample size will produce a reliable result. The basic answer is "no" to the small sample size question. At least for the programs I am using, since we all time the search in a similar non-whole-iteration way. I reported here that the sample sizes being quoted to either (a) compare X to Y or (b) to determine if X is better than X' are sufficient, because I ran over 100K 80 game samples and easily discovered that 80 games is not enough to prove anything with any reasonable standard error.

At some point, I do plan on going the stat route to discover what is the "optimal" sample size. Right now, I can run enough tests to smooth out the randomess easily enough. But I am also interested in how far I can back off of my current sample size and still produce acceptable error. But I am not there yet. I will get there at some point. All I know right now is that to discover where A' is better than A requires _way_ more than 320 games. And that 20K games provides a stable result. Since 20K is not intractable at present, I am using that and getting accurate results. As I have time, I have a stat friend that wants to help see if we can back down from that number a bit without compromising accuracy. At present, cutting it by a factor of 2 introduces more randomness. I can already run 20K games twice and get slightly different results each time. The fewer games I play, the greater the variance, and it _quickly_ loses the ability to pick up on small changes in skill.

That's where I am. Along the way you ask questions and I try to answer them. Is one of those wild variations "atypical"? Depends on your definition of atypical. I have seen them enough that they don't stand out to me at all. So I don't call them atypical. If you use some definition of (say) once in 10000 trials is atypical, or once in 100 trials is atypical, that's fine. I don't have a precise number I use in that comparison. I just think of "rare" and those results are not rare to me.


And totally irrelevant, as we are not discussing if the description 'a-typical' is a proper one, but the fact that you post a 1-in-15,000 fluke like it is the most common thing in the world, and then throw a fit if you get caught at it... So even if this falls under your definition of 'typical', the 15,000 still stands, and that is all that matters.
That is pretty strange. The last test I posted had 3 or 4 matches of significant variance followed by 7-8 of pretty stable results. And you said "that is nothing special." Yet if I posted just those first 4, you say "that is a one in 15,000 event and it is fake." Since I have played over 8,000,000 games, in 80 game matches, that is 100,000 of those matches. I could _easily_ pick one that was rare. But I didn't pick it because it _was_ rare. I just picked it because happened to be the data I had at hand. I can't even post the rest of the run I started to post about this morning. Our IBRIX filesystem died, and the run terminated and cleaned up after itself leaving no data to look at. I am running the test again. If you want, I will post the first 4 matches from it as well, so that we can again get into the catcalls about "that can't be."

As I have said before, "it is what it is." And nothing more. I am trying to supply information. Sometimes I wonder if it is worth the trouble. There is always someone waiting around to poke holes in it rather than take it for what it is.

So keep poking. I'm not sure I'm interested in continuing the dance, however. I want to get back to making my program better, playing what you consider to be an excessive number of games, to get what I consider to be very high accuracy...

that's good enough for "team Crafty"...
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

The problem is that you are so disinterested in doing statistical analysis on your data, that you don't seem to be able to distinguish a one-in-a-million event from a one-in-a-hundred event. A deviation of 2.5*sigma is one-in-a-hundred, so if you post 64 mini-matches it is not at all significant if it occurs once. A deviation of 4.5*sigma is a one-in-a-million event. You seem to think "Oh well, it is not even twice larger than 2.5*sigma, so if 2.5*sigma occurs frequently, 4.5*sigma cannot be considered unusual". Well, that is as wrong and as naive as thinking that a billion is only 50% more than a million, because it has 9 zeros in stead of 6. Logic at the level of "I have seen two birds, so anything flies".

No one contest your right to remain in ignorant bliss of statistics, and concentrate on things that interest you more. But than don't mingle in discussions about statistics, as your uninformed and irrelevant comments only serve to confuse people.

The matter of randomness in move choice was already discussed ad nauseam in another thread, and not really of interest to anyone, as it is fully understood, and most of us do not seem to suffer from this as much as you do, if at all.

The fact that you cannot tell apart a one-in-15,000 fluke from a 1-in-4 run-off-the-mill data set, really says it all: this discussion is completely over your head. Note that I did not say that your gang-of-four was a "fake" (if you insist on calling a hypothetical case such), but that I only _hoped_ it was a fake, and not a cheat or a goof. OK, so you argue that it was a goof, and that you are not to blame for it because you are not intellectually equiped to know what you are doing ("statistics doesn't interest me"). Well, if that suits you better, fine. But if you want to masquerade for a scientist, it is'n't really good advertisement...
User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Re: An objective test process for the rest of us?

Post by mhull »

hgm wrote: A deviation of 2.5*sigma is one-in-a-hundred, so if you post 64 mini-matches it is not at all significant if it occurs once. A deviation of 4.5*sigma is a one-in-a-million event.
Could the apparent correlation in the data be related to favorable/unfavorable memory and cache issues mentioned by Bob at program startup. Could this introduce a perturbed periodicity to the quality of play by one engine or another that undermines consistency of play?
Matthew Hull
User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Re: An objective test process for the rest of us?

Post by mhull »

mhull wrote:
hgm wrote: A deviation of 2.5*sigma is one-in-a-hundred, so if you post 64 mini-matches it is not at all significant if it occurs once. A deviation of 4.5*sigma is a one-in-a-million event.
Could the apparent correlation in the data be related to favorable/unfavorable memory and cache issues mentioned by Bob at program startup. Could this introduce a perturbed periodicity to the quality of play by one engine or another that undermines consistency of play?
Or maybe the cluster is not entirely "uniform" with some nodes containing inconsistently performing hardware.
Matthew Hull
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Well, this is what I proposed as well, but Bob denies that it can happen.

Problem is that it is not really apparent anymore how the data looks at all, as we have only Bob's vague statements that "large" deviations occur "often". And as he obviously is not able to see or appreciate the difference between a 2-sigma and a 4-sigma deviation, and seems to consider something as "typical" when he has seen it before while sifting through 8 million samples, it becomes kind of hard to know what such statements mean.

So I now tend to dismiss all these claims all as completely meaningless, and just go on the data that has actually been posted here. And that data does not contain any hint that the distribution and variance of the mini-match results is different from what one would expect for 80 independent games (i.e. normally distributed, with a variance of 7 to 8). All except the infamous gang of four, of course, the origin of which is unclear, but which conveniently and magically happened to "be around".

So I don't think it would be worth speculating on anything before we have seen a histogram indicating the observed occurrence frequency of the result of a few thousands of mini-matches. Otherwise we likely will only be chasing ghosts.

The best bet currently seems that this whole business of "large" variance is just a red herring, caused by lack of understanding of statistical matters on Bob's part.
Uri Blass
Posts: 10905
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: An objective test process for the rest of us?

Post by Uri Blass »

hgm wrote:The problem is that you are so disinterested in doing statistical analysis on your data, that you don't seem to be able to distinguish a one-in-a-million event from a one-in-a-hundred event. A deviation of 2.5*sigma is one-in-a-hundred, so if you post 64 mini-matches it is not at all significant if it occurs once. A deviation of 4.5*sigma is a one-in-a-million event. You seem to think "Oh well, it is not even twice larger than 2.5*sigma, so if 2.5*sigma occurs frequently, 4.5*sigma cannot be considered unusual". Well, that is as wrong and as naive as thinking that a billion is only 50% more than a million, because it has 9 zeros in stead of 6. Logic at the level of "I have seen two birds, so anything flies".

No one contest your right to remain in ignorant bliss of statistics, and concentrate on things that interest you more. But than don't mingle in discussions about statistics, as your uninformed and irrelevant comments only serve to confuse people.

The matter of randomness in move choice was already discussed ad nauseam in another thread, and not really of interest to anyone, as it is fully understood, and most of us do not seem to suffer from this as much as you do, if at all.

The fact that you cannot tell apart a one-in-15,000 fluke from a 1-in-4 run-off-the-mill data set, really says it all: this discussion is completely over your head. Note that I did not say that your gang-of-four was a "fake" (if you insist on calling a hypothetical case such), but that I only _hoped_ it was a fake, and not a cheat or a goof. OK, so you argue that it was a goof, and that you are not to blame for it because you are not intellectually equiped to know what you are doing ("statistics doesn't interest me"). Well, if that suits you better, fine. But if you want to masquerade for a scientist, it is'n't really good advertisement...
I agree with you.
Note that I am not sure what is sigma and you used an upper bound for it.
It is possible that what you discribe as 2.5*sigma is practically more than it(so you can also discribe it as non typical data) and in order to have a better opinion we need to have all the data and not only match results in order to get an estimate for sigma.

Estimate for sigma is the square root of the sum of 80 estimates for the variance of the possible result of a single game from the same position
with the same colors.

I suspect that part of these variance numbers are small because one program wins most of the games.

Uri
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: An objective test process for the rest of us?

Post by Gerd Isenberg »

Let say we randomly change the thinking time of one nominee from 10-20 seconds per move while the second one has always 15 seconds (or number of projected nodes)? How does that impact the expected variance (assuming balanced starting positions)?

Thanks,
Gerd
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Indeed, totally correct. If some of the starting positions have a bias, the variance can be lower. In the first 64-match data set (posted by Bob some months ago) this bias was clearly visible, and the variance turned out to be 5 in stead of the maximum for that draw percentage 7.2. The 64-match data set posted recently does not seem to have such biased games. The measured variance is close to the expected one, so my guess is that sigma is indeed what we expect and see, so that we are really dealing with a 2-sigma deviation.

The gang of four doesn't have enough traces to recognize any bias for individual start positions.
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

Gerd Isenberg wrote:Let say we randomly change the thinking time of one nominee from 10-20 seconds per move while the second one has always 15 seconds (or number of projected nodes)? How does that impact the expected variance (assuming balanced starting positions)?

Thanks,
Gerd
Do you mean on a per-move basis? Or on a per-game or per-match basis? Usually engines gain 70 Elo points for doubling the time. That means that at 20 sec per move they are 29 Elo stronger than at 15, and at 10 sec/move they are 40 Elo weaker. What exactly the effect will be if you alternate (or randomly mix) +29 Elo moves with -40 Elo moves, is not completely clear. If the result of a game is decided by adding a small probability for a fatal mistake in all the moves, that probability would approximately be linear over such a small Elo range, and you could simply take the average Elo. So the engine being modulated would lose about 5 Elo.

If an engine has to play the entire game with 15 sec in stead of 20, it simply is 40 Elo weaker. But the Elo describes the effect on the average score. On the variance such things would hardly have any effect. Unless you would have one engine use a different time during the entire mini-match. Then the variance could go up (and the game results within one mini-match get correlated, through all using the same time control, which could be different in other mini-matches).