An objective test process for the rest of us?

bob · Post by **bob** » Tue Sep 18, 2007 4:01 am

hgm wrote:
bob wrote:When 4 sequential 80 game runs produce these kinds of results:

1: =--===++--=-=-++=-=-++++---+=+-+--=+-=-+-+-=-++-+++--=+=++=+---+-+++-==+---++=-- (-2)
2: =---+---+-=+=-+++---++------+=-+===+---=--=---+--+=-=-=-++-+--+--+---=-=+-+++++- (-18)
3: +---=-=-+++-+=+===--+++----=-+-=-+++-+----=-=+==+-=+--+--+=+--+=+=+-+++-+-=+--=- (-6)
4: =-=-==--+---+-=+=----+=---+===---=-=---=--====+------=---+-+--+--=+--++=+--+--=- (-31)

It is hard to draw any conclusions about which represents the real world.
"When"? Is this a hypothetical case? How do you get these traces? Do you just make them up, or do you select the worst you ever encountered amongst the millions of such mini-matches that you claim to have played?

Have you paid any attention to the test data I have been reporting? See those numbers on the left-hand side? (1: 2:, etc)??? Those represent the "match number". The first was match 1 of N (I am not sure whether N was 32 or 64 in this case, but it isn't that important). So those are the first 4, not the worst 4. Unless the first 4 were the worst 4 by some strange twist. I posted another complete set of 32 matches as well.

What would be the purpose for making them up? I simply reported what I found, after being surprised.

Personally, I find that particular comment absolutely beyond stupid. Asinine, ignorant, incompetent all come to mind to describe it, in fact. Feel free to make up what you want. In fact, based on that comment, feel free to continue this discussion with yourself. I certainly have better things to do that deal with that level of ignorance...

You are giving 4 draws from a distribution that (when properly done) cannot have a SD of more than 9 (and, considering the number of draws in all traces, more likely has SD ~ 7). You claim to know the true average to be -2 from a much longer measurement. So I see a deviation of 29 = 4 sigma. That is an event at the 0.01% level (3.2e-5 for a 4-sigma event, but out of 4 tries).

So what you show here cannot be typical for the first 4 mini-matches you play in any evaluation. If it is, there must be an error. If it is selected as the worst case out of several thousand, it is quite normal that you would find something like this, of course.

bob · Post by **bob** » Tue Sep 18, 2007 4:40 am

nczempin wrote:Regarding the use of multiple positions:

This will reduce variation of the overall results, because any bias in just one particular position will be reduced. And there may very well be changes where the result in 39 positions is unchanged, but in only one it is much improved.

Unfortunately, even those 40 positions could be biased, and it could take 400 positions. The same prinicple applies here that does for the number of games: Yes, more is better, but if you have fewer, you simply need higher values to conclude with a certain confidence that they are not merely random results.

So without loss of generality we can for our analysis just use one position that so far has turned out to be a fairly good discriminator. Yes, there are numerous pitfalls you can fall into before you can actually draw a valid conclusion, but since we are not interested in determining for a specific change that Bob has made whether it is significant, but in general concepts, we can use just the one.

That is simply wrong. You can play a complete game without ever using certain parts of the evaluation, and more specifically, the part you changed. The game of chess is incredibly broad in scope, such that no one single game covers even a tiny fraction of the potential features that are important to overall play. If you choose one position, you are not going to learn very much in the context I am working in... Even more important you need to play enough games to make sure that you also pass through the positions where your new eval term might hurt. Picking such is impossible, so you have to play enough games to feel fairly confident you encountered them. Hence the large number of starting positions.

None of this testing methodology was developed overnight. It has received a lot of attention, thought and effort. By a significant number of people from grandmasters, computer-chess savvy people like Kaufman, to programmers including the commercial guys. If you keep trying to simplify the process, you can simplify it to the point where you get absolutely no useful information from the test, so you may as simplify even further and just flip a coin...

Sometimes there really are no short-cuts. Certainly one position is pointless... Even a small number is pointless. The more the merrier within the confines of available hardware. But more is better...

bob · Post by **bob** » Tue Sep 18, 2007 4:45 am

nczempin wrote:
bob wrote:
nczempin wrote:Just another illustration:

If I teach my non-tablebase-using program some knowledge on how to handle KPK endgames, a wild guess (which could be measured in a better way of course, but that's not the concern here) would be that perhaps I'd need 20,000 games before the effect would become noticable [this is a somewhat contrived example, because I would be assuming that I wasn't sure that the change would improve the program; we could redefine the test to check if that code is bug-free].

So if I don't have a significant result after 100, 1000 or 10,000 games, I wouldn't be concerned at all. (In practice, I would test endgame position suites and then assume that if I score better in them that would be enough for me). My engine is extremely weak tactically, and the opening book leans somewhat towards getting quick results (IMHO good advice for any player <2000), so it will get into endgames very rarely; those where KPK is significant even more rarely.

But if I double my NPS and do not see an improvement (remember, NPS is my limiting factor right now) after, say, 10 games (also completely arbitrary), I would conclude that I must have introduced a bug somewhere.
The problem with your statement above is "if I don't have a significant result."

Suppose you add endgame tables and your first 100 games come out 80-20. Is that significant? So that you can stop the experiment there? Many of us have reported that the first 100 games often has that kind of good or bad result because of the randomness I have now isolated. But after more games, things settle down to "the truth".

So just how do you decide whether the first 100 is significant or not? Some sort of crystal ball? I don't have one myself. So, I have to rely on the observation that the only way to make sure the first 100 results are significant is to play another 19,900 and have those confirm the first 100 games.

This is the old progressive-betting argument for gamblers. You can recognize patterns of how the cards fall and how you could vary your bet to win, but you can't recognize the pattern until _after_ it has happened. And then it is too late. No such progressive betting pattern will win (except for the classic martingale which can't be played due to table maxes and the lack of a necessarily infinite bankroll to tolerate losing infinity-1 rounds before you win and go ahead.)

So, please explain how you decide something is significant after 100 games, having seen the kind of variability I have shown in 80 game matches played by the same players, same starting positions, same time limit per move.
Well, the decision of whether 80-20 is significant would depend on the variance that I have measured and consider to be an approximator to the real variance. At any given confidence level, and for a certain distribution, you can determine how many sigmas from the mean the result is, or, simply, how likely the 80-20 result would come from mere chance.

There's no crystal ball involved.

And suppose the first 100 games ends up 80-20, and the second (which you choose to not play) ends up 20-80? Then what. This is not horribly uncommon. Ed has mentioned that. C. Theron has mentioned that. Everyone that discusses this kind of testing has seen that exact thing happen. So what says your first 100 is better than your second 100 that you chose to not play, other than a hunch? Ah, but you say it is unlikely for that to happen. Unfortunately it isn't unlikely enough. And when you run as many tests as I run, it becomes commonplace...

jwes · Post by **jwes** » Tue Sep 18, 2007 9:39 am

bob wrote:
And suppose the first 100 games ends up 80-20, and the second (which you choose to not play) ends up 20-80? Then what.

What they are saying is that the variances you are quoting are much higher than you would get if it were a stochastic process, e.g if the probabilities of program A against crafty are 40% wins, 30% draws, and 30% losses, and you wrote a program that randomly generated sequences of 100 trials with the above probabilities, you would not have nearly the differences between these sequences that you have been getting. This would strongly suggest problems with the experimental design.

hgm · Post by **hgm** » Tue Sep 18, 2007 9:50 am

bob wrote:Personally, I find that particular comment absolutely beyond stupid. Asinine, ignorant, incompetent all come to mind to describe it, in fact. Feel free to make up what you want. In fact, based on that comment, feel free to continue this discussion with yourself. I certainly have better things to do that deal with that level of ignorance...

Oh well, everyone is entitled to his personal opinion... Usual personal opinions are very revealing.

This one, for instance, reveals a lot about your about your scientific abilities, and statistical skills. You give an example in a scientific discussion that represents a one-in-miliion event, (4 sigma and a second rather large deviation), so it would certainly be very important to know if this is a hypothertical case, selected actual data or randomly chosen actual data.

That you 'crash through the ice' if someone asks you if you use a hypothetical example, is remarkable. I used a hypothetical example before, about the children's length. I don't consider that a crime of any sort in a scientific discussion, and your wording certainly did not exclude it. You must have severe psychological problems on this subject to react like this.

Normally, since the data is so different in character from the 32 traces you posted before, I would assume this is highly selected data, and that you could not repeat such an extreme deviation or anything near it in the next 100 mini-matches, despite the fact that your reaction sets me thinking. That would make it a very _misleading_ example.

So we can end this discussion on the conclusion that you not only failed as a scientist, presenting obviously faulty data or or misleading people by presenting highly selected data as if it were typical, without warning, but also as a Human for being unable to engage in polite discussion. Too bad, I had hoped I could learn something from you...

nczempin · Post by **nczempin** » Tue Sep 18, 2007 10:06 am

bob wrote: When 4 sequential 80 game runs produce these kinds of results:

1: =--===++--=-=-++=-=-++++---+=+-+--=+-=-+-+-=-++-+++--=+=++=+---+-+++-==+---++=-- (-2)
2: =---+---+-=+=-+++---++------+=-+===+---=--=---+--+=-=-=-++-+--+--+---=-=+-+++++- (-18)
3: +---=-=-+++-+=+===--+++----=-+-=-+++-+----=-=+==+-=+--+--+=+--+=+=+-+++-+-=+--=- (-6)
4: =-=-==--+---+-=+=----+=---+===---=-=---=--====+------=---+-+--+--=+--++=+--+--=- (-31)

It is hard to draw any conclusions about which represents the real world. I can tell you that after 32 such matches, the average score was -5. So 2 are pretty close, 2 are so far off it isn't funny. Add 'em all up and they are still well away from the truth (-14 or so). But two very close, two way off. If you just do 80, which one do you get? For those runs, 50-50 you think the change is a lemon, or you think it is ok, assuming the original version was scoring say -8. Even 320 games doesn't tell me what I need to know. It is off on the wrong side and would convince me to toss the change that actually works. I can run the same version twice and conclude that the second version is worse, even though they are the same program.

So that's where this is coming from. I've not said that statistics do not work. They do. They just present a large error term for small sample sizes. I didn't expect it. In fact, we tested like this (but using just 160 game matches, 4 games per position) for a good while before it became obvious it was wrong.

Can we agree on this: One match of both sides of 40 positions is one match, one sample. So 320 games would be 4 samples. They are not 320 samples.

Also: You didn't expect that from 4 samples you would get such a high variance. Well, it would not have surprised me. Especially given the situation that Crafty is in. But it is entirely possible that other engines will show lower variability, and that given experiments will show significance more quickly.

The whole point of all those statistical methodologies is to let you decide if you need more samples or not, whether you can make a decision with a certain confidence despite the low number of samples or not.

What you're saying is that your tests showed that you needed more samples. Fine. But the thing you are claiming after that, that everybody needs more samples, is not a valid conclusion, because not everybody is getting the same test results.

Also (again) remember that the variance you should be interested in is the theoretical variance of the underlying distribution. That can be estimated only, and, yes, the more games you use, the more accurate this estimate will get. And the number of samples you need for you to decide that the estimate is good enough is not a magic number, it depends on the actual situation.

Look: I agree that many times in computer chess tournaments, people tend to ascribe more significance to the results than is appropriate. But this fact seems to have turned you towards the other extreme; not everybody makes this mistake, certainly not hgm, and I hope I don't make it either.

I agree that short events don't prove conclusively which engine is the best in the world. But neither do the Olympic games or the Super Bowl, or the World Series of Poker. But shouting this fact around merely sounds like coming from a sore loser.

nczempin · Post by **nczempin** » Tue Sep 18, 2007 10:15 am

bob wrote:
hgm wrote:
bob wrote:When 4 sequential 80 game runs produce these kinds of results:

1: =--===++--=-=-++=-=-++++---+=+-+--=+-=-+-+-=-++-+++--=+=++=+---+-+++-==+---++=-- (-2)
2: =---+---+-=+=-+++---++------+=-+===+---=--=---+--+=-=-=-++-+--+--+---=-=+-+++++- (-18)
3: +---=-=-+++-+=+===--+++----=-+-=-+++-+----=-=+==+-=+--+--+=+--+=+=+-+++-+-=+--=- (-6)
4: =-=-==--+---+-=+=----+=---+===---=-=---=--====+------=---+-+--+--=+--++=+--+--=- (-31)

It is hard to draw any conclusions about which represents the real world.
"When"? Is this a hypothetical case? How do you get these traces? Do you just make them up, or do you select the worst you ever encountered amongst the millions of such mini-matches that you claim to have played?
Have you paid any attention to the test data I have been reporting? See those numbers on the left-hand side? (1: 2:, etc)??? Those represent the "match number". The first was match 1 of N (I am not sure whether N was 32 or 64 in this case, but it isn't that important). So those are the first 4, not the worst 4. Unless the first 4 were the worst 4 by some strange twist. I posted another complete set of 32 matches as well.

What would be the purpose for making them up? I simply reported what I found, after being surprised.

Personally, I find that particular comment absolutely beyond stupid. Asinine, ignorant, incompetent all come to mind to describe it, in fact. Feel free to make up what you want. In fact, based on that comment, feel free to continue this discussion with yourself. I certainly have better things to do that deal with that level of ignorance...

Well, i think that this comment is a little impolite.

I do think that hgm's hypothesis on you making it up was perhaps a little uncalled for; it should have been clear that you are quoting from your results.

Asking how or whether you selected them seems legitimate, although assuming that you properly selected them not as an extreme example, but as a typical result, would not have invalidated the discussion IMHO.

But I don't think hgm had any malicious intentions that you read into them.

I still think you could have held the flak a little.

Uri Blass · Post by **Uri Blass** » Tue Sep 18, 2007 10:28 am

hgm wrote:
bob wrote:Personally, I find that particular comment absolutely beyond stupid. Asinine, ignorant, incompetent all come to mind to describe it, in fact. Feel free to make up what you want. In fact, based on that comment, feel free to continue this discussion with yourself. I certainly have better things to do that deal with that level of ignorance...
Oh well, everyone is entitled to his personal opinion... Usual personal opinions are very revealing.

This one, for instance, reveals a lot about your about your scientific abilities, and statistical skills. You give an example in a scientific discussion that represents a one-in-miliion event, (4 sigma and a second rather large deviation), so it would certainly be very important to know if this is a hypothertical case, selected actual data or randomly chosen actual data.

That you 'crash through the ice' if someone asks you if you use a hypothetical example, is remarkable. I used a hypothetical example before, about the children's length. I don't consider that a crime of any sort in a scientific discussion, and your wording certainly did not exclude it. You must have severe psychological problems on this subject to react like this.

Normally, since the data is so different in character from the 32 traces you posted before, I would assume this is highly selected data, and that you could not repeat such an extreme deviation or anything near it in the next 100 mini-matches, despite the fact that your reaction sets me thinking. That would make it a very _misleading_ example.

So we can end this discussion on the conclusion that you not only failed as a scientist, presenting obviously faulty data or or misleading people by presenting highly selected data as if it were typical, without warning, but also as a Human for being unable to engage in polite discussion. Too bad, I had hoped I could learn something from you...

Ignoring Bob's insulting comments
Bob explained that he used real data in the last post and he used the first matches and not looked for worst case.

I think that the most logical explanation for Bob's results is that the events are not independent(for example maybe one engine was slowed down during a match of 80 games by a significant factor and it did not happen in one game but in many games).

Uri

hgm · Post by **hgm** » Tue Sep 18, 2007 10:57 am

Uri Blass wrote:Ignoring Bob's insulting comments
Bob explained that he used real data in the last post and he used the first matches and not looked for worst case.

I think that the most logical explanation for Bob's results is that the events are not independent(for example maybe one engine was slowed down during a match of 80 games by a significant factor and it did not happen in one game but in many games).

Yes, that could certainly be an explanation (falling under the header "faulty measutrement").

But I don't think Bob's reply was completely unambiguous concerning the amount of selection involved: These were the first 4 mini-matches of a longer run, sure. But was the run selected, or was it the first run he ever did in his life, or was it randomly selected from all runs he ever did in his life?

Note that in the other thread he gave another run that also started with mini-matches numbered 1, 2, 3, ..., and that data actually looked unsuspect from a statistical point of view. This still prompts the question why he is posting these four particular initial traces, rather than the first four of the data-set in the other thread.

If the claim is that variance like this is typical, i.e. if randomly selected minimatches between the same engine versions would actually have a result _distribution_ (given as a histogram) that has a variance that has most of the events outside the theoretical maximum range for independent games (within the mini-match) allows, it would suggest that effects like you mention interfered with the measurements all the time. Or it could be that, say, 90% of the mini-matches is distributed normally, but the average is spoiled by 10% of perturbed measurements that fall in a very wide, nonintegrable tail.

Either way, to diagnose the problem, it would be necessary to see that complete result distribution for a typical run of 5K mini-matches. And even then it would be important to know if the extreme samples occur randomly in the sequence, or typically cluster near the start of the run. In absence of this, we will not be able to make very accurate speculations as to what exactly causes the problem.

hgm · Post by **hgm** » Tue Sep 18, 2007 11:12 am

nczempin wrote:I do think that hgm's hypothesis on you making it up was perhaps a little uncalled for; it should have been clear that you are quoting from your results.

Well, from the phrasing "When 4 sequencial game runs produce" it was not clear to me if this was actual or hypothetical data. For real data I would have expected more something like "This morning's results, for instance, started with the 4 following traces". Taking the unlikeliness of the presented data into account, I consider this a quite normal question, and I would ask it again to anyone that would report once-in-a-million events. People that cannot handle critical questions emotionally do not belong in the scientific arena!

Actually the explanation that this had been merely a hypothetical example would have been the least worrisome of all possibilities I mentioned. There is nothing wrong in using hypothetical examples to illustrate a scientific point, I just did so with the children-measurement sampling, where I completele made up the quoted population variances and averages. But presenting highly selected data without cautioning the reader of this, is so misleading that I would consider it a scientific crime, as would be the uncritical publishing of obviously erroneous data.

An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?