An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote: That's all old news. And it has already been discussed/dismissed. Anything that purely speeds up a program, without changing the size/shape of the tree in any way, has to be good. There are no known examples where going faster is worse (in chess) except for an occasional pathological case where going one ply deeper leads you to change to a move that is actually worse. But everyone believes that faster is better all else remaining constant, so that's not what we are talking about here...
But this is only an illustration, and changes can run the gamut from being so clearly better to those that, say, turn half the wins into losses, and half the losses into wins, or those that will lead to one additional win in 1,000,000 games.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

Bob, you work at a University. Surely you have some Statistics colleagues who could settle the issue?

You claim that Hgm is wrong, that I am wrong and that Uri is wrong, and you are right.

Perhaps someone neutral who you would presumably accept as an expert in the field we are discussing can convince you?

Perhaps he could even convince me that I am wrong, but so far my impression is increasing that somehow you are saying that basic statistics doesn't apply to computer chess, and I just can't see how or why this would be the case.
User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

bob wrote:You are completely missing the point. Hint: The point is not that you have to buy a big cluster to properly test. The point is that taking small samples of games leads to gross errors. If you can't play enough to shrink the error to an acceptable level, you must realize that the small sample size you do use has an error large enough to make using it impractical.

Just look at all the discussions here about "my 16 game round-robin shows that program X has had no improvement over the previous version. 16 games doesn't show much of anything.
Point is if you can apply standard statistical theory or not. Obviously standard statistical theory tells us that a 16-game RR is not enough, except in very extreme cases. (16 losses for A vs 16 wins for A' would certainly be significant!)
You ignored the _second_ result. In both cases, the old version scored better. I could show you the second pair of runs from that test where the second version scored far better. The point being that using either of those two results leads you to the _wrong_ conclusion, when you run a much longer test to eliminate most of the randomness.
You speak before your turn: the second result was discussed directly after this:
hgm wrote: I suppose you repeat the test with the changed version. So it is now 40-40. Well, that is 5 points different from the first one, or 1 sigma. Not something to start doubting the validity of the test (e.g. through virus infection of the computer), which I would certainly do if the second result had been 0-80. In fact the new test (weakly) confirms the old one, as 40-40 is still better than 35-45. But the only meaningful quantity is the sum of the tests, which should have a variance twice that of the original test (because it is now a 160-match test), but in the _average score_ of the two tests the variance would be only half the original one, as averaging entails division by 2. The difference between A and A' would then have a variance of sqrt(1+0.5)*3.6. But the average score difference would now be only 7.5 points. This means you are off by 1.7*sigma now. So your confidence has in fact decreased a little. But not enough to reject it.
bob wrote: apples and oranges. Computer vs computer games have _inherent_ randomness. By trying to artificially eliminate that due to poor time allocation or decisions about when to start/not-start a new iteration, you just pick a small sample. But suppose the 100 children _you_ pick come from an area where all the buildings are painted using lead paint? Is your sample more valid than my sample (that is larger) and which represents a more average cross-section of the population rather than just the kids suffering from ultra-high levels of lead in their system?
Of course picking 100 children from the same neighborhood cannot be described as random sampling. Anyone knows that. Basically you claim that large populations cannot be representatively sampled by a small set of samples. That is total nonsense. How many samples you need for an accurate repersentation depends on the heterogeneity of the ppulation, and has absolutely nothing to do with the _size_ of the population.
Unfortunately, I do not believe it is possible to choose a small sample of games from a population that is very large and random in nature, and do so in an intelligent way. How does one intelligently choose a small sample from random values? I've never seen such a methodology.
No idea what you mean by that. Populations are not random in nature, they are just there. What you do is sample them randomly. There are techniques for that, it is actually a science (look under "Monte-Carlo simulations").
Or perhaps the data is a bit more random than you give it credit for being???

That's certainly what I am seeing.
Remarks like this show that you understand zilch about statistics. :shock:

More random than totally random? Ludicrous!

What you would need to produce such variance is the opposite of randomness. You can only get it by highly correlating the data, and making results of games that can go either way dependent (positively correlated) on the result of earlier games that could go either way. And that is what I call faulty measurement.

There is no other way to get such variance.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote: Just look at all the discussions here about "my 16 game round-robin shows that program X has had no improvement over the previous version. 16 games doesn't show much of anything.
If the discussions do indeed make such a claim, they would be wrong. All they could say would be "after my 16 game RR an improvement has not been shown".
It depends on your level of accuracy. If your previous event ended 8-8, and this one ended 0-16, you could, with a small amount of accuracy, conclude that your change is worse.

An interesting experiment: play a 1000 game match, Record the results as a string of +, =, and - characters. Now examine that string to see the longest run of +'s you can find. Or the longest string of - values. What if you had run an evaluation test on a new change, and just happened to start at the beginning of one of those long strings of wins or losses?

That's the danger in testing. Anyone that has done significant amounts of testing have noticed that it is very common to play 200 games, and lose the first 20 in a row, before bouncing back to win or equalize in the match. Or you win the first 20 in a row before it settles back down to equality (or worse).

You just have to play enough games to get outside of those windows or you make a bad step backward.


Which neither proves that there was an improvement, nor that there wasn't. All you can conclude is that you can't say yet.
Something I already knew before playing the 20 game RR event, in fact. Many of us have pointed out for years that the WCCC events do not identify the strongest chess program in the world. They identify the program that played the best in that particular event, which might or might not be the best in the world. It takes way more games than that.


However, it is entirely possible after 16 games to show such an improvement: If the first version went 0-16, and the other version went 16-0, there is significant evidence that the new version is stronger against the same set of opponents and under the same circumstances.
Suppose, since you use a book, that your opponent just happened to pick bad lines and you won those first 16 games. But give him time and enough games, and it will play better openings and blow you out again. 16 games is just not enough to conclude _anything_ with any reasonable level of confidence.

Again, the test I suggested above will highlight that. If you play 1000 games, you can pick a 40-80 game group to prove most anything you want, contrary to what the entire match shows.


Whether any conclusions can be drawn as to whether this means that the engine has improved within the larger context, is a question that is orthogonal to the basics of such a result.

And it is entirely possible to draw the same kind of conclusions, with slightly lower confidence, when the result is not 16-0 but, say, 15-1 or whatever. And exactly what you can conclude at what confidence is what basic statistic principles allow you to find out.

I haven't seen you explain convincingly why these basic statistics should be treated differently from treating if dice are loaded, or if any seemingly random events are really random or not.
OK, how many test flips do you need to determine if one die is loaded such that it gives an extra 6 once in 60 flips??? Once in 30 flips? With enough certainty that you are willing to accept the consequence that I am going to shoot you in the head if you tell me the die is loaded when it isn't, or it isn't when it is. So you need a pretty high confidence level here, just like I want when evaluating changes in my chess program.

Can you do that with 20 flips?? 100??
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
bob wrote:You are completely missing the point. Hint: The point is not that you have to buy a big cluster to properly test. The point is that taking small samples of games leads to gross errors. If you can't play enough to shrink the error to an acceptable level, you must realize that the small sample size you do use has an error large enough to make using it impractical.

Just look at all the discussions here about "my 16 game round-robin shows that program X has had no improvement over the previous version. 16 games doesn't show much of anything.
Point is if you can apply standard statistical theory or not. Obviously standard statistical theory tells us that a 16-game RR is not enough, except in very extreme cases. (16 losses for A vs 16 wins for A' would certainly be significant!)
Or not. I've started matches that began with 20-0 results and thought "wow, this is good". Come back a day later and it is now 100-120 or worse. Doesn't happen every time obviously. But it happens often enough to make trusting those first 20 games very difficult.
You ignored the _second_ result. In both cases, the old version scored better. I could show you the second pair of runs from that test where the second version scored far better. The point being that using either of those two results leads you to the _wrong_ conclusion, when you run a much longer test to eliminate most of the randomness.
You speak before your turn: the second result was discussed directly after this:
hgm wrote: I suppose you repeat the test with the changed version. So it is now 40-40. Well, that is 5 points different from the first one, or 1 sigma. Not something to start doubting the validity of the test (e.g. through virus infection of the computer), which I would certainly do if the second result had been 0-80. In fact the new test (weakly) confirms the old one, as 40-40 is still better than 35-45. But the only meaningful quantity is the sum of the tests, which should have a variance twice that of the original test (because it is now a 160-match test), but in the _average score_ of the two tests the variance would be only half the original one, as averaging entails division by 2. The difference between A and A' would then have a variance of sqrt(1+0.5)*3.6. But the average score difference would now be only 7.5 points. This means you are off by 1.7*sigma now. So your confidence has in fact decreased a little. But not enough to reject it.
bob wrote: apples and oranges. Computer vs computer games have _inherent_ randomness. By trying to artificially eliminate that due to poor time allocation or decisions about when to start/not-start a new iteration, you just pick a small sample. But suppose the 100 children _you_ pick come from an area where all the buildings are painted using lead paint? Is your sample more valid than my sample (that is larger) and which represents a more average cross-section of the population rather than just the kids suffering from ultra-high levels of lead in their system?
Of course picking 100 children from the same neighborhood cannot be described as random sampling. Anyone knows that. Basically you claim that large populations cannot be representatively sampled by a small set of samples.
No, I am not claiming that at all. You are grossly over-simplifying the issue. You are measuring a small sample in one characteristic only. I claim that a chess program has more than enough variability in it in that it can produce lots of different outcomes from the same starting position. If you take your small sample, and also want to know about the effect of race on the growth rate, now you have to modify your sampling to get 100 kids, but 50-50 between the two races you are interested in. You just lost accuracy. And we want to go farther since a chess program is comprised of many different components and I want to see how they all interact with a new change I introduced. So I pick a sample of 100 games, one that reached a KPPK endgame, one that reached a KRPKR endgame, etc. And suddenly I have a sample size of one for each sub-class I am interested in.

SO lets get back to the game of chess, which is _far_ more complicated than just estimating the 1-year growth rate of an 8-year-old. Or at least mine is. I can pick one hundred games out of a sample of 1000 and not even see an endgame at all. or a king-side attack. But I want a cross-section of games to see how this outside passed pawn code affects play in them all. And 20-40-100 games is _not_ enough. My point all along. Tangential issues aside. Small sample sizes aside. Simple test cases compared to computer chess aside.

That is total nonsense. How many samples you need for an accurate repersentation depends on the heterogeneity of the ppulation, and has absolutely nothing to do with the _size_ of the population.
Unfortunately, I do not believe it is possible to choose a small sample of games from a population that is very large and random in nature, and do so in an intelligent way. How does one intelligently choose a small sample from random values? I've never seen such a methodology.
No idea what you mean by that. Populations are not random in nature, they are just there. What you do is sample them randomly. There are techniques for that, it is actually a science (look under "Monte-Carlo simulations").
Or perhaps the data is a bit more random than you give it credit for being???

That's certainly what I am seeing.
Remarks like this show that you understand zilch about statistics. :shock:

More random than totally random? Ludicrous!



Please come back to reality. Chess games are not _completely_ random. But they have a high-level random component. I am not talking about random samples. Never was, never will. I am talking about samples taken from a pair of chess programs where the outcome of each individual game is weighted by the skill of each opponent, and then a random component is factored in that can outweigh (at times) the skill component, and at other times it does not.

Why can't we just stick with chess? growth rates and the rest are not anything I am worried about. Just choosing/playing enough games to get a reliable indicator of whether a change is good or bad or neutral. If you can do that with a small number of games, go for it. I don't believe it is possible. I'm not going to continue going round and round the mulberry bush here.

Test as you like, I'll test as I am testing now. If your methodology is superior, then by definition you will catch and pass me and my inferior/time-consuming test methodology.




What you would need to produce such variance is the opposite of randomness. You can only get it by highly correlating the data, and making results of games that can go either way dependent (positively correlated) on the result of earlier games that could go either way. And that is what I call faulty measurement.
No argument there. Which is one reason I don't use an opening book, where learning can cause that very effect. But I am using a large set of starting positions and playing a much larger set of games so that the inherent randomness of the results is overwhelmed by the skill levels of the two players ultimately proving which is best.



There is no other way to get such variance.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote: That's all old news. And it has already been discussed/dismissed. Anything that purely speeds up a program, without changing the size/shape of the tree in any way, has to be good. There are no known examples where going faster is worse (in chess) except for an occasional pathological case where going one ply deeper leads you to change to a move that is actually worse. But everyone believes that faster is better all else remaining constant, so that's not what we are talking about here...
But this is only an illustration, and changes can run the gamut from being so clearly better to those that, say, turn half the wins into losses, and half the losses into wins, or those that will lead to one additional win in 1,000,000 games.
I've never seen a change that was "clearly better" before testing. Even adding null-move can absolutely kill a program's performance. You can look at the comments in main.c for Crafty to see that null-move R=2 was very bad back in the early days of crafty and the slow hardware available (PC) at the time. Now R=2 to 3 works extremely well.

We have, in the past year, made changes (Tracy/Mike/Myself) that we were absolutely certain were better. They weren't. Even though it looked obvious.

There's no replacement for testing.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:Bob, you work at a University. Surely you have some Statistics colleagues who could settle the issue?

You claim that Hgm is wrong, that I am wrong and that Uri is wrong, and you are right.

Perhaps someone neutral who you would presumably accept as an expert in the field we are discussing can convince you?

Perhaps he could even convince me that I am wrong, but so far my impression is increasing that somehow you are saying that basic statistics doesn't apply to computer chess, and I just can't see how or why this would be the case.
I am not saying basic statistics don't apply. I have simply said that the randomness of the results is so pronounced, that a large sample size is necessary to figure out what is really going on.

But it is interesting you should say that. One of the people looking at the data _is_ a statistician. He was the one that originally said "your 80 game matches really are producing some strange results."

He was looking at a series of 11 80 game matches I ran back when I was playing with the history value in Crafty. I started out with 0, then 10, all the way up to 100. 0 said reduce everything, 100 said reduce nothing, and values in between slowly ramped up the aggressiveness of the history reductions. His first comment was "Hmmm... I expected a trend, as the value went from 0 to 100 in steps of 10, I expected 0 to be worse, and expected that the match results would improve up to a point, where you reach the optimal value, and then start to drop off again as you go past that point. But your data looks too random. Do you have a bug?" So I tried the same test on Fruit. And we got the same results, there was no clear "best value". And we would see things like 20 is good, 30 is bad, 40 is good again.

So he suggested more games. And as we did this, we began to see a better pattern. We originally settled on 2560 game matches with each opponent, but I ran several of them and found that the results still had just a bit more randomness (on average) than I wanted. The current scheme might produce a result of 30+ 40= 30- in one run, and 31+ 39= 30- in the next, so they rarely vary by more than 1 point when normalized (averaged) to 80 points total.

This is a person that plays chess, that was on my Ph.D. committee 20 years ago, and was just as surprised as I was about the "deterministic myth" we had all talked about for years... He had read my stuff on opening book learning, as well as that of others talking about how to prevent a program from playing the same game over and over, and he commented "hey, it really looks like a daunting challenge to do what you have been trying not to do so hard..."

That's where all of this came from. Yes, I could reduce the number of games somewhat but then interpreting the results becomes more cloudy. At present a simple comparison makes it easy to say good or bad. You saw the kind of variance I was getting in 80 game matches. One match says "good change", next one says "bad change". I can't rely on something that inconsistent. I kept increasing the number of games until I got as consistent as possible with the results.

When 4 sequential 80 game runs produce these kinds of results:

1: =--===++--=-=-++=-=-++++---+=+-+--=+-=-+-+-=-++-+++--=+=++=+---+-+++-==+---++=-- (-2)
2: =---+---+-=+=-+++---++------+=-+===+---=--=---+--+=-=-=-++-+--+--+---=-=+-+++++- (-18)
3: +---=-=-+++-+=+===--+++----=-+-=-+++-+----=-=+==+-=+--+--+=+--+=+=+-+++-+-=+--=- (-6)
4: =-=-==--+---+-=+=----+=---+===---=-=---=--====+------=---+-+--+--=+--++=+--+--=- (-31)

It is hard to draw any conclusions about which represents the real world. I can tell you that after 32 such matches, the average score was -5. So 2 are pretty close, 2 are so far off it isn't funny. Add 'em all up and they are still well away from the truth (-14 or so). But two very close, two way off. If you just do 80, which one do you get? For those runs, 50-50 you think the change is a lemon, or you think it is ok, assuming the original version was scoring say -8. Even 320 games doesn't tell me what I need to know. It is off on the wrong side and would convince me to toss the change that actually works. I can run the same version twice and conclude that the second version is worse, even though they are the same program. :)

So that's where this is coming from. I've not said that statistics do not work. They do. They just present a large error term for small sample sizes. I didn't expect it. In fact, we tested like this (but using just 160 game matches, 4 games per position) for a good while before it became obvious it was wrong.

Now, feel free to test as you want. I see the kind of data I have been getting and am absolutely convinced that 80 games is worthless against a single opponent, using 40 standard starting positions, no opening book, no learning of any kind, no pondering, no parallel search, equal hardware. Even moving to 160 games against each of 4 opponents did not reduce the potential for error enough to help.

What more can I say? I only claim this is true for the programs I have identified. And for a couple I have not identified. I don't know that _everybody_ gets this sort of random result, but I know these programs certainly do, and from a lot of analysis to understand why, I believe that any program that uses a rational way of limiting time is going to exhibit the same thing. And even if you go to the extreme point of deciding whether or not to start the next iteration based on time left, you will eventually see the randomness since eventually you need to test with pondering on, and now the opponent introduces enough randomness to change the games regardless of what you do. And then you want to test your parallel search... and then... and each "and then" increases the randomness and the number of games needed to make rational decisions.
User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

bob wrote:No, I am not claiming that at all. You are grossly over-simplifying the issue. You are measuring a small sample in one characteristic only. I claim that a chess program has more than enough variability in it in that it can produce lots of different outcomes from the same starting position. If you take your small sample, and also want to know about the effect of race on the growth rate, now you have to modify your sampling to get 100 kids, but 50-50 between the two races you are interested in. You just lost accuracy. And we want to go farther since a chess program is comprised of many different components and I want to see how they all interact with a new change I introduced. So I pick a sample of 100 games, one that reached a KPPK endgame, one that reached a KRPKR endgame, etc. And suddenly I have a sample size of one for each sub-class I am interested in.

SO lets get back to the game of chess, which is _far_ more complicated than just estimating the 1-year growth rate of an 8-year-old. Or at least mine is. I can pick one hundred games out of a sample of 1000 and not even see an endgame at all. or a king-side attack. But I want a cross-section of games to see how this outside passed pawn code affects play in them all. And 20-40-100 games is _not_ enough. My point all along. Tangential issues aside. Small sample sizes aside. Simple test cases compared to computer chess aside.
Well, it depends what you are testing then. I am mainly testing very general capabilities of my engine, like how to search, how to sort moves. The sort of things that affect every single game in almost every position. If LMR pays off, and recapture extensions. And I suspect that Nicolai is considering even more basic things. That means we are not dependent for a visible result on very rare circumstances, like waiting for a KRRNKRR ending, or checkmating through an e.p. capture. So as far as homogeneity is concerned, we could do with quite small samples. One game is already ~60 positions. Our only concern is mainly with statistical error, then.

For the other problem, there are methods too. Basically one biases the sampling, increasing the ocurence of the rare positions that are relevant for our change (like pawn endings) by a known amount.

Actually I don't even consider it relevant to present the engine with such rare circumstances in exactly the frequency whith which they occur in its games. I expect a certain universality of my engine. I don't want it to be ignorant in end-games where it has Knights, just because it has a knack for swapping its Knights early in the game. Even if improving end-game play with Knights offers 0% score improvement in games, because they just don't occur, I would consider it a worthwile improvement to my engine. Because I don't want to bar the road to learning that not swapping off the Knights might be better. So I would make sure that my test set would force positions where the capabilities I consider important are tested would occur in the sampling, but if they occur in the natural frequency would concern me less. Unless I would need very expensive and specialized evaluation terms for that, of course. But at the current stage I am mainly interested in how general changes in search and evaluation would affect such capabilities.
Please come back to reality. Chess games are not _completely_ random.
But the upper limit to the variance I quoted _was_ for a completely random, uncorrelated process. And you claim that your observations beat that. (Which I don't believe either. It seems you don't even know your own data.)
But they have a high-level random component. I am not talking about random samples. Never was, never will. I am talking about samples taken from a pair of chess programs where the outcome of each individual game is weighted by the skill of each opponent, and then a random component is factored in that can outweigh (at times) the skill component, and at other times it does not.
So the variance _must_ be smaller than that of a totally random process. If it is higher, you made a mistake somewhere.
Why can't we just stick with chess? growth rates and the rest are not anything I am worried about. Just choosing/playing enough games to get a reliable indicator of whether a change is good or bad or neutral. If you can do that with a small number of games, go for it. I don't believe it is possible. I'm not going to continue going round and round the mulberry bush here.
The point is that paired sampling of populations with a large intrinsic variance is highly more efficient than taking independent samples. As lng as that is of no interest to you, there is indeed no reason to engage in this discussion.
Test as you like, I'll test as I am testing now. If your methodology is superior, then by definition you will catch and pass me and my inferior/time-consuming test methodology.
Not necessarily, as you have 256 as many computers, and thus can afford to waste 99.5% of your time and still do better than the rest of us. But it helps, of course. :lol: :lol: :lol:

But the point is there are others than you and me, and I think that those have a right to know the truth too.
No argument there.
Well, that is new, then. As so far you have been giving nothing but argument on this!
Which is one reason I don't use an opening book, where learning can cause that very effect.
Well, learning is obviously the last thing one should tolerate, while testing.
But I am using a large set of starting positions and playing a much larger set of games so that the inherent randomness of the results is overwhelmed by the skill levels of the two players ultimately proving which is best.
Yes, that always works, if you can afford the number of games. But that doesn't mean that superior sampling strategies couldn't achieve better accuracy with viewer games.



There is no other way to get such variance.[/quote][/quote]
User avatar
hgm
Posts: 28359
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

bob wrote:When 4 sequential 80 game runs produce these kinds of results:

1: =--===++--=-=-++=-=-++++---+=+-+--=+-=-+-+-=-++-+++--=+=++=+---+-+++-==+---++=-- (-2)
2: =---+---+-=+=-+++---++------+=-+===+---=--=---+--+=-=-=-++-+--+--+---=-=+-+++++- (-18)
3: +---=-=-+++-+=+===--+++----=-+-=-+++-+----=-=+==+-=+--+--+=+--+=+=+-+++-+-=+--=- (-6)
4: =-=-==--+---+-=+=----+=---+===---=-=---=--====+------=---+-+--+--=+--++=+--+--=- (-31)

It is hard to draw any conclusions about which represents the real world.
"When"? Is this a hypothetical case? How do you get these traces? Do you just make them up, or do you select the worst you ever encountered amongst the millions of such mini-matches that you claim to have played?

You are giving 4 draws from a distribution that (when properly done) cannot have a SD of more than 9 (and, considering the number of draws in all traces, more likely has SD ~ 7). You claim to know the true average to be -2 from a much longer measurement. So I see a deviation of 29 = 4 sigma. That is an event at the 0.01% level (3.2e-5 for a 4-sigma event, but out of 4 tries).

So what you show here cannot be typical for the first 4 mini-matches you play in any evaluation. If it is, there must be an error. If it is selected as the worst case out of several thousand, it is quite normal that you would find something like this, of course.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
bob wrote:No, I am not claiming that at all. You are grossly over-simplifying the issue. You are measuring a small sample in one characteristic only. I claim that a chess program has more than enough variability in it in that it can produce lots of different outcomes from the same starting position. If you take your small sample, and also want to know about the effect of race on the growth rate, now you have to modify your sampling to get 100 kids, but 50-50 between the two races you are interested in. You just lost accuracy. And we want to go farther since a chess program is comprised of many different components and I want to see how they all interact with a new change I introduced. So I pick a sample of 100 games, one that reached a KPPK endgame, one that reached a KRPKR endgame, etc. And suddenly I have a sample size of one for each sub-class I am interested in.

SO lets get back to the game of chess, which is _far_ more complicated than just estimating the 1-year growth rate of an 8-year-old. Or at least mine is. I can pick one hundred games out of a sample of 1000 and not even see an endgame at all. or a king-side attack. But I want a cross-section of games to see how this outside passed pawn code affects play in them all. And 20-40-100 games is _not_ enough. My point all along. Tangential issues aside. Small sample sizes aside. Simple test cases compared to computer chess aside.
Well, it depends what you are testing then. I am mainly testing very general capabilities of my engine, like how to search, how to sort moves. The sort of things that affect every single game in almost every position. If LMR pays off, and recapture extensions. And I suspect that Nicolai is considering even more basic things. That means we are not dependent for a visible result on very rare circumstances, like waiting for a KRRNKRR ending, or checkmating through an e.p. capture. So as far as homogeneity is concerned, we could do with quite small samples. One game is already ~60 positions. Our only concern is mainly with statistical error, then.

For the other problem, there are methods too. Basically one biases the sampling, increasing the ocurence of the rare positions that are relevant for our change (like pawn endings) by a known amount.

Actually I don't even consider it relevant to present the engine with such rare circumstances in exactly the frequency whith which they occur in its games. I expect a certain universality of my engine. I don't want it to be ignorant in end-games where it has Knights, just because it has a knack for swapping its Knights early in the game. Even if improving end-game play with Knights offers 0% score improvement in games, because they just don't occur, I would consider it a worthwile improvement to my engine. Because I don't want to bar the road to learning that not swapping off the Knights might be better. So I would make sure that my test set would force positions where the capabilities I consider important are tested would occur in the sampling, but if they occur in the natural frequency would concern me less. Unless I would need very expensive and specialized evaluation terms for that, of course. But at the current stage I am mainly interested in how general changes in search and evaluation would affect such capabilities.
Please come back to reality. Chess games are not _completely_ random.
But the upper limit to the variance I quoted _was_ for a completely random, uncorrelated process. And you claim that your observations beat that. (Which I don't believe either. It seems you don't even know your own data.)
But they have a high-level random component. I am not talking about random samples. Never was, never will. I am talking about samples taken from a pair of chess programs where the outcome of each individual game is weighted by the skill of each opponent, and then a random component is factored in that can outweigh (at times) the skill component, and at other times it does not.
So the variance _must_ be smaller than that of a totally random process. If it is higher, you made a mistake somewhere.
Why can't we just stick with chess? growth rates and the rest are not anything I am worried about. Just choosing/playing enough games to get a reliable indicator of whether a change is good or bad or neutral. If you can do that with a small number of games, go for it. I don't believe it is possible. I'm not going to continue going round and round the mulberry bush here.
The point is that paired sampling of populations with a large intrinsic variance is highly more efficient than taking independent samples. As lng as that is of no interest to you, there is indeed no reason to engage in this discussion.
Test as you like, I'll test as I am testing now. If your methodology is superior, then by definition you will catch and pass me and my inferior/time-consuming test methodology.
Not necessarily, as you have 256 as many computers, and thus can afford to waste 99.5% of your time and still do better than the rest of us. But it helps, of course. :lol: :lol: :lol:

But the point is there are others than you and me, and I think that those have a right to know the truth too.
That is all I have been giving. Nearly _everyone_ assumes that 100 games is enough. It is not even close. My 80 game matches show that beyond any kind of doubt.

No argument there.
Well, that is new, then. As so far you have been giving nothing but argument on this!
Then I suggest you re-read my statements. I have not changed sides on any issue since this started.
Which is one reason I don't use an opening book, where learning can cause that very effect.
Well, learning is obviously the last thing one should tolerate, while testing.
Too broad a statement. Suppose _learning_ is what I am currently working on. What better way to test if it works? I did this when I originally started the book learning stuff in the middle 90's...

But I am using a large set of starting positions and playing a much larger set of games so that the inherent randomness of the results is overwhelmed by the skill levels of the two players ultimately proving which is best.
Yes, that always works, if you can afford the number of games. But that doesn't mean that superior sampling strategies couldn't achieve better accuracy with viewer games.
I will say this one more time... Somehow it keeps getting overlooked. How _exactly_ can you do some sort of "superior sampling strategy" when playing chess games? Weaken your engine by making it completely deterministic? Play each move to a fixed number of nodes? I don't believe any of those work unless the timing randomness is eliminated. But since time allocation is a unique part of each engine, removing that doesn't make any sense to me, any more than turning off parts of the evaluations or search extensions. I really want to test as I am going to run, and ultimately I will be testing with pondering on, books on, SMP search on, learning on, etc. But for now I am trying to reduce the number of games to zero in on one particular part of what we are working on, as best we can. And rather than trying to completely eliminate the timing randomness, which I believe would cripple my tests somewhat, I prefer instead to play enough games so that the randomness becomes less influential.

There is no other way to get such variance.
[/quote][/quote]

The problem is that for matches of a hundred or two hundred games, the variance is more than big enough to make interpretation very unreliable. That's what I have said all along. That's what my data has shown all along.