An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote: I am not sure my 20K total games (5K x 4 opponents) is either enough or overkill. But I am absolutely certain that 500 is not enough because of the variance I have measured.
But the variance you have measured can only apply to your particular situation.

What you are doing is claiming that it is representative of everybody else's situation. And to find out whether it is overkill or not can be determined; you don't have to guess like this.
Please...

I used 6 different programs. How different is that from your using 6 different programs? No they are not _exactly_ the same. But do you claim that you can pick 5 other opponents that don't have any randomness at all so that 100 games gives you "the truth"? Are those programs representative of the real computer chess world? I named mine. Fruit. Glaurung 1/2. Arasan and gnu. And Crafty. I didn't pick them for any reason other than for strength...

So exactly how is _my_ experiment so different from anyone else's that my results don't apply to theirs? Every tournament result I see reported here draws conclusions that are absolutely meaningless due to the small sample-size. And now some are using that _same_ methodology to determine whether a change is good or bad.

I have simply said "that is bad science". And provided data (not just from my program) to show that...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:Just another illustration:

If I teach my non-tablebase-using program some knowledge on how to handle KPK endgames, a wild guess (which could be measured in a better way of course, but that's not the concern here) would be that perhaps I'd need 20,000 games before the effect would become noticable [this is a somewhat contrived example, because I would be assuming that I wasn't sure that the change would improve the program; we could redefine the test to check if that code is bug-free].

So if I don't have a significant result after 100, 1000 or 10,000 games, I wouldn't be concerned at all. (In practice, I would test endgame position suites and then assume that if I score better in them that would be enough for me). My engine is extremely weak tactically, and the opening book leans somewhat towards getting quick results (IMHO good advice for any player <2000), so it will get into endgames very rarely; those where KPK is significant even more rarely.

But if I double my NPS and do not see an improvement (remember, NPS is my limiting factor right now) after, say, 10 games (also completely arbitrary), I would conclude that I must have introduced a bug somewhere.
The problem with your statement above is "if I don't have a significant result."

Suppose you add endgame tables and your first 100 games come out 80-20. Is that significant? So that you can stop the experiment there? Many of us have reported that the first 100 games often has that kind of good or bad result because of the randomness I have now isolated. But after more games, things settle down to "the truth".

So just how do you decide whether the first 100 is significant or not? Some sort of crystal ball? I don't have one myself. So, I have to rely on the observation that the only way to make sure the first 100 results are significant is to play another 19,900 and have those confirm the first 100 games.

This is the old progressive-betting argument for gamblers. You can recognize patterns of how the cards fall and how you could vary your bet to win, but you can't recognize the pattern until _after_ it has happened. And then it is too late. No such progressive betting pattern will win (except for the classic martingale which can't be played due to table maxes and the lack of a necessarily infinite bankroll to tolerate losing infinity-1 rounds before you win and go ahead.)

So, please explain how you decide something is significant after 100 games, having seen the kind of variability I have shown in 80 game matches played by the same players, same starting positions, same time limit per move.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote:
So the three samples are -22, +22 and 0. mean of the squares is 22^2 + 22^2 + 0^2 = 968 / 3 = 323. Divide by 2 to compute the variance (161). Is that the variance you are talking about? square the difference of each pair of observations, then compute the mean of that and divide by 2???



I claim that number is _very_ high for computer vs computer chess games.
This claim is essentially meaningless unless you define at least:
1. What exactly you mean by "very high"
2. Under which conditions this variance is defined. Claiming that games at Crafty's level are representative for "computer vs computer chess games" (in general) is very bold, and I don't see how you could support it other than with your huge experience (which we all acknowledge); unfortunately that kind of reasoning would not be (I'm not saying you're using it) very scientific.[/uote]

Would you please re-read my posts on this topic. I am not just "using crafty's level". I have used programs above and below that level and found _exactly_ the same variance. No I don't test 1600 programs as they are too weak to provide any useful data. But I do have at least one 2000-level program in the mix.

I am not relying on experience. I am relying on a _ton_ of data produced on my cluster. I was originally surprised. Because I tried a test to figure out the best history value for fruit. And the results almost suggested that the history parameter was unimportant. Because of the variability at 80-160 games per trial.

q
I claim it goes down as the number of games goes up.
Well, we apparently need to define exactly which variance we are talking about: The theoretical underlying variance we would get if we were able to play an infinite number of games between all engines there are and will be. Anything we do to measure it can only be an approximation.

I agree that the approximation gets better the more measurements you take, but (assuming the engines all stay the same while we're measuring) the underlying is constant; doesn't incrase or decrease.
I claim that until the number of games is far larger than one might originally suspect, the variance is extremely high making the results highly random and unusable. I can't deal with variance that large and conclude anything useful.
Perhaps you cannot deal with such high variance, but there are methods that can take it into account. No matter how big your variance (well, of course unless the actual underlying variance is the maximum, which means you're throwing perfect dice etc.), you can always get results that lie outside of the variance for a given confidence level.
That is what Elostat does. and it produces values of X +/- Y.

But what do you do when you run two trials and get X1 +/- Y1, and X2 +/- Y2, where X1-Y1 is much greater than X2+Y2? I call that useless as the two ranges don't come close to overlapping. Which is right?
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote: In any case, 100 games is not enough if you play the same 100 game match 2 times and get a different indication each time. For example, you play 2 matches before a change and get 50-50 and 45-65. You play 2 matches after the change and get 30-70 and 70-30.

Well, for that particular set of data (intuitively, without having measured it), you wouldn't be able to draw any conclusion.

Actually, to me this particular set would indicate that I haven't run enough tests before the change. If you are seeing that kind of behaviour, you would be well-advised to run more tests. I never claimed that your observations are wrong for your particular situation. I only questioned whether your situation can necessarily be applied to mine, or perhaps even others (although that part I'm not all that concerned about), which you seem to be claiming, again, correct me if I'm wrong.

[Incidentally, because the debate is getting a little heated, please let me assure you that if I ever slip and get personal or anything, that it was not intentional. With some of my posts I'm not so sure if they can be misunderstood in that way].

And 90-10 would also be enough.

I keep saying this all over, but you seem to simply ignore it.
I seem to be having that same problem. I have not seen you quote a single set of 100 game matches and give the results of each.

This seems a little unfair, as I don't have your resources. I don't have a cluster or anything. My Laptop is running day and night running a gauntlet against >50 other engines (working on the engine is on hold for now). I am currently in round 4 of the gauntlet, at game 193 out of 530. I will post the intermediate results of two matches once I have them.

Also please don't compare your 100 games to my 100 games. You could choose a meaningful starting position out of your 40 and then the results are more comparable. Or you could just run them from the starting positions, with own books enabled. I know that that is not the way you normally test, but given the fact that you have much more horsepower available, perhaps you could spare a few cycles on this particular analysis, just like I am sparing 100 % of my cycles to do something I normally don't.
I did that. My results showed that just running 100 games, which could produce any one of those results I posted (or many other possible results as well) could lead you to the wrong conclusion.

Where is your data to show that 100 games is enough? It is easy enough to run my test and post the results to prove they are stable enough that 100 is enough.

There is an accepted way of dealing with limited sample sizes, but you're just saying that 100 games can never be enough.
Yes I am. I posted samples where A beat B and lost to B, yet A is provably stronger than B. I posted sample games where A beat B and lost to B where they are equal. And I posted games where A beat B and lost to B and A was far stronger than B.

Please provide some data to contradict mine. And don't just rely on your program if it is pretty deterministic. Try others as I did, to convince yourself things are not as stable as you think, which means the number of games needed (N) is much larger than you think.
But I have never made any claims about other programs. I have only ever been talking about my program, and asking for advice on how to handle its particular situation. I am not looking for the underlying theoretical variance.

And I am saying that 100 games can be enough, and Statistics provides some methods to find out exactly when this is the case. And please don't give me that "95 % confidence means 1 out of 20 is wrong". Okay, so you require what confidence level? 99 %? Wow, you'll be wrong 1 times out of 100.
with 95% you will be wrong one of every 20 changes. Too high. Being wrong once can negate 5 previous rights...

To me, this whole discussion seems to be: You were surprised that the variability was much higher than you expected, and now you're on a crusade to tell everybody "100 games is not enough" without qualification (as in the situation under which it applies, I am not questioning your credentials). And it is this missing qualification that concerns me.

[Well, let me tell you: 20000 games is not enough. I changed one line in the code and so far I haven't noticed any significant results. What do you have to say about that?][/quote]

20000 might not be enough. However, I have no idea what you mean "without qualification". I believe, if you re-read my post, I qualified what I found very precisely. I tried games from a pool of 6 opponents. two significantly better than the rest, two pretty even in the middle of the pack, and two that were much worse. No matter who played who, variance was unbearable over 80-160-320 games. Yes you can get a quick idea of whether A is better than B or not with fewer games. But I am not doing that. I want to know if A' is better than A, where the difference is not that large. Somehow the discussion keeps drifting away from that point.

[/quote]

Well, the heading of _this_ thread is "...for the rest of us", which means that I am explicitly taking Crafty out of the picture, because from my point of view, Crafty's situation is special (just as probably from your point of view, Eden's situation is special). Now, I never claimed that any of the conclusions you made about Crafty is wrong. What I did claim was that your results do not necessarily apply to "the rest of us", which it seemed to me (but I may of course have been wrong) you have been implying, saying things like "100 games is not enough".

Without qualification means "100 games is not enough".
With qualification would mean "100 games is not enough for me to prove that Crafty version x.1 is stronger than x.2".


If you want to discuss something else, fine. But _my_ discussion was about this specific point alone. Nothing more, nothing less. Somehow the subject keeps getting shifted around however.
Yes, and I am happy that you're still listening and that we are starting to clear up some of our misunderstandings.
Back to the main idea, once again. I have 6 programs that I play against each other, one of which is mine. Generally I only play mine against 4 of that set of 5 (I weed out one that is significantly worse, as one of those is enough). I want to run Crafty against the set of 4 opponents, then run Crafty' against the same set, and then be able to conclude with a high degree of accuracy whether Crafty or Crafty' is better. Nothing more, nothing less. 80 or 160 or 320 games is absolutely worthless to make that determination. I have the data to prove this convincingly.
I have said elsewhere that intuitively (basically, just 4 samples), and without having done the maths, your result is most likely right. I never said it wasn't.

Any other discussion is not about my premise at all, and I don't care about them. I don't care which of my pool of programs is stronger, unless you count Crafty and Crafty'. I don't care about really large N as it becomes intractable. I don't care about really small N (N <=100) as the data is worthless. So I am somewhere in the middle, currently using 20K games, 5K against each of 4 opponents, and still see a small bit of variance there as expected. But not a lot.

So can we discuss _that_ issue alone? Comparing and old to new version to decide whether the new version is stronger. And get off of all the tagential issues that keep creeping in?
But for that issue, in the context of Crafty, there is no discussion. I am saying that you are probably right for that context. I have been trying to talk about something else.

And while I would prefer if you could help with my particular issue, I will accept it if it doesn't interest you.

I have stated my case concisely above. I have offered data to support it. Feel free to offer data that refutes it, and to have that scrutinized as I did my own when I started this testing last year...
My claim so far regarding Crafty is that my engine is seeing lower variance in its results than Crafty is. I am working on the data to test this hypothesis. Unfortunately it is currently infeasible for me to do the exact same test that you have done, although I can find another computer and leave it running for a week or so.

So, for now I am only doing that gauntlet against >50 opponents. I will do that 40-position test at some stage, I just can't do it right now. Perhaps someone else could chip in? In addition, perhaps you would be willing to do just one small test where the conditions are closer to what I'm doing right now.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote: Would you please re-read my posts on this topic. I am not just "using crafty's level". I have used programs above and below that level and found _exactly_ the same variance.

Okay, perhaps I have mis-quoted you or even mis-read you in this aspect. All I know is that I am talking about Eden's level, which is a long way away from 2000.

No I don't test 1600 programs as they are too weak to provide any useful data. But I do have at least one 2000-level program in the mix.
This is unfortunate, because 1600 programs are just about the level I'm finding myself at. Again, I'm not trying to draw any generally applicable conclusions, just trying to improve my engine.

And for that purpose, using engines at my level seems to me provide a more accurate picture.


Regarding the test data at my level, all I can ask for is some patience. It may very well be that my data confirms your hypothesis. All the better, because then we can concentrate on the issue that is really important to me, and I can actually use some of your results with more confidence.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
nczempin wrote:Just another illustration:

If I teach my non-tablebase-using program some knowledge on how to handle KPK endgames, a wild guess (which could be measured in a better way of course, but that's not the concern here) would be that perhaps I'd need 20,000 games before the effect would become noticable [this is a somewhat contrived example, because I would be assuming that I wasn't sure that the change would improve the program; we could redefine the test to check if that code is bug-free].

So if I don't have a significant result after 100, 1000 or 10,000 games, I wouldn't be concerned at all. (In practice, I would test endgame position suites and then assume that if I score better in them that would be enough for me). My engine is extremely weak tactically, and the opening book leans somewhat towards getting quick results (IMHO good advice for any player <2000), so it will get into endgames very rarely; those where KPK is significant even more rarely.

But if I double my NPS and do not see an improvement (remember, NPS is my limiting factor right now) after, say, 10 games (also completely arbitrary), I would conclude that I must have introduced a bug somewhere.
The problem with your statement above is "if I don't have a significant result."

Suppose you add endgame tables and your first 100 games come out 80-20. Is that significant? So that you can stop the experiment there? Many of us have reported that the first 100 games often has that kind of good or bad result because of the randomness I have now isolated. But after more games, things settle down to "the truth".

So just how do you decide whether the first 100 is significant or not? Some sort of crystal ball? I don't have one myself. So, I have to rely on the observation that the only way to make sure the first 100 results are significant is to play another 19,900 and have those confirm the first 100 games.

This is the old progressive-betting argument for gamblers. You can recognize patterns of how the cards fall and how you could vary your bet to win, but you can't recognize the pattern until _after_ it has happened. And then it is too late. No such progressive betting pattern will win (except for the classic martingale which can't be played due to table maxes and the lack of a necessarily infinite bankroll to tolerate losing infinity-1 rounds before you win and go ahead.)

So, please explain how you decide something is significant after 100 games, having seen the kind of variability I have shown in 80 game matches played by the same players, same starting positions, same time limit per move.
Well, the decision of whether 80-20 is significant would depend on the variance that I have measured and consider to be an approximator to the real variance. At any given confidence level, and for a certain distribution, you can determine how many sigmas from the mean the result is, or, simply, how likely the 80-20 result would come from mere chance.

There's no crystal ball involved.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote: That is what Elostat does. and it produces values of X +/- Y.

But what do you do when you run two trials and get X1 +/- Y1, and X2 +/- Y2, where X1-Y1 is much greater than X2+Y2? I call that useless as the two ranges don't come close to overlapping. Which is right?
Well, I am not sure what the specific meaning of those terms are, but I am sure they can easily be misunderstood to mean something which they don't.

All I get from those values is e. g. if I have n engines in the list, the one with the highest Ys are the ones that have the highest need for more games. And if mere position would indicate that the engine in question could be used as a test opponent for Eden, I try to make sure that those values are among the lowest ones.

Perhaps they signify one sigma or whatever, in which case you would need results to not overlap on at least 3 times that for an acceptable statement about actual strengths.

There are so many assumptions built into those Elo ratings that you have to be very careful in interpreting those numbers. And, yes, the more games the better :-)
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote: I tried games from a pool of 6 opponents. two significantly better than the rest, two pretty even in the middle of the pack, and two that were much worse. No matter who played who, variance was unbearable over 80-160-320 games. Yes you can get a quick idea of whether A is better than B or not with fewer games. But I am not doing that. I want to know if A' is better than A, where the difference is not that large.
As far as I have understood, using 80 games from those 40 positions means that each set of 80 is one sample. So 320 games would mean 4 samples.

I would expect the variance to be quite high for 4 samples.

I think I'm repeating myself, but I haven't received an answer to this idea (which may be wrong).
User avatar
hgm
Posts: 28395
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An objective test process for the rest of us?

Post by hgm »

bob wrote:Apparently nothing I have said has registered.
It has been registered and dismissed, as being not to the point or simply false.
If you do a simple node count search limit, you will produce perfect repeatibilty. The question is, are the set of games you happen to produce actually representative of how the thing plays? It is just a small random sample, which means "NO" is the answer.
If I take a sample of positions from actual games that is large enough, it will indeed be representative of how the thing plays. There is no need to play from the same position twice, except for the opening, of course, and a few positions very close to it that you have to duplicate to get the required number of games. Starting from selected positions (e.g. Nunn) would even avoid that.

Note that I am not dependent on variability from A to generate a large set of games. I try many different opponents, and they all produce different moves on their turn. Furthermore, I have different versions of A (A', A"), that, exactly because they are different, produce different moves on A's turn, even when restricted to same number of nodes or iterations. So I will have plenty of different games, from which I can select positions that form a large and representative set.
Each time you encounter the _same_ position in a game, your program can play a different move due to the timing issues. Since there is that random component, a sample of the potential games from that position is far more informative than a sample size of one.
Well, if I somehow feel that my set of test positions is not large enough, or not representative enough, I can always try to expand the number of possible games by trying my luck with this timing stuff, and then add both possible moves (and the games resulting from it) to my set. But I don't think this will ever be needed. Anyway, it is not central to the method how I obtain the test positions. The crux is to play all versions in those same positions, to eliminate the noise associated with choosing those positions.
If you believe that methodology is OK, then go for it. I once believed it as well until I obtained the facilities to delve into this issue in extreme detail, where I learned things I had no idea were there.
Well, I don't think you tried anything of the sort that I have in mind. If you did I would be curious to know the result.
I'm not going to argue the case. I've already run millions of games,
Exactly. You run games. Bad idea! Very noisy...
our "team" has looked at the ridiculous level of non-determinism we have been producing on the cluster.
You keep saying that, and fail to substantiate it with evidence. There was nothing I would consider ridiculous in the variability you showed in the 80-game mini-matches, it was in fact lower than expected from sound statistical treatment.
and we've spent hundreds of hours going over the results to understand what is going on in them.

The bottom line is that large sample size == high level of confidence in the results, small sample size == very low level of confidence in the results. It really is that simple. And trying to finagle ways to somehow make large sample size close to small sample size isn't going to work.
Well, sorry to burst your bubble, but there are actually standard techniques for this in statistics. If I want to know how much on the average children grow between their 8th and 9th birthday, the most stupid thing to do is select N children at 8 and measure there average length, and then independently select N children at age 9 to make a new average, and then take the difference. Due to the large variance in the length of children in the population, you would need N to be many millions before the standard error would give you any significance on the difference. What one does in stead is select N children, and measure these same children both at age 8 and 9. Then you can reach the same level of accuracy with just N=100 or so.
I would think that everyone sees the importance of an _accurate_ assessment approach to decide whether a change is good or bad. It is _the_ way to make progress. Trying to short-cut the process is just a "random walk" experiment. Eventually you will "get there" but after a whole lot more false steps than the methodical approach.
Accuracy can come from using smart methodology just as much, if not more, as from sample size.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
bob wrote:Apparently nothing I have said has registered.
It has been registered and dismissed, as being not to the point or simply false.
If you do a simple node count search limit, you will produce perfect repeatibilty. The question is, are the set of games you happen to produce actually representative of how the thing plays? It is just a small random sample, which means "NO" is the answer.
If I take a sample of positions from actual games that is large enough, it will indeed be representative of how the thing plays. There is no need to play from the same position twice, except for the opening, of course, and a few positions very close to it that you have to duplicate to get the required number of games. Starting from selected positions (e.g. Nunn) would even avoid that.

Note that I am not dependent on variability from A to generate a large set of games. I try many different opponents, and they all produce different moves on their turn. Furthermore, I have different versions of A (A', A"), that, exactly because they are different, produce different moves on A's turn, even when restricted to same number of nodes or iterations. So I will have plenty of different games, from which I can select positions that form a large and representative set.

If you believe that, then we should stop the discussion. Your most telling comment above is "if I take a sample size that is large enough..." That is where we will simply have to disagree. Because I understand the idea that for _any_ position, it is possible for a program to play different moves if the search is based on a time limit. And since probabilities are additive, the probability of producing the same game twice is extremely low. So we are simply going to disagree on how large the sample size has to be.

I'm not going to be overly concerned. I have enough data to have a good idea of what is needed and what won't cut the mustard. Time will eventually prove one of us wrong.

Each time you encounter the _same_ position in a game, your program can play a different move due to the timing issues. Since there is that random component, a sample of the potential games from that position is far more informative than a sample size of one.
Well, if I somehow feel that my set of test positions is not large enough, or not representative enough, I can always try to expand the number of possible games by trying my luck with this timing stuff, and then add both possible moves (and the games resulting from it) to my set. But I don't think this will ever be needed. Anyway, it is not central to the method how I obtain the test positions. The crux is to play all versions in those same positions, to eliminate the noise associated with choosing those positions.


I don't see how testing over _positions_ is going to produce a qualitative answer as to "better or worse" where playing a complete _game_ answers that idea pretty succinctly. I want a test methodology that I can execute and get a "go/no-go" answer with no effort on my part. No results to look at, no positions to compare answers to, just "go" or "no-go". I have that test paradigm functioning, and that is all I want. Minimum effort on my part to evaluate a change. Just compile, submit, and wait for the results to come back summarized...



If you believe that methodology is OK, then go for it. I once believed it as well until I obtained the facilities to delve into this issue in extreme detail, where I learned things I had no idea were there.
Well, I don't think you tried anything of the sort that I have in mind. If you did I would be curious to know the result.
I'm not going to argue the case. I've already run millions of games,
Exactly. You run games. Bad idea! Very noisy...
Then this is a hopeless discussion. I don't have any "noise" to speak of because I play enough games to drown the noise out. That is the point. And I don't have to spend any massive amount of time going over the results to decide good or bad.

Did you ever stop to think why almost _everybody_ relies on games to make these decisions? The commercial authors have discussed this here from time to time. CT once mentioned the exact same thing I am discussing, that testing this way is the most critical aspect of developing a chess program. We develop them to play games. They ought to be evaluated doing the thing they were designed to do.

our "team" has looked at the ridiculous level of non-determinism we have been producing on the cluster.
You keep saying that, and fail to substantiate it with evidence. There was nothing I would consider ridiculous in the variability you showed in the 80-game mini-matches, it was in fact lower than expected from sound statistical treatment.
Then answer me this. Old version produces a result of 45-35. New version produces a result of 35-45. Do you keep the change or not? You run the test again, and get 40-40. Now what? I have explained that multiple times. You never address that question, instead changing the topic to something that is not directly related to the original point I raised.

Again, I want something I can execute which comes back with "go / no-go" with respect to accepting/rejecting a new change to Crafty. I don't want any extra subjective effort required. I want the test to measure Crafty as closely as possible to the way it will play in a tournament. Etc.

Somehow that is getting overlooked...

and we've spent hundreds of hours going over the results to understand what is going on in them.

The bottom line is that large sample size == high level of confidence in the results, small sample size == very low level of confidence in the results. It really is that simple. And trying to finagle ways to somehow make large sample size close to small sample size isn't going to work.
Well, sorry to burst your bubble, but there are actually standard techniques for this in statistics. If I want to know how much on the average children grow between their 8th and 9th birthday, the most stupid thing to do is select N children at 8 and measure there average length, and then independently select N children at age 9 to make a new average, and then take the difference.
Aha. You get my point. That is _exactly_ what _you_ are doing with a small sample size. I am playing enough games so that I take most of the available children in the 8-year-old age group, so that I _do_ get better answers. How can you misunderstand that?

I'm not for small sample sizes at all. Never have been throughout this discussion... You have been the one advocating the smaller samples, not me. I _know_ I need to sample enough of the population to get a good estimate of the actual mean.

Due to the large variance in the length of children in the population, you would need N to be many millions before the standard error would give you any significance on the difference. What one does in stead is select N children, and measure these same children both at age 8 and 9. Then you can reach the same level of accuracy with just N=100 or so.
I would think that everyone sees the importance of an _accurate_ assessment approach to decide whether a change is good or bad. It is _the_ way to make progress. Trying to short-cut the process is just a "random walk" experiment. Eventually you will "get there" but after a whole lot more false steps than the methodical approach.
Accuracy can come from using smart methodology just as much, if not more, as from sample size.
Baloney. If the sample size is too small, all the T-tests and chi-square tests in the world are not going to help because of the standard error... I'm going for an error estimate small enough to give me confidence in my decision on whether to keep changes or not. I am not going to play 32 games in a mini-round-robin event to make that decision, the data is no good.