An objective test process for the rest of us?

bob · Post by **bob** » Mon Sep 17, 2007 1:14 am

nczempin wrote:
bob wrote: In any case, 100 games is not enough if you play the same 100 game match 2 times and get a different indication each time. For example, you play 2 matches before a change and get 50-50 and 45-65. You play 2 matches after the change and get 30-70 and 70-30.

Well, for that particular set of data (intuitively, without having measured it), you wouldn't be able to draw any conclusion.

Actually, to me this particular set would indicate that I haven't run enough tests before the change.
I feel like I am stuck in some endless loop in an episode of either "The Twilight Zone" or "Groundhog day."

If you don't get those kind of results, something is wrong with your testing. Or else Fruit, Glaurung 1/2, Crafty, Arasan, GnuchessX (and even a few others I won't mention) have serious problems because _they_ can't produce deterministic results for the 40 positions / 80 game matches I am running. So the point is, those _are_ the kinds of results at least most of us are seeing. Whether you or HGM do or not, I have no idea, and do not care. I am reporting what I see with a broad group of programs, not just my own.

Again, just run the test and report back, rather than continuing the discussion forever. Let's see what kind of variability you get over an 80 game match played at least 2-3 times...

If you are seeing that kind of behaviour, you would be well-advised to run more tests. I never claimed that your observations are wrong for your particular situation. I only questioned whether your situation can necessarily be applied to mine, or perhaps even others (although that part I'm not all that concerned about), which you seem to be claiming, again, correct me if I'm wrong.

[Incidentally, because the debate is getting a little heated, please let me assure you that if I ever slip and get personal or anything, that it was not intentional. With some of my posts I'm not so sure if they can be misunderstood in that way].

And 90-10 would also be enough.

I keep saying this all over, but you seem to simply ignore it.
I seem to be having that same problem. I have not seen you quote a single set of 100 game matches and give the results of each.

This seems a little unfair, as I don't have your resources. I don't have a cluster or anything. My Laptop is running day and night running a gauntlet against >50 other engines (working on the engine is on hold for now). I am currently in round 4 of the gauntlet, at game 193 out of 530. I will post the intermediate results of two matches once I have them.

Also please don't compare your 100 games to my 100 games. You could choose a meaningful starting position out of your 40 and then the results are more comparable. Or you could just run them from the starting positions, with own books enabled. I know that that is not the way you normally test, but given the fact that you have much more horsepower available, perhaps you could spare a few cycles on this particular analysis, just like I am sparing 100 % of my cycles to do something I normally don't.

To be a reasonable test, I want to play enough games to give the new change a chance to be used in many types of positions. Tactical attacks, positional games, simple endgames, complex endgames. Otherwise I can't be sure whether a simple "outside passed pawn" term is good or bad. I have seen such a term work very well in some positions, and cause the program to trade B for 3 pawns to create a distant passed pawn, only to lose because the opponent has an extra piece. If you test a change only in positions where the change is obviously important, you are overlooking a critical part of the testing methodology you need. Hence my decision to use Albert's 40 positions even though I know that Crafty will likely not play into some of those positions. But the positions that occur are still important with respect to understanding how you r evaluation performs over a large cross-section of potential positions you might see in tournaments.

I did that. My results showed that just running 100 games, which could produce any one of those results I posted (or many other possible results as well) could lead you to the wrong conclusion.

Where is your data to show that 100 games is enough? It is easy enough to run my test and post the results to prove they are stable enough that 100 is enough.

There is an accepted way of dealing with limited sample sizes, but you're just saying that 100 games can never be enough.
Yes I am. I posted samples where A beat B and lost to B, yet A is provably stronger than B. I posted sample games where A beat B and lost to B where they are equal. And I posted games where A beat B and lost to B and A was far stronger than B.

Please provide some data to contradict mine. And don't just rely on your program if it is pretty deterministic. Try others as I did, to convince yourself things are not as stable as you think, which means the number of games needed (N) is much larger than you think.
But I have never made any claims about other programs. I have only ever been talking about my program, and asking for advice on how to handle its particular situation. I am not looking for the underlying theoretical variance.

Then why are we having this discussion? I _know_ the concept applies to typical commercial and amateur engines that play at a very high level. Do you ever intend to reach that level? If you don't have the non-determinism I see in these other programs, I suspect you will as you get better. Maybe then you will see the need for a better testing regime, or else you are going to do as I have done _many_ times in the past and make some serious false steps that are going to cause some embarassing losses. I need point no further than the first two games of the 1986 WCCC event to show how an insufficient number of test games can lead to a horrible decision and result.

And I am saying that 100 games can be enough, and Statistics provides some methods to find out exactly when this is the case. And please don't give me that "95 % confidence means 1 out of 20 is wrong". Okay, so you require what confidence level? 99 %? Wow, you'll be wrong 1 times out of 100.
with 95% you will be wrong one of every 20 changes. Too high. Being wrong once can negate 5 previous rights...

To me, this whole discussion seems to be: You were surprised that the variability was much higher than you expected, and now you're on a crusade to tell everybody "100 games is not enough" without qualification (as in the situation under which it applies, I am not questioning your credentials). And it is this missing qualification that concerns me.

[Well, let me tell you: 20000 games is not enough. I changed one line in the code and so far I haven't noticed any significant results. What do you have to say about that?]

20000 might not be enough. However, I have no idea what you mean "without qualification". I believe, if you re-read my post, I qualified what I found very precisely. I tried games from a pool of 6 opponents. two significantly better than the rest, two pretty even in the middle of the pack, and two that were much worse. No matter who played who, variance was unbearable over 80-160-320 games. Yes you can get a quick idea of whether A is better than B or not with fewer games. But I am not doing that. I want to know if A' is better than A, where the difference is not that large. Somehow the discussion keeps drifting away from that point.

[/quote]

Well, the heading of _this_ thread is "...for the rest of us", which means that I am explicitly taking Crafty out of the picture, because from my point of view, Crafty's situation is special (just as probably from your point of view, Eden's situation is special). Now, I never claimed that any of the conclusions you made about Crafty is wrong. What I did claim was that your results do not necessarily apply to "the rest of us", which it seemed to me (but I may of course have been wrong) you have been implying, saying things like "100 games is not enough".

Without qualification means "100 games is not enough".
With qualification would mean "100 games is not enough for me to prove that Crafty version x.1 is stronger than x.2".

I certainly can't claim my observations apply to "all of the rest of you". But based on the opponents I _have_ tested, I would claim that they apply to _most_ of the rest of you. And I will also bet that as time goes by, they will apply to you more and more as well...

If you want to discuss something else, fine. But _my_ discussion was about this specific point alone. Nothing more, nothing less. Somehow the subject keeps getting shifted around however.

Yes, and I am happy that you're still listening and that we are starting to clear up some of our misunderstandings.

Back to the main idea, once again. I have 6 programs that I play against each other, one of which is mine. Generally I only play mine against 4 of that set of 5 (I weed out one that is significantly worse, as one of those is enough). I want to run Crafty against the set of 4 opponents, then run Crafty' against the same set, and then be able to conclude with a high degree of accuracy whether Crafty or Crafty' is better. Nothing more, nothing less. 80 or 160 or 320 games is absolutely worthless to make that determination. I have the data to prove this convincingly.

I have said elsewhere that intuitively (basically, just 4 samples), and without having done the maths, your result is most likely right. I never said it wasn't.

Any other discussion is not about my premise at all, and I don't care about them. I don't care which of my pool of programs is stronger, unless you count Crafty and Crafty'. I don't care about really large N as it becomes intractable. I don't care about really small N (N <=100) as the data is worthless. So I am somewhere in the middle, currently using 20K games, 5K against each of 4 opponents, and still see a small bit of variance there as expected. But not a lot.

So can we discuss _that_ issue alone? Comparing and old to new version to decide whether the new version is stronger. And get off of all the tagential issues that keep creeping in?

But for that issue, in the context of Crafty, there is no discussion. I am saying that you are probably right for that context. I have been trying to talk about something else.

And while I would prefer if you could help with my particular issue, I will accept it if it doesn't interest you.

I have stated my case concisely above. I have offered data to support it. Feel free to offer data that refutes it, and to have that scrutinized as I did my own when I started this testing last year...

My claim so far regarding Crafty is that my engine is seeing lower variance in its results than Crafty is. I am working on the data to test this hypothesis. Unfortunately it is currently infeasible for me to do the exact same test that you have done, although I can find another computer and leave it running for a week or so.

So, for now I am only doing that gauntlet against >50 opponents. I will do that 40-position test at some stage, I just can't do it right now. Perhaps someone else could chip in? In addition, perhaps you would be willing to do just one small test where the conditions are closer to what I'm doing right now.[/quote]

You can do this test pretty easily. You don't need to run long games. I have run 1+0, 1+1, 2+1, 1+2, 5+5, 10+10, 30+30 and 60+60 and the variability doesn't change very much. A time control of 1+0 lets you play about 30 games per hour, actually a bit more. You could produce a 100 game match in 3 hours. Two in 6 hours and that's all you need to test this hypothesis...

nczempin · Post by **nczempin** » Mon Sep 17, 2007 10:48 am

bob wrote:
nczempin wrote: But I have never made any claims about other programs. I have only ever been talking about my program, and asking for advice on how to handle its particular situation. I am not looking for the underlying theoretical variance.

Then why are we having this discussion?

Because you gave some figures for the situation at the very high level, and I wanted to find out which, if any, principles, apply at a level which is just slightly above the level of a potato playing chess.

If that level does not interest you, fine. But it does interest me, because my engine is currently not more than a smart potato.

I _know_ the concept applies to typical commercial and amateur engines that play at a very high level.

Because the world does not revolve around Crafty, or other engines that play at a very high level.

I explicitly said "for the rest of us", which in most context means average kind of people as opposed to superstars, and I don't see how it can be misunderstood to mean anything to do with a high level.

Do you ever intend to reach that level?

I do eventually, but don't hold your breath. Certainly not with a Java-based engine, and not based on my principle that each new version should be only slightly, but significantly, better, than the previous one. Based on that, it'll take at least 20 more releases of Eden. That is quite a bit if you consdier that so far I'm at release number 13, and I don't even have a major version number yet.

If you don't have the non-determinism I see in these other programs, I suspect you will as you get better.

I am almost certain that I will. But I am only talking about the here and now.

Maybe then you will see the need for a better testing regime, or else you are going to do as I have done _many_ times in the past and make some serious false steps that are going to cause some embarassing losses. I need point no further than the first two games of the 1986 WCCC event to show how an insufficient number of test games can lead to a horrible decision and result.

Umm, yeah, but two games out of how many over more than 20 years, is exactly how significant? Sure, you remember it vividly, but the human mind is very capable of giving more significance to individual events than appropriate. That's why we have statistics.

And no, playing more games will not solve the problem of the human mind.

hgm · Post by **hgm** » Mon Sep 17, 2007 11:42 am

bob wrote:If you believe that, then we should stop the discussion. Your most telling comment above is "if I take a sample size that is large enough..." That is where we will simply have to disagree. Because I understand the idea that for _any_ position, it is possible for a program to play different moves if the search is based on a time limit. And since probabilities are additive, the probability of producing the same game twice is extremely low. So we are simply going to disagree on how large the sample size has to be.

So actually you don't disagree at all. You agree: you admit that there _exists_ a sample size that would do the job. As this is what "large-enough" by definition means...

I'm not going to be overly concerned. I have enough data to have a good idea of what is needed and what won't cut the mustard. Time will eventually prove one of us wrong.

I don't see how testing over _positions_ is going to produce a qualitative answer as to "better or worse" where playing a complete _game_ answers that idea pretty succinctly. I want a test methodology that I can execute and get a "go/no-go" answer with no effort on my part. No results to look at, no positions to compare answers to, just "go" or "no-go". I have that test paradigm functioning, and that is all I want. Minimum effort on my part to evaluate a change. Just compile, submit, and wait for the results to come back summarized...

Well, like I said, good for you. But not really of any interest to readers of this forum, as most of them can't or don't want to come up with a million dollars to buy 256 high-end PCs.

For us it is important to use more advanced testing methodology, that does not waste that much computer time, as for us computer time is a limited resource.

Then this is a hopeless discussion. I don't have any "noise" to speak of because I play enough games to drown the noise out. That is the point. And I don't have to spend any massive amount of time going over the results to decide good or bad.

I would say that you have exactly as much noise as anyone else, but you have developed a (very expensive) method to live with it. By averaging it out of the data. And we both know that that is _quadratically_ expensive...

Did you ever stop to think why almost _everybody_ relies on games to make these decisions? The commercial authors have discussed this here from time to time. CT once mentioned the exact same thing I am discussing, that testing this way is the most critical aspect of developing a chess program. We develop them to play games. They ought to be evaluated doing the thing they were designed to do.

I am not overly concerned about that. Men is a creature of habit, and not many people have imaginative ideas. People that are blessed to have such ideas in one area, very often fail to have them in other areas. Good programmers, or good Chess players are not necessarily schooled in statistics and data analysis.

Your last remark basically dismisses 99% of all scientific research ever done. One almost never gets any valuable data from measurements under real-life conditions, as these conditions involve way too much uncontrolled, unpredictable and unknow factors influencing the process of interest. So one always tries to isolate the system from its usual environment, and replaces the latter by a controlled environment where some properties are varied, others kept constant, but all are precisely known. Standard practice...

The thing you are advocating is not science, but alchemy!

Then answer me this. Old version produces a result of 45-35. New version produces a result of 35-45. Do you keep the change or not? You run the test again, and get 40-40. Now what? I have explained that multiple times. You never address that question, instead changing the topic to something that is not directly related to the original point I raised.

Well, easy enough. Without a conspicuous absense of draws I would assume an upper limit to the SD of the result of 3.6. The difference in score is 10 pts, or 2.77*sigma (or more, if sigma is unexpectedly small). But in a difference of two independent measurements, the variances are added, so sigma is a factor sqrt(2) larger, and 10 points is only 1.96*sigma. So I _know_ that there is only a probability of 2.5% that an equal program could achieve this much score improvement by sheer luck. That would be enough for me to keep the change and start working on the next one, as with the test equipment I have, requiring even more taxing standards of confidence would be very counterproductive, as it would slow me down enormously (the "paranoia effect"). I would only be wrong once every 40 times, and I would not even take a step back then, as this is the probability to accept an _equal_ change. The probability that I would accept a change that was actually detrimental is much lower. And even if I would take one step backward for every 40 steps forward, I would progress at a rate of 38 improvements per 40 succesfull test runs. If I could beef up the confidence from 97.5% to 100% by testing twice longer, I would improve by 40 steps in 80 test runs, nearly half as fast...

I suppose you repeat the test with the changed version. So it is now 40-40. Well, that is 5 points different from the first one, or 1 sigma. Not something to start doubting the validity of the test (e.g. through virus infection of the computer), which I would certainly do if the second result had been 0-80. In fact the new test (weakly) confirms the old one, as 40-40 is still better than 35-45. But the only meaningful quantity is the sum of the tests, which should have a variance twice that of the original test (because it is now a 160-match test), but in the _average score_ of the two tests the variance would be only half the original one, as averaging entails division by 2. The difference between A and A' would then have a variance of sqrt(1+0.5)*3.6. But the average score difference would now be only 7.5 points. This means you are off by 1.7*sigma now. So your confidence has in fact decreased a little. But not enough to reject it.

Note that if I would be worried after the second test, I would not play more games with the changed version, but with the original one, as the SD in the difference starts to be dominated by that in the measurement of the old version.

I hope that answers your question.

As far as I am concerned the "point you originally raised" is that we could not say with 97.5% confidence that an engine scoring 15 out of a 20-game gauntlet is better than one scoring 5 out of 20. If you are discussing something different now, it must be you who somehow changed topic.

Again, I want something I can execute which comes back with "go / no-go" with respect to accepting/rejecting a new change to Crafty. I don't want any extra subjective effort required. I want the test to measure Crafty as closely as possible to the way it will play in a tournament. Etc.

Somehow that is getting overlooked...

Not at all. If I were to adopt a better test methodology, I would of coures also fully automate it. There is nothing in my scheme that precludes that, or even to make it difficult.

hgm wrote:If I want to know how much on the average children grow between their 8th and 9th birthday, the most stupid thing to do is select N children at 8 and measure there average length, and then independently select N children at age 9 to make a new average, and then take the difference.
Aha. You get my point. That is _exactly_ what _you_ are doing with a small sample size. I am playing enough games so that I take most of the available children in the 8-year-old age group, so that I _do_ get better answers. How can you misunderstand that?

Again, it is you who seems to misunderstand something here: N was not specified in my example, so _if_ I were doing this with small N, and you are doing this with large N, we would still _both_ be doing exactly what the example said. But I am not doing that at all, of course. Like I said, it is the most stupid way to do it, so why should I do it? It is what _you_ do, and that you can finally get a precise answer by taking an excessively large N, that would not make it any less stupid. For the quality of the methodology is not measured by the precision that you finally get (which in every methodology can always be made arbitrarily small), but by the value of N one needed to obtain this accuracy.

I'm not for small sample sizes at all. Never have been throughout this discussion... You have been the one advocating the smaller samples, not me. I _know_ I need to sample enough of the population to get a good estimate of the actual mean.

So where does the notion come from that N was supposed to be small? Do you actually react to what I post, or does your mind create a virtual me that claims quite different things than what I write, and are you actually replying to those phantom claims?

hgm wrote: Due to the large variance in the length of children in the population, you would need N to be many millions before the standard error would give you any significance on the difference. What one does in stead is select N children, and measure these same children both at age 8 and 9. Then you can reach the same level of accuracy with just N=100 or so.

Accuracy can come from using smart methodology just as much, if not more, as from sample size.
Baloney. If the sample size is too small, all the T-tests and chi-square tests in the world are not going to help because of the standard error... I'm going for an error estimate small enough to give me confidence in my decision on whether to keep changes or not. I am not going to play 32 games in a mini-round-robin event to make that decision, the data is no good.

It seems you still don't grasp the example I have given, or you would not write such nonsense. Let me make it more concrete, then:

Suppose the length of children in the population of the Netherlands had a SD of 20cm. Suppose they grow on average 3 cm between their 8th and 9th birthday, with a SD of 1cm.

To recover these numbers by sampling, the methodology described directly above ("_my_" methodology) would take 100 8-yr olds, measure them, measure them again one year later, and take the average. That average would than have a SD of 1cm/sqrt(100) = 0.1 cm.
On the other hand, _your_ methodology of just taking large-enough samples of independently chosen 8- and 9-year olds would require, to get the same accuracy of 0.1cm, a SD in the individual length averages of 0.071 cm, which is 1/283 of the SD of an individual length. So your sample size would have to be 283^2 = 80,000 of both age groups. In our 16M population that means pretty much the entire population of 8 and 9-year olds. (So it would take a year anyway, to wait for their birthdays...) Not do-able, or horrendously expensive. While I only had to take 100 children, and get the same result. Or take 1000, if I wanted to do 3 times better...

So indeed, it is very possible that a 1000-element sample produces much better accuracy than a 160,000-element sample. Provided the sample is chosen in a smart, rather than a dumb way.

Uri Blass · Post by **Uri Blass** » Mon Sep 17, 2007 2:13 pm

Note that evaluation changes like changing weights or adding evaluation factor that is relevant in all parts of the games are clearly changes that you need many games to make sure that they are better but not all changes in chess programs are changes like that type of change.

There are still changes that are simply about making the program faster(even if it is not exactly x times faster with the same analysis)

For example I make checks in the first ply of the qsearch but I still have no function to generate only checks and captures and I simply generate all moves and find checks and captures in them.

Having a function to generate only captures and checks should be an obvious improvement.

I cannot say that it is exactly making the program x times faster because the order of generating moves may be different but I think that in this case
I do not need thousands of games to verify that there is an improvement.

The only possibility not to have an improvement from that type of change is if I have a bug in my code and I have code to make other tests to verify that there are no bugs.

Games are also needed because maybe there is some bug that happen only in games but my assumption is that there is an improvement if I see better results in games.

Note that my evaluation is relatively expensive so I do not expect this change to give a big speed improvement but I expect the change to give some speed improvement(I will see how much improvement I get when I compare analysis with the new version that is not finished with analysis with the old version and I see how much faster is the new version).

Uri

nczempin · Post by **nczempin** » Mon Sep 17, 2007 2:47 pm

Uri Blass wrote:Note that evaluation changes like changing weights or adding evaluation factor that is relevant in all parts of the games are clearly changes that you need many games to make sure that they are better but not all changes in chess programs are changes like that type of change.

There are still changes that are simply about making the program faster(even if it is not exactly x times faster with the same analysis)

For example I make checks in the first ply of the qsearch but I still have no function to generate only checks and captures and I simply generate all moves and find checks and captures in them.

Having a function to generate only captures and checks should be an obvious improvement.

I cannot say that it is exactly making the program x times faster because the order of generating moves may be different but I think that in this case
I do not need thousands of games to verify that there is an improvement.

The only possibility not to have an improvement from that type of change is if I have a bug in my code and I have code to make other tests to verify that there are no bugs.

Games are also needed because maybe there is some bug that happen only in games but my assumption is that there is an improvement if I see better results in games.

Note that my evaluation is relatively expensive so I do not expect this change to give a big speed improvement but I expect the change to give some speed improvement(I will see how much improvement I get when I compare analysis with the new version that is not finished with analysis with the old version and I see how much faster is the new version).

Uri

Well, the nice thing about using real statistical methods is that independent of whether the change was big or small, they will give you an indication of whether the change was significant.

With tiny changes, it would automatically take more games to decide that they are significant.

And as for my methodology, because my engine is still so immature, I simply ignore those changes that would only be significant after a huge number of games. Therefore I can conclude after a lot smaller number of games that I have or have not reached the significance level.

I am not interested in going further for a small change, I simply decide that it is too small. I. e. small == no change for me.

hgm · Post by **hgm** » Mon Sep 17, 2007 6:15 pm

Oops, I see that I have overlooked this post, because it was already "snowed under" by Nicolai's posts by the time I looked:

bob wrote:I was originally talking about variance in the engineering sense. But I have been using the usual stastical term recently. In that when we started testing on our cluster, I could run 26 game matches, and get 2-24, 24-2 and 13-13.

So the three samples are -22, +22 and 0. mean of the squares is 22^2 + 22^2 + 0^2 = 968 / 3 = 323. Divide by 2 to compute the variance (161). Is that the variance you are talking about? square the difference of each pair of observations, then compute the mean of that and divide by 2??? I claim that number is _very_ high for computer vs computer chess games. I claim it goes down as the number of games goes up. I claim that until the number of games is far larger than one might originally suspect, the variance is extremely high making the results highly random and unusable. I can't deal with variance that large and conclude anything useful.

So I agree with your "if you play a large number". But I have been arguing for a large number of games to reduce variance. You have been arguing for a small number of games, which absolutely increases the variance. We just don't seem to agree on how much this increase actually is...

So maybe I am now confused, since I am arguing for large N small sigma squared. So back to the beginning. Do you advocate large N or small N? I will again remind you, I am using 40 positions so that I get a representative cross-section of games, tactical, positional, middlegame attacks, endgame finesses, etc. That requires at least 80 games since you must alternate colors to cancel unbalanced positions. What N do you advocate? I am finding it necessary to run about 5K games per opponent to get N up and sigma^2 down enough to make evaluation accurate. I also choose to use multiple opponents as I don't want to tune to beat X only to lose to Y and Z because X has problems with certain types of positions that skew those results. I am not sure my 20K total games (5K x 4 opponents) is either enough or overkill. But I am absolutely certain that 500 is not enough because of the variance I have measured.

Also, with my cluster, I can test parallel searches, and I would really like (and plan on doing so) to test against parallel search engines to compare the parallel search effectiveness of the parallel engines. But that introduces even more non-determinism and will drive N even larger. And it would be nice to test a book in the same way, again further increasing N. Not to mention pondering. So far I have tried to whittle away every source of non-determinism except for timing for the moves. Interfering with that produces stability but the match results generally do not come close to the match results produced with more games using time limits.

The variance I talk about is the average of the square deviation from the mean, like the 323 you calculate, except that I would calculate it in the limit of infinitely many games (where the result frequencies would be equal to the probabilities, so that I would have calculated the variance of the probability distribution from which the mini-match results were drawn). The variance of a finite sample would be an approximation to this. (If one did not take the average, but N/(N-1) times the average, it would be an even more accurate approximation.) It will tend to a (finite, nonzero) _constant_ as N tends to infinity.

You seem to divide by an extra factor N. That means you don't calculate the variance of the distribution (or sample), but the variance in the sample _average_. Of course that tends to zero when you increase N.

Now your claim is interesting, for I claim that this number should _never_ exceed 0.5*sqrt(M), with M the number of games in a mini-match. (The limit for infinite N, without the extra division by N.) So if by "very large" you mean larger than 0.5*sqrt(M), we definitely are in disagreement on this.

Note that if you really would obtain -22, +22 and 0 in a 26-game match, it would be virtually certain that you made a mistake, and are not measuring what you think you are measuring. (You probably mixed up some executables between the tests.) As the most probable way to bring about such a result would be for a single-game result-distribution of 50-50-0 (+,-,=). As the SD of a single mini-match result would be 5.1 in that case, it means that you experienced a 4.3 sigma deviation (which in itself has a probability of 8.7e-6) not once, but _twice_. The probability for this would be 75e-12, (times 12 for the different permutations of the three results), i.e. you would have witnessed a one-in-a-billion event. More likely that some of your students screwed up.

(Or that your precious cluster would spontaneously levitate, and float through the campus, for that matter!

)

To the beginning:

I don't advocate any N other than what standard statistics tells that one should use. So you might say I advocate that standard statistical theory applies even to computer Chess. Nothing more, nothing less.

Of course standard statistical theory tells you that certain methodologies, such as paired sampling, can have orders of magnitude higher resolving power than two independent samples for determining differences between two probability populations.

The SD of the total score (+1,-1,0 counting) in a 20K-game result will be limited to ~0.8*sqrt(20,000) = 112. (In fact it is significantly less, as some of your 40 positions seem strongly unbalanced.) For the difference of two such results (for your changed and unchanged version) the SD would be 160. In terms of score fraction (0-100%), that would be 0.4%. That corrsponds to SD ~ 3 Elo, so with 99.9% confidence you could resolve differences of 10 Elo. If that is what you are after, the number is sufficient. If you test changes worth about 100 Elo, and you would just be interested if they gave an improvement, but not in precisely how much (i.e. you don't care if it is 120 or 110 Elo), you could do with 100 times fewer games.

I don't think that any of the effects you mention can drive up the needed number of games compared to this calculation, as the calculation already assumed the maximum theoretically possible variance. In practice the variance will be lower than that, because of the unbalanced positions. Book, parralellism and timing variability could also not cause increase over this smaller variance set by the (unknown) imbalance.

If you see more variance than that, your testing is faulty.

bob · Post by **bob** » Mon Sep 17, 2007 7:25 pm

hgm wrote:
bob wrote:If you believe that, then we should stop the discussion. Your most telling comment above is "if I take a sample size that is large enough..." That is where we will simply have to disagree. Because I understand the idea that for _any_ position, it is possible for a program to play different moves if the search is based on a time limit. And since probabilities are additive, the probability of producing the same game twice is extremely low. So we are simply going to disagree on how large the sample size has to be.
So actually you don't disagree at all. You agree: you admit that there _exists_ a sample size that would do the job. As this is what "large-enough" by definition means...

When have I ever said anything different from that. I have _consistently_ said that a large sample size is needed. We have been arguing about that point for days. Have we somehow changed positions, supposedly? I haven't.

I'm not going to be overly concerned. I have enough data to have a good idea of what is needed and what won't cut the mustard. Time will eventually prove one of us wrong.

I don't see how testing over _positions_ is going to produce a qualitative answer as to "better or worse" where playing a complete _game_ answers that idea pretty succinctly. I want a test methodology that I can execute and get a "go/no-go" answer with no effort on my part. No results to look at, no positions to compare answers to, just "go" or "no-go". I have that test paradigm functioning, and that is all I want. Minimum effort on my part to evaluate a change. Just compile, submit, and wait for the results to come back summarized...
Well, like I said, good for you. But not really of any interest to readers of this forum, as most of them can't or don't want to come up with a million dollars to buy 256 high-end PCs.

You are completely missing the point. Hint: The point is not that you have to buy a big cluster to properly test. The point is that taking small samples of games leads to gross errors. If you can't play enough to shrink the error to an acceptable level, you must realize that the small sample size you do use has an error large enough to make using it impractical.

Just look at all the discussions here about "my 16 game round-robin shows that program X has had no improvement over the previous version. 16 games doesn't show much of anything.

For us it is important to use more advanced testing methodology, that does not waste that much computer time, as for us computer time is a limited resource.
Then this is a hopeless discussion. I don't have any "noise" to speak of because I play enough games to drown the noise out. That is the point. And I don't have to spend any massive amount of time going over the results to decide good or bad.
I would say that you have exactly as much noise as anyone else, but you have developed a (very expensive) method to live with it. By averaging it out of the data. And we both know that that is _quadratically_ expensive...
Did you ever stop to think why almost _everybody_ relies on games to make these decisions? The commercial authors have discussed this here from time to time. CT once mentioned the exact same thing I am discussing, that testing this way is the most critical aspect of developing a chess program. We develop them to play games. They ought to be evaluated doing the thing they were designed to do.
I am not overly concerned about that. Men is a creature of habit, and not many people have imaginative ideas. People that are blessed to have such ideas in one area, very often fail to have them in other areas. Good programmers, or good Chess players are not necessarily schooled in statistics and data analysis.

Your last remark basically dismisses 99% of all scientific research ever done. One almost never gets any valuable data from measurements under real-life conditions, as these conditions involve way too much uncontrolled, unpredictable and unknow factors influencing the process of interest. So one always tries to isolate the system from its usual environment, and replaces the latter by a controlled environment where some properties are varied, others kept constant, but all are precisely known. Standard practice...

The thing you are advocating is not science, but alchemy!
Then answer me this. Old version produces a result of 45-35. New version produces a result of 35-45. Do you keep the change or not? You run the test again, and get 40-40. Now what? I have explained that multiple times. You never address that question, instead changing the topic to something that is not directly related to the original point I raised.
Well, easy enough. Without a conspicuous absense of draws I would assume an upper limit to the SD of the result of 3.6. The difference in score is 10 pts, or 2.77*sigma (or more, if sigma is unexpectedly small). But in a difference of two independent measurements, the variances are added, so sigma is a factor sqrt(2) larger, and 10 points is only 1.96*sigma. So I _know_ that there is only a probability of 2.5% that an equal program could achieve this much score improvement by sheer luck. That would be enough for me to keep the change and start working on the next one, as with the test equipment I have,

You ignored the _second_ result. In both cases, the old version scored better. I could show you the second pair of runs from that test where the second version scored far better. The point being that using either of those two results leads you to the _wrong_ conclusion, when you run a much longer test to eliminate most of the randomness.

requiring even more taxing standards of confidence would be very counterproductive, as it would slow me down enormously (the "paranoia effect"). I would only be wrong once every 40 times, and I would not even take a step back then, as this is the probability to accept an _equal_ change. The probability that I would accept a change that was actually detrimental is much lower. And even if I would take one step backward for every 40 steps forward, I would progress at a rate of 38 improvements per 40 succesfull test runs. If I could beef up the confidence from 97.5% to 100% by testing twice longer, I would improve by 40 steps in 80 test runs, nearly half as fast...

I suppose you repeat the test with the changed version. So it is now 40-40. Well, that is 5 points different from the first one, or 1 sigma. Not something to start doubting the validity of the test (e.g. through virus infection of the computer), which I would certainly do if the second result had been 0-80. In fact the new test (weakly) confirms the old one, as 40-40 is still better than 35-45. But the only meaningful quantity is the sum of the tests, which should have a variance twice that of the original test (because it is now a 160-match test), but in the _average score_ of the two tests the variance would be only half the original one, as averaging entails division by 2. The difference between A and A' would then have a variance of sqrt(1+0.5)*3.6. But the average score difference would now be only 7.5 points. This means you are off by 1.7*sigma now. So your confidence has in fact decreased a little. But not enough to reject it.

Note that if I would be worried after the second test, I would not play more games with the changed version, but with the original one, as the SD in the difference starts to be dominated by that in the measurement of the old version.

I hope that answers your question.

As far as I am concerned the "point you originally raised" is that we could not say with 97.5% confidence that an engine scoring 15 out of a 20-game gauntlet is better than one scoring 5 out of 20. If you are discussing something different now, it must be you who somehow changed topic.
Again, I want something I can execute which comes back with "go / no-go" with respect to accepting/rejecting a new change to Crafty. I don't want any extra subjective effort required. I want the test to measure Crafty as closely as possible to the way it will play in a tournament. Etc.

Somehow that is getting overlooked...
Not at all. If I were to adopt a better test methodology, I would of coures also fully automate it. There is nothing in my scheme that precludes that, or even to make it difficult.

hgm wrote:If I want to know how much on the average children grow between their 8th and 9th birthday, the most stupid thing to do is select N children at 8 and measure there average length, and then independently select N children at age 9 to make a new average, and then take the difference.
Aha. You get my point. That is _exactly_ what _you_ are doing with a small sample size. I am playing enough games so that I take most of the available children in the 8-year-old age group, so that I _do_ get better answers. How can you misunderstand that?
Again, it is you who seems to misunderstand something here: N was not specified in my example, so _if_ I were doing this with small N, and you are doing this with large N, we would still _both_ be doing exactly what the example said. But I am not doing that at all, of course. Like I said, it is the most stupid way to do it, so why should I do it? It is what _you_ do, and that you can finally get a precise answer by taking an excessively large N, that would not make it any less stupid. For the quality of the methodology is not measured by the precision that you finally get (which in every methodology can always be made arbitrarily small), but by the value of N one needed to obtain this accuracy.
I'm not for small sample sizes at all. Never have been throughout this discussion... You have been the one advocating the smaller samples, not me. I _know_ I need to sample enough of the population to get a good estimate of the actual mean.
So where does the notion come from that N was supposed to be small? Do you actually react to what I post, or does your mind create a virtual me that claims quite different things than what I write, and are you actually replying to those phantom claims?

I sometimes wonder if you are reading what _you_ write. I've been advocating large N since the start of the original thread. You have been advocating small N. Repeatedly. So that's where I get the idea that N was supposed to be small according to you. Because you have repeatedly said that.

hgm wrote: Due to the large variance in the length of children in the population, you would need N to be many millions before the standard error would give you any significance on the difference. What one does in stead is select N children, and measure these same children both at age 8 and 9. Then you can reach the same level of accuracy with just N=100 or so.

Accuracy can come from using smart methodology just as much, if not more, as from sample size.
Baloney. If the sample size is too small, all the T-tests and chi-square tests in the world are not going to help because of the standard error... I'm going for an error estimate small enough to give me confidence in my decision on whether to keep changes or not. I am not going to play 32 games in a mini-round-robin event to make that decision, the data is no good.
It seems you still don't grasp the example I have given, or you would not write such nonsense. Let me make it more concrete, then:

Suppose the length of children in the population of the Netherlands had a SD of 20cm. Suppose they grow on average 3 cm between their 8th and 9th birthday, with a SD of 1cm.

To recover these numbers by sampling, the methodology described directly above ("_my_" methodology) would take 100 8-yr olds, measure them, measure them again one year later, and take the average. That average would than have a SD of 1cm/sqrt(100) = 0.1 cm.
On the other hand, _your_ methodology of just taking large-enough samples of independently chosen 8- and 9-year olds would require, to get the same accuracy of 0.1cm, a SD in the individual length averages of 0.071 cm, which is 1/283 of the SD of an individual length. So your sample size would have to be 283^2 = 80,000 of both age groups. In our 16M population that means pretty much the entire population of 8 and 9-year olds. (So it would take a year anyway, to wait for their birthdays...) Not do-able, or horrendously expensive. While I only had to take 100 children, and get the same result. Or take 1000, if I wanted to do 3 times better...

apples and oranges. Computer vs computer games have _inherent_ randomness. By trying to artificially eliminate that due to poor time allocation or decisions about when to start/not-start a new iteration, you just pick a small sample. But suppose the 100 children _you_ pick come from an area where all the buildings are painted using lead paint? Is your sample more valid than my sample (that is larger) and which represents a more average cross-section of the population rather than just the kids suffering from ultra-high levels of lead in their system?

So indeed, it is very possible that a 1000-element sample produces much better accuracy than a 160,000-element sample. Provided the sample is chosen in a smart, rather than a dumb way.

Unfortunately, I do not believe it is possible to choose a small sample of games from a population that is very large and random in nature, and do so in an intelligent way. How does one intelligently choose a small sample from random values? I've never seen such a methodology.

bob · Post by **bob** » Mon Sep 17, 2007 7:28 pm

Uri Blass wrote:Note that evaluation changes like changing weights or adding evaluation factor that is relevant in all parts of the games are clearly changes that you need many games to make sure that they are better but not all changes in chess programs are changes like that type of change.

There are still changes that are simply about making the program faster(even if it is not exactly x times faster with the same analysis)

For example I make checks in the first ply of the qsearch but I still have no function to generate only checks and captures and I simply generate all moves and find checks and captures in them.

Having a function to generate only captures and checks should be an obvious improvement.

I cannot say that it is exactly making the program x times faster because the order of generating moves may be different but I think that in this case
I do not need thousands of games to verify that there is an improvement.

The only possibility not to have an improvement from that type of change is if I have a bug in my code and I have code to make other tests to verify that there are no bugs.

Games are also needed because maybe there is some bug that happen only in games but my assumption is that there is an improvement if I see better results in games.

Note that my evaluation is relatively expensive so I do not expect this change to give a big speed improvement but I expect the change to give some speed improvement(I will see how much improvement I get when I compare analysis with the new version that is not finished with analysis with the old version and I see how much faster is the new version).

Uri

That's all old news. And it has already been discussed/dismissed. Anything that purely speeds up a program, without changing the size/shape of the tree in any way, has to be good. There are no known examples where going faster is worse (in chess) except for an occasional pathological case where going one ply deeper leads you to change to a move that is actually worse. But everyone believes that faster is better all else remaining constant, so that's not what we are talking about here...

bob · Post by **bob** » Mon Sep 17, 2007 7:31 pm

hgm wrote:Oops, I see that I have overlooked this post, because it was already "snowed under" by Nicolai's posts by the time I looked:
bob wrote:I was originally talking about variance in the engineering sense. But I have been using the usual stastical term recently. In that when we started testing on our cluster, I could run 26 game matches, and get 2-24, 24-2 and 13-13.

So the three samples are -22, +22 and 0. mean of the squares is 22^2 + 22^2 + 0^2 = 968 / 3 = 323. Divide by 2 to compute the variance (161). Is that the variance you are talking about? square the difference of each pair of observations, then compute the mean of that and divide by 2??? I claim that number is _very_ high for computer vs computer chess games. I claim it goes down as the number of games goes up. I claim that until the number of games is far larger than one might originally suspect, the variance is extremely high making the results highly random and unusable. I can't deal with variance that large and conclude anything useful.

So I agree with your "if you play a large number". But I have been arguing for a large number of games to reduce variance. You have been arguing for a small number of games, which absolutely increases the variance. We just don't seem to agree on how much this increase actually is...

So maybe I am now confused, since I am arguing for large N small sigma squared. So back to the beginning. Do you advocate large N or small N? I will again remind you, I am using 40 positions so that I get a representative cross-section of games, tactical, positional, middlegame attacks, endgame finesses, etc. That requires at least 80 games since you must alternate colors to cancel unbalanced positions. What N do you advocate? I am finding it necessary to run about 5K games per opponent to get N up and sigma^2 down enough to make evaluation accurate. I also choose to use multiple opponents as I don't want to tune to beat X only to lose to Y and Z because X has problems with certain types of positions that skew those results. I am not sure my 20K total games (5K x 4 opponents) is either enough or overkill. But I am absolutely certain that 500 is not enough because of the variance I have measured.

Also, with my cluster, I can test parallel searches, and I would really like (and plan on doing so) to test against parallel search engines to compare the parallel search effectiveness of the parallel engines. But that introduces even more non-determinism and will drive N even larger. And it would be nice to test a book in the same way, again further increasing N. Not to mention pondering. So far I have tried to whittle away every source of non-determinism except for timing for the moves. Interfering with that produces stability but the match results generally do not come close to the match results produced with more games using time limits.
The variance I talk about is the average of the square deviation from the mean, like the 323 you calculate, except that I would calculate it in the limit of infinitely many games (where the result frequencies would be equal to the probabilities, so that I would have calculated the variance of the probability distribution from which the mini-match results were drawn). The variance of a finite sample would be an approximation to this. (If one did not take the average, but N/(N-1) times the average, it would be an even more accurate approximation.) It will tend to a (finite, nonzero) _constant_ as N tends to infinity.

You seem to divide by an extra factor N. That means you don't calculate the variance of the distribution (or sample), but the variance in the sample _average_. Of course that tends to zero when you increase N.

Now your claim is interesting, for I claim that this number should _never_ exceed 0.5*sqrt(M), with M the number of games in a mini-match. (The limit for infinite N, without the extra division by N.) So if by "very large" you mean larger than 0.5*sqrt(M), we definitely are in disagreement on this.

Note that if you really would obtain -22, +22 and 0 in a 26-game match, it would be virtually certain that you made a mistake, and are not measuring what you think you are measuring. (You probably mixed up some executables between the tests.) As the most probable way to bring about such a result would be for a single-game result-distribution of 50-50-0 (+,-,=). As the SD of a single mini-match result would be 5.1 in that case, it means that you experienced a 4.3 sigma deviation (which in itself has a probability of 8.7e-6) not once, but _twice_. The probability for this would be 75e-12, (times 12 for the different permutations of the three results), i.e. you would have witnessed a one-in-a-billion event. More likely that some of your students screwed up. (Or that your precious cluster would spontaneously levitate, and float through the campus, for that matter! )

To the beginning:

I don't advocate any N other than what standard statistics tells that one should use. So you might say I advocate that standard statistical theory applies even to computer Chess. Nothing more, nothing less.

Of course standard statistical theory tells you that certain methodologies, such as paired sampling, can have orders of magnitude higher resolving power than two independent samples for determining differences between two probability populations.

The SD of the total score (+1,-1,0 counting) in a 20K-game result will be limited to ~0.8*sqrt(20,000) = 112. (In fact it is significantly less, as some of your 40 positions seem strongly unbalanced.) For the difference of two such results (for your changed and unchanged version) the SD would be 160. In terms of score fraction (0-100%), that would be 0.4%. That corrsponds to SD ~ 3 Elo, so with 99.9% confidence you could resolve differences of 10 Elo. If that is what you are after, the number is sufficient. If you test changes worth about 100 Elo, and you would just be interested if they gave an improvement, but not in precisely how much (i.e. you don't care if it is 120 or 110 Elo), you could do with 100 times fewer games.

I don't think that any of the effects you mention can drive up the needed number of games compared to this calculation, as the calculation already assumed the maximum theoretically possible variance. In practice the variance will be lower than that, because of the unbalanced positions. Book, parralellism and timing variability could also not cause increase over this smaller variance set by the (unknown) imbalance.

If you see more variance than that, your testing is faulty.

Or perhaps the data is a bit more random than you give it credit for being???

That's certainly what I am seeing.

nczempin · Post by **nczempin** » Mon Sep 17, 2007 7:38 pm

bob wrote: Just look at all the discussions here about "my 16 game round-robin shows that program X has had no improvement over the previous version. 16 games doesn't show much of anything.

If the discussions do indeed make such a claim, they would be wrong. All they could say would be "after my 16 game RR an improvement has not been shown".

Which neither proves that there was an improvement, nor that there wasn't. All you can conclude is that you can't say yet.

However, it is entirely possible after 16 games to show such an improvement: If the first version went 0-16, and the other version went 16-0, there is significant evidence that the new version is stronger against the same set of opponents and under the same circumstances.

Whether any conclusions can be drawn as to whether this means that the engine has improved within the larger context, is a question that is orthogonal to the basics of such a result.

And it is entirely possible to draw the same kind of conclusions, with slightly lower confidence, when the result is not 16-0 but, say, 15-1 or whatever. And exactly what you can conclude at what confidence is what basic statistic principles allow you to find out.

I haven't seen you explain convincingly why these basic statistics should be treated differently from treating if dice are loaded, or if any seemingly random events are really random or not.

An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?