how to do a proper statistical test

hgm · Post by **hgm** » Wed Sep 19, 2007 8:53 pm

bob wrote:It makes testing far more intractable than if you just accurately test each change the first time. Because now you are faced with rerunning past tests by removing one feature at a time. And hoping your inaccuracies won't cover up the problem a second time.

Well, performance usually doesn't come for free. It sounds like a small price to pay for highly accelerating the development rate. Like I said, one would have to re-test accepted changes after sme time anyway, if the engine changes so much that non-linear interaction between the features becomes important.

In uMax 4.0 the recapture extension gave an improvement. After having added an all-capture QS, in stead of recapture only QS, the recapture extension was suddenly counter-productive. One never can take anything for granted.

Anyway, I just point out that methods that take backward steps can be very viable, and for many problems superior to methods that at any cost try to prevent such steps. There is a contiuos range of confidence, and there is an optimum along the confidence scale. Being satisfied with too low a confidence is counterproductive because there are too many backward steps. Asking for too much confidence is counterproductive because individual steps take too long. The optimum is a compromise, and where exactly that compromise lies is both a function of the relative cost of testing and implementing new ideas, (very different for a 256 Xeon cluster and a single laptop), and the steepness of the learing curve t which it is applied (very different for a 1600 Elo engine as for a 2700 Elo engine).

From people on one end of the scale, 100-game tests can easily be the best solution. Provided you are aware of the risks, and adopt a strategy to live with it. A 100-Elo improvement scores nominally 64%, which is 2.5 sigma for a difference between two 100-game gauntlets. That means you will pick it out with 99% confidence. You are not interested anyway in 1-Elo improvements, they won't bring you from 1600 to 2700...

Always optimize your methodology to the problem at hand.

bob · Post by **bob** » Thu Sep 20, 2007 4:44 am

hgm wrote:
bob wrote:It makes testing far more intractable than if you just accurately test each change the first time. Because now you are faced with rerunning past tests by removing one feature at a time. And hoping your inaccuracies won't cover up the problem a second time.
Well, performance usually doesn't come for free. It sounds like a small price to pay for highly accelerating the development rate. Like I said, one would have to re-test accepted changes after sme time anyway, if the engine changes so much that non-linear interaction between the features becomes important.

We are not talking the same language. If small runs lead to the necessity of more runs to figure out where the errors were made, then this is not an accelerated development rate. It is _slower_. From a software engineering perspective it is far better to weed out the problems as they are added, rather than having to go back and weed them out later, because it is harder to do that...

I would much rather weed out bad changes as I test them, rather than letting them creep into the code undetected, and then try to figure out what is broken later on. At the point of insertion, an accurate go/no-go prevents almost all such changes from getting in. If you wait for a month, how do you figure out which of the supposedly good changes is really bad without orders of magnitude more tests as you randomly remove things which were recently added? Sounds ugly. I'd prefer to do it once and move on. At a faster pace than trying to decipher bad changes well after the fact...

In uMax 4.0 the recapture extension gave an improvement. After having added an all-capture QS, in stead of recapture only QS, the recapture extension was suddenly counter-productive. One never can take anything for granted.

No argument there. I do that kind of testing all the time. But not for individual evaluation parameters.

Anyway, I just point out that methods that take backward steps can be very viable, and for many problems superior to methods that at any cost try to prevent such steps. There is a contiuos range of confidence, and there is an optimum along the confidence scale. Being satisfied with too low a confidence is counterproductive because there are too many backward steps. Asking for too much confidence is counterproductive because individual steps take too long. The optimum is a compromise, and where exactly that compromise lies is both a function of the relative cost of testing and implementing new ideas, (very different for a 256 Xeon cluster and a single laptop), and the steepness of the learing curve t which it is applied (very different for a 1600 Elo engine as for a 2700 Elo engine).

From people on one end of the scale, 100-game tests can easily be the best solution. Provided you are aware of the risks, and adopt a strategy to live with it. A 100-Elo improvement scores nominally 64%, which is 2.5 sigma for a difference between two 100-game gauntlets. That means you will pick it out with 99% confidence. You are not interested anyway in 1-Elo improvements, they won't bring you from 1600 to 2700...

I won't argue about recognizing 100 Elo improvements.

However, I am pretty certain I am way beyond finding that kind of improvement in Crafty. My changes for several years now have been slowly evolutionary, not revolutionary. Even the reduction stuff was a modest improvement, nowhere near 100 Elo.

So I (and most of us) are stuck with recognizing much smaller improvements, which is necessarily much more difficult to accurately discover.

Always optimize your methodology to the problem at hand.

hgm · Post by **hgm** » Thu Sep 20, 2007 9:21 am

bob wrote:We are not talking the same language. If small runs lead to the necessity of more runs to figure out where the errors were made, then this is not an accelerated development rate. It is _slower_. From a software engineering perspective it is far better to weed out the problems as they are added, rather than having to go back and weed them out later, because it is harder to do that...

With such qualitative arguments we could argue forever.

But one can actually _calculate_ how fast it is. When you want to evaluate N small changes (i.e. giving small change in playing strength), on these N different aspects you can chose either the better or the worse version. With the "high-confidence testing" that you champion, it would take N tests to figure out which is which (assuming the changes are additive).

If we would do it through the random-walk approach, randomly selecting one of the N features in every step to test the opposite variant, we have a porobability p to accept the worse one. If the current quality of our engine is Q, meaning we have included Q of the N improvements, the probability that we select a good feature for deletion is than Q/N*p, while the probability that we select a currently bad feature for improvement is (N-Q)/N*(1-p). These are transition probabilities in a "Markov chain", and the development process will be a biased random walk through this chain. The average progress per attempted step is (1-Q/N)*(1-p)-Q/N*p = 1 - p -Q/N. The equilibrium point will occur when Q/N = 1-p. (i.e. if you have 95%-confidence testing, p=0.05, and your improvement would stop after incorporation of 95% of the benificial changes).

From this we can see that a significant slowdown of the development (through the effect you mention) only occurs when Q/N starts to apprach one, i.e. when a large part of the improvements is already incor
From this we see that the development only starts to slow down after a significant fraction of all possible improvements has already been incorporated.

And this is exactly the difference between an engine at the bottom of the learning curve, like Eden, and an engine at the top, like Crafty. For Eden Q/N << 1, and 1-p is a good approximation for the improvement per test.

I would much rather weed out bad changes as I test them, rather than letting them creep into the code undetected, and then try to figure out what is broken later on. At the point of insertion, an accurate go/no-go prevents almost all such changes from getting in. If you wait for a month, how do you figure out which of the supposedly good changes is really bad without orders of magnitude more tests as you randomly remove things which were recently added? Sounds ugly. I'd prefer to do it once and move on. At a faster pace than trying to decipher bad changes well after the fact...

Considering what I said above it is quite understandable you would prefer that, considering the Q/N you have to deal with in Crafty. But for those of us for which Q/N is still very small, the situation is totally reversed. As the strength of your engine goes up, the confidence level needed for optimally fast improvement goes up as well.

So I (and most of us) are stuck with recognizing much smaller improvements, which is necessarily much more difficult to accurately discover.

Well, the 'most of us' is debatable here. Most engines actually have ratings (far) below 2000, so we have still many hundreds of Elo to go. That won't happen through steps of 10.

hristo · Post by **hristo** » Thu Sep 20, 2007 9:06 pm

hgm wrote: But one can actually _calculate_ how fast it is. When you want to evaluate N small changes (i.e. giving small change in playing strength), on these N different aspects you can chose either the better or the worse version. With the "high-confidence testing" that you champion, it would take N tests to figure out which is which (assuming the changes are additive).

This gets a bit more difficult when the changes we introduce are not orthogonal to one another with respect to "play strength". In this sense, if by 'additive' you also understand (imply) that the changes have no inherent connection with each other then the process to weed-out the 'bad' changes and leave only the 'good' ones is well described in your post. Thank you.

hgm wrote: If we would do it through the random-walk approach, randomly selecting one of the N features in every step to test the opposite variant, we have a porobability p to accept the worse one. If the current quality of our engine is Q, meaning we have included Q of the N improvements, the probability that we select a good feature for deletion is than Q/N*p, while the probability that we select a currently bad feature for improvement is (N-Q)/N*(1-p). These are transition probabilities in a "Markov chain", and the development process will be a biased random walk through this chain. The average progress per attempted step is (1-Q/N)*(1-p)-Q/N*p = 1 - p -Q/N. The equilibrium point will occur when Q/N = 1-p. (i.e. if you have 95%-confidence testing, p=0.05, and your improvement would stop after incorporation of 95% of the benificial changes).

How would this work, mathematically speaking, if some set of Q -> Q' is dependent on some other set Q'' where those sets are not unique?

The above (Markov) seems to be tailored for optimization in the context of a given function (state), however it is possible to introduce multiple changes in the chess-engine that, due to lack of proper and timely testing with the consequent selection "against/for", become codependent and that eventually can only be removed as a group (rather than individually) in order to evaluate playing strength.

Regards,
Hristo

bob · Post by **bob** » Thu Sep 20, 2007 10:38 pm

nczempin wrote:
bob wrote: I don't see why it is hard to understand why 1 in 20 is going to cause lots of trouble. 1 in 100 will occasionally cause trouble. It burned me in 1986.
I don't even have 20 releases of my software yet, why should I be worried about that small an error?

Easy. Think toward the future, or live and die in the past. I have, in the past 10 years, produced a new version every 2 days, or every 2 weeks.

And surely one tournament out of how many is not such a bad result? I think you are wrong in attributing all of whatever burned you in 1986 to your result. It could just as well have been that one thing in 20 that was wrong, but all the other things were right, surely more than 20.

Sorry, but there you don't know a thing about what you are talking about. We tracked the "problem" back to one change that showed up as good over a 110 game test prior to the WCCC. But it was really bad. Once we had positions where you look and think "why in the world did it not push the pawn there??" then you can start debugging, and we came back to the 4 lines of code we added for pawn holes that looked good, but were actually bad.

How can, over the long run, such a small thing make such a large difference? Especially since you are aware of it and have been able to, after more testing, to correct the mistake.

Again, I don't know what you are talking about. We ran right at 110 "test games" before I left Hattiesburg on the way to UAB to work on my PhD. About 6 weeks later we played in the Denver ACM event and did poorly, but with just 5 games in the sample we thought "OK, several are catching up to our speed, so we are not going to do as well as we were expecting." Then we went to Cologne with the same program and the first two rounds went badly (although we won the first, we played some horrible moves). Then we started looking at "why?" and found the answer.

It is not like I test each change by itself. I always test the engine holistically. But you already know that.

I understand that. It's a poor way of testing, according to any software principle you care to quote, but it is a method one can choose to use.

Case in point, my engine still uses straightforward alpha-beta, no pvs. I tried including pvs, it got too complicated for that particular version of the source code in relation to the test results, which didn't show a significant improvement. How significant would going back from pvs to alpha-beta be for Crafty?

PVS is a very modest improvement over normal alpa/beta, The effect would not be that significant. But it would be worse.

We just live in different worlds.

I have received enough information from people that seem to be able to imagine themselves in my shoes that I don't need your approval. You seem to be unable to lower yourself down to my level, and I will just live with that from now on.

up to you. You can either listen to people that have done it far longer than yourself, and who have far more experience, or you can re-invent the wheel multiple times and waste the effort in doing so...

Some prefer to follow that path. It will work. Just _far_ slower...

bob · Post by **bob** » Thu Sep 20, 2007 10:43 pm

nczempin wrote:
bob wrote: If your main concern is to get stronger as quickly as possible, with as few backward steps as possible, then you need a pretty accurate measurement to avoid those giant steps back that you don't even realize you took.
For the hundredth time, my main concern is not to get stronger as quickly as possible, but to get stronger gradually, and actually learn something along the way, always still knowing what I'm doing. This isn't a full-time job for me.

Then pardon my bluntness but that is a stupid way of thinking. Did you take one university course a year? Why not? That would be a "gradual" way of learning. Why would you take all the required courses over a 4-5 year span when you could take them over 25 years? Time is apparently not important, based on your comment above.

Most want to improve as quickly as possible for as long as possible. If that is not your intent, no wonder we can't agree on testing procedures. Nobody I know of would want to optimize a testing procedure that works as you are describing...

Incidentally, why are those 1 in 20 steps back suddenly "giant"?

Because they are _very_ difficult to uncover after they are combined with 19 other changes. It is far easier to carefully verify "good"/"bad" when the change is done, rather than after it and 19 other changes are done and we figure out one is bad but we have no idea which one.

I am not doing computer chess research; I am just implementing stuff that has been known for 30 years. What giant steps backwards are you envisioning that I will take?

I can think of about 10,000 that I took, and discovered before going on...

bob · Post by **bob** » Fri Sep 21, 2007 7:08 am

hgm wrote:
nczempin wrote:Incidentally, why are those 1 in 20 steps back suddenly "giant"?
The point is that 1 in 20 is the acceptance probability for a _neutral_ change. A change for the worse has a lot lower acceptance probability. A "giant step" backward has as much hope as a snowball in Hell...

Nope. You misunderstood. A "giant step backward" is a step backward that requires a _lot_ of effort to undo. And if you do 19 good steps, and 1 bad one (and it doesn't have to be a really bad one, just a bad one, that you didn't catch), it takes more effort later to find it than it did at the instant it was introduced.

I'm not talking about a -200 elo screw-up. Never said that. Never implied that. If it takes me a day to make a baby step forward or backward, and recognize it, then when it takes a week or more for me to discover and fix a mistake, that mistake was a giant step backward in terms of effort expended for what I got in return, which was primarily a return to an older version.

Again, given the choice of a little more effort to avoid making mistakes as frequently is better if the cost of those occasional mistakes ends up being larger than the effort to weed them out as they happen.

It's a basic premise of software engineering...

hgm · Post by **hgm** » Fri Sep 21, 2007 12:22 pm

Somehow I can't really relate to that. When I add new evaluation terms, they are usually pretty much independent of the terms that were already there. And I usually add them in #ifdef ... #endif brackets, so that I only have to flip a compiler switch to remove them again. Same thing if I replace a term, except that there then also is an #else. Doesn't really strike me as "a _lot_ of work to undo".

Furthermore, even if undoing a bad step takes 7 times the amount of testing as accepting it, if it happened only once every 20 times, you would need 27 tests to take 20 forward steps. That would still be faster than making 20 "guaranteed" forward steps per 20 test runs, if the test runs to achieve the higher confidence took twice as long.

Such details only shift the optimum. It doesn't alter the fact that there _is_ an optimum, and that both testing less accurate than this optimum and testing more accurate than it will be counter-productive.

hristo · Post by **hristo** » Fri Sep 21, 2007 6:59 pm

hgm wrote:Somehow I can't really relate to that. When I add new evaluation terms, they are usually pretty much independent of the terms that were already there.

This is the part that I question. It is very possible and IMO very likely that as you add more and more terms some of them will become implicitly dependent on others. This dependency will be exhibited in the solution space even when there is no explicit implementation dependence.

If this is actually the case then performing exhaustive tests at each step would seem to be a better approach, rather than having to untangle the mess at a later time.

Earlier you said that engines that implemented a small set of features (terms) Q -- Q/N <<1 -- would not require as much testing as those where Q/N approaches 1-p. This is probably true, however due to unavoidable inter dependencies that will be introduced with additional features the threshold where "full testing" should commence is probably much lower than Q/N==1.

Where that threshold 'is' is not clear to me, so I wonder if there is some rough way to estimate it.

Regards and Good morning,
Hristo

bob · Post by **bob** » Fri Sep 21, 2007 7:02 pm

hgm wrote:Somehow I can't really relate to that. When I add new evaluation terms, they are usually pretty much independent of the terms that were already there. And I usually add them in #ifdef ... #endif brackets, so that I only have to flip a compiler switch to remove them again. Same thing if I replace a term, except that there then also is an #else. Doesn't really strike me as "a _lot_ of work to undo".

Furthermore, even if undoing a bad step takes 7 times the amount of testing as accepting it, if it happened only once every 20 times, you would need 27 tests to take 20 forward steps. That would still be faster than making 20 "guaranteed" forward steps per 20 test runs, if the test runs to achieve the higher confidence took twice as long.

Such details only shift the optimum. It doesn't alter the fact that there _is_ an optimum, and that both testing less accurate than this optimum and testing more accurate than it will be counter-productive.

Here is the most basic example that bites everyone. You add a simple passed pawn evaluation term. And now you see a position here and there where you (a) trade a knight for 2 or 3 pawns + create that passer. If the passer score is too big, a knight for 2 pawns is a real lemon. A knight for 3 pawns is generally bad unless you can quickly trade to the endgame with 3 pawns vs a knight... (b) trade a bishop and knight for rook and pawn + create a passed pawn again. Not a good idea. (c) create a passed pawn while letting your opponent break up (but not win anything) your kingside pawn structure. (d) create a passed pawn while letting your opponent get a pair of rooks to your 7th rank.

That one term can influence scoring all over the board. And if your testing happens to avoid those positions for a while, by playing a small number of test games, then you won't realize how that can happen.

Other examples include things like "king has not castled penalty" which can interact with other scoring ideas in bizarre ways. You reduce score term X, and now you start winning because you no longer will try to produce "X" as an advantage, while getting into a position where your king gets stuck in the center and attacked.

I find the scoring terms interact in ways I had no intention of happening. And it becomes difficult to figure this out well after the term was added or changed, since I didn't happen to play any games where it was important.

hence my requirement of a broad set of starting positions, and enough games played with each one to make sure I am not getting false positives or false negatives, just because of the variability of the results.

I agree there is an "optimum". I also believe it is very difficult to define this value. Which means I have the choice of going over or under without knowing. I choose to go over, since the cost (for me) is minimal. This way I get no steps backward that I don't recognize. Or at least I get _very_ few, since I am not sure I can play enough games on each test to be 100% certain there is no error.

how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test