how to do a proper statistical test

hgm · Post by **hgm** » Wed Sep 19, 2007 1:00 pm

Rein Halbersma wrote:He indeed calculates things correctly, the difference is that I standardize everything to a single game to look at the scaling behaviour for multiple games.

Well, glad you put that right...

But this might be a good point to re-iterate an issue that I brought up before: how desirable is it actually to have very good confidence and resolving power. Your own example show that you need excessively large number of games to acheive such confidence levels; 27,000 games represents a life time of testing to most of us. So after that we can be really sure that the change was an improvement. Well, much good will it do, as we are dead and burried by that time, and in the mean time all development has been stopped because out computer was busy running test matches. 7 Elo in 10 years, is that good progress?

Bob says that going with smaller confidence makes your engine development like a random, as you will take many steps in the wrong direction, discarding improvements and adopting changes for the worse.

But I say: "So what?"! A random walk can bring you to your destination faster than going in a straight line, if you take a step every second where the straight walker takes only one step every week. An angry hornet will get to the honey earlier than a snail. So the picture is not complete if you don't consider progress per unit of time.

So taking one step backward for every 20 forward (as you would have with 95% confidence testing) brings you still 18 steps forward, 90% of what you could have done through perfect testing. If testing twice longer would beef up confidence to 99%, you would accept one backward step out of 100, so 98% progress. But with twice the effort, so only 49% progress in the same time the lower confidence made 90% progress. Going from 95% confidence to 99% confidence testing was a losing strategy!

So in fact one could do the opposite, and reduce confidence. Only do 1.15 sigma testing in stead of 1.65-sigma, for only 87.5% confidence. One of every 8 steps is backwards. But you only need half the number of games, so you make 75% progress in half the time, 150% in the same time as you otherwise advanced 90%. Another factor 2 in testing level brings you to the 79% confidence level (0.83 sigma, one-sided), 58% progress in one quarter of the time, is 232% in the time you originally achieved 90%.

You see that the low confidence testing is extremely competative, and leads to much faster improvement of your engine than high-confidence testing. At some point it stops, of course. And this approach assumes that testing is the limited facture. The picture changes when you are running out of ideas to test. But in this method you should not shy from re-trying ideas that were thrown out earlier, as they might have just been unlucky. You would have to do that anyway, if you wanted to test combinations of ideas. What might not work initially, might work after you changed many other things. So the need to retest rejected ideas, and continuing testing if the accepted ideas are still useful, is a necessity anyway. Also in the high-confidence approach.

Thing might also become different when you get really close to the optimum, where almost no change gives an improvement, and almost every change is detrimental. Then the math above doesn't apply, and you would have to take into account the natural bias against improvements. But for engines around 1600 Elo, we are very far from such an optimum, and almost any change can have an effect in either direction. It is just a completely different game as improving engines of 2700 Elo. And as a consequence, it needs different type of testing.

bob · Post by **bob** » Wed Sep 19, 2007 5:05 pm

nczempin wrote:
Rein Halbersma wrote:
nczempin wrote: Bob also seems to have a problem with 95 % confidence, saying "1 in 20 will be wrong". What is your answer to that?
Bob's right: 5% of the time you make a Type I error and accept a program change that wasn't an actual improvement. And in 95% of the repeated tests it wouldn't show as an improvement, just you were unlucky that it did in your particular match. To only way to avoid is to increase the significance level to say 1%. That would mean that you need twice as much games.

Well, of course he's right, it is exactly what 95 % confidence means, that you will be wrong 1 times in 20. But his conclusion is that this is not acceptable.

If you do as we do, and test every change of any significance (we do not tweak one scoring term and run a test for example, although I have certainly done that for adjusting extensions and reductions since a tiny change has a very measurable effect) then 1 in 20 is not acceptable. one of every 20 tests sees you make a wrong decision. And even worse, it now biases the next test you run when that one error that led you to think the last change was good now produces a more realistic effect and offsets the next change (good) that you try and you conclude that is no good. So now we went from one error in 20 tests to 2 errors, because we kept a bad change and just threw out a good one. How long before you figure out what is going on?

I'd love 0% error. But I can't play all possible games.

I don't see why it is hard to understand why 1 in 20 is going to cause lots of trouble. 1 in 100 will occasionally cause trouble. It burned me in 1986.

But both the effect size and the significance level are up to the experimenter and to some degree are a matter of taste and available resources. Just make sure that you understand what you are trying to measure and with which accuracy your measurements are made.
Yes, and what Bob is saying is that both his required confidence level would be higher, and his results have a lower significance. Therefore for his situation a lot more games are required. And I have no clue why he doesn't acknowledge that the reverse direction is also true; if your significance is higher and your required confidence, you will need fewer games than he does.

For me with Eden, even 95 % confidence seems overkill. After all, as long as the new version is not significantly _weaker_ no-one would complain (actually no-one will complain except a certain L. will complain, and I will be disappointed that my goal for each new Eden version has been violated. But no-one will nail me to a tree for that.).

If your main concern is to avoid being nailed to a tree, you can just play one game and make decisions. Nobody is going to do that. If your main concern is to get stronger as quickly as possible, with as few backward steps as possible, then you need a pretty accurate measurement to avoid those giant steps back that you don't even realize you took.

If you accept larger error and the resulting penalty, that's fine. You can smoke. Drink, Eat poorly. etc too. But I think it wise to inform the general population of the potential down-side of any of those actions. Ditto for testing. remember, this thread is _not_ between you and me. It's public and read by many others. If nothing else, I hope it explains why the various rating lists should be taken with a big grain of salt, at best..

bob · Post by **bob** » Wed Sep 19, 2007 5:18 pm

hgm wrote:
Rein Halbersma wrote:He indeed calculates things correctly, the difference is that I standardize everything to a single game to look at the scaling behaviour for multiple games.
Well, glad you put that right...

But this might be a good point to re-iterate an issue that I brought up before: how desirable is it actually to have very good confidence and resolving power. Your own example show that you need excessively large number of games to acheive such confidence levels; 27,000 games represents a life time of testing to most of us. So after that we can be really sure that the change was an improvement. Well, much good will it do, as we are dead and burried by that time, and in the mean time all development has been stopped because out computer was busy running test matches. 7 Elo in 10 years, is that good progress?

Bob says that going with smaller confidence makes your engine development like a random, as you will take many steps in the wrong direction, discarding improvements and adopting changes for the worse.

But I say: "So what?"! A random walk can bring you to your destination faster than going in a straight line, if you take a step every second where the straight walker takes only one step every week. An angry hornet will get to the honey earlier than a snail. So the picture is not complete if you don't consider progress per unit of time.

So taking one step backward for every 20 forward (as you would have with 95% confidence testing) brings you still 18 steps forward, 90% of what you could have done through perfect testing. If testing twice longer would beef up confidence to 99%, you would accept one backward step out of 100, so 98% progress. But with twice the effort, so only 49% progress in the same time the lower confidence made 90% progress. Going from 95% confidence to 99% confidence testing was a losing strategy!

So in fact one could do the opposite, and reduce confidence. Only do 1.15 sigma testing in stead of 1.65-sigma, for only 87.5% confidence. One of every 8 steps is backwards. But you only need half the number of games, so you make 75% progress in half the time, 150% in the same time as you otherwise advanced 90%. Another factor 2 in testing level brings you to the 79% confidence level (0.83 sigma, one-sided), 58% progress in one quarter of the time, is 232% in the time you originally achieved 90%.

You see that the low confidence testing is extremely competative, and leads to much faster improvement of your engine than high-confidence testing. At some point it stops, of course. And this approach assumes that testing is the limited facture. The picture changes when you are running out of ideas to test. But in this method you should not shy from re-trying ideas that were thrown out earlier, as they might have just been unlucky. You would have to do that anyway, if you wanted to test combinations of ideas. What might not work initially, might work after you changed many other things. So the need to retest rejected ideas, and continuing testing if the accepted ideas are still useful, is a necessity anyway. Also in the high-confidence approach.

Thing might also become different when you get really close to the optimum, where almost no change gives an improvement, and almost every change is detrimental. Then the math above doesn't apply, and you would have to take into account the natural bias against improvements. But for engines around 1600 Elo, we are very far from such an optimum, and almost any change can have an effect in either direction. It is just a completely different game as improving engines of 2700 Elo. And as a consequence, it needs different type of testing.

You overlook the effect of that 1 in 20 backward step. You test a bad idea, but one in twenty times it shows up as good or indifferent. And you keep it. And on the next test, a good change is counteracted by the previous bad change you didn't discard because of testing error, and now you have a second error. How long will it take you to figure out that something is bad and has been screwing up your tests over and over, so that once you get that 1 in 20 error, you then start getting 20 of 20 errors?

We have had that happen even with our larger number of games. Tracy frequently sends me changes that test good in small matches, only to look significantly worse in our real test matches. Fortunately, we all use the cluster runs for final go/no-go, but by the same token, how many of these "short tests" show bad when the change is really good? No idea as they are not making it past his first screening, except for the cases where he says "this tests worse, but I really believe it should be better..." And testing often shows his intuition to be correct...

hgm · Post by **hgm** » Wed Sep 19, 2007 5:49 pm

Incorporating bad changes is not a problem, as long as you occasionally test already incorporated ideas for removal.

Theory for this process is quite well developed, as people working in statistical thermodynamics use it all the time. These type of Monte Carlo calculations are known as "Simulated Annealing", where you intentionally take steps now and then against the gradient, but with a lower probability. The forward/backward stepping ratio is controlled by an exponental of the difference of the quantity to optimize (usually energy in a collection of model molecules) divided by the temperature (the so called Boltzmann factor exp(-E/kT)).

One reason for taing backward steps is to avoid trapping in local optiima. The nice thing about developing an engine is that one automatically does SA, as the probability of accepting a backward step is always non-zero. The confidence level of the acceptance criterion effectively determines the 'temperature' at which the process is conducted. Low temperatures do not always lead to the fastest results. Often the best results are obtained by cooling the system during the process, starting at a high temperature to speed things up, and lowering the temperature as the target is approached to gain precision. In our case this would meen increasing the sample size during the development process.

For truly hard problems one works at multiple temperatures simultaneously '("parallel tempering"). In our case that would mean having parallel developments, one with small sample sizes, the other with large sample sizes, and occasionally swap the versions from one process to another. Lots of formalized theory has been generated to describe this problem.

nczempin · Post by **nczempin** » Wed Sep 19, 2007 5:54 pm

bob wrote: I don't see why it is hard to understand why 1 in 20 is going to cause lots of trouble. 1 in 100 will occasionally cause trouble. It burned me in 1986.

I don't even have 20 releases of my software yet, why should I be worried about that small an error?

And surely one tournament out of how many is not such a bad result? I think you are wrong in attributing all of whatever burned you in 1986 to your result. It could just as well have been that one thing in 20 that was wrong, but all the other things were right, surely more than 20.

How can, over the long run, such a small thing make such a large difference? Especially since you are aware of it and have been able to, after more testing, to correct the mistake.

It is not like I test each change by itself. I always test the engine holistically. But you already know that.

Case in point, my engine still uses straightforward alpha-beta, no pvs. I tried including pvs, it got too complicated for that particular version of the source code in relation to the test results, which didn't show a significant improvement. How significant would going back from pvs to alpha-beta be for Crafty?

We just live in different worlds.

I have received enough information from people that seem to be able to imagine themselves in my shoes that I don't need your approval. You seem to be unable to lower yourself down to my level, and I will just live with that from now on.

nczempin · Post by **nczempin** » Wed Sep 19, 2007 5:58 pm

bob wrote: If your main concern is to get stronger as quickly as possible, with as few backward steps as possible, then you need a pretty accurate measurement to avoid those giant steps back that you don't even realize you took.

For the hundredth time, my main concern is not to get stronger as quickly as possible, but to get stronger gradually, and actually learn something along the way, always still knowing what I'm doing. This isn't a full-time job for me.

Incidentally, why are those 1 in 20 steps back suddenly "giant"?

I am not doing computer chess research; I am just implementing stuff that has been known for 30 years. What giant steps backwards are you envisioning that I will take?

hgm · Post by **hgm** » Wed Sep 19, 2007 6:08 pm

nczempin wrote:Incidentally, why are those 1 in 20 steps back suddenly "giant"?

The point is that 1 in 20 is the acceptance probability for a _neutral_ change. A change for the worse has a lot lower acceptance probability. A "giant step" backward has as much hope as a snowball in Hell...

nczempin · Post by **nczempin** » Wed Sep 19, 2007 6:18 pm

bob wrote: If you accept larger error and the resulting penalty, that's fine. You can smoke. Drink, Eat poorly. etc too. But I think it wise to inform the general population of the potential down-side of any of those actions. Ditto for testing. remember, this thread is _not_ between you and me. It's public and read by many others. If nothing else, I hope it explains why the various rating lists should be taken with a big grain of salt, at best..

This sounds like you claim a monopoly of being right.

Look, we are aware of the downsides, but we are managing differently.

Yes, as I have said numerous times, most get it wrong. I have just pointed out an example of it in the general forum (regarding the strongest fruit). But here, you are preaching to the choir, and taking a step too far.

Note also that the general discussion will not change dramatically if we agree on the same level of confidence. None of the results will change in principle if we also take 99 % (or which confidence level are you using again, given that you got burned in 1986?). But again, you are not on some kind of higher moral ground in claiming that your number is better than ours. There are plenty of real numbers I can quote than each of your apparently universally valid confidence levels. Does that make me more right than you?

nczempin · Post by **nczempin** » Wed Sep 19, 2007 6:20 pm

I have one thing I just have to get off my chest, please don't take it personally:

I think at some stage one must realize whether one is the Windmill or Don Quixote.

bob · Post by **bob** » Wed Sep 19, 2007 7:43 pm

hgm wrote:Incorporating bad changes is not a problem, as long as you occasionally test already incorporated ideas for removal.

It makes testing far more intractable than if you just accurately test each change the first time. Because now you are faced with rerunning past tests by removing one feature at a time. And hoping your inaccuracies won't cover up the problem a second time.

Theory for this process is quite well developed, as people working in statistical thermodynamics use it all the time. These type of Monte Carlo calculations are known as "Simulated Annealing", where you intentionally take steps now and then against the gradient, but with a lower probability. The forward/backward stepping ratio is controlled by an exponental of the difference of the quantity to optimize (usually energy in a collection of model molecules) divided by the temperature (the so called Boltzmann factor exp(-E/kT)).

One reason for taing backward steps is to avoid trapping in local optiima. The nice thing about developing an engine is that one automatically does SA, as the probability of accepting a backward step is always non-zero. The confidence level of the acceptance criterion effectively determines the 'temperature' at which the process is conducted. Low temperatures do not always lead to the fastest results. Often the best results are obtained by cooling the system during the process, starting at a high temperature to speed things up, and lowering the temperature as the target is approached to gain precision. In our case this would meen increasing the sample size during the development process.

Nothing new there at all. But there at least you are taking a "backward step" in an attempt to find a path away from that local maximum. That's not what we are talking about here. We are talking about unintentional (and, indeed, unknown) steps backward. They are quite difficult to find and undo later.

For truly hard problems one works at multiple temperatures simultaneously '("parallel tempering"). In our case that would mean having parallel developments, one with small sample sizes, the other with large sample sizes, and occasionally swap the versions from one process to another. Lots of formalized theory has been generated to describe this problem.

And very few of those approaches utilize a _small_ number of iterations. They iterate until the rate of change is below some threshold. Which somehow sounds quite familiar to me.

how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test