how to do a proper statistical test

bob · Post by **bob** » Fri Sep 21, 2007 7:21 pm

I think there are multiple cases.

Some eval terms are obviously interdependent, by design. Passed pawns are one example. You have to advance them, blockade them, create them, support them, etc. Each of those contributes to the total value of that pawn. And often any one of those terms is sub-optimal, but added together they work well.

King safety is another obvious case that is far larger and more complex. Pawn structure around the king, pieces attacking and defending the king. weak squares around the king and pieces that can possibly exploit those, etc.

And then there are the cases where the dependency is far from obvious but still there. Trying to stop your program from trading knight for 2-3 pawns, or knight and bishop for rook and pawn. But then other scores can add enough to the equation to overpower the "bad trade" score and you do it anyway. A change to king safety or passed pawns (both typically can produce large score ranges) can easily overwhelm the bad trade penalty and there you go.

As the number of terms goes up, the complexity appears to go up exponentially, or geometrically at best. If I make one change and find that it screws things up, I can fix or get rid of it easily enough. If I let it sneak in silently, thinking it is good when it is really bad, then I am going to have a tough debugging session to discover this.

We have, over the past year, violated my golden rule of "one change at a time" and we have occasionally paid dearly for it. Three changes, two good, one bad, looks like an improvement. It would be a bigger improvement if that one bad change was left out. Finding those is very difficult, and often has led us to go back and test them one at a time as we originally should have. Of course we still don't know which change was bad so we have to thest them all and the changes made since that point are left out and have to be retested as well...

I had this problem with "blitz" in the 70's, with Cray Blitz in the 80's and early 90's, and with Crafty since 1995. It is still "alive and well" today...

bob · Post by **bob** » Fri Sep 21, 2007 7:43 pm

Rein Halbersma wrote:The discussion in the various threads on testing engine improvements so far has been rather ad hoc as far as the statistical methodology is concerned.

H.G. Muller seems to think that the variance determines the confidence intervals around a match outcome. That's not true. What matters is the standard error on the mean, which is equal to the square root of the variance divided by the square root of the number of games. So more games in a match let you test more precisely, just as Robert Hyatt is claiming.

E.g. (building on an example that was given earlier), if you play a 100 game match with 51 wins, 49 losses, then the mean result is that you score 51 points, with a variance of 24.99 (=51*(1-.51)^2 + 49*(0-.51)^2). The average result per game is 0.51 with a standard error on the mean of 0.04999=sqrt(24.99/100). If we assume that games are statistically indepent, then the outcome of a 100 game match with two oppenents that score 51-49 against each other is normally distributed around a mean of 51% and a standard deviation of 0.04999.

If you want to test whether the 51-49 match was a signifcant improvement over a 50-50 match (no draws again), then you have to calculate the 95% confidence interval (one-sided) under the null hypothesis of a 50-50 result. That turns out to be a score of 58,2% or more *in a 100 game match*. Hence, a 51-49 result in a 100 game match is not a significant improvement.

How many games do you have to repeat in order for a 51-49 result to be an improvement over 50-50? From the scaling behaviour of the standard error on the mean it turns out that you would need to score 51% in a 6700 game match in order to show a signficant improvement!

Now, that was only the significance, i.e. the chance of having no Type I errors (95% chance that an accepted improvement was genuine). But what do you do when the result of a 100 game is not significant? You can't conclude that there is no improvement either since you might commit a Type II error (falsely rejecting a genuine improvement).

So besides significance, you also want to have a test of high statistical power so that you give yourself a shot of finding, let's say, 95% of all genuine improvements. If you want to simultaneously accept 95% of the genuine improvements and also have 95% of the accepted improvements to be genuine, then you need even more games. For the example in hand it would mean that only after 27,000 games would a 51-49 outcome be sufficient to conclude that you would find a genuine improvement with 95% probability.

I didn't notice this until someone pointed it out to me, but your 27,000 games is awfully close to the 5120 * 4 games (20,480) games we are currently using. And I have noticed that there is a bit of variance left but it is very small when I average this to a single 80 game match result...

hgm · Post by **hgm** » Fri Sep 21, 2007 11:19 pm

bob wrote:Here is the most basic example that bites everyone. You add a simple passed pawn evaluation term. And now you see a position here and there where you (a) trade a knight for 2 or 3 pawns + create that passer. If the passer score is too big, a knight for 2 pawns is a real lemon. A knight for 3 pawns is generally bad unless you can quickly trade to the endgame with 3 pawns vs a knight... (b) trade a bishop and knight for rook and pawn + create a passed pawn again. Not a good idea. (c) create a passed pawn while letting your opponent break up (but not win anything) your kingside pawn structure. (d) create a passed pawn while letting your opponent get a pair of rooks to your 7th rank.

...

That doesn't strike me as relevant for the problem at hand. What you mention is an interaction of terms, and that would force you to tune the terms simultaneously in any method. What do you gain if your very precise testing does recognize the N vs 2P problem, and therefore tells you that the passer bonus is bad? You would reject it! But the problem was not that the passer bonnus was bad, but that after including it, other parameters need retuning to make it good. So actually it was the wrong decision to reject it. And if the precise test tells you that the disadvantage does not outweigh the advantage, and you accept it anyway, you might as well not have done the test.

This is a classic example on trapping on local maxima (witthin the search-space topology defined by the mutation prescriptions). The SA calculations at finite temperature are far more robust wrt this than the gradient climbing you advocate.

bob · Post by **bob** » Sat Sep 22, 2007 12:04 am

hgm wrote:
bob wrote:Here is the most basic example that bites everyone. You add a simple passed pawn evaluation term. And now you see a position here and there where you (a) trade a knight for 2 or 3 pawns + create that passer. If the passer score is too big, a knight for 2 pawns is a real lemon. A knight for 3 pawns is generally bad unless you can quickly trade to the endgame with 3 pawns vs a knight... (b) trade a bishop and knight for rook and pawn + create a passed pawn again. Not a good idea. (c) create a passed pawn while letting your opponent break up (but not win anything) your kingside pawn structure. (d) create a passed pawn while letting your opponent get a pair of rooks to your 7th rank.

...
That doesn't strike me as relevant for the problem at hand. What you mention is an interaction of terms, and that would force you to tune the terms simultaneously in any method. What do you gain if your very precise testing does recognize the N vs 2P problem, and therefore tells you that the passer bonus is bad? You would reject it! But the problem was not that the passer bonnus was bad, but that after including it, other parameters need retuning to make it good. So actually it was the wrong decision to reject it. And if the precise test tells you that the disadvantage does not outweigh the advantage, and you accept it anyway, you might as well not have done the test.

This is a classic example on trapping on local maxima (witthin the search-space topology defined by the mutation prescriptions). The SA calculations at finite temperature are far more robust wrt this than the gradient climbing you advocate.

It depends on how you test. If you play enough games with enough variability, you will see the passed pawn code cause you to make an error you had stopped previously. If you play too few games, but make sure you play games dealing with passed pawns since that is what you are working on, you miss it. Often it takes a large number of games to recognize an unexpected interaction and to fix it, other times it takes a large number of games to deal with an expected interaction (disrupting opponents king-side vs long-term pawn structure). or king-attacks vs centralization of pieces which are often orthogonal in nature.

However, SA doesn't fit computer chess testing, so I am not sure what that has to do with testing changes.

BTW I don't reject changes if the results are worse, without some thought about "why?" If a change is good, I keep it. But I often tweak it further to see if I can gain even more. If something doesn't work, we go back and try to figure out why. Tracy has had a couple of these counter-intuitive "it is bad" events that drove us crazy to understand why. Usually there is a reason, but it is not always easy to pick out.

So there are planned interactions, and unplanned interactions. We have to deal with both. I have several thousand lines of code in our current evaluation. That leads to plenty of issues when making changes... There is hardly any case any longer where a change is just a single-effect / simple change. The current eval modifications are spread over several pieces of code to produce a single effect that is hopefully better.

hgm · Post by **hgm** » Sat Sep 22, 2007 1:29 pm

SA fits testing Chess engines, as the evaluation of the changes is a stochastic process. So there is always a non-zero probability of making an evaluation error, and as a consequence accepting a backward step. So engine development can never be done at zero temperature. It is only a question of how low you want the temperature to be. And from SA we know that lower (i.e. more precise testing) is not necessarily better. It all depends on the topology of the fitness landscape. And as your examples above illustrate, that is not always trivial.

If one does not reject/accept based on only the test results, but starts to analyze individual games, it becomes a different game. But I am not sure how you could do that, if you play 100,000 games per night. If the passer code would cause 1000 faulty N vs 2P sacs, how would you know? Do you have a filter to select games with unequal pawns? Having to look at the games by hand seems to violate your wish for automated decision. And it would argue for a small number of games, rather than a large one. That is how I often do it: just watch 100 games, and watch for strange strategic behavior. It only has to occur once, you are not really interested to know how often it would occur precisely, as it seems an avoidable error, and you want to suppress it first.

BubbaTough · Post by **BubbaTough** » Sat Sep 22, 2007 6:19 pm

It seems pretty clear that tweaking chess parameters is a classic hill climbing problem, which could be addressed through any of the classic methods including SA, genetic algorithms, etc.. It also seems clear that the testing method of changing one parameter at a time and testing it with lots of games is equivalent to the greedy algorithms criticized so often for getting stuck easily in local optima (which would possibly explain why so many programmers so quickly reach a point where it seems no change they make seems to help).

What the best method of testing parameter changes is of course an interesting and important question worthy of debate, and I do think there is an important role for having a good testbed. But I find is surprising so many bright computer scientists seem so devoted to the greedy algorithm approach as the only worthy method of tuning parameters, in the name of "scientific method".

Most likely this is not because the very smart people working in this area are unaware that greedy algorithms are pretty bad at optimizing these kind of things. More likely is that in the middle of working on a project like this the line between testing and optimizing is a blurry line, and if you think what your doing is testing, then a big testbed is a great choice. Also, greedy algorithms are very satisfying to use, since each version gets better and better. But in the long run, doing optimization by changing a single parameter and testing it thoroughly is a BAD way to optimize in my opinion. And in the optimization community, that opinion is by no means in the minority.

-Sam

bob · Post by **bob** » Sat Sep 22, 2007 6:35 pm

my comment on SA involves the huge number of parameters that can be adjusted. The typical annealer has problems with that. Anthony and I tried that a couple of years back to tune chess engines, and we found it to be (a) a daunting computational task; (b) a daunting algorithmic task. After the long annealing runs we ran, we did not produce any versions that were better than the original Crafty...

hgm · Post by **hgm** » Sat Sep 22, 2007 8:30 pm

Well, no one forces you to change all parameters all the time. SA can also be done on a small subset of the parameters that you consider as not definitively settled yet. i.e. when you expect dependencies, like giving bonuses specific to a piece type, unlock that piece value, but keep the others fixed.

bob · Post by **bob** » Sat Sep 22, 2007 9:27 pm

hgm wrote:Well, no one forces you to change all parameters all the time. SA can also be done on a small subset of the parameters that you consider as not definitively settled yet. i.e. when you expect dependencies, like giving bonuses specific to a piece type, unlock that piece value, but keep the others fixed.

Yes. But how do you "anneal"?? Play games? This is a random-walk approach of stumbling thru lots of changes that are no good to find (hopefully) some that are.

The idea works, but it looks to be _very_ difficult to apply to chess with any effectiveness, based on the results we produced a couple of years ago (Cozzie and myself).

BubbaTough · Post by **BubbaTough** » Sat Sep 22, 2007 11:14 pm

[quote="bob"]my comment on SA involves the huge number of parameters that can be adjusted. The typical annealer has problems with that. Anthony and I tried that a couple of years back to tune chess engines, and we found it to be (a) a daunting computational task; (b) a daunting algorithmic task. After the long annealing runs we ran, we did not produce any versions that were better than the original Crafty...[/quote]

I well believe that. The first time I tried was a miserable failure, and the second time success was only modest. As you have pointed out a couple of times if you use games against opponents as your fitness function its even hard to get enough games to do a greedy approach (which generally requires far less games).

Anyway, the purpose of my post was not to champion SA or genetic algorithms or even to say the greedy approach is wrong . My main point is that there is a difference between testing and optimizing, and the best way to test is not necessarily the best way to optimize. Both are important, but the best way to do each is different and each should be considered separately. I think the discussion here has been very helpful regarding building effective testebeds. And the fact that someone like you, who is one of the most experienced and effective testbed builders in the world, is sharing opinions and data is extremely valuable to the community. But it seemed that optimization and testing were getting intermixed in the discussion too much for my taste (not necessarily by you, but by the community in general) and so I thought I would pipe up and suggest a separation of the discussion on how to determine whether a new version is better than an old version (testbed)...something that cannot be done very frequently for most of us due to the amount of time and resources required... and how to come up with a new version worthy of undergoing formal testing (optimization).

Anyway, regarding optimizition using things like SA/GA I have a couple of findings to share from my own work:

1. do not allow all features to change in a session. Pick a related subset of features (Bob gaves some examples of these earlier). This helps reduce the amount of time it takes to find decent places in the solution space.
2. put limits on the range that values can change. In most cases, a 0 weight should be included in the range. This also limits the space a bit.
3. do not use as time consuming a fitness function as you use in a greedy algorithm. More generations is often more important than accuracy, since these systems can recover from taking a wrong step better than greedy approaches which optimize one feature at a time and then move on.
4. do not expect fast results. If you do this, it is going to run in the background for a long long time before it starts helping. If you are not the kind of person who wants to spend a lot of time carefully setting things up, and set it running and check in a month later, this may not be your cup of tea.

-Sam

how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test

Re: how to do a proper statistical test