bob wrote:nczempin wrote:bob wrote:nczempin wrote:
Could you give me the correct way to analyse my example statistically?
I don't think we have such a methodology yet. At least I have not found one after all the games I have played using our cluster. This is really a difficult problem to address. I'm simply trying to point out to everyone that a few dozen games are a poor indicator. Again the easiest way to see why is to run the same "thing" more than once and look at how unstable the results are.
But how unstable can it get?
If you play a match of two games (black and white), and your engine wins both games. Then you play another match, and your engine loses both games.
That is the maxium variance you will ever be able to get. Correct me if I'm wrong.
Isn't that bad enough? One test says "new feature is great". Second test says "new feature is horrible." How bad is that? How could it get any worse, in fact???
Well, that is exactly what I'm saying: You are seeing the maximum variance (it can't get any worse), which means that from the observations so far, you cannot draw the conclusion that it is stronger or that it is weaker.
If you play one more match after these two, your variance is necessarily lower, it cannot become higher (and it cannot stay at the maximum).
I am not proposing to draw the conclusion "new feature is x" after a two-game match. But 20000 is way over the number of games you need to say, for a given confidence factor, that your change is not significant.
There are certainly changes that will only turn out to be significant after 20000 games. And it may well be that you have reached the level of maturity where the changes are so small.
Once you have implemented all the changes you have found that are significant after 20000 games, your changes will need to be tested at an even higher number of games. Will you then tell everybody that 20000 games are not enough, that everybody needs 1000000 games?
For me, the situation is easier: Because I require significance at a much higher level, I can reject (or simply conclude ("this change is not significant at my required level, so I define it to be insufficient for my purpose") changes more quickly (that would perhaps be significant after 20000 games).
It is not meaningful to claim that everybody needs the same minimum number of games.
Your example with Black Jack is not appropriate either, because the variance is much higher than it can ever be for Chess. Same for Poker (the only other Casino game that in the long run actually allows skilled players to win).
I have some figures for BJ and Poker back in Berlin in some Malmuth or Sklansky book; I cculd bring them in about a week.
Or we could post our question on the two-plus-two forums (where there are a lot of Statistics experts, and I am sure they would give their left arm if you could give the variance you are seeing to their Poker careers).
What I'm trying to do in this thread is to find out exactly what variance you, Bob, are seeing, and what variance I'm seeing.
Claiming that Basic Statistics is inappropriate is not exactly constructive unless you can put forward reasoning why more Advanced Statistics is needed. So far I haven't seen this reasoning.