An objective test process for the rest of us?

bob · Post by **bob** » Sat Sep 15, 2007 11:56 pm

hgm wrote:
bob wrote:That is _exactly_ the kind of comment that makes me cringe. ~5 elo? Based on what?
Based on theoretical considerations and general Elo vs search time, as in the derivation I gave in the other thread. In absense of actual data I see no reason to disbelieve plausible theoretical predictions. I do not have the equipment to reliably measure such small Elo differences. You yourself told me that yu never measured this. So the theoretical prediction stands until it is convincingly falsified by experimental data.

So somehow modify all the other engines to use something provided by the first? Are we trying to test/debug or introduce bugs to find? I'm not going to modify other programs and then debug that...
Of course not! The opponents don't have to be modified at all, as they play only once in every position, as I described above. Their moves go into the data-base, and if another engine of my A family brings them in a position they have already seen, the move comes from the database. That guarantees 100% reproducible play by the opponents, without having to modify them in any way.

Ugh. So your way of eliminating the random factor is to take the _first_ result for each position and use it over and over?

Doing that, my coin either throws nothing but heads, or nothing but tails...

bob · Post by **bob** » Sat Sep 15, 2007 11:57 pm

nczempin wrote:
bob wrote:
nczempin wrote: Could you give me the correct way to analyse my example statistically?
I don't think we have such a methodology yet. At least I have not found one after all the games I have played using our cluster. This is really a difficult problem to address. I'm simply trying to point out to everyone that a few dozen games are a poor indicator. Again the easiest way to see why is to run the same "thing" more than once and look at how unstable the results are.
But how unstable can it get?

If you play a match of two games (black and white), and your engine wins both games. Then you play another match, and your engine loses both games.

That is the maxium variance you will ever be able to get. Correct me if I'm wrong.

Isn't that bad enough? One test says "new feature is great". Second test says "new feature is horrible." How bad is that? How could it get any worse, in fact???

hgm · Post by **hgm** » Sun Sep 16, 2007 12:27 am

bob wrote:Ugh. So your way of eliminating the random factor is to take the _first_ result for each position and use it over and over?

Doing that, my coin either throws nothing but heads, or nothing but tails...

Yes, so what? I am not running these tests to test my opponents. Very nice of them that they _can_ generate multiple moves from the same position. (Not that it would hinder me much if they couldn't, plenty of other opponents that would generate other moves.) But I as a tester will decide when I want to present my engines with the same opponent moves, or with different ones. In particular I want A and A' to see the same moves. Great if I can also use a different second, third and fourth move of that same opponent from that same position. But then I still would want both A and A' and A" to see all those moves. Not A one subset, A' another subset, and A" yet another subset. Because that is exactly what causes unnecessary nois. And if I can reduce the noise by a factor N, I can reduce the number of games by a factor N^2...

nczempin · Post by **nczempin** » Sun Sep 16, 2007 11:44 am

bob wrote:
nczempin wrote:
bob wrote:
nczempin wrote: Could you give me the correct way to analyse my example statistically?
I don't think we have such a methodology yet. At least I have not found one after all the games I have played using our cluster. This is really a difficult problem to address. I'm simply trying to point out to everyone that a few dozen games are a poor indicator. Again the easiest way to see why is to run the same "thing" more than once and look at how unstable the results are.
But how unstable can it get?

If you play a match of two games (black and white), and your engine wins both games. Then you play another match, and your engine loses both games.

That is the maxium variance you will ever be able to get. Correct me if I'm wrong.
Isn't that bad enough? One test says "new feature is great". Second test says "new feature is horrible." How bad is that? How could it get any worse, in fact???

Well, that is exactly what I'm saying: You are seeing the maximum variance (it can't get any worse), which means that from the observations so far, you cannot draw the conclusion that it is stronger or that it is weaker.

If you play one more match after these two, your variance is necessarily lower, it cannot become higher (and it cannot stay at the maximum).

I am not proposing to draw the conclusion "new feature is x" after a two-game match. But 20000 is way over the number of games you need to say, for a given confidence factor, that your change is not significant.

There are certainly changes that will only turn out to be significant after 20000 games. And it may well be that you have reached the level of maturity where the changes are so small.

Once you have implemented all the changes you have found that are significant after 20000 games, your changes will need to be tested at an even higher number of games. Will you then tell everybody that 20000 games are not enough, that everybody needs 1000000 games?

For me, the situation is easier: Because I require significance at a much higher level, I can reject (or simply conclude ("this change is not significant at my required level, so I define it to be insufficient for my purpose") changes more quickly (that would perhaps be significant after 20000 games).

It is not meaningful to claim that everybody needs the same minimum number of games.

Your example with Black Jack is not appropriate either, because the variance is much higher than it can ever be for Chess. Same for Poker (the only other Casino game that in the long run actually allows skilled players to win).

I have some figures for BJ and Poker back in Berlin in some Malmuth or Sklansky book; I cculd bring them in about a week.

Or we could post our question on the two-plus-two forums (where there are a lot of Statistics experts, and I am sure they would give their left arm if you could give the variance you are seeing to their Poker careers).

What I'm trying to do in this thread is to find out exactly what variance you, Bob, are seeing, and what variance I'm seeing.

Claiming that Basic Statistics is inappropriate is not exactly constructive unless you can put forward reasoning why more Advanced Statistics is needed. So far I haven't seen this reasoning.

nczempin · Post by **nczempin** » Sun Sep 16, 2007 11:51 am

Regarding the use of multiple positions:

This will reduce variation of the overall results, because any bias in just one particular position will be reduced. And there may very well be changes where the result in 39 positions is unchanged, but in only one it is much improved.

Unfortunately, even those 40 positions could be biased, and it could take 400 positions. The same prinicple applies here that does for the number of games: Yes, more is better, but if you have fewer, you simply need higher values to conclude with a certain confidence that they are not merely random results.

So without loss of generality we can for our analysis just use one position that so far has turned out to be a fairly good discriminator. Yes, there are numerous pitfalls you can fall into before you can actually draw a valid conclusion, but since we are not interested in determining for a specific change that Bob has made whether it is significant, but in general concepts, we can use just the one.

bob · Post by **bob** » Sun Sep 16, 2007 6:06 pm

hgm wrote:
bob wrote:Ugh. So your way of eliminating the random factor is to take the _first_ result for each position and use it over and over?

Doing that, my coin either throws nothing but heads, or nothing but tails...
Yes, so what? I am not running these tests to test my opponents. Very nice of them that they _can_ generate multiple moves from the same position. (Not that it would hinder me much if they couldn't, plenty of other opponents that would generate other moves.) But I as a tester will decide when I want to present my engines with the same opponent moves, or with different ones. In particular I want A and A' to see the same moves. Great if I can also use a different second, third and fourth move of that same opponent from that same position. But then I still would want both A and A' and A" to see all those moves. Not A one subset, A' another subset, and A" yet another subset. Because that is exactly what causes unnecessary nois. And if I can reduce the noise by a factor N, I can reduce the number of games by a factor N^2...

Apparently nothing I have said has registered. If you do a simple node count search limit, you will produce perfect repeatibilty. The question is, are the set of games you happen to produce actually representative of how the thing plays? It is just a small random sample, which means "NO" is the answer.

Each time you encounter the _same_ position in a game, your program can play a different move due to the timing issues. Since there is that random component, a sample of the potential games from that position is far more informative than a sample size of one.

If you believe that methodology is OK, then go for it. I once believed it as well until I obtained the facilities to delve into this issue in extreme detail, where I learned things I had no idea were there.

I'm not going to argue the case. I've already run millions of games, our "team" has looked at the ridiculous level of non-determinism we have been producing on the cluster. and we've spent hundreds of hours going over the results to understand what is going on in them.

The bottom line is that large sample size == high level of confidence in the results, small sample size == very low level of confidence in the results. It really is that simple. And trying to finagle ways to somehow make large sample size close to small sample size isn't going to work.

I would think that everyone sees the importance of an _accurate_ assessment approach to decide whether a change is good or bad. It is _the_ way to make progress. Trying to short-cut the process is just a "random walk" experiment. Eventually you will "get there" but after a whole lot more false steps than the methodical approach.

bob · Post by **bob** » Sun Sep 16, 2007 6:18 pm

nczempin wrote:
bob wrote:
nczempin wrote:
bob wrote:
nczempin wrote: Could you give me the correct way to analyse my example statistically?
I don't think we have such a methodology yet. At least I have not found one after all the games I have played using our cluster. This is really a difficult problem to address. I'm simply trying to point out to everyone that a few dozen games are a poor indicator. Again the easiest way to see why is to run the same "thing" more than once and look at how unstable the results are.
But how unstable can it get?

If you play a match of two games (black and white), and your engine wins both games. Then you play another match, and your engine loses both games.

That is the maxium variance you will ever be able to get. Correct me if I'm wrong.
Isn't that bad enough? One test says "new feature is great". Second test says "new feature is horrible." How bad is that? How could it get any worse, in fact???
Well, that is exactly what I'm saying: You are seeing the maximum variance (it can't get any worse), which means that from the observations so far, you cannot draw the conclusion that it is stro\nger or that it is
.

Now we are getting somewhere. Note that those 4 samples represent _320_ games. And you agree we can't draw any conclusions from them. That has been my point all along. 320 games is worthless for this.

If you play one more match after these two, your variance is necessarily lower, it cannot become higher (and it cannot stay at the maximum).

Sorry, but in _my_ statistical world, another sample _can_ increase the variance. Samples 1,2,3,4 are 5, 7, 5, 2, sample 5 is 20. You are telling me the variance didn't increase???

I am not proposing to draw the conclusion "new feature is x" after a two-game match. But 20000 is way over the number of games you need to say, for a given confidence factor, that your change is not significant.

There are certainly changes that will only turn out to be significant after 20000 games. And it may well be that you have reached the level of maturity where the changes are so small.

Once you have implemented all the changes you have found that are significant after 20000 games, your changes will need to be tested at an even higher number of games. Will you then tell everybody that 20000 games are not enough, that everybody needs 1000000 games?

I am simply telling everybody that 100 games is _not_ enough. Not how many "is" enough.

For me, the situation is easier: Because I require significance at a much higher level, I can reject (or simply conclude ("this change is not significant at my required level, so I define it to be insufficient for my purpose") changes more quickly (that would perhaps be significant after 20000 games).

You do realize that with a 95% confidence, you will make a false step 1 out of 20 times? And sometimes more frequently as the samples are so small the variance can be very high due to the inherent non-determinism in the results.

It is not meaningful to claim that everybody needs the same minimum number of games.[/quote

Perhaps not, but it is _FAR_ worse to claim that most can get by with a hundred games or less. Which of the two cases do you think is most harmful to development? Using pure random results or using overkill to provide a high-level of accuracy?

]

Your example with Black Jack is not appropriate either, because the variance is much higher than it can ever be for Chess. Same for Poker (the only other Casino game that in the long run actually allows skilled players to win).
Sorry, but that is wrong. The variance for chess is far higher than I would have believed. You saw just some of the data at the top of this thread. Variance in human chess is not nearly so high, I agree. But we aren't talking about human chess.

I have some figures for BJ and Poker back in Berlin in some Malmuth or Sklansky book; I cculd bring them in about a week.
I understand the game of blackjack, having played for _many_ years. I have Peter Griffin's "Theory of Blackjack" - the bible of blackjack players that are serious about the game.

Or we could post our question on the two-plus-two forums (where there are a lot of Statistics experts, and I am sure they would give their left arm if you could give the variance you are seeing to their Poker careers).

What I'm trying to do in this thread is to find out exactly what variance you, Bob, are seeing, and what variance I'm seeing.

Claiming that Basic Statistics is inappropriate is not exactly constructive unless you can put forward reasoning why more Advanced Statistics is needed. So far I haven't seen this reasoning.
Elo's entire premise was based on human observation, analyzed to death. I have stated that human performance is far more "reproducible" than computer vs computer performance. How would you explain this:

I ran 320 games thru elostat months ago, and got a rating + an error bar. I took the next 320 games, ran it through, got a far different elo, with an error bar. The lower rating + error bar was _far_ removed from the higher rating - error bar. What does that mean? neither number means a thing and if you look at the wrong one you draw the wrong conclusion.

Humans don't produce that kind of variability.

nczempin · Post by **nczempin** » Sun Sep 16, 2007 6:28 pm

bob wrote:
nczempin wrote: Claiming that Basic Statistics is inappropriate is not exactly constructive unless you can put forward reasoning why more Advanced Statistics is needed. So far I haven't seen this reasoning.
Elo's entire premise was based on human observation, analyzed to death. I have stated that human performance is far more "reproducible" than computer vs computer performance. How would you explain this:

I ran 320 games thru elostat months ago, and got a rating + an error bar. I took the next 320 games, ran it through, got a far different elo, with an error bar. The lower rating + error bar was _far_ removed from the higher rating - error bar. What does that mean? neither number means a thing and if you look at the wrong one you draw the wrong conclusion.

Humans don't produce that kind of variability.

I don't think I mentioned Elo in any way. Perhaps we should split this thread. I am also not interested in those other issues regarding depth etc., I would appreciate if that subject were split off, too.

I'm also concerned that some of my most important points you seem to simply ignore. Or perhaps I should give you more time.

You make such a big thing out of 320 games, when I have already asked to simplify the discussion, separating out the 40-base-position issue.

nczempin · Post by **nczempin** » Sun Sep 16, 2007 6:30 pm

bob wrote:
nczempin wrote:
If you play one more match after these two, your variance is necessarily lower, it cannot become higher (and it cannot stay at the maximum).
Sorry, but in _my_ statistical world, another sample _can_ increase the variance. Samples 1,2,3,4 are 5, 7, 5, 2, sample 5 is 20. You are telling me the variance didn't increase???

No, I didn't say variance can never increase. I only said that the variance cannot increase in the example I gave, which was 2 for the first result, and 0 for the second, where 2 is the maximum any sample can have, and 0 is the minimum.

nczempin · Post by **nczempin** » Sun Sep 16, 2007 6:32 pm

bob wrote:
nczempin wrote: I have some figures for BJ and Poker back in Berlin in some Malmuth or Sklansky book; I cculd bring them in about a week.
I understand the game of blackjack, having played for _many_ years. I have Peter Griffin's "Theory of Blackjack" - the bible of blackjack players that are serious about the game.

Yes, that book is quoted in many texts, but it is not the only good (presumaby, I am not all that interested in BJ right now) book on Blackjack, nor the only good book on Gambling or Statistics.

An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?