An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

That book is _the_ book on blackjack. From a theoretical perspective. And absolutely nothing comes close to it.

go to any advantage play web site and ask.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:[

So maybe I am now confused, since I am arguing for large N small sigma squared. So back to the beginning. Do you advocate large N or small N? I will again remind you, I am using 40 positions so that I get a representative cross-section of games, tactical, positional, middlegame attacks, endgame finesses, etc. That requires at least 80 games since you must alternate colors to cancel unbalanced positions. What N do you advocate?
Well, I can't speak for hgm, but if you insist on using those 40 positions (which is probably a good idea in your situation), then you have to basically take each 80-game match as 1 sample (which means 320 games is only 4 samples; very likely to be not sufficient despite the lower overall variance). And measure the variance of those sample; the variance of the individual positions is not a very good predictor of the overall variance. If it were, you shouldn't be using them.

I. e., it is actually a good sign that you are seeing a high variance over each individual position; you could possibly reduce the number of positions by finding out if there are any strong correlations between any two of them. Of course to really draw any conclusions from those correlations, your sample sizes must be large enough (and no, this is not a contradiction to my other statements).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote:
nczempin wrote: Your example with Black Jack is not appropriate either, because the variance is much higher than it can ever be for Chess. Same for Poker (the only other Casino game that in the long run actually allows skilled players to win).
Sorry, but that is wrong. The variance for chess is far higher than I would have believed. You saw just some of the data at the top of this thread. Variance in human chess is not nearly so high, I agree. But we aren't talking about human chess.
Umm...

What is wrong? My claim that variance for a zero-sum deterministic game cannot be higher than a card game?

I never said how high you would have believed the variance was. I also never said how high I believe it is. I'm only saying that it is highly likely to be higher in Black Jack than it is in chess, regardless of whether humans play or computers.
And that is the point I disagree with. Whether humans or computers play is a critical aspect to this. Computers are _far_ more random in their behavior, regardless of the idea that they should be very determistic. Due to the timing issue that I did not think about to start with, the outcome of games starting from the same position are far more variable than what you would observe with two humans playing from the same starting position.

And I mean variance in the mathematical sense; I hope you mean the same, and "your statistical world" is the same one that mathematicians have agreed on.
Mine is the sum of the differences of any two observations squared, divided by the number of observations * 2.


Of course one has to take care of what exactly one is measuring; if I measured, say how many times the white queen went to d3, I would find much more variance than in the end-of-year bankroll delta of a professional blackjack player.
Actually I doubt that is true. I know pro players that have been up or down 250,000 dollars at the end of a year, yet they almost always end up ahead after enough rounds are played to get the variance to a reasonable level.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
nczempin wrote:
bob wrote:
nczempin wrote:
If you play one more match after these two, your variance is necessarily lower, it cannot become higher (and it cannot stay at the maximum).
Sorry, but in _my_ statistical world, another sample _can_ increase the variance. Samples 1,2,3,4 are 5, 7, 5, 2, sample 5 is 20. You are telling me the variance didn't increase???
No, I didn't say variance can never increase. I only said that the variance cannot increase in the example I gave, which was 2 for the first result, and 0 for the second, where 2 is the maximum any sample can have, and 0 is the minimum.
I don't disagree. But how is that useful? Are you going to play lots of 2-game matches, get the variance down to the theoretical minumum, and then discover that every test reaches the same result, change or no change. You have to play enough different games so that an improvement gets a chance to be exercised and influence the result. I have settled on 40 different positions, played twice to alternate colors. Is that enough? I'd like more. But then I also want to be able to play enough games to get a stable results. 2x more positions would probably require 4x the number of games to get a similar stability level.

I will make the same comment I made to H.G. You can't quote variance for large N to justify playing small N. I've argued for large N all along. Completely consistently...
Well, there are really two sides to the equation:
1. Your estimate of the theoretical variance of matches between engines, be it just for your engine or the theoretical all-engines situation. The more matches you run, the more accurate this estimate will be. It only depends on your required level of confidence (plus the "severity" of the changes you can make; if all of them fall within the variance, you cannot distinguish the results from randomness. I suspect that your situation falls into this area, because Crafty is a very mature engine). In essence, this can never be high enough, and only for practical reasons we could say, okay, 1,000,000 samples and a level of 99.99 % (or perhaps "six sigma"), both values grabbed from thin air, is enough _for_me_ (and someone else could disagree and require more, although I somehow doubt it :-)). One could also check what confidence levels are required to have the FDA approve a new Drug, or what confidence levels are required for AIDS tests, or anything that you would deem adequate in a more close-to-life situation.

2. The variance of the test of the new, potentially improved version.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
nczempin wrote: I never said how high you would have believed the variance was. I also never said how high I believe it is. I'm only saying that it is highly likely to be higher in Black Jack than it is in chess, regardless of whether humans play or computers.
And that is the point I disagree with. Whether humans or computers play is a critical aspect to this. Computers are _far_ more random in their behavior, regardless of the idea that they should be very determistic. Due to the timing issue that I did not think about to start with, the outcome of games starting from the same position are far more variable than what you would observe with two humans playing from the same starting position.
Well, I have no opinion on whether humans are more variable or computers are (thinking about it right now, I'd say it would be expected that computers are more predictable), it is not the issue at hand. It is an interesting thought, and one could find a way to measure it.

The opinion I do have, and I believe someone, perhaps not I, can actually prove this, is that the meaningful comparison (which we haven't actually defined yet) between BJ and chess would find out that there is more variance in BJ than in chess. I am sure you will agree once we have agreed on where the meaningful comparison would lie, of which I'm not sure if it's possible.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
nczempin wrote: And I mean variance in the mathematical sense; I hope you mean the same, and "your statistical world" is the same one that mathematicians have agreed on.
Mine is the sum of the differences of any two observations squared, divided by the number of observations * 2.
Well, mine is this (Wikipedia). Under "Elementary desription" it gives this algorithm: "compute the difference between each possible pair of numbers; square the differences; compute the mean of these squares; divide this by 2. The resulting value is the variance."

As far as I can see, this is the same as your definition.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
nczempin wrote:
Of course one has to take care of what exactly one is measuring; if I measured, say how many times the white queen went to d3, I would find much more variance than in the end-of-year bankroll delta of a professional blackjack player.
Actually I doubt that is true. I know pro players that have been up or down 250,000 dollars at the end of a year, yet they almost always end up ahead after enough rounds are played to get the variance to a reasonable level.
I knew you'd say that, it was a trap :-)

You are well aware, of course, that I was only trying to illustrate a point; I have no idea if it is actually true.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote:
nczempin wrote: Once you have implemented all the changes you have found that are significant after 20000 games, your changes will need to be tested at an even higher number of games. Will you then tell everybody that 20000 games are not enough, that everybody needs 1000000 games?
I am simply telling everybody that 100 games is _not_ enough. Not how many "is" enough.
But 100 games could be enough if one side were to win all 100 of them.
Yes, but can we get back to practical answers??? If you win all 100 games, how can you tell if your change is good or bad when you still win all 100? So that's worthless and not worth discussing in this context.

In any case, 100 games is not enough if you play the same 100 game match 2 times and get a different indication each time. For example, you play 2 matches before a change and get 50-50 and 45-35. You play 2 matches after the change and get 30-70 and 70-30.

And 90-10 would also be enough.

I keep saying this all over, but you seem to simply ignore it.
I seem to be having that same problem. I have not seen you quote a single set of 100 game matches and give the results of each. I did that. My results showed that just running 100 games, which could produce any one of those results I posted (or many other possible results as well) could lead you to the wrong conclusion.

Where is your data to show that 100 games is enough? It is easy enough to run my test and post the results to prove they are stable enough that 100 is enough.

There is an accepted way of dealing with limited sample sizes, but you're just saying that 100 games can never be enough.
Yes I am. I posted samples where A beat B and lost to B, yet A is provably stronger than B. I posted sample games where A beat B and lost to B where they are equal. And I posted games where A beat B and lost to B and A was far stronger than B.

Please provide some data to contradict mine. And don't just rely on your program if it is pretty deterministic. Try others as I did, to convince yourself things are not as stable as you think, which means the number of games needed (N) is much larger than you think.


And I am saying that 100 games can be enough, and Statistics provides some methods to find out exactly when this is the case. And please don't give me that "95 % confidence means 1 out of 20 is wrong". Okay, so you require what confidence level? 99 %? Wow, you'll be wrong 1 times out of 100.
with 95% you will be wrong one of every 20 changes. Too high. Being wrong once can negate 5 previous rights...
[/quote]


To me, this whole discussion seems to be: You were surprised that the variability was much higher than you expected, and now you're on a crusade to tell everybody "100 games is not enough" without qualification (as in the situation under which it applies, I am not questioning your credentials). And it is this missing qualification that concerns me.

[Well, let me tell you: 20000 games is not enough. I changed one line in the code and so far I haven't noticed any significant results. What do you have to say about that?][/quote]

20000 might not be enough. However, I have no idea what you mean "without qualification". I believe, if you re-read my post, I qualified what I found very precisely. I tried games from a pool of 6 opponents. two significantly better than the rest, two pretty even in the middle of the pack, and two that were much worse. No matter who played who, variance was unbearable over 80-160-320 games. Yes you can get a quick idea of whether A is better than B or not with fewer games. But I am not doing that. I want to know if A' is better than A, where the difference is not that large. Somehow the discussion keeps drifting away from that point.

If you want to discuss something else, fine. But _my_ discussion was about this specific point alone. Nothing more, nothing less. Somehow the subject keeps getting shifted around however.

Back to the main idea, once again. I have 6 programs that I play against each other, one of which is mine. Generally I only play mine against 4 of that set of 5 (I weed out one that is significantly worse, as one of those is enough). I want to run Crafty against the set of 4 opponents, then run Crafty' against the same set, and then be able to conclude with a high degree of accuracy whether Crafty or Crafty' is better. Nothing more, nothing less. 80 or 160 or 320 games is absolutely worthless to make that determination. I have the data to prove this convincingly.

Any other discussion is not about my premise at all, and I don't care about them. I don't care which of my pool of programs is stronger, unless you count Crafty and Crafty'. I don't care about really large N as it becomes intractable. I don't care about really small N (N <=100) as the data is worthless. So I am somewhere in the middle, currently using 20K games, 5K against each of 4 opponents, and still see a small bit of variance there as expected. But not a lot.

So can we discuss _that_ issue alone? Comparing and old to new version to decide whether the new version is stronger. And get off of all the tagential issues that keep creeping in?

I have stated my case concisely above. I have offered data to support it. Feel free to offer data that refutes it, and to have that scrutinized as I did my own when I started this testing last year...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote:
hgm wrote:Yes, variance does not always have to be finite. But to have infinite variance, a quantity should be able to attain abitrarily large value (implying that it should also be able to attain infinitely many different values).

For chess scores a game can result ion only 3 scores, 0, 1/2 or 1. That means that the variance can be at most 1/2 squared = 1/4. Standard deviation is always limited to half the range of the maximum and the minimum outcome, and pathological cases can only occur when his range is infinite. It can then even happen that the expectation value does not exist. But no such thing in chess scores.
The problem with that statement is that you are basing it on an infinite number of games, while only playing a small number of games in the argument.
I don't think infinity in the number of games has anything to do with infinity in the possible scores.

And the infinity in the number of games (samples) is taken care of by the Statistics; all natural sciences are based on it. Yes, we do not know for sure if the Sun will still rise each day in 100 years, but so far the statistical evidence has been pretty good.
But if you start from scratch, and watch the sun for only 100 days, and the last ovservation ends in an eclipse, what do you conclude?

Of course discussing events with zero variance is not very useful. But a small number of observations in a real situation might be different. Suppose every day is cloudy. Do you conclude there is no sun?

Silly arguments...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote: I was originally talking about variance in the engineering sense. But I have been using the usual stastical term recently. In that when we started testing on our cluster, I could run 26 game matches, and get 2-24, 24-2 and 13-13.
I don't get it. What is variance in the engineering sense, and how does it differ from variance in the Mathematical/Statistical sense? I always thought engineers use results and principles such as variance and don't redefine the term.
Variance in the engineering sense is based on the same variance we use in statistics, except without the squared term. How much does the speed of an object vary as it moves? They are related, but I am really interested in "how many different results do I get, and how different are they from the truth?" Statistical variance is just a more precise formulation of that idea. Perhaps variability is better than variance. But either one works just fine in this context as I showed.