An objective test process for the rest of us?

Discussion of chess software programming and technical issues.

Moderator: Ras

nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
nczempin wrote: Your example with Black Jack is not appropriate either, because the variance is much higher than it can ever be for Chess. Same for Poker (the only other Casino game that in the long run actually allows skilled players to win).
Sorry, but that is wrong. The variance for chess is far higher than I would have believed. You saw just some of the data at the top of this thread. Variance in human chess is not nearly so high, I agree. But we aren't talking about human chess.
Umm...

What is wrong? My claim that variance for a zero-sum deterministic game cannot be higher than a card game?

I never said how high you would have believed the variance was. I also never said how high I believe it is. I'm only saying that it is highly likely to be higher in Black Jack than it is in chess, regardless of whether humans play or computers.

And I mean variance in the mathematical sense; I hope you mean the same, and "your statistical world" is the same one that mathematicians have agreed on.

Of course one has to take care of what exactly one is measuring; if I measured, say how many times the white queen went to d3, I would find much more variance than in the end-of-year bankroll delta of a professional blackjack player.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:
bob wrote:the "standard error" might be 2 points. The variance in such a match is _far_ higher. Just play 26 game matches several times. You will _not_ get just a 2 game variance. I've already posted results from several programs including fruit, glaurung, arasan, gnuchess and crafty. So talking about standard error is really meaningless here, as is the +/- elostat output. It just is not applicable based on a _huge_ number of small matches...
Wait a minute! Are we talking here about 'variance' in the usual statistical meaning, as the square of the standard deviation?

Then it is safe to say that what you claim is just nonsense. The addition law for variances of combined events (and the resulting central limit theorem) can be mathematically _proven_ under very weak conditions. The only thing that is required is that the variances be finite (which in Chess they necessarily are, as all individual scores are finite), and that the events are independent.

So what you claim could only be true if the results of games within one match were dependent on each other, in a way that winning one, would give you a larger probability of winning the next. As we typically restart our engines that doesn't seem possible without violating causality...

If you play a large number of 26-game matches, the results will be distributed with a SD of at most 0.5*sqrt(26) = 2.5, and you can only reach that if the engines cannot play draws, and score near 50% on average. With normal draw percentages it will be 2 points (i.e. the variance will be 4 square points).

No way the variance is ever going to be any higher.
I was originally talking about variance in the engineering sense. But I have been using the usual stastical term recently. In that when we started testing on our cluster, I could run 26 game matches, and get 2-24, 24-2 and 13-13.

So the three samples are -22, +22 and 0. mean of the squares is 22^2 + 22^2 + 0^2 = 968 / 3 = 323. Divide by 2 to compute the variance (161). Is that the variance you are talking about? square the difference of each pair of observations, then compute the mean of that and divide by 2??? I claim that number is _very_ high for computer vs computer chess games. I claim it goes down as the number of games goes up. I claim that until the number of games is far larger than one might originally suspect, the variance is extremely high making the results highly random and unusable. I can't deal with variance that large and conclude anything useful.

So I agree with your "if you play a large number". But I have been arguing for a large number of games to reduce variance. You have been arguing for a small number of games, which absolutely increases the variance. We just don't seem to agree on how much this increase actually is...

So maybe I am now confused, since I am arguing for large N small sigma squared. So back to the beginning. Do you advocate large N or small N? I will again remind you, I am using 40 positions so that I get a representative cross-section of games, tactical, positional, middlegame attacks, endgame finesses, etc. That requires at least 80 games since you must alternate colors to cancel unbalanced positions. What N do you advocate? I am finding it necessary to run about 5K games per opponent to get N up and sigma^2 down enough to make evaluation accurate. I also choose to use multiple opponents as I don't want to tune to beat X only to lose to Y and Z because X has problems with certain types of positions that skew those results. I am not sure my 20K total games (5K x 4 opponents) is either enough or overkill. But I am absolutely certain that 500 is not enough because of the variance I have measured.

Also, with my cluster, I can test parallel searches, and I would really like (and plan on doing so) to test against parallel search engines to compare the parallel search effectiveness of the parallel engines. But that introduces even more non-determinism and will drive N even larger. And it would be nice to test a book in the same way, again further increasing N. Not to mention pondering. So far I have tried to whittle away every source of non-determinism except for timing for the moves. Interfering with that produces stability but the match results generally do not come close to the match results produced with more games using time limits.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

hgm wrote:Yes, variance does not always have to be finite. But to have infinite variance, a quantity should be able to attain abitrarily large value (implying that it should also be able to attain infinitely many different values).

For chess scores a game can result ion only 3 scores, 0, 1/2 or 1. That means that the variance can be at most 1/2 squared = 1/4. Standard deviation is always limited to half the range of the maximum and the minimum outcome, and pathological cases can only occur when his range is infinite. It can then even happen that the expectation value does not exist. But no such thing in chess scores.
The problem with that statement is that you are basing it on an infinite number of games, while only playing a small number of games in the argument.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
nczempin wrote: Once you have implemented all the changes you have found that are significant after 20000 games, your changes will need to be tested at an even higher number of games. Will you then tell everybody that 20000 games are not enough, that everybody needs 1000000 games?
I am simply telling everybody that 100 games is _not_ enough. Not how many "is" enough.
But 100 games could be enough if one side were to win all 100 of them.

And 90-10 would also be enough.

I keep saying this all over, but you seem to simply ignore it.

There is an accepted way of dealing with limited sample sizes, but you're just saying that 100 games can never be enough.

And I am saying that 100 games can be enough, and Statistics provides some methods to find out exactly when this is the case. And please don't give me that "95 % confidence means 1 out of 20 is wrong". Okay, so you require what confidence level? 99 %? Wow, you'll be wrong 1 times out of 100.


To me, this whole discussion seems to be: You were surprised that the variability was much higher than you expected, and now you're on a crusade to tell everybody "100 games is not enough" without qualification (as in the situation under which it applies, I am not questioning your credentials). And it is this missing qualification that concerns me.

[Well, let me tell you: 20000 games is not enough. I changed one line in the code and so far I haven't noticed any significant results. What do you have to say about that?]
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote: I was originally talking about variance in the engineering sense. But I have been using the usual stastical term recently. In that when we started testing on our cluster, I could run 26 game matches, and get 2-24, 24-2 and 13-13.
I don't get it. What is variance in the engineering sense, and how does it differ from variance in the Mathematical/Statistical sense? I always thought engineers use results and principles such as variance and don't redefine the term.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote: I am not sure my 20K total games (5K x 4 opponents) is either enough or overkill. But I am absolutely certain that 500 is not enough because of the variance I have measured.
But the variance you have measured can only apply to your particular situation.

What you are doing is claiming that it is representative of everybody else's situation. And to find out whether it is overkill or not can be determined; you don't have to guess like this.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

Just another illustration:

If I teach my non-tablebase-using program some knowledge on how to handle KPK endgames, a wild guess (which could be measured in a better way of course, but that's not the concern here) would be that perhaps I'd need 20,000 games before the effect would become noticable [this is a somewhat contrived example, because I would be assuming that I wasn't sure that the change would improve the program; we could redefine the test to check if that code is bug-free].

So if I don't have a significant result after 100, 1000 or 10,000 games, I wouldn't be concerned at all. (In practice, I would test endgame position suites and then assume that if I score better in them that would be enough for me). My engine is extremely weak tactically, and the opening book leans somewhat towards getting quick results (IMHO good advice for any player <2000), so it will get into endgames very rarely; those where KPK is significant even more rarely.

But if I double my NPS and do not see an improvement (remember, NPS is my limiting factor right now) after, say, 10 games (also completely arbitrary), I would conclude that I must have introduced a bug somewhere.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
hgm wrote:Yes, variance does not always have to be finite. But to have infinite variance, a quantity should be able to attain abitrarily large value (implying that it should also be able to attain infinitely many different values).

For chess scores a game can result ion only 3 scores, 0, 1/2 or 1. That means that the variance can be at most 1/2 squared = 1/4. Standard deviation is always limited to half the range of the maximum and the minimum outcome, and pathological cases can only occur when his range is infinite. It can then even happen that the expectation value does not exist. But no such thing in chess scores.
The problem with that statement is that you are basing it on an infinite number of games, while only playing a small number of games in the argument.
I don't think infinity in the number of games has anything to do with infinity in the possible scores.

And the infinity in the number of games (samples) is taken care of by the Statistics; all natural sciences are based on it. Yes, we do not know for sure if the Sun will still rise each day in 100 years, but so far the statistical evidence has been pretty good.
nczempin

Re: An objective test process for the rest of us?

Post by nczempin »

bob wrote:
So the three samples are -22, +22 and 0. mean of the squares is 22^2 + 22^2 + 0^2 = 968 / 3 = 323. Divide by 2 to compute the variance (161). Is that the variance you are talking about? square the difference of each pair of observations, then compute the mean of that and divide by 2???



I claim that number is _very_ high for computer vs computer chess games.
This claim is essentially meaningless unless you define at least:
1. What exactly you mean by "very high"
2. Under which conditions this variance is defined. Claiming that games at Crafty's level are representative for "computer vs computer chess games" (in general) is very bold, and I don't see how you could support it other than with your huge experience (which we all acknowledge); unfortunately that kind of reasoning would not be (I'm not saying you're using it) very scientific.
I claim it goes down as the number of games goes up.
Well, we apparently need to define exactly which variance we are talking about: The theoretical underlying variance we would get if we were able to play an infinite number of games between all engines there are and will be. Anything we do to measure it can only be an approximation.

I agree that the approximation gets better the more measurements you take, but (assuming the engines all stay the same while we're measuring) the underlying is constant; doesn't incrase or decrease.
I claim that until the number of games is far larger than one might originally suspect, the variance is extremely high making the results highly random and unusable. I can't deal with variance that large and conclude anything useful.
Perhaps you cannot deal with such high variance, but there are methods that can take it into account. No matter how big your variance (well, of course unless the actual underlying variance is the maximum, which means you're throwing perfect dice etc.), you can always get results that lie outside of the variance for a given confidence level.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An objective test process for the rest of us?

Post by bob »

nczempin wrote:
bob wrote:
nczempin wrote:
If you play one more match after these two, your variance is necessarily lower, it cannot become higher (and it cannot stay at the maximum).
Sorry, but in _my_ statistical world, another sample _can_ increase the variance. Samples 1,2,3,4 are 5, 7, 5, 2, sample 5 is 20. You are telling me the variance didn't increase???
No, I didn't say variance can never increase. I only said that the variance cannot increase in the example I gave, which was 2 for the first result, and 0 for the second, where 2 is the maximum any sample can have, and 0 is the minimum.
I don't disagree. But how is that useful? Are you going to play lots of 2-game matches, get the variance down to the theoretical minumum, and then discover that every test reaches the same result, change or no change. You have to play enough different games so that an improvement gets a chance to be exercised and influence the result. I have settled on 40 different positions, played twice to alternate colors. Is that enough? I'd like more. But then I also want to be able to play enough games to get a stable results. 2x more positions would probably require 4x the number of games to get a similar stability level.

I will make the same comment I made to H.G. You can't quote variance for large N to justify playing small N. I've argued for large N all along. Completely consistently...