Variance reports for testing engine improvements

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

nczempin

Variance reports for testing engine improvements

Post by nczempin »

bob wrote:
nczempin wrote: I'll go and look up the values you posted unless someone beats me to it.
If you don't find them, we can start a new thread and I can post some data. The only problem is that my results are in chunks of 80 games against the same opponent, 40 starting positions, playing each position with alternating colors. Probably the best would be to post 4 80 game matches between several programs in my test gauntlet...
Okay, please post them if you have them nearby.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Variance reports for testing engine improvements

Post by hgm »

You mean stuff like this?

Code: Select all

01:  -+-=-++=-+=++-=-+-+=+-+--=-+=--++=-++--=-+=-===+-+===+=++-=++--+-+--+--+=--++--- 
02:  -+=+=++-=+--+-=-+-+==-+----+=-=++--+=--+--+---==-+-==+=+---++--+--+-=--+-=--+--- 
03:  -+--=++==+-++-+-+-+-+-==-+-++--+=---+--+-++--+-+===+=+-=+-==+==+---=---++--++=+- 
04:  -+=-=+=--=-++-+-+-+=+-+--+-+-==++=-+==-+--+=-+-+-+=+=--+---==--+-++----++-=++--- 
05:  -+=--+=--+=-+-+---=-+-+----++=-++=-+--=+=-+--+-+-+=+=+-+===++--+-+-==--+----+-+- 
06:  -+---=+=-+-+====+-+-+-+-=+-+==-=+--++--==++-=+---+=+=+-+--=++--+=-+-=--+=-----=- 
07:  -+-+--+--+=-+==-=-+-+-+--+=-=-=++-=++=-+-=+--+-+-+=+-+-++-=++-=+--=-=--++---+=+- 
08:  -+-+-=+--+-==-=-+=+=+=+--+-==-=+=--++----++--+=--+=--+-=+-----=+-++==--++=--+--= 
09:  -+-+-=+--+-+==--+-+-+-+--==----++==++=-+=++--+=+-=-=-+-+=--++-=+--=-+=---=-+--+- 
10:  -+=+-++--+-++-==+-+-+-==-+-+=-=++-=++-=+==+--+-=-+-==+==+--++--+=++==--+----+-+= 
11:  -+---====+-++=--==+---+--+-+----+--+=-=+-=+--+=+-+-+-+-==--=+--+=++-+--++=-++-== 
12:  -+=+=++--+-++-+-=-+-+-+=-+-+-=--+--+=--+=-+==+-=-+=--+-=+-==+-=+-++-=--++---+--- 
13:  -+-+--+==+--==+-+-+-=-+=-+-+=--+-=-++-=--++--+===--+=+=+---++-=+-=+-----+---+--- 
14:  -+-=-++--+-+==+-=-+=+-+=-+=+=--++=-++--+==+--==-===+-+=-+-=-+-=+--+-=-=++--+=-+- 
15:  -+-+-++--+-=+==-+-+-+-+--+-+=---+--++-=+-++---=+-+==-+-+---==-=+--=-+--+=--++-+= 
16:  -+=+=++=-+=++=+=--+-+=+--+=+=---+--++--+=-+==+==-+-+=+-++---+-=+-=+-==-++===+-+- 
17:  -=-+-=+--+-++==-+-+-+-+--+==+-=++--++--+==+-=+-==+=--+-++--++-=+--+----++---+-+- 
18:  -+==-++--+=====-+=+-+-+--+-++--++=-+=--+-++--+=+-+=+-+=++=-++-=+==+==--+--=++--- 
19:  =+-+=+-=-=-=+-+---+-+-+--+-++--++=-++-==-++-=+-+++=-=+-++--=+--+-++-==-++--++-=- 
20:  -+---++-=+=-+-+=+-=-+-+--+-+=--++=-+=-=+-++--+=+=+--=+-++--++--+-=+-+--++--++-+- 
21:  -+=-=++--+=+=-+=+-+-+-+----=--=++=-=---+-++---=+-+=+-+=+--=-+--+--+=+--=+---=--- 
22:  -+-=-=+-==--+=+-=-+-+=+----+=-=++-=++====++--+---+===+=++--++--+--+---=++---+-=- 
23:  -+-=-++=-+-+=---=-=-+--=-+=+-===-=-+=--==++--+=+=+=+=+=++--++-=+--+-----+-----+- 
24:  -+-+-+-=-+-++-+-+---+-+--+-=----+--++--+-=+--=-====+==-++--++-=--=+--=-++--++--- 
25:  =+---+==-+-+==+-=-+-+-+=++-+=--++--=+--+==+--+==-+-===-++--++=-+-==--==++=--+-+- 
26:  -+=+-++--+-=+=+=+-=-+=+--+-+-=-=+-=++--=-++--+-+-+=-=+-++---+-=+-=+-=--=+=-++==- 
27:  -+-+-++--=-+====+-=-+-==-+=-+---+--+=--=-=+===-+-=-+-+-++--=+--+-++-=--++=-++-+= 
28:  =+---+---+-++-+-+-+-+----=-+===++--++-=+==+--+---+-+-+-+=-==+-=+--+==-=+=---+-+- 
29:  =+-+-++-===+--+=+-+-+-=--+=+---++--+=-==-++-=+---+=+=+--+-=++--+-++==--++-=++--= 
30:  -+=+-++-=+-=+-+=+-=-+-+--+-+---++--++-=+=++-==-+=--+=+=++--++----+=-+--+---++--= 
31:  -+=+==+--+-++=+-+-+=+-+=-+-++=-++--+=--+=-+--+-+=--+=+-+--=++--+--+-=--++=--+--- 
32:  -+-=--+-=+-+-=-=+-+-+-+--+-+=--++=-+=-=+-++=-==--+-+=--++--++--+-++-+-=++=-++-+- 
32 distinct runs (2560 games) found 

That particular snippet of data has an average of 36.6 (out of 80, = 45.7%), and a variance of 6.3, which means SD = 2.51.

Predicted fo 80 games would be 0.4*sqrt(80) ~ 3.6. So actually the variance is on the low side, but not so much that there is reason to worry for such a small sample of 32 runs.
Last edited by hgm on Sat Sep 15, 2007 4:39 pm, edited 2 times in total.
nczempin

Re: Variance reports for testing engine improvements

Post by nczempin »

hgm wrote:You mean stuff like this?
Well, I meant not the raw data, but the computed results, and specifically those results that Bob had that he mentioned, with Crafty, Fruit, Glaurung, etc.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Variance reports for testing engine improvements

Post by hgm »

Of course the smaller-than-expected variance of the 80-game result could be a systematic effect of the machines not being able to play some positions as well as others. The expected variance was calculated assuming each individual game had a scoring probability around 45% (which is the total scoring percentage), and a corresponding variance. The variance of the match is then simply the sum of variance of the individual games, i.e. 80 times that of one game if they all have equal variance.

But if some of the games have win probabilities significantly different from 50%, the variance goes down. A 10% winn probability gives a variance of only 0.9*0.1 (and even lower if you include draws) = 0.09 in stead of 0.45*0.55 = 0.24. So such games contribute less to the variance, although one 10% game and one 90% game contribute the same to the score as two 50% games.

As you can see that some of the columns are quite consistent plusses or minusses, it is a safe bet that due to these games the match variance is lower. As is all predicted by standard statistical analysis. In particular, there is no way the variance could be higher than calculated.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Variance reports for testing engine improvements

Post by bob »

nczempin wrote:
bob wrote:
nczempin wrote: I'll go and look up the values you posted unless someone beats me to it.
If you don't find them, we can start a new thread and I can post some data. The only problem is that my results are in chunks of 80 games against the same opponent, 40 starting positions, playing each position with alternating colors. Probably the best would be to post 4 80 game matches between several programs in my test gauntlet...
Okay, please post them if you have them nearby.
OK. Here are 3 80 game matches against an opponent that Crafty plays very evenly against:
B: -=+--++-=+-==-==-====---+--+====--+-==+-+---+==+-+--++==---=+-+===+--=-+++++-=++ ( -6)
B: -++==-=+-+-==++-+=-=+-------+-=-----=-+=+=-----+++++=--=+--+==+=-=+==+===+-=+--- (-11)
B: -----+=-=+--==-=+=--+-=-++--+--=----+=+-+-+-=-+-=++=-+--+----+=--=-+-----=-+==-- (-25)

the values at the end are the final result for each. 80 games, varying from -6 to -25, time control was 60+60, so these are not noisy 1 second games...

Here's three similar matches against a program Crafty usually drubs pretty well:

C: =+=++==+=-++=-+===+++++=-+++==-+=+++--++--+---+-==+++++=+-=+++=++++--==-=+===+-= ( 21)
C: ++=+++-+-+++++=+===++-++-+++=+++++++=+++-+++-++=++++-+=-=+==-+=+-=+==-+-=-=++=+= ( 34)
C: ++=+-++-++-++=++=+=++=+==+++==++++++=+===-+-+=++=+=+=+===-++++=++=-++++++++-+-++ ( 40)

And finally three matches between an opponent that usually beats up on Crafty pretty badly in this test and an opponent that crafty usually beats pretty badly (neither is crafty in this case to show that the randomness is not from crafty, but is intrinsic to playing computer vs computer games (again, all games are 60+60 in the data I am providing...

A: +--=+-==++=-=+-+--=+=-+-===-+=-+==---+-=-+--+++--+-===---=-=-=+=+-+-+=-+=+-+---= (-10)
A: =--+==-=+--+-+-=-+--=-=--+-==+=-++--=----+=-=+-==+=+===--+=-==+---+--=-----+++-+ (-17)
A: =---=-=+=--=-=-==------------=-==----=-+-==-=----+-=+=---+==-=+---+--+-----=-=-- (-42)

I can provide way more data if you want, but those matches, where were played in succession, give some idea of the variability that is common.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Variance reports for testing engine improvements

Post by bob »

hgm wrote:Of course the smaller-than-expected variance of the 80-game result could be a systematic effect of the machines not being able to play some positions as well as others. The expected variance was calculated assuming each individual game had a scoring probability around 45% (which is the total scoring percentage), and a corresponding variance. The variance of the match is then simply the sum of variance of the individual games, i.e. 80 times that of one game if they all have equal variance.

But if some of the games have win probabilities significantly different from 50%, the variance goes down. A 10% winn probability gives a variance of only 0.9*0.1 (and even lower if you include draws) = 0.09 in stead of 0.45*0.55 = 0.24. So such games contribute less to the variance, although one 10% game and one 90% game contribute the same to the score as two 50% games.

As you can see that some of the columns are quite consistent plusses or minusses, it is a safe bet that due to these games the match variance is lower. As is all predicted by standard statistical analysis. In particular, there is no way the variance could be higher than calculated.
That is my point. The 45% is not just wrong. It is _badly_ wrong. And using that as a basis for any sort of statistical analysis produces meaningless data.

BTW, for "no way" I suggest you go back to basic statistics and the central limit theorem.. There is most definitely a "way" just as a card counter can suffer horrible losses in a casino, playing with a +1% advantage, or you can flip a coin 100 times and get 75 heads.. This is a normal curve, but for the programs I do not believe we know anything about probabilities to date. The "random factor" has never been analyzed to any extent until I started producing such a ridiculous number of games...
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Variance reports for testing engine improvements

Post by hgm »

bob wrote:BTW, for "no way" I suggest you go back to basic statistics and the central limit theorem.. There is most definitely a "way" just as a card counter can suffer horrible losses in a casino, playing with a +1% advantage, or you can flip a coin 100 times and get 75 heads..
That is not the variance. A single outcome has no variance.

You are confusing individual outcomes with the distribution from which they are drawn. Of course some individual outcomes of your sample will have a deviation from the mean (much) larger than the SD. That is what SD means: the average rms deviation. Not everything can be smaller than the average...

This effect is fully included in the determination of the uncertainty intervals. It is why an interval of 2 sigma, or any interval beyond sigma, is not automatically a 100% confidence interval.)

It doesn't seem that it is me who needs a refreshment course in statistics.

I don't see what you mean that 45% is badly wrong. The data-traces show a score of 45%. That is a fact, and there is nothing wrong with that observation. Now you might have some interpretation of this fact, unknown to us, that is badly wrong. I cannot comment on that before I know what that interpretation is.

Fact is that the standard deviation of the match results can never be larger than 0.5*sqrt(80), no matter what the true win probabilities are. It can only be smaller.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Variance reports for testing engine improvements

Post by bob »

hgm wrote:
bob wrote:BTW, for "no way" I suggest you go back to basic statistics and the central limit theorem.. There is most definitely a "way" just as a card counter can suffer horrible losses in a casino, playing with a +1% advantage, or you can flip a coin 100 times and get 75 heads..
That is not the variance. A single outcome has no variance.

You are confusing individual outcomes with the distribution from which they are drawn. Of course some individual outcomes of your sample will have a deviation from the mean (much) larger than the SD. That is what SD means: the average rms deviation. Not everything can be smaller than the average...

No, I'm not confusing anything. Your "variance" is based on a huge number of games. And I agree, with N=infinity that variance (and hence standard deviation) is very small. But remember, _I_ an the one arguing for large N. You (and others) are arguing for small N. So don't quote sigma^2 (or sigma) for large N to justify a small number of games. That is exactly my point. To reduce the variance to something that is reliable, we need a large number of games because the results of the 80-game mini-matches I play are so widely distributed. I am not bothered by the necessity of playing 20K games to make a decision. Since I can play 256 at a time, that is the same as playing less than 100 games one at a time. With a fast time control, that doesn't take that long.

The alternative is to use small N, and make poor decisions because of the inherent randomness (large sigma/sigma^2).

BTW you could make sigma much smaller by just taking each 2-game position as a match. the results have to lie between -2 to +2. But I am not sure how you can use that to accept/reject a change to the program since the before/after results will be identical to several decimal places.

This effect is fully included in the determination of the uncertainty intervals. It is why an interval of 2 sigma, or any interval beyond sigma, is not automatically a 100% confidence interval.)

It doesn't seem that it is me who needs a refreshment course in statistics.

I don't see what you mean that 45% is badly wrong. The data-traces show a score of 45%. That is a fact, and there is nothing wrong with that observation. Now you might have some interpretation of this fact, unknown to us, that is badly wrong. I cannot comment on that before I know what that interpretation is.

Given two equal humans, who play 100 games, what would you expect to be the outcome? 45-45-10? If you repeat the test, what would you expect? I would _not_ expect the two equal computers to produce that same level of consistency. That is what I am trying to explain. Comp vs Comp is nowhere close to human vs human..

Fact is that the standard deviation of the match results can never be larger than 0.5*sqrt(80), no matter what the true win probabilities are. It can only be smaller.
Far an infinite number of trials, I agree. But I am the one arguing for large N. So how can you argue based on large N values, when you want to use small N results?? The standard deviation can be much larger than the above, for small sample sizes that you propose...
nczempin

Re: Variance reports for testing engine improvements

Post by nczempin »

bob wrote:
hgm wrote:
bob wrote:BTW, for "no way" I suggest you go back to basic statistics and the central limit theorem.. There is most definitely a "way" just as a card counter can suffer horrible losses in a casino, playing with a +1% advantage, or you can flip a coin 100 times and get 75 heads..
That is not the variance. A single outcome has no variance.

You are confusing individual outcomes with the distribution from which they are drawn. Of course some individual outcomes of your sample will have a deviation from the mean (much) larger than the SD. That is what SD means: the average rms deviation. Not everything can be smaller than the average...

No, I'm not confusing anything. Your "variance" is based on a huge number of games. And I agree, with N=infinity that variance (and hence standard deviation) is very small. But remember, _I_ an the one arguing for large N. You (and others) are arguing for small N. So don't quote sigma^2 (or sigma) for large N to justify a small number of games. That is exactly my point. To reduce the variance to something that is reliable, we need a large number of games because the results of the 80-game mini-matches I play are so widely distributed. I am not bothered by the necessity of playing 20K games to make a decision. Since I can play 256 at a time, that is the same as playing less than 100 games one at a time. With a fast time control, that doesn't take that long.

The alternative is to use small N, and make poor decisions because of the inherent randomness (large sigma/sigma^2).

BTW you could make sigma much smaller by just taking each 2-game position as a match. the results have to lie between -2 to +2. But I am not sure how you can use that to accept/reject a change to the program since the before/after results will be identical to several decimal places.

This effect is fully included in the determination of the uncertainty intervals. It is why an interval of 2 sigma, or any interval beyond sigma, is not automatically a 100% confidence interval.)

It doesn't seem that it is me who needs a refreshment course in statistics.

I don't see what you mean that 45% is badly wrong. The data-traces show a score of 45%. That is a fact, and there is nothing wrong with that observation. Now you might have some interpretation of this fact, unknown to us, that is badly wrong. I cannot comment on that before I know what that interpretation is.

Given two equal humans, who play 100 games, what would you expect to be the outcome? 45-45-10? If you repeat the test, what would you expect? I would _not_ expect the two equal computers to produce that same level of consistency. That is what I am trying to explain. Comp vs Comp is nowhere close to human vs human..

Fact is that the standard deviation of the match results can never be larger than 0.5*sqrt(80), no matter what the true win probabilities are. It can only be smaller.
Far an infinite number of trials, I agree. But I am the one arguing for large N. So how can you argue based on large N values, when you want to use small N results?? The standard deviation can be much larger than the above, for small sample sizes that you propose...
I (one of the "others") am not advocating small N, as such.

There are two Ns: One is the base test to see what variance your old version has in its matches. The larger the number of games, the closer the observed variance will get to the real variance.

But even when you have a small number here, if the observed variance is already small (and as we all know by know, for you it isn't), then you don't need as many games for part 2:

The second N, which is the number of games that are played with the new version, to find out if it is stronger than the old version.

I am not advocating a small number of games (you are surely aware that the opposite of "we need many games" is not "we need few games" logically), I am saying that depending on the situation, fewer games may be needed than what you seem to be claiming to be use-everywhere-results.
nczempin

Re: Variance reports for testing engine improvements

Post by nczempin »

A temporary snapshot of my current 10-game gauntlet:

Code: Select all

    Engine             Score      Ed
01: Eden_0.0.13        61,5/195 ···· 
02: Alf109             4,0/4    1111 
02: BikJump            4,0/4    1111 
02: Cheetah            4,0/4    1111 
02: FAUCE              4,0/4    1111 
02: Flux               4,0/4    1111 
02: Heracles           4,0/4    1111 
02: Jsbam              4,0/4    1111 
02: Kingsout           4,0/4    1111 
02: Mediocre           4,0/4    1111 
02: Murderhole         4,0/4    1111 
02: Needle             4,0/4    1111 
13: Enigma             3,5/4    11=1 
13: Mizar              3,5/4    111= 
13: Olithink           3,5/4    =111 
13: PolarEngine        3,5/4    1=11 
17: Gedeone1620        3,0/4    0111 
17: Milady             3,0/4    ==11 
17: ALChess1.5b        3,0/4    1101 
17: Hoplite            3,0/4    1101 
17: IQ23               3,0/4    1=1= 
17: Beaches            3,0/4    1011 
17: DChess1_0_2        3,0/4    11== 
17: Roce036            3,0/3    111  
17: Rotor              3,0/3    111  
17: TJchess078R2       3,0/3    111  
27: Blikskottel        2,5/4    0=11 
27: Tamerlane02        2,5/3    1=1  
27: Aldebaran          2,5/4    01=1 
30: Piranha            2,0/4    0101 
30: Minimax            2,0/4    0101 
30: Pooky27            2,0/3    110  
30: Atlanchess         2,0/4    0101 
30: Exacto             2,0/4    0101 
30: SharpChess2        2,0/3    101  
30: Storm              2,0/3    011  
30: Eden_0.0.11        2,0/4    1010 
30: Eden_0.0.12_server 2,0/4    1010 
30: Tscp181            2,0/3    ==1  
30: Umax4_8w           2,0/3    101  
30: Vanillachess       2,0/3    011  
30: Yawce016           2,0/3    110  
30: Z2k3               2,0/3    1==  
44: Awe170             1,5/4    =010 
44: Roque              1,5/3    =01  
44: Cefap              1,5/4    010= 
44: Rainman            1,5/3    0=1  
48: APILchess          1,0/4    0=0= 
48: Vicki              1,0/3    010  
48: Philemon           1,0/4    =0=0 
48: Eden0.0.12 JA      1,0/4    0010 
52: Dimitri Chess      0,0/4    0000 
52: Sdbc               0,0/3    000  
52: RobinEngine        0,0/3    000  
I was going to wait until I had at least 4 games for each match, but perhaps something useful can be drawn from this result already.

Perhaps if one removes those engines that only have 3 results (since the choice would be arbitrary, it shouldn't make any difference as compared to the situation I would have if I hadn't included those engines from the start).