Okay, please post them if you have them nearby.bob wrote:If you don't find them, we can start a new thread and I can post some data. The only problem is that my results are in chunks of 80 games against the same opponent, 40 starting positions, playing each position with alternating colors. Probably the best would be to post 4 80 game matches between several programs in my test gauntlet...nczempin wrote: I'll go and look up the values you posted unless someone beats me to it.
Variance reports for testing engine improvements
Moderators: bob, hgm, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Variance reports for testing engine improvements
- hgm
- Posts: 24619
- Joined: Fri Mar 10, 2006 9:06 am
- Location: Amsterdam
- Full name: H G Muller
- Contact:
Re: Variance reports for testing engine improvements
You mean stuff like this?
That particular snippet of data has an average of 36.6 (out of 80, = 45.7%), and a variance of 6.3, which means SD = 2.51.
Predicted fo 80 games would be 0.4*sqrt(80) ~ 3.6. So actually the variance is on the low side, but not so much that there is reason to worry for such a small sample of 32 runs.
Code: Select all
01: -+-=-++=-+=++-=-+-+=+-+--=-+=--++=-++--=-+=-===+-+===+=++-=++--+-+--+--+=--++---
02: -+=+=++-=+--+-=-+-+==-+----+=-=++--+=--+--+---==-+-==+=+---++--+--+-=--+-=--+---
03: -+--=++==+-++-+-+-+-+-==-+-++--+=---+--+-++--+-+===+=+-=+-==+==+---=---++--++=+-
04: -+=-=+=--=-++-+-+-+=+-+--+-+-==++=-+==-+--+=-+-+-+=+=--+---==--+-++----++-=++---
05: -+=--+=--+=-+-+---=-+-+----++=-++=-+--=+=-+--+-+-+=+=+-+===++--+-+-==--+----+-+-
06: -+---=+=-+-+====+-+-+-+-=+-+==-=+--++--==++-=+---+=+=+-+--=++--+=-+-=--+=-----=-
07: -+-+--+--+=-+==-=-+-+-+--+=-=-=++-=++=-+-=+--+-+-+=+-+-++-=++-=+--=-=--++---+=+-
08: -+-+-=+--+-==-=-+=+=+=+--+-==-=+=--++----++--+=--+=--+-=+-----=+-++==--++=--+--=
09: -+-+-=+--+-+==--+-+-+-+--==----++==++=-+=++--+=+-=-=-+-+=--++-=+--=-+=---=-+--+-
10: -+=+-++--+-++-==+-+-+-==-+-+=-=++-=++-=+==+--+-=-+-==+==+--++--+=++==--+----+-+=
11: -+---====+-++=--==+---+--+-+----+--+=-=+-=+--+=+-+-+-+-==--=+--+=++-+--++=-++-==
12: -+=+=++--+-++-+-=-+-+-+=-+-+-=--+--+=--+=-+==+-=-+=--+-=+-==+-=+-++-=--++---+---
13: -+-+--+==+--==+-+-+-=-+=-+-+=--+-=-++-=--++--+===--+=+=+---++-=+-=+-----+---+---
14: -+-=-++--+-+==+-=-+=+-+=-+=+=--++=-++--+==+--==-===+-+=-+-=-+-=+--+-=-=++--+=-+-
15: -+-+-++--+-=+==-+-+-+-+--+-+=---+--++-=+-++---=+-+==-+-+---==-=+--=-+--+=--++-+=
16: -+=+=++=-+=++=+=--+-+=+--+=+=---+--++--+=-+==+==-+-+=+-++---+-=+-=+-==-++===+-+-
17: -=-+-=+--+-++==-+-+-+-+--+==+-=++--++--+==+-=+-==+=--+-++--++-=+--+----++---+-+-
18: -+==-++--+=====-+=+-+-+--+-++--++=-+=--+-++--+=+-+=+-+=++=-++-=+==+==--+--=++---
19: =+-+=+-=-=-=+-+---+-+-+--+-++--++=-++-==-++-=+-+++=-=+-++--=+--+-++-==-++--++-=-
20: -+---++-=+=-+-+=+-=-+-+--+-+=--++=-+=-=+-++--+=+=+--=+-++--++--+-=+-+--++--++-+-
21: -+=-=++--+=+=-+=+-+-+-+----=--=++=-=---+-++---=+-+=+-+=+--=-+--+--+=+--=+---=---
22: -+-=-=+-==--+=+-=-+-+=+----+=-=++-=++====++--+---+===+=++--++--+--+---=++---+-=-
23: -+-=-++=-+-+=---=-=-+--=-+=+-===-=-+=--==++--+=+=+=+=+=++--++-=+--+-----+-----+-
24: -+-+-+-=-+-++-+-+---+-+--+-=----+--++--+-=+--=-====+==-++--++-=--=+--=-++--++---
25: =+---+==-+-+==+-=-+-+-+=++-+=--++--=+--+==+--+==-+-===-++--++=-+-==--==++=--+-+-
26: -+=+-++--+-=+=+=+-=-+=+--+-+-=-=+-=++--=-++--+-+-+=-=+-++---+-=+-=+-=--=+=-++==-
27: -+-+-++--=-+====+-=-+-==-+=-+---+--+=--=-=+===-+-=-+-+-++--=+--+-++-=--++=-++-+=
28: =+---+---+-++-+-+-+-+----=-+===++--++-=+==+--+---+-+-+-+=-==+-=+--+==-=+=---+-+-
29: =+-+-++-===+--+=+-+-+-=--+=+---++--+=-==-++-=+---+=+=+--+-=++--+-++==--++-=++--=
30: -+=+-++-=+-=+-+=+-=-+-+--+-+---++--++-=+=++-==-+=--+=+=++--++----+=-+--+---++--=
31: -+=+==+--+-++=+-+-+=+-+=-+-++=-++--+=--+=-+--+-+=--+=+-+--=++--+--+-=--++=--+---
32: -+-=--+-=+-+-=-=+-+-+-+--+-+=--++=-+=-=+-++=-==--+-+=--++--++--+-++-+-=++=-++-+-
32 distinct runs (2560 games) found
Predicted fo 80 games would be 0.4*sqrt(80) ~ 3.6. So actually the variance is on the low side, but not so much that there is reason to worry for such a small sample of 32 runs.
Last edited by hgm on Sat Sep 15, 2007 2:39 pm, edited 2 times in total.
Re: Variance reports for testing engine improvements
Well, I meant not the raw data, but the computed results, and specifically those results that Bob had that he mentioned, with Crafty, Fruit, Glaurung, etc.hgm wrote:You mean stuff like this?
- hgm
- Posts: 24619
- Joined: Fri Mar 10, 2006 9:06 am
- Location: Amsterdam
- Full name: H G Muller
- Contact:
Re: Variance reports for testing engine improvements
Of course the smaller-than-expected variance of the 80-game result could be a systematic effect of the machines not being able to play some positions as well as others. The expected variance was calculated assuming each individual game had a scoring probability around 45% (which is the total scoring percentage), and a corresponding variance. The variance of the match is then simply the sum of variance of the individual games, i.e. 80 times that of one game if they all have equal variance.
But if some of the games have win probabilities significantly different from 50%, the variance goes down. A 10% winn probability gives a variance of only 0.9*0.1 (and even lower if you include draws) = 0.09 in stead of 0.45*0.55 = 0.24. So such games contribute less to the variance, although one 10% game and one 90% game contribute the same to the score as two 50% games.
As you can see that some of the columns are quite consistent plusses or minusses, it is a safe bet that due to these games the match variance is lower. As is all predicted by standard statistical analysis. In particular, there is no way the variance could be higher than calculated.
But if some of the games have win probabilities significantly different from 50%, the variance goes down. A 10% winn probability gives a variance of only 0.9*0.1 (and even lower if you include draws) = 0.09 in stead of 0.45*0.55 = 0.24. So such games contribute less to the variance, although one 10% game and one 90% game contribute the same to the score as two 50% games.
As you can see that some of the columns are quite consistent plusses or minusses, it is a safe bet that due to these games the match variance is lower. As is all predicted by standard statistical analysis. In particular, there is no way the variance could be higher than calculated.
Re: Variance reports for testing engine improvements
OK. Here are 3 80 game matches against an opponent that Crafty plays very evenly against:nczempin wrote:Okay, please post them if you have them nearby.bob wrote:If you don't find them, we can start a new thread and I can post some data. The only problem is that my results are in chunks of 80 games against the same opponent, 40 starting positions, playing each position with alternating colors. Probably the best would be to post 4 80 game matches between several programs in my test gauntlet...nczempin wrote: I'll go and look up the values you posted unless someone beats me to it.
B: -=+--++-=+-==-==-====---+--+====--+-==+-+---+==+-+--++==---=+-+===+--=-+++++-=++ ( -6)
B: -++==-=+-+-==++-+=-=+-------+-=-----=-+=+=-----+++++=--=+--+==+=-=+==+===+-=+--- (-11)
B: -----+=-=+--==-=+=--+-=-++--+--=----+=+-+-+-=-+-=++=-+--+----+=--=-+-----=-+==-- (-25)
the values at the end are the final result for each. 80 games, varying from -6 to -25, time control was 60+60, so these are not noisy 1 second games...
Here's three similar matches against a program Crafty usually drubs pretty well:
C: =+=++==+=-++=-+===+++++=-+++==-+=+++--++--+---+-==+++++=+-=+++=++++--==-=+===+-= ( 21)
C: ++=+++-+-+++++=+===++-++-+++=+++++++=+++-+++-++=++++-+=-=+==-+=+-=+==-+-=-=++=+= ( 34)
C: ++=+-++-++-++=++=+=++=+==+++==++++++=+===-+-+=++=+=+=+===-++++=++=-++++++++-+-++ ( 40)
And finally three matches between an opponent that usually beats up on Crafty pretty badly in this test and an opponent that crafty usually beats pretty badly (neither is crafty in this case to show that the randomness is not from crafty, but is intrinsic to playing computer vs computer games (again, all games are 60+60 in the data I am providing...
A: +--=+-==++=-=+-+--=+=-+-===-+=-+==---+-=-+--+++--+-===---=-=-=+=+-+-+=-+=+-+---= (-10)
A: =--+==-=+--+-+-=-+--=-=--+-==+=-++--=----+=-=+-==+=+===--+=-==+---+--=-----+++-+ (-17)
A: =---=-=+=--=-=-==------------=-==----=-+-==-=----+-=+=---+==-=+---+--+-----=-=-- (-42)
I can provide way more data if you want, but those matches, where were played in succession, give some idea of the variability that is common.
Re: Variance reports for testing engine improvements
That is my point. The 45% is not just wrong. It is _badly_ wrong. And using that as a basis for any sort of statistical analysis produces meaningless data.hgm wrote:Of course the smaller-than-expected variance of the 80-game result could be a systematic effect of the machines not being able to play some positions as well as others. The expected variance was calculated assuming each individual game had a scoring probability around 45% (which is the total scoring percentage), and a corresponding variance. The variance of the match is then simply the sum of variance of the individual games, i.e. 80 times that of one game if they all have equal variance.
But if some of the games have win probabilities significantly different from 50%, the variance goes down. A 10% winn probability gives a variance of only 0.9*0.1 (and even lower if you include draws) = 0.09 in stead of 0.45*0.55 = 0.24. So such games contribute less to the variance, although one 10% game and one 90% game contribute the same to the score as two 50% games.
As you can see that some of the columns are quite consistent plusses or minusses, it is a safe bet that due to these games the match variance is lower. As is all predicted by standard statistical analysis. In particular, there is no way the variance could be higher than calculated.
BTW, for "no way" I suggest you go back to basic statistics and the central limit theorem.. There is most definitely a "way" just as a card counter can suffer horrible losses in a casino, playing with a +1% advantage, or you can flip a coin 100 times and get 75 heads.. This is a normal curve, but for the programs I do not believe we know anything about probabilities to date. The "random factor" has never been analyzed to any extent until I started producing such a ridiculous number of games...
- hgm
- Posts: 24619
- Joined: Fri Mar 10, 2006 9:06 am
- Location: Amsterdam
- Full name: H G Muller
- Contact:
Re: Variance reports for testing engine improvements
That is not the variance. A single outcome has no variance.bob wrote:BTW, for "no way" I suggest you go back to basic statistics and the central limit theorem.. There is most definitely a "way" just as a card counter can suffer horrible losses in a casino, playing with a +1% advantage, or you can flip a coin 100 times and get 75 heads..
You are confusing individual outcomes with the distribution from which they are drawn. Of course some individual outcomes of your sample will have a deviation from the mean (much) larger than the SD. That is what SD means: the average rms deviation. Not everything can be smaller than the average...
This effect is fully included in the determination of the uncertainty intervals. It is why an interval of 2 sigma, or any interval beyond sigma, is not automatically a 100% confidence interval.)
It doesn't seem that it is me who needs a refreshment course in statistics.
I don't see what you mean that 45% is badly wrong. The data-traces show a score of 45%. That is a fact, and there is nothing wrong with that observation. Now you might have some interpretation of this fact, unknown to us, that is badly wrong. I cannot comment on that before I know what that interpretation is.
Fact is that the standard deviation of the match results can never be larger than 0.5*sqrt(80), no matter what the true win probabilities are. It can only be smaller.
Re: Variance reports for testing engine improvements
hgm wrote:That is not the variance. A single outcome has no variance.bob wrote:BTW, for "no way" I suggest you go back to basic statistics and the central limit theorem.. There is most definitely a "way" just as a card counter can suffer horrible losses in a casino, playing with a +1% advantage, or you can flip a coin 100 times and get 75 heads..
You are confusing individual outcomes with the distribution from which they are drawn. Of course some individual outcomes of your sample will have a deviation from the mean (much) larger than the SD. That is what SD means: the average rms deviation. Not everything can be smaller than the average...
No, I'm not confusing anything. Your "variance" is based on a huge number of games. And I agree, with N=infinity that variance (and hence standard deviation) is very small. But remember, _I_ an the one arguing for large N. You (and others) are arguing for small N. So don't quote sigma^2 (or sigma) for large N to justify a small number of games. That is exactly my point. To reduce the variance to something that is reliable, we need a large number of games because the results of the 80-game mini-matches I play are so widely distributed. I am not bothered by the necessity of playing 20K games to make a decision. Since I can play 256 at a time, that is the same as playing less than 100 games one at a time. With a fast time control, that doesn't take that long.
The alternative is to use small N, and make poor decisions because of the inherent randomness (large sigma/sigma^2).
BTW you could make sigma much smaller by just taking each 2-game position as a match. the results have to lie between -2 to +2. But I am not sure how you can use that to accept/reject a change to the program since the before/after results will be identical to several decimal places.
Far an infinite number of trials, I agree. But I am the one arguing for large N. So how can you argue based on large N values, when you want to use small N results?? The standard deviation can be much larger than the above, for small sample sizes that you propose...
This effect is fully included in the determination of the uncertainty intervals. It is why an interval of 2 sigma, or any interval beyond sigma, is not automatically a 100% confidence interval.)
It doesn't seem that it is me who needs a refreshment course in statistics.
I don't see what you mean that 45% is badly wrong. The data-traces show a score of 45%. That is a fact, and there is nothing wrong with that observation. Now you might have some interpretation of this fact, unknown to us, that is badly wrong. I cannot comment on that before I know what that interpretation is.
Fact is that the standard deviation of the match results can never be larger than 0.5*sqrt(80), no matter what the true win probabilities are. It can only be smaller.
Given two equal humans, who play 100 games, what would you expect to be the outcome? 45-45-10? If you repeat the test, what would you expect? I would _not_ expect the two equal computers to produce that same level of consistency. That is what I am trying to explain. Comp vs Comp is nowhere close to human vs human..
Re: Variance reports for testing engine improvements
I (one of the "others") am not advocating small N, as such.bob wrote:hgm wrote:That is not the variance. A single outcome has no variance.bob wrote:BTW, for "no way" I suggest you go back to basic statistics and the central limit theorem.. There is most definitely a "way" just as a card counter can suffer horrible losses in a casino, playing with a +1% advantage, or you can flip a coin 100 times and get 75 heads..
You are confusing individual outcomes with the distribution from which they are drawn. Of course some individual outcomes of your sample will have a deviation from the mean (much) larger than the SD. That is what SD means: the average rms deviation. Not everything can be smaller than the average...
No, I'm not confusing anything. Your "variance" is based on a huge number of games. And I agree, with N=infinity that variance (and hence standard deviation) is very small. But remember, _I_ an the one arguing for large N. You (and others) are arguing for small N. So don't quote sigma^2 (or sigma) for large N to justify a small number of games. That is exactly my point. To reduce the variance to something that is reliable, we need a large number of games because the results of the 80-game mini-matches I play are so widely distributed. I am not bothered by the necessity of playing 20K games to make a decision. Since I can play 256 at a time, that is the same as playing less than 100 games one at a time. With a fast time control, that doesn't take that long.
The alternative is to use small N, and make poor decisions because of the inherent randomness (large sigma/sigma^2).
BTW you could make sigma much smaller by just taking each 2-game position as a match. the results have to lie between -2 to +2. But I am not sure how you can use that to accept/reject a change to the program since the before/after results will be identical to several decimal places.
Far an infinite number of trials, I agree. But I am the one arguing for large N. So how can you argue based on large N values, when you want to use small N results?? The standard deviation can be much larger than the above, for small sample sizes that you propose...
This effect is fully included in the determination of the uncertainty intervals. It is why an interval of 2 sigma, or any interval beyond sigma, is not automatically a 100% confidence interval.)
It doesn't seem that it is me who needs a refreshment course in statistics.
I don't see what you mean that 45% is badly wrong. The data-traces show a score of 45%. That is a fact, and there is nothing wrong with that observation. Now you might have some interpretation of this fact, unknown to us, that is badly wrong. I cannot comment on that before I know what that interpretation is.
Fact is that the standard deviation of the match results can never be larger than 0.5*sqrt(80), no matter what the true win probabilities are. It can only be smaller.
Given two equal humans, who play 100 games, what would you expect to be the outcome? 45-45-10? If you repeat the test, what would you expect? I would _not_ expect the two equal computers to produce that same level of consistency. That is what I am trying to explain. Comp vs Comp is nowhere close to human vs human..
There are two Ns: One is the base test to see what variance your old version has in its matches. The larger the number of games, the closer the observed variance will get to the real variance.
But even when you have a small number here, if the observed variance is already small (and as we all know by know, for you it isn't), then you don't need as many games for part 2:
The second N, which is the number of games that are played with the new version, to find out if it is stronger than the old version.
I am not advocating a small number of games (you are surely aware that the opposite of "we need many games" is not "we need few games" logically), I am saying that depending on the situation, fewer games may be needed than what you seem to be claiming to be use-everywhere-results.
Re: Variance reports for testing engine improvements
A temporary snapshot of my current 10-game gauntlet:
I was going to wait until I had at least 4 games for each match, but perhaps something useful can be drawn from this result already.
Perhaps if one removes those engines that only have 3 results (since the choice would be arbitrary, it shouldn't make any difference as compared to the situation I would have if I hadn't included those engines from the start).
Code: Select all
Engine Score Ed
01: Eden_0.0.13 61,5/195 ····
02: Alf109 4,0/4 1111
02: BikJump 4,0/4 1111
02: Cheetah 4,0/4 1111
02: FAUCE 4,0/4 1111
02: Flux 4,0/4 1111
02: Heracles 4,0/4 1111
02: Jsbam 4,0/4 1111
02: Kingsout 4,0/4 1111
02: Mediocre 4,0/4 1111
02: Murderhole 4,0/4 1111
02: Needle 4,0/4 1111
13: Enigma 3,5/4 11=1
13: Mizar 3,5/4 111=
13: Olithink 3,5/4 =111
13: PolarEngine 3,5/4 1=11
17: Gedeone1620 3,0/4 0111
17: Milady 3,0/4 ==11
17: ALChess1.5b 3,0/4 1101
17: Hoplite 3,0/4 1101
17: IQ23 3,0/4 1=1=
17: Beaches 3,0/4 1011
17: DChess1_0_2 3,0/4 11==
17: Roce036 3,0/3 111
17: Rotor 3,0/3 111
17: TJchess078R2 3,0/3 111
27: Blikskottel 2,5/4 0=11
27: Tamerlane02 2,5/3 1=1
27: Aldebaran 2,5/4 01=1
30: Piranha 2,0/4 0101
30: Minimax 2,0/4 0101
30: Pooky27 2,0/3 110
30: Atlanchess 2,0/4 0101
30: Exacto 2,0/4 0101
30: SharpChess2 2,0/3 101
30: Storm 2,0/3 011
30: Eden_0.0.11 2,0/4 1010
30: Eden_0.0.12_server 2,0/4 1010
30: Tscp181 2,0/3 ==1
30: Umax4_8w 2,0/3 101
30: Vanillachess 2,0/3 011
30: Yawce016 2,0/3 110
30: Z2k3 2,0/3 1==
44: Awe170 1,5/4 =010
44: Roque 1,5/3 =01
44: Cefap 1,5/4 010=
44: Rainman 1,5/3 0=1
48: APILchess 1,0/4 0=0=
48: Vicki 1,0/3 010
48: Philemon 1,0/4 =0=0
48: Eden0.0.12 JA 1,0/4 0010
52: Dimitri Chess 0,0/4 0000
52: Sdbc 0,0/3 000
52: RobinEngine 0,0/3 000
Perhaps if one removes those engines that only have 3 results (since the choice would be arbitrary, it shouldn't make any difference as compared to the situation I would have if I hadn't included those engines from the start).