Variance reports for testing engine improvements

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

nczempin

Re: Variance reports for testing engine improvements

Post by nczempin »

nczempin wrote:A temporary snapshot of my current 10-game gauntlet:

...
Perhaps if one removes those engines that only have 3 results (since the choice would be arbitrary, it shouldn't make any difference as compared to the situation I would have if I hadn't included those engines from the start).
Perhaps I should stop this particular experiment and choose 5 opponents that I should do the 40-position test with. However, I strongly believe that using the 40-position test or not is the dominant factor as to the conclusions that can be drawn about variance.

I. e. I expect that in that test, the variance will be much higher for Eden than it is in the current test.
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Variance reports for testing engine improvements

Post by hgm »

bob wrote:No, I'm not confusing anything. Your "variance" is based on a huge number of games.
My variance is based not on a number of games, but on the probability distribution.
And I agree, with N=infinity that variance (and hence standard deviation) is very small. But remember, _I_ an the one arguing for large N. You (and others) are arguing for small N.
I have not been arguing for any number of games. I just calculated the confidence intervals for the number of games you or Nicolai had been using.
So don't quote sigma^2 (or sigma) for large N to justify a small number of games.
I still consider that nonsense. Variance is a property of the probability distribution from which you draw the games. It has _nothing_ to do with the number of games. Your actual results will vary per draw, and of course 32% will have a deviation from the mean larger than sigma, and 5% larger than 2 sigma. That doesn't mean the variance of the match results is any different from what is calculated. In fact it confirms it.

Given the variance of the distribution one can calculate the probability for a certain deviation, and this is what I did.
That is exactly my point. To reduce the variance to something that is reliable, we need a large number of games because the results of the 80-game mini-matches I play are so widely distributed.
I did not think they were widely distributed at all. The observed variance of the mini-match results was lower than one would expect for the score, and the game results were actually highly correlated between matches. This shows that either the initial positions are not very well selected for equality, or the non-determinism you champion is not nearly as big as you claim.
I am not bothered by the necessity of playing 20K games to make a decision. Since I can play 256 at a time, that is the same as playing less than 100 games one at a time. With a fast time control, that doesn't take that long.
Well, good for you. So availability of excessive computer power causes atrophy of the brain. Nothing new about that. But I only have one dual-core machine, so I still have to _think_ how to do things efficiently. And actually the thinking is what is all the fun. :P
The alternative is to use small N, and make poor decisions because of the inherent randomness (large sigma/sigma^2).
The alternative for me is to eliminate all unnecessary variability, so that I can base decisions with as much confidence as you, on 256 times fewer games.
BTW you could make sigma much smaller by just taking each 2-game position as a match. the results have to lie between -2 to +2. But I am not sure how you can use that to accept/reject a change to the program since the before/after results will be identical to several decimal places.
So what? Like anyone cares about sigma... What matters is the probability that you will exceed a certain preset score threshold where you are confident that the engine is better. For two matches and +1/-1 scoring the SD per game is ~0.8, so per 2-game match it is about 1.2. For 2-sigma confidence you would have to score 2.5 wins out of 2 games. Well, good luck with it. How many hours would that take on your cluster to succeed once? :lol:
Given two equal humans, who play 100 games, what would you expect to be the outcome? 45-45-10? If you repeat the test, what would you expect? I would _not_ expect the two equal computers to produce that same level of consistency. That is what I am trying to explain. Comp vs Comp is nowhere close to human vs human..
I don't really follow Human Chess, and from my relative ignorance I would more expect spomething like 25-25-50 (+,-,=). I don't see what you are driving at at all. What do you mean by 'level of consistency'? I don't see anything consistent in 45-45-10 or 25-25-50.
Far an infinite number of trials, I agree. But I am the one arguing for large N. So how can you argue based on large N values, when you want to use small N results?? The standard deviation can be much larger than the above, for small sample sizes that you propose...
Again, what you say seems based on the idea that the variance of a stochastic variable is depedent on the number of draws. But it is not.

For individual outcomes ('mini-matches') anything is _possible_. But not with equal _probability_. If the first mini-match you play has 80 losses, than it is possible that the next one, with the same opponents and the same initial positions will have 80 wins. But you will NOT see it happen in the lifetime of the universe, even if you would use a million of your clusters. At least, if the games within the mini-match are independent. (e.g. if the first game that crashes the engine would cause forfeit on time of all subsequent games as the engine could not recover enough to recognize the next 'new' command, this would of course be different.)

The variance of the score distribution of an 80-game match (in the +1/0/-1 scoring system) can never be larger than sqrt(80). That is a hard fact.

The variance of this probability distribution will tell you how large the probability is that a certain deviation from the mean will occur in a mini-match, or in a given number of minimatches.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Variance reports for testing engine improvements

Post by bob »

nczempin wrote:
bob wrote:
hgm wrote:
bob wrote:BTW, for "no way" I suggest you go back to basic statistics and the central limit theorem.. There is most definitely a "way" just as a card counter can suffer horrible losses in a casino, playing with a +1% advantage, or you can flip a coin 100 times and get 75 heads..
That is not the variance. A single outcome has no variance.

You are confusing individual outcomes with the distribution from which they are drawn. Of course some individual outcomes of your sample will have a deviation from the mean (much) larger than the SD. That is what SD means: the average rms deviation. Not everything can be smaller than the average...

No, I'm not confusing anything. Your "variance" is based on a huge number of games. And I agree, with N=infinity that variance (and hence standard deviation) is very small. But remember, _I_ an the one arguing for large N. You (and others) are arguing for small N. So don't quote sigma^2 (or sigma) for large N to justify a small number of games. That is exactly my point. To reduce the variance to something that is reliable, we need a large number of games because the results of the 80-game mini-matches I play are so widely distributed. I am not bothered by the necessity of playing 20K games to make a decision. Since I can play 256 at a time, that is the same as playing less than 100 games one at a time. With a fast time control, that doesn't take that long.

The alternative is to use small N, and make poor decisions because of the inherent randomness (large sigma/sigma^2).

BTW you could make sigma much smaller by just taking each 2-game position as a match. the results have to lie between -2 to +2. But I am not sure how you can use that to accept/reject a change to the program since the before/after results will be identical to several decimal places.

This effect is fully included in the determination of the uncertainty intervals. It is why an interval of 2 sigma, or any interval beyond sigma, is not automatically a 100% confidence interval.)

It doesn't seem that it is me who needs a refreshment course in statistics.

I don't see what you mean that 45% is badly wrong. The data-traces show a score of 45%. That is a fact, and there is nothing wrong with that observation. Now you might have some interpretation of this fact, unknown to us, that is badly wrong. I cannot comment on that before I know what that interpretation is.

Given two equal humans, who play 100 games, what would you expect to be the outcome? 45-45-10? If you repeat the test, what would you expect? I would _not_ expect the two equal computers to produce that same level of consistency. That is what I am trying to explain. Comp vs Comp is nowhere close to human vs human..

Fact is that the standard deviation of the match results can never be larger than 0.5*sqrt(80), no matter what the true win probabilities are. It can only be smaller.
Far an infinite number of trials, I agree. But I am the one arguing for large N. So how can you argue based on large N values, when you want to use small N results?? The standard deviation can be much larger than the above, for small sample sizes that you propose...
I (one of the "others") am not advocating small N, as such.

There are two Ns: One is the base test to see what variance your old version has in its matches. The larger the number of games, the closer the observed variance will get to the real variance.

But even when you have a small number here, if the observed variance is already small (and as we all know by know, for you it isn't), then you don't need as many games for part 2:

The second N, which is the number of games that are played with the new version, to find out if it is stronger than the old version.

I am not advocating a small number of games (you are surely aware that the opposite of "we need many games" is not "we need few games" logically), I am saying that depending on the situation, fewer games may be needed than what you seem to be claiming to be use-everywhere-results.
Hmm... the opposite of "we need many games" is "we do not need many games". So far as I can parse that, not needing many means needing something less. And I tend to assume that needed != many - 2 or many - 4.

Again, we digress. I am not talking about testing Crafty and Crafty', where Crafty' has some huge change (reductions/null-move vs no reductions/null-move.) For changes like that, yes it doesn't take many games. I am talking about the kind of evolutionary changes we make daily to the programs. The improvement is _not_ going to be huge. Just significant. Even adding something significant like outside passed pawns or trapped bishop code is not going to make a huge difference since that might not be important in every game/position. And most changes are less significant than that. So it is already a given that a few games won't pick up the slight improvement or worsening of the program's play.

I've never claimed anything else...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Variance reports for testing engine improvements

Post by bob »

hgm wrote:
bob wrote:No, I'm not confusing anything. Your "variance" is based on a huge number of games.
My variance is based not on a number of games, but on the probability distribution.
What distribution? How did you arrive at it? Is it different for every pair of programs? I know of no accurate distribution that indicates how chess games are distributed over the interval {0, .5, 1} when we are talking about computers vs computers.
And I agree, with N=infinity that variance (and hence standard deviation) is very small. But remember, _I_ an the one arguing for large N. You (and others) are arguing for small N.
I have not been arguing for any number of games. I just calculated the confidence intervals for the number of games you or Nicolai had been using.
So don't quote sigma^2 (or sigma) for large N to justify a small number of games.
I still consider that nonsense. Variance is a property of the probability distribution from which you draw the games. It has _nothing_ to do with the number of games. Your actual results will vary per draw, and of course 32% will have a deviation from the mean larger than sigma, and 5% larger than 2 sigma. That doesn't mean the variance of the match results is any different from what is calculated. In fact it confirms it.
Do you actually have enough time to play every possible chess game so that you cover them all? I don't. So I have to rely on a sample of games. If I do 2 trials, I get two samples, and based on current results, the variance is sky-high across 80 games, compared to what I originally expected (total determinism). But variance is based on the samples you use to compute it. We can compute the theoretical variance if we know a specific probability distribution. But I (and nobody else I know of) knows such a distribution when it applies to computer vs computer chess games. At least not yet...

Each sample, composed of several games, has it's own mean, standard deviation, etc. As does the complete population. the larger the sample size, the closer the sample's numbers approach the population's numbers. I am not quite sure what you are talking about above, in that context.


Given the variance of the distribution one can calculate the probability for a certain deviation, and this is what I did.
That is exactly my point. To reduce the variance to something that is reliable, we need a large number of games because the results of the 80-game mini-matches I play are so widely distributed.
I did not think they were widely distributed at all. The observed variance of the mini-match results was lower than one would expect for the score, and the game results were actually highly correlated between matches. This shows that either the initial positions are not very well selected for equality, or the non-determinism you champion is not nearly as big as you claim.
I am not bothered by the necessity of playing 20K games to make a decision. Since I can play 256 at a time, that is the same as playing less than 100 games one at a time. With a fast time control, that doesn't take that long.
Well, good for you. So availability of excessive computer power causes atrophy of the brain. Nothing new about that. But I only have one dual-core machine, so I still have to _think_ how to do things efficiently. And actually the thinking is what is all the fun. :P
The alternative is to use small N, and make poor decisions because of the inherent randomness (large sigma/sigma^2).
The alternative for me is to eliminate all unnecessary variability, so that I can base decisions with as much confidence as you, on 256 times fewer games.

Then I await such an amazing test paradigm. So far, such has not been discovered and published or discussed. If you eliminate the variability, you are eliminating one thing that is an absolute part of playing real games. And, therefore, drawing conclusions on what you think is a small but adequate set of data, when in reality it is just a tiny subset of a much bigger set that you are completely ignoring in testing. But which you will encounter in real games...

BTW you could make sigma much smaller by just taking each 2-game position as a match. the results have to lie between -2 to +2. But I am not sure how you can use that to accept/reject a change to the program since the before/after results will be identical to several decimal places.
So what? Like anyone cares about sigma... What matters is the probability that you will exceed a certain preset score threshold where you are confident that the engine is better. For two matches and +1/-1 scoring the SD per game is ~0.8, so per 2-game match it is about 1.2. For 2-sigma confidence you would have to score 2.5 wins out of 2 games. Well, good luck with it. How many hours would that take on your cluster to succeed once? :lol:
Given two equal humans, who play 100 games, what would you expect to be the outcome? 45-45-10? If you repeat the test, what would you expect? I would _not_ expect the two equal computers to produce that same level of consistency. That is what I am trying to explain. Comp vs Comp is nowhere close to human vs human..
I don't really follow Human Chess, and from my relative ignorance I would more expect spomething like 25-25-50 (+,-,=). I don't see what you are driving at at all. What do you mean by 'level of consistency'? I don't see anything consistent in 45-45-10 or 25-25-50.
I have played chess for 50 years now. In our local club events, with the same participants, the results have been consistent from meeting/tournament to meeting/tournament. Some variability, but in general a person rated 200 points above the next highest-rated player wins the events consistently. Not with computers. Not anywhere near that level of consistency.
Far an infinite number of trials, I agree. But I am the one arguing for large N. So how can you argue based on large N values, when you want to use small N results?? The standard deviation can be much larger than the above, for small sample sizes that you propose...
Again, what you say seems based on the idea that the variance of a stochastic variable is depedent on the number of draws. But it is not.

For individual outcomes ('mini-matches') anything is _possible_. But not with equal _probability_. If the first mini-match you play has 80 losses, than it is possible that the next one, with the same opponents and the same initial positions will have 80 wins. But you will NOT see it happen in the lifetime of the universe, even if you would use a million of your clusters. At least, if the games within the mini-match are independent. (e.g. if the first game that crashes the engine would cause forfeit on time of all subsequent games as the engine could not recover enough to recognize the next 'new' command, this would of course be different.)

The variance of the score distribution of an 80-game match (in the +1/0/-1 scoring system) can never be larger than sqrt(80). That is a hard fact.

The variance of this probability distribution will tell you how large the probability is that a certain deviation from the mean will occur in a mini-match, or in a given number of minimatches.
I understand. But when I look at 80-game matches, I see results that very enough that any single 80-game match is worthless for telling me "good/bad" with respect to a new change. 4 of those matches added together _still_ has too much variance to be useful as one says "better" and the next says "worse".
User avatar
hgm
Posts: 27808
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Variance reports for testing engine improvements

Post by hgm »

bob wrote:What distribution? How did you arrive at it? Is it different for every pair of programs? I know of no accurate distribution that indicates how chess games are distributed over the interval {0, .5, 1} when we are talking about computers vs computers.
The nice thing is that you don't have to know that. The standard deviation of a distribution can never be larger than half the range that the results can attain. So in this case, the (1-0)/2 = 0.5. So if you use this upper limit in calculating your confidence intervals, you will always err on the safe side. The probability you get that ane equal or worse version will still pass the test for a better one, can only be lower if the variability is in practice lower than the worst case.

Now in practice, the variance is not very sensitive to the exact distribution. SD=0.5 is only achieved for a 50-50-0 distribution, but add 30% draws, and it is still ~0.4. For 10-90-0 it is still 0.3, and such extreme scores are immediately obvious from the data. So you are seldomly off by a factor 2.
Do you actually have enough time to play every possible chess game so that you cover them all? I don't. So I have to rely on a sample of games. If I do 2 trials, I get two samples, and based on current results, the variance is sky-high across 80 games, compared to what I originally expected (total determinism).
Well, I cannot take any responsible for what you expect, of course. But the variance is not sky high compared to the statistical prediction of 0.4*sqrt(80) = 3.6. In fact it was significantly lower.
But variance is based on the samples you use to compute it. We can compute the theoretical variance if we know a specific probability distribution. But I (and nobody else I know of) knows such a distribution when it applies to computer vs computer chess games. At least not yet...

Each sample, composed of several games, has it's own mean, standard deviation, etc. As does the complete population. the larger the sample size, the closer the sample's numbers approach the population's numbers. I am not quite sure what you are talking about above, in that context.
In the calculations I presented, the variance was not computed from any samples, but derived from its theoretical upper limit.
Then I await such an amazing test paradigm. So far, such has not been discovered and published or discussed.
Somehow I don't see that as a good reason for not designing one...
If you eliminate the variability, you are eliminating one thing that is an absolute part of playing real games. And, therefore, drawing conclusions on what you think is a small but adequate set of data, when in reality it is just a tiny subset of a much bigger set that you are completely ignoring in testing. But which you will encounter in real games...
This is intrinsic in sampling. As you said, it is not possible to play every possible game, so there is always a much bigger set of positions that you ignore which _could_ occur in games.

How big your sample will have to be depends on what change you are testing. Obviously adding knowledge about KPK will require a much larger sample of games to get a realistic number of KPK endings, than deciding if the value of a Bishop is smaller or larger than 2 Pawns (which almost affects every game).
I have played chess for 50 years now. In our local club events, with the same participants, the results have been consistent from meeting/tournament to meeting/tournament. Some variability, but in general a person rated 200 points above the next highest-rated player wins the events consistently. Not with computers. Not anywhere near that level of consistency.
I am still not sure what this has to do with the measurement problem. You seem to imply that the rating model (expected score percentage vs rating difference) should be different for computers than for Humans, in particular that it should have longet tails. That could of course be easily fixed. But so far all we are discussing is win-probabilities, and how they are affected by changes. Ratings, or how you could compute these from such win probabilities, simply have no bearing on that discussion at all.
I understand. But when I look at 80-game matches, I see results that very enough that any single 80-game match is worthless for telling me "good/bad" with respect to a new change. 4 of those matches added together _still_ has too much variance to be useful as one says "better" and the next says "worse".
Well, that depends on how big the change is. To resolve small strength differences, you need more games. That is what standard statistical theory tells you in the first place, so I don't see how that could have any bearing on the applicability of standard statistical theory.
nczempin

Re: Variance reports for testing engine improvements

Post by nczempin »

nczempin wrote:
nczempin wrote:A temporary snapshot of my current 10-game gauntlet:

...
Perhaps if one removes those engines that only have 3 results (since the choice would be arbitrary, it shouldn't make any difference as compared to the situation I would have if I hadn't included those engines from the start).
Perhaps I should stop this particular experiment and choose 5 opponents that I should do the 40-position test with. However, I strongly believe that using the 40-position test or not is the dominant factor as to the conclusions that can be drawn about variance.

I. e. I expect that in that test, the variance will be much higher for Eden than it is in the current test.

Here's the current data after 5 games. Yes, 5 is not a good point, but I just happened to be at that point. You can ignore the 5th result and just take the completed 4-game matches.

Code: Select all

    Engine             Score       Ed
01: Eden_0.0.13       86,5/265 ····· 
02: BikJump            5,0/5    11111 
02: Cheetah            5,0/5    11111 
02: Flux               5,0/5    11111 
02: Heracles           5,0/5    11111 
02: Kingsout           5,0/5    11111 
02: Mediocre           5,0/5    11111 
02: Needle             5,0/5    11111 
02: Roce036            5,0/5    11111 
02: Rotor              5,0/5    11111 
02: TJchess078R2       5,0/5    11111 
12: Alf109             4,5/5    1111= 
12: Jsbam              4,5/5    1111= 
12: Mizar              4,5/5    111=1 
12: Murderhole         4,5/5    1111= 
16: Beaches            4,0/5    10111 
16: Milady             4,0/5    ==111 
16: Gedeone1620        4,0/5    01111 
16: DChess1_0_2        4,0/5    11==1 
16: Hoplite            4,0/5    11011 
16: PolarEngine        4,0/5    1=11= 
16: IQ23               4,0/5    1=1=1 
16: Enigma             4,0/5    11=1= 
16: Storm              4,0/5    01111 
16: Tamerlane02        4,0/5    1=1=1 
16: FAUCE              4,0/5    11110 
16: Z2k3               4,0/5    1==11 
28: Olithink           3,5/5    =1110 
28: Tscp181            3,5/5    ==11= 
28: Blikskottel        3,5/5    0=111 
31: Pooky27            3,0/5    11001 
31: Rainman            3,0/5    0=1=1 
31: Minimax            3,0/5    01011 
31: Aldebaran          3,0/5    01=1= 
31: SharpChess2        3,0/5    10101 
31: Eden_0.0.11        3,0/5    10101 
31: Eden_0.0.12_server 3,0/5    10101 
31: ALChess1.5b        3,0/5    11010 
31: Piranha            3,0/5    01011 
31: Umax4_8w           3,0/5    10101 
31: Yawce016           3,0/5    11001 
31: Exacto             3,0/5    01011 
43: Awe170             2,5/5    =0101 
43: Vanillachess       2,5/5    0110= 
45: Vicki              2,0/5    01010 
45: Atlanchess         2,0/5    01010 
45: Roque              2,0/5    =010= 
48: Cefap              1,5/5    010=0 
49: Philemon           1,0/5    =0=00 
49: Eden0.0.12 JA      1,0/5    00100 
49: APILchess          1,0/5    0=0=0 
52: Sdbc               0,5/5    000=0 
53: Dimitri Chess      0,0/5    00000 
53: RobinEngine        0,0/5    00000 
I will suspend this test once I have reached 6, and proceed with the 40 positions.

Bob, would you care to choose 5 or 6 opponents that I should use in that test?

And could you please point me to where I will find them? I can of course search, but it would be nice if we had a reference in this thread.
nczempin

Re: Variance reports for testing engine improvements

Post by nczempin »

nczempin wrote: And could you please point me to where I will find them? I can of course search, but it would be nice if we had a reference in this thread.
Okay, I have tried my best to find those positions, so far I haven't found them. Would you please post where I can find an FEN of them?

Much obliged.
nczempin

Re: Variance reports for testing engine improvements

Post by nczempin »

Okay, here's my current status, only 12 games to go before the 3rd match is finished. I can run those overnight, and before that I can use my computer for other purposes.

I would really really appreciate if someone could post those 40 positions that we were talking about, so that I can carry out the other test.


Any suggestions as to what engines I should use as opponents would be appreciated, too. Pooky immediately comes to mind, so far it seems to give the highest variance. In the absence of other information, this seems to be a good choice. But what about the others?

Code: Select all


    Engine             Score         Ed
01: Eden_0.0.13dev03   101,5/306 ······ 
02: BikJump            6,0/6     111111 
02: Cheetah            6,0/6     111111 
02: Flux               6,0/6     111111 
02: Kingsout           6,0/6     111111 
02: Mediocre           6,0/6     111111 
02: Needle             6,0/6     111111 
02: Roce036            6,0/6     111111 
09: Heracles           5,5/6     11111= 
09: Mizar              5,5/6     111=11 
09: Jsbam              5,5/6     1111=1 
09: Alf109             5,5/6     1111=1 
13: Enigma             5,0/6     11=1=1 
13: FAUCE              5,0/6     111101 
13: Beaches            5,0/6     101111 
13: Milady             5,0/6     ==1111 
13: Gedeone1620        5,0/6     011111 
13: DChess1_0_2        5,0/6     11==11 
13: Hoplite            5,0/6     110111 
13: Rotor              5,0/5     11111  
13: TJchess078R2       5,0/5     11111  
22: PolarEngine        4,5/6     1=11== 
22: Murderhole         4,5/6     1111=0 
22: IQ23               4,5/6     1=1=1= 
22: Olithink           4,5/6     =11101 
26: Pooky27            4,0/6     110011 
26: Blikskottel        4,0/6     0=111= 
26: Aldebaran          4,0/6     01=1=1 
26: Storm              4,0/5     01111  
26: Tamerlane02        4,0/5     1=1=1  
26: Exacto             4,0/6     010111 
26: Z2k3               4,0/5     1==11  
33: ALChess1.5b        3,5/6     11010= 
33: Tscp181            3,5/5     ==11=  
33: Rainman            3,5/6     0=1=1= 
36: Minimax            3,0/6     010110 
36: Eden_0.0.11        3,0/6     101010 
36: Piranha            3,0/6     010110 
36: Eden_0.0.12_server 3,0/6     101010 
36: Umax4_8w           3,0/5     10101  
36: Yawce016           3,0/5     11001  
36: SharpChess2        3,0/5     10101  
43: Awe170             2,5/6     =01010 
43: Vanillachess       2,5/5     0110=  
43: Cefap              2,5/6     010=01 
43: Roque              2,5/6     =010== 
47: Atlanchess         2,0/6     010100 
47: Vicki              2,0/5     01010  
49: APILchess          1,5/6     0=0=0= 
50: Philemon           1,0/6     =0=000 
50: Eden0.0.12 JA      1,0/6     001000 
52: Sdbc               0,5/5     000=0  
53: Dimitri Chess      0,0/6     000000 
53: RobinEngine        0,0/6     000000 

306 of 318 games played
Name of the tournament: Eden 13 Gauntlet
Site/ Country: ATZE, Germany
Level: Blitz 2/6
Hardware: mit 1.022 MB Speicher
Operating system: Microsoft Windows NT Home Edition (Build 6000)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Variance reports for testing engine improvements

Post by bob »

Here they are:

1rbq1rk1/1pp2pbp/p1np1np1/4p3/2PPP3/2N1BP2/PP1Q2PP/R1N1KB1R w KQ e6 fmvn 10; id "Silver Suite - King's Indian, Saemisch : ECO E84";
r1b1k2r/ppppnppp/2n2q2/2b5/3NP3/2P1B3/PP3PPP/RN1QKB1R w KQkq - fmvn 7; id "Silver Suite - Scotch : ECO C45";
r1b1k2r/ppppqppp/2n1pn2/8/1bPP4/5NP1/PP1BPP1P/RN1QKB1R w KQkq - fmvn 6; id "Silver Suite - Bogo-Indian : ECO E11";
r1b1kbnr/1pp2ppp/p1p5/8/3NP3/8/PPP2PPP/RNB1K2R b KQkq - fmvn 7; id "Silver Suite - Spanish, Exchange : ECO C68";
r1b1kbnr/1pqp1ppp/p1n1p3/8/3NP3/2N5/PPP1BPPP/R1BQK2R w KQkq - fmvn 7; id "Silver Suite - Sicilian, Paulsen : ECO B47";
r1b1kbnr/pp3ppp/1qn1p3/3pP3/2pP4/P1P2N2/1P3PPP/RNBQKB1R w KQkq - fmvn 7; id "Silver Suite - French, Advance : ECO C02";
r1bq1rk1/1p2bppp/p1n1pn2/8/P1BP4/2N2N2/1P2QPPP/R1BR2K1 b - - fmvn 11; id "Silver Suite - Queen's Gambit accepted : ECO D27";
r1bq1rk1/pp2bpp1/2n2n1p/3p4/3N4/2N1B1P1/PP2PPBP/R2Q1RK1 b - - fmvn 11; id "Silver Suite - Queen's Gambit, Tarrash : ECO D34";
r1bq1rk1/pp2ppbp/2n3p1/2p5/2BPP3/2P1B3/P3NPPP/R2Q1RK1 b - - fmvn 10; id "Silver Suite - Gruenfeld : ECO D87";
r1bq1rk1/pp3ppp/2n1pn2/2pp4/2PP4/P1PBPN2/5PPP/R1BQ1RK1 b - - fmvn 9; id "Silver Suite - Nimzo-Indian, Rubinstein : ECO E58";
r1bq1rk1/ppp1npbp/3p2p1/3Pp2n/1PP1P3/2N2N2/P3BPPP/R1BQR1K1 b - - fmvn 10; id "Silver Suite - King's Indian, Classical : ECO E99";
r1bq1rk1/pppn1pbp/3p1np1/4p3/2PPP3/2N2NP1/PP3PBP/R1BQ1RK1 b - e3 fmvn 8; id "Silver Suite - King's Indian, Fianchetto : ECO E66";
r1bqk1nr/pp3pbp/2npp1p1/2p5/4PP2/2NP2P1/PPP3BP/R1BQK1NR w KQkq - fmvn 7; id "Silver Suite - Sicilian, Closed : ECO B25";
r1bqk2r/1pp2ppp/pbnp1n2/4p3/PPB1P3/2PP1N2/5PPP/RNBQK2R w KQkq - fmvn 8; id "Silver Suite - Italian : ECO C54";
r1bqk2r/5ppp/p1np1b2/1p1Np3/4P3/N1P5/PP3PPP/R2QKB1R b KQkq - fmvn 11; id "Silver Suite - Sicilian, Sveshnikov : ECO B33";
r1bqkb1r/pp1n1ppp/2n1p3/2ppP3/3P4/2PB4/PP1N1PPP/R1BQK1NR w KQkq - fmvn 7; id "Silver Suite - French, Tarrasch : ECO C05";
r1bqkb1r/pppp1ppp/2n2n2/4p3/2P5/2N2NP1/PP1PPP1P/R1BQKB1R b KQkq - fmvn 4; id "Silver Suite - English, Asymmetric : ECO A29";
r1r3k1/pp1bppbp/3p1np1/q3n3/3NP2P/1BN1BP2/PPPQ2P1/2KR3R w - - fmvn 13; id "Silver Suite - Sicilian, Dragon : ECO B79";
r2q1rk1/1b1nbppp/pp1ppn2/8/2PQP3/1PN2NP1/PB3PBP/R2R2K1 b - e3 fmvn 12; id "Silver Suite - Queen's Indian, hedgehog : ECO A30";
r2qk2r/pp1n1ppp/2p1pn2/5b2/PbBP4/2N1PN2/1P3PPP/R1BQ1RK1 w kq - fmvn 9; id "Silver Suite - Queen's Gambit, Slav : ECO D18";
r2qkb1r/2p2ppp/p1n1b3/1p1pP3/4n3/1B3N2/PPP2PPP/RNBQ1RK1 w kq - fmvn 9; id "Silver Suite - Spanish, Open : ECO C80";
r2qr1k1/1bp1bppp/p1np1n2/1p2p3/3PP3/1BP2N1P/PP1N1PP1/R1BQR1K1 b - - fmvn 11; id "Silver Suite - Spanish, Closed - Zaitsev : ECO C92";
r3k2r/pp2bppp/2nqpn2/7b/3P4/2N1BN1P/PP2BPP1/R2Q1RK1 w kq - fmvn 12; id "Silver Suite - Sicilian, 2.c3 : ECO B22";
rn1q1rk1/pbp2pp1/1p2pn1p/3p4/2PP3B/P1Q2P2/1P2P1PP/R3KBNR w KQ d6 fmvn 10; id "Silver Suite - Nimzo-Indian, 4.Qc2 : ECO E32";
rn1q1rk1/ppp1ppbp/3p1np1/8/3PP1b1/2N2N2/PPP1BPPP/R1BQ1RK1 w - - fmvn 7; id "Silver Suite - Pirc, 4.Nf3 : ECO B08";
rn1qk2r/p3bppp/bpp1pn2/3p4/2PP4/1PB2NP1/P3PPBP/RN1QK2R w KQkq d6 fmvn 9; id "Silver Suite - Queen's Indian : ECO E15";
rn1qk2r/ppp1bppp/1n1pp3/4P2b/2PP4/5N1P/PP2BPP1/RNBQ1RK1 w kq - fmvn 9; id "Silver Suite - Alekhine, Modern : ECO B05";
rn1qkb1r/3ppp1p/b4np1/2pP4/8/2N5/PP2PPPP/R1BQKBNR w KQkq - fmvn 7; id "Silver Suite - Benko Gambit : ECO A58";
rn1qkbnr/pp3ppp/2p1p3/3pPb2/3P4/5N2/PPP1BPPP/RNBQK2R b KQkq - fmvn 5; id "Silver Suite - Caro-Kann, Advance : ECO B12";
rn2kb1r/pp2pppp/2p2n2/q4b2/2BP4/2N2N2/PPP2PPP/R1BQK2R w KQkq - fmvn 7; id "Silver Suite - Scandinavian : ECO B01";
rnb1qrk1/ppp1p1bp/3p1np1/3P1p2/2P5/2N2NP1/PP2PPBP/R1BQ1RK1 b - - fmvn 8; id "Silver Suite - Dutch, Leningrad : ECO A87";
rnb2rk1/1pq1bppp/p2ppn2/8/3NPP2/2N1B3/PPP1B1PP/R2Q1RK1 w - - fmvn 10; id "Silver Suite - Sicilian, Scheveningen : ECO B84";
rnbq1rk1/p1p1bpp1/1p2pn1p/3p4/2PP3B/2N1PN2/PP3PPP/R2QKB1R w KQ - fmvn 8; id "Silver Suite - Queen's Gambit declined, Tartakower : ECO D58";
rnbq1rk1/pp3pbp/3p1np1/2pP4/4P3/2N2N2/PP2BPPP/R1BQ1RK1 b - - fmvn 9; id "Silver Suite - Benoni, Modern : ECO A73";
rnbq1rk1/ppp1ppbp/3p1np1/8/3PPP2/2NB1N2/PPP3PP/R1BQK2R b KQ - fmvn 6; id "Silver Suite - Pirc, Austrian attack : ECO B09";
rnbq1rk1/ppp1ppbp/3p1np1/8/8/3P1NP1/PPP1PPBP/RNBQ1RK1 w - - fmvn 6; id "Silver Suite - Reti Opening : ECO A05";
rnbqk2r/pp2nppp/4p3/2ppP3/3P4/P1P2N2/2P2PPP/R1BQKB1R b KQkq - fmvn 7; id "Silver Suite - French, Winawer : ECO C19";
rnbqk2r/ppp1bppp/3p1n2/8/3NP3/2N5/PPP2PPP/R1BQKB1R w KQkq - fmvn 6; id "Silver Suite - Philidor's Defense : ECO C41";
rnbqkb1r/pp3ppp/4pn2/2pp4/3P4/2PBPN2/PP3PPP/RNBQK2R b KQkq - fmvn 5; id "Silver Suite - Colle system : ECO D05";
rnbqkb1r/ppp1pppp/5n2/3p4/5P2/1P3N2/P1PPP1PP/RNBQKB1R b KQkq - fmvn 3; id "Silver Suite - Bird's Opening : ECO A03";
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Variance reports for testing engine improvements

Post by bob »

I am not sure what test you are going to run. The test I suggested you run was this:

Your program vs one opponent, using the 40 Silver positions, playing black and white in each, for a total of 80 games. Then run the _same_ test again and report both results to see if you see the kind of non-deterministic results I see with the programs I am using.

You don't need to use long time controls either, I have found very little difference between very fast and very slow with respect to the non-determinism. You might score differently at 1+1 than at 60+60, but for me both produce the same randomness level.