Variance reports for testing engine improvements

nczempin · Post by **nczempin** » Tue Sep 18, 2007 6:17 pm

bob wrote:I am not sure what test you are going to run. The test I suggested you run was this:

Your program vs one opponent, using the 40 Silver positions, playing black and white in each, for a total of 80 games. Then run the _same_ test again and report both results to see if you see the kind of non-deterministic results I see with the programs I am using.

You don't need to use long time controls either, I have found very little difference between very fast and very slow with respect to the non-determinism. You might score differently at 1+1 than at 60+60, but for me both produce the same randomness level.

Okay, I will try to follow these specifications as closely as possible. I will still run them at 2/6 however, and stop complaining about resources.

nczempin · Post by **nczempin** » Tue Sep 18, 2007 6:19 pm

nczempin wrote:
bob wrote:I am not sure what test you are going to run. The test I suggested you run was this:

Your program vs one opponent, using the 40 Silver positions, playing black and white in each, for a total of 80 games. Then run the _same_ test again and report both results to see if you see the kind of non-deterministic results I see with the programs I am using.

You don't need to use long time controls either, I have found very little difference between very fast and very slow with respect to the non-determinism. You might score differently at 1+1 than at 60+60, but for me both produce the same randomness level.
Okay, I will try to follow these specifications as closely as possible. I will still run them at 2/6 however, and stop complaining about resources.

I am still open to suggestions as to what engines I should use. I will choose one engine to run against, and then choose the next one, so as long as the matches with the first engine has not finished yet, I am still free to choose the others.

My first opponent shall be Pooky 2.7.

bob · Post by **bob** » Tue Sep 18, 2007 7:06 pm

nczempin wrote:
bob wrote:Here they are:
...
Thank you.

I am proceeding to use this in my test.

Since no-one has come up with any suggestions for suitable opponents, I will choose them off-line, looking for evidence of high variance. This is not necessarily a good predictor of high variance in the test I will do, but if all engines are equivalent, then it doesn't hurt if I choose those, and I could make a weak claim that the variance could be higher for those chosen. This means that umax 4_8 certainly won't be in the test

Although perhaps it would be interesting to see how an engine that shows such low variance against my engine will behave in that other test.

From my testing, if one shows this property, the results will show significant variance. If both show it, of course it goes up. But I can not say for certain that one being random has 1/2 the variance of two being random. I didn't measure that. I did run tests with one varying and the other not, and with both varying, but I didn't do any real analysis as both showed more than enough randomness to make the point...

bob · Post by **bob** » Tue Sep 18, 2007 7:11 pm

Uri Blass wrote:
nczempin wrote:
hgm wrote:OK, send them as a PM. I guess they are only of relevance for this discussion if they are actually the same.
Three times the exact same game, three times a time forfeit for umax in winning position. I'll post it later.
Interesting that you get the same game when Bob seems not only not to get it but also to get a big variance.

I suspect that something is wrong in his testing

Here are some statistical thought of me about it:

I think that it may be interesting to check the hypothesis H0 that the result of the games are independent against the hypothesis H1 that it does not happen.

I simply suspect based on your results that it is possible that the results are not independent and I can think of 2 possible factors that can explain a situation when the results are not independent(learning or a case that one program is slowed down by a significant factor).

You can do it by the following definitions.

You have 80 games in the silver suite test.

You play
80n games when n is the number of times that you repeat the silver suite against the same opponent.

denote Gi,j to be result of game i in silver match number j when
1<=i<=80 1<=j<=n(game i mean the exact color of crafty and the exact position)

Denote the result of match j by Nj
when Nj=G1,j+G2,j+...G80,j

Look at the variance of the following numbers(S^2 denote variance )

x1=S^2(G1,1,G1,2,...G1,n)
X2=S^2(G2,1,G2,2,...G2,n)
..
X80=S^2(G80,1,G80,2,...G80,n)

Denote X=S^2(N1,N2,...Nn)

The question is if the
E(X)= E(x1)+E(x2)+...E(x80).

I think that it may be interesting to check

H0:E(X)= E(x1)+E(x2)+...E(x80).
H1:E(X)>E(x1)+E(x2)+...E(x80).

We can assume that X X1,...XN have the normal distribution based on the central limit theorem and I will think later how to continue from it if I get no reply that suggest what to do.

Uri

The data I posted was produced exactly like that. While I can't tell you who played which color, in the string from match 1, the first two games are position 1, the first game is with crafty being on move first, the second game is the opponent being on move first. That is repeated consistently for every pair of results. So the first two results are position 1, the second 2 are position 2, ... , the last two are position 40. Each line is always in that exact same order. The ones I find interesting are where you pick any specific column (not easy using html and proportional spacing) and look for a column where there are lots of +'s and -'s... That is the same position, same program playing the same color, winning some and losing some. I do that kind of analysis looking for positions where I lose most games, playing either black or white, because that shows some kind of serious misunderstanding of that position by Crafty. That's why I display the data with the +/-/= format sometimes. But I don't use that for evaluating small changes, just when I am ready to try to figure out "what do we look at next"...

bob · Post by **bob** » Tue Sep 18, 2007 7:15 pm

nczempin wrote:
nczempin wrote:
bob wrote:I am not sure what test you are going to run. The test I suggested you run was this:

Your program vs one opponent, using the 40 Silver positions, playing black and white in each, for a total of 80 games. Then run the _same_ test again and report both results to see if you see the kind of non-deterministic results I see with the programs I am using.

You don't need to use long time controls either, I have found very little difference between very fast and very slow with respect to the non-determinism. You might score differently at 1+1 than at 60+60, but for me both produce the same randomness level.
Okay, I will try to follow these specifications as closely as possible. I will still run them at 2/6 however, and stop complaining about resources.
I am still open to suggestions as to what engines I should use. I will choose one engine to run against, and then choose the next one, so as long as the matches with the first engine has not finished yet, I am still free to choose the others.

My first opponent shall be Pooky 2.7.

I can't offer any advice there. I picked my opponents based on several things.

1. I need source so I can compile executables on an experimental cluster, running bleeding-edge libs and kernel, because it is unlikely anyone could compile something that would work for me.

2. I need programs that are either polyglot (UCI) or xboard compatible. My automated referee can use polyglot just fine, and for native xboard programs it handles them as well. Programs that only work in other GUIs I can't deal with and exclude them.

3. I want programs that are competitive with mine or better. I don't find much useful information in drubbing a low-rated opponent, and since the games do take time to play, I want to make the results as useful as possible. But I don't want to just use one as that test is not as informative in a global sense as playing multiple opponents and combining the results.

Alessandro Scotti · Post by **Alessandro Scotti** » Tue Sep 18, 2007 9:41 pm

nczempin wrote:Okay, I will try to follow these specifications as closely as possible. I will still run them at 2/6 however, and stop complaining about resources.

IMO you don't have to worry too much about test matches until the engine is reasonably free of bugs. Such an engine will be stronger than 2000 elo with just a minimal set of features. Until that point, I think it's better to spend time in unit test and maybe reviewing the code and algorithms.
After that, I'm still convinced that 1+1 is preferrable to 2+6 as it allows more games to be played.

hgm · Post by **hgm** » Tue Sep 18, 2007 9:47 pm

nczempin wrote:
hgm wrote:A bit off topic: could you post the Eden - uMax games from your above test? It is kind of unusual for uMax 4.8 to lose from Eden (also Eden 0.0.13 might of course be much better than the version of Eden I am used too, which is about the same level as uMax 1.6), and I see that in your test it already lost 2 games. So I wonder if these are the same games, and if the loss is somehow related to the time-control trouble you notified me about earlier.
Well, I can send them to you in private unless someone else really wants to see them. I was also surprised at those results, and indeed when running a tournament, umax 4_8 usually finishes way above any Eden version so far. But I haven't looked at the actual games yet.

OK, thanks for the games.

I think they are actually relevant for this discussion:

Of the 7 games played so far, all the 4 games Eden-uMax are the same, and all the games uMax-Eden are the same. Move for move, until the very end. uMax forfeits the white game on time, (so all white games), after 32 moves. Apparently it cannot handle this time control. I really should do something about that.

I compared them stroboscopically, and the only differences in the PGN are the last number in the comments. This differs by 1 unit in 4 to 5 places in each game. What does that number represent? Is that time spend on the move, in seconds?

No variability for these two engines...

hgm · Post by **hgm** » Wed Sep 19, 2007 11:09 am

bob wrote:Your program vs one opponent, using the 40 Silver positions, playing black and white in each, for a total of 80 games. Then run the _same_ test again and report both results to see if you see the kind of non-deterministic results I see with the programs I am using.

OK, test results are in, so some real data.

I played the two 80-game matches from the Silver positions between micro-Max 1.6 and Eden 0.0.13_win32_jet, at 40/1'.

Of the 80 matches, 64 were identical, move for move, until the very end.

Of the other 16, it was Eden who deviated 10 times, on move 11, 38, 27, 10, 42, 42, 6, 28, 16 and 7. The other 6 deviations were due to uMax chosing a different move, on move 32, 46, 24, 15, 37 and 25.

Note that some of the games were short (shortest game: 10 moves), increasing the likelihood that they are identical, but that there was also one 125-move game, and two 112-move games that were completely identical.

It can be concluded that the number of identically repeated games is very high (around 80%), so that one would waste a factor of 5 in testing time by running multiple mini-matches between these engines, duplicating games that have already been played before. Clearly different testing methodology is needed. (E.g. more starting positions.)

Btw, Eden won both mini-matches with 41.5 - 38.5, although about half of the 16 games that diverged had also different result.

The theoretical estimate that uMax would vary about one in 40 moves seems quite realistic. The large number of identical games is simply a consequence of the fact that so many games were shorter than 40 moves. As to the game results, many games of course are already decided before move 40, even if they drag on 60 more moves. Therefore games that diverge only after 30-40 moves produce highly correlated results, even though the position might not be biased in itself.

nczempin · Post by **nczempin** » Wed Sep 19, 2007 11:24 am

bob wrote:
nczempin wrote:
nczempin wrote:
bob wrote:I am not sure what test you are going to run. The test I suggested you run was this:

Your program vs one opponent, using the 40 Silver positions, playing black and white in each, for a total of 80 games. Then run the _same_ test again and report both results to see if you see the kind of non-deterministic results I see with the programs I am using.

You don't need to use long time controls either, I have found very little difference between very fast and very slow with respect to the non-determinism. You might score differently at 1+1 than at 60+60, but for me both produce the same randomness level.
Okay, I will try to follow these specifications as closely as possible. I will still run them at 2/6 however, and stop complaining about resources.
I am still open to suggestions as to what engines I should use. I will choose one engine to run against, and then choose the next one, so as long as the matches with the first engine has not finished yet, I am still free to choose the others.

My first opponent shall be Pooky 2.7.
I can't offer any advice there. I picked my opponents based on several things.

1. I need source so I can compile executables on an experimental cluster, running bleeding-edge libs and kernel, because it is unlikely anyone could compile something that would work for me.

2. I need programs that are either polyglot (UCI) or xboard compatible. My automated referee can use polyglot just fine, and for native xboard programs it handles them as well. Programs that only work in other GUIs I can't deal with and exclude them.

3. I want programs that are competitive with mine or better. I don't find much useful information in drubbing a low-rated opponent, and since the games do take time to play, I want to make the results as useful as possible. But I don't want to just use one as that test is not as informative in a global sense as playing multiple opponents and combining the results.

I agree with these three prinicples; 1 of course severely limits your choice, and for me it is not a requirement.

In particular, I agree with 3, and for the same reason we both eliminate opponents that are too weak, I eliminate those that are so strong that it is highly unlikely that the new version will make any progress against them.

nczempin · Post by **nczempin** » Wed Sep 19, 2007 11:31 am

Alessandro Scotti wrote:
nczempin wrote:Okay, I will try to follow these specifications as closely as possible. I will still run them at 2/6 however, and stop complaining about resources.
IMO you don't have to worry too much about test matches until the engine is reasonably free of bugs. Such an engine will be stronger than 2000 elo with just a minimal set of features. Until that point, I think it's better to spend time in unit test and maybe reviewing the code and algorithms.
After that, I'm still convinced that 1+1 is preferrable to 2+6 as it allows more games to be played.

Under the goal I have about releases of Eden, which I have stated here sufficiently often, this is not an option. Yes, you can question the goal, and I will stick with it.

Interestingly, Eden 0.0.13 has a huge known bug about not recognizing diagonal saves from back-rank mates. This is costing it a fairly high number of games, and was trivial to fix.

Yet I still want to find out if I release Eden 0.0.14 with just this fix, will it be sufficient to claim that it is significantly stronger?

So I think test matches are still useful even if your engine has bugs, even big ones.

Note also that I have an extensive set of automated (j)unit tests, for the basics of the protocols and commands, the basics of legality, performance, and for each bug I add at least one regression test.

So my engine can be considered reasonably bug-free; I believe I am doing more internal tests than most at my level, and probably more than many at higher levels, although probably not at the highest levels.

I also asked for more test suites and test discipline a very long time ago, and got no response. Not one.

Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements