Variance reports for testing engine improvements

nczempin · Post by **nczempin** » Tue Sep 18, 2007 9:45 am

bob wrote:I am not sure what test you are going to run. The test I suggested you run was this:

Your program vs one opponent, using the 40 Silver positions, playing black and white in each, for a total of 80 games. Then run the _same_ test again and report both results to see if you see the kind of non-deterministic results I see with the programs I am using.

You don't need to use long time controls either, I have found very little difference between very fast and very slow with respect to the non-determinism. You might score differently at 1+1 than at 60+60, but for me both produce the same randomness level.

I'll test at 2/6, because that is what I always use. For my particular situation, it seems to be a good balance between:
a) still being able to watch the games occasionally
b) for the engines at Eden's level, giving them much less than about 10 seconds per move will decrease their strength dramatically; they will then reach ply 4 or so instead of ply 6. I consider ply 6 a critical depth. The level of play is bad enough at this level, I will try not to decrease it even further. I also cannot decide with confidence whether the results for this level are significant as a predictor for long time controls. For engines that reach a depth of 12 in the blink of an eye, I would consider the situation to be different.
c) Ideally I would like to test at longer time controls, but I do have to finish them eventually.

So I'll just run that test at 2/6, and stop whining about you having more resources

hgm · Post by **hgm** » Tue Sep 18, 2007 1:10 pm

A bit off topic: could you post the Eden - uMax games from your above test? It is kind of unusual for uMax 4.8 to lose from Eden (also Eden 0.0.13 might of course be much better than the version of Eden I am used too, which is about the same level as uMax 1.6), and I see that in your test it already lost 2 games. So I wonder if these are the same games, and if the loss is somehow related to the time-control trouble you notified me about earlier.

nczempin · Post by **nczempin** » Tue Sep 18, 2007 4:25 pm

hgm wrote:A bit off topic: could you post the Eden - uMax games from your above test? It is kind of unusual for uMax 4.8 to lose from Eden (also Eden 0.0.13 might of course be much better than the version of Eden I am used too, which is about the same level as uMax 1.6), and I see that in your test it already lost 2 games. So I wonder if these are the same games, and if the loss is somehow related to the time-control trouble you notified me about earlier.

Well, I can send them to you in private unless someone else really wants to see them. I was also surprised at those results, and indeed when running a tournament, umax 4_8 usually finishes way above any Eden version so far. But I haven't looked at the actual games yet.

hgm · Post by **hgm** » Tue Sep 18, 2007 5:06 pm

OK, send them as a PM. I guess they are only of relevance for this discussion if they are actually the same.

bob · Post by **bob** » Tue Sep 18, 2007 5:20 pm

nczempin wrote:
bob wrote:I am not sure what test you are going to run. The test I suggested you run was this:

Your program vs one opponent, using the 40 Silver positions, playing black and white in each, for a total of 80 games. Then run the _same_ test again and report both results to see if you see the kind of non-deterministic results I see with the programs I am using.

You don't need to use long time controls either, I have found very little difference between very fast and very slow with respect to the non-determinism. You might score differently at 1+1 than at 60+60, but for me both produce the same randomness level.
I'll test at 2/6, because that is what I always use. For my particular situation, it seems to be a good balance between:
a) still being able to watch the games occasionally
b) for the engines at Eden's level, giving them much less than about 10 seconds per move will decrease their strength dramatically; they will then reach ply 4 or so instead of ply 6. I consider ply 6 a critical depth. The level of play is bad enough at this level, I will try not to decrease it even further. I also cannot decide with confidence whether the results for this level are significant as a predictor for long time controls. For engines that reach a depth of 12 in the blink of an eye, I would consider the situation to be different.
c) Ideally I would like to test at longer time controls, but I do have to finish them eventually.

So I'll just run that test at 2/6, and stop whining about you having more resources

Remember the main point. "is a change better". So short time controls are fine for that for the most part. If the shorter time hurts you more than your opponent, so what? You should still see an improved score if you make a change that is good.

nczempin · Post by **nczempin** » Tue Sep 18, 2007 5:24 pm

hgm wrote:OK, send them as a PM. I guess they are only of relevance for this discussion if they are actually the same.

Three times the exact same game, three times a time forfeit for umax in winning position. I'll post it later.

Uri Blass · Post by **Uri Blass** » Tue Sep 18, 2007 5:39 pm

nczempin wrote:
hgm wrote:OK, send them as a PM. I guess they are only of relevance for this discussion if they are actually the same.
Three times the exact same game, three times a time forfeit for umax in winning position. I'll post it later.

Interesting that you get the same game when Bob seems not only not to get it but also to get a big variance.

I suspect that something is wrong in his testing

Here are some statistical thought of me about it:

I think that it may be interesting to check the hypothesis H0 that the result of the games are independent against the hypothesis H1 that it does not happen.

I simply suspect based on your results that it is possible that the results are not independent and I can think of 2 possible factors that can explain a situation when the results are not independent(learning or a case that one program is slowed down by a significant factor).

You can do it by the following definitions.

You have 80 games in the silver suite test.

You play
80n games when n is the number of times that you repeat the silver suite against the same opponent.

denote Gi,j to be result of game i in silver match number j when
1<=i<=80 1<=j<=n(game i mean the exact color of crafty and the exact position)

Denote the result of match j by Nj
when Nj=G1,j+G2,j+...G80,j

Look at the variance of the following numbers(S^2 denote variance )

x1=S^2(G1,1,G1,2,...G1,n)
X2=S^2(G2,1,G2,2,...G2,n)
..
X80=S^2(G80,1,G80,2,...G80,n)

Denote X=S^2(N1,N2,...Nn)

The question is if the
E(X)= E(x1)+E(x2)+...E(x80).

I think that it may be interesting to check

H0:E(X)= E(x1)+E(x2)+...E(x80).
H1:E(X)>E(x1)+E(x2)+...E(x80).

We can assume that X X1,...XN have the normal distribution based on the central limit theorem and I will think later how to continue from it if I get no reply that suggest what to do.

Uri

nczempin · Post by **nczempin** » Tue Sep 18, 2007 5:47 pm

bob wrote:
nczempin wrote:
bob wrote:I am not sure what test you are going to run. The test I suggested you run was this:

Your program vs one opponent, using the 40 Silver positions, playing black and white in each, for a total of 80 games. Then run the _same_ test again and report both results to see if you see the kind of non-deterministic results I see with the programs I am using.

You don't need to use long time controls either, I have found very little difference between very fast and very slow with respect to the non-determinism. You might score differently at 1+1 than at 60+60, but for me both produce the same randomness level.
I'll test at 2/6, because that is what I always use. For my particular situation, it seems to be a good balance between:
a) still being able to watch the games occasionally
b) for the engines at Eden's level, giving them much less than about 10 seconds per move will decrease their strength dramatically; they will then reach ply 4 or so instead of ply 6. I consider ply 6 a critical depth. The level of play is bad enough at this level, I will try not to decrease it even further. I also cannot decide with confidence whether the results for this level are significant as a predictor for long time controls. For engines that reach a depth of 12 in the blink of an eye, I would consider the situation to be different.
c) Ideally I would like to test at longer time controls, but I do have to finish them eventually.

So I'll just run that test at 2/6, and stop whining about you having more resources
Remember the main point. "is a change better". So short time controls are fine for that for the most part. If the shorter time hurts you more than your opponent, so what? You should still see an improved score if you make a change that is good.

Well, my theory, based on observation back when I decided to use this time format, is that the changes I am working on now will be more significant at the level I am testing at. This theory has not been thoroughly tested, but in addition to the other properties I described, I find it a workable compromise.

Incidentally, for me, the main point is not "is a change better", but only "is the new version better". The engine is so immature and its development still so dynamic that trying to conclude anything about individual changes is not very useful IMHO.

When I have built up my knowledge on chess engines and started from scratch with a version written in C, and I have an engine > 2300, I may possibly treat this issue more the way you do.

nczempin · Post by **nczempin** » Tue Sep 18, 2007 6:10 pm

Uri Blass wrote:
nczempin wrote:
hgm wrote:OK, send them as a PM. I guess they are only of relevance for this discussion if they are actually the same.
Three times the exact same game, three times a time forfeit for umax in winning position. I'll post it later.
Interesting that you get the same game when Bob seems not only not to get it but also to get a big variance.

I suspect that something is wrong in his testing

No, I think we have discussed sufficiently why this can happen at the lower level.

I have no data to show that anything is wrong in Bobs testing, and I don't believe you do. I do believe, however, that he is overextending his conclusions. But we are in active discussion about that, and an important test has yet to be run by me.

It could very well be a complete coincidence, an artifact of the particular engine combination, that we are getting the same game. The data I have posted certainly shows counterexamples. Without having analysed any of it, I would say that those engines that are showing big variance are those with big own opening books, that are still fairly close in strength to Eden's.

nczempin · Post by **nczempin** » Tue Sep 18, 2007 6:15 pm

bob wrote:Here they are:
...

Thank you.

I am proceeding to use this in my test.

Since no-one has come up with any suggestions for suitable opponents, I will choose them off-line, looking for evidence of high variance. This is not necessarily a good predictor of high variance in the test I will do, but if all engines are equivalent, then it doesn't hurt if I choose those, and I could make a weak claim that the variance could be higher for those chosen. This means that umax 4_8 certainly won't be in the test

Although perhaps it would be interesting to see how an engine that shows such low variance against my engine will behave in that other test.

Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements

Re: Variance reports for testing engine improvements