An objective test process for the rest of us?

bob · Post by **bob** » Thu Sep 13, 2007 5:58 pm

nczempin wrote:Okay, here's a first shot at getting to a more formalized description, we can always use a more mathematical language once we have some agreement on the content:

I have an engine E and another engine E'. I want to determine whether E' is (staticstically) significantly stronger than E.

What does stronger mean in this context (some assumptions that I am making for my personal situation, YMWV)?

Stronger means that E' has a better result than E, one which is unlikely to be due to random factors alone (at some given confidence factor, I think 95 % is a good starting point).

For me, the goal is to test the engines "out of the box", which means their own book (or none, if they don't have one) is included. I'm looking to evaluate the whole package, not just the "thinking" part.

IMHO that is the wrong way to test. In any scientific experiment, the goal is to just change one thing at a time. If you use an opening book, now the program has choices and those choices introduce a second degree of freedom into the experiment. If the program has learning, there's a third degree of freedom.

Yes, at times it is interesting to know whether A or A' is better than C,D,E and F. But if you are trying to compare A to A', making multiple changes to A' makes it impossible to attribute a different result to the actual cause of that result.

Actually I think using random or Nunn positions could skew the results, because they would show you how your engine plays in those positions. There may very well be positions that your engine doesn't "like", and that a sensible opening book would avoid. Of course, someone whose goal is the performance under those positions (or under "all likely positions", of which such a selected group is then assumed to be a good proxy) would need to use a lot more games. So in the discussion I would like this factor to be separated.
If Eden A achieves a certain result with a certain book, I would like to know if Eden B would achieve a better result with the same book.

You miss the key point. So your program does badly in some Nunn positions. So what? You don't care about how well you play against B,C and D. You only care if the new version plays _better_ against those programs. That tells you your changes are for the better.

Don't confuse A being better than C with trying to decide whether A or A' is better. The two questions are completely different. And for me, I am asking the former, "is my change good or bad". Not "does my change make me better than B or not?"

I have already discussed why I consider opponents of far higher and of far lower strength to be inadequate indicators.
Just playing E against E' is also not sufficient. Even just playing both against some other engine D is not sufficient, because as has been mentioned elsewhere engine strength is not a transitive property (i. e. it may well be that A would beat B consistently, B would beat C and C would beat A. So to find out which is the strongest within a theoretical universe where only these three engines exist, a complete tournament would have to be played).

So it would make sense to play a sufficient number of games against a sufficient number of opponents of approximately the same strength.

Thus we have a set of opponents O, and E should play a number of games against each opponent, and E' should play the same number of games against the same opponents. For practical reasons the "other version" could be included in each set of opponents. This presumably does skew the results, but arguably it is an improvement over just playing E against E'.

So what we have now is a RR tournament which includes E, E' and the set O.

What I'd like to know is, assuming E' achieves more points than E, what result should make me confident that E' is actually stronger, and this is not merely a random result?

Without loss of generality, let's set the size of O to 8 (which would make the tourney have 10 contestants).

My first conjecture is that there is an inverse relationship between the number of games per match and the points differential between E and E'.

Opinions?

(Feel free to improve on my pseudo-mathematical terminology, and of course on other things).

nczempin · Post by **nczempin** » Thu Sep 13, 2007 6:45 pm

bob wrote:
nczempin wrote:Okay, here's a first shot at getting to a more formalized description, we can always use a more mathematical language once we have some agreement on the content:

I have an engine E and another engine E'. I want to determine whether E' is (staticstically) significantly stronger than E.

What does stronger mean in this context (some assumptions that I am making for my personal situation, YMWV)?

Stronger means that E' has a better result than E, one which is unlikely to be due to random factors alone (at some given confidence factor, I think 95 % is a good starting point).

For me, the goal is to test the engines "out of the box", which means their own book (or none, if they don't have one) is included. I'm looking to evaluate the whole package, not just the "thinking" part.
IMHO that is the wrong way to test. In any scientific experiment, the goal is to just change one thing at a time. If you use an opening book, now the program has choices and those choices introduce a second degree of freedom into the experiment. If the program has learning, there's a third degree of freedom.

My program does not have any choices within the opening book. It'll always play the same line.

Yes, at times it is interesting to know whether A or A' is better than C,D,E and F. But if you are trying to compare A to A', making multiple changes to A' makes it impossible to attribute a different result to the actual cause of that result.

My goal is not to find out which of the (multiple or not) changes caused the improvement, my goal is only to find out whether the new version with that "black box of changes" is better than the old version.

Actually I think using random or Nunn positions could skew the results, because they would show you how your engine plays in those positions. There may very well be positions that your engine doesn't "like", and that a sensible opening book would avoid. Of course, someone whose goal is the performance under those positions (or under "all likely positions", of which such a selected group is then assumed to be a good proxy) would need to use a lot more games. So in the discussion I would like this factor to be separated.
If Eden A achieves a certain result with a certain book, I would like to know if Eden B would achieve a better result with the same book.
You miss the key point. So your program does badly in some Nunn positions. So what? You don't care about how well you play against B,C and D. You only care if the new version plays _better_ against those programs. That tells you your changes are for the better.

Because my opening book will not lead to those positions, I don't care if it performs better or worse in those positions, because it will never reach them. So having to test with them will be a waste of time.

Don't confuse A being better than C with trying to decide whether A or A' is better. The two questions are completely different. And for me, I am asking the former, "is my change good or bad". Not "does my change make me better than B or not?"

I am not confusing the two, I am not interested in knowing if A is better than C. I am only interested in knowing whether A' performs better than A against the conglomerate of B, C and D, without knowing the individual results.

We can simply agree that our goals are different. I guess mine is not the search for scientific knowledge but just to improve my engine in a regular way.

Incidentally, if you really are looking for the answer to the question "is my change good or bad", IMHO your approach is still inadequate (unless I'm missing something here):
All you could safely say after your significant number of games would be "Making change x to Crafty version a.0 makes that version a.0 significantly stronger"

It could very well be that in a later version of Crafty, say a.24, that change turns out to be insignificant. That Crafty a.25 would actually be stronger than a.24 if you removed the change you added from a.0 to a.1 (or at least would make no difference by then).

For example, when the depth an engine can reach is fairly low, certain pieces of knowledge would make it stronger, so they could be made part of the eval. When the depth gets much larger, those may become irrelevant or even counter-productive (because now they are taken into account twice).

And I don't think any number of games will prevent this; I think the goal of "keeping everything else constant" is basically impossible at a certain level of complexity.

I have already discussed why I consider opponents of far higher and of far lower strength to be inadequate indicators.
Just playing E against E' is also not sufficient. Even just playing both against some other engine D is not sufficient, because as has been mentioned elsewhere engine strength is not a transitive property (i. e. it may well be that A would beat B consistently, B would beat C and C would beat A. So to find out which is the strongest within a theoretical universe where only these three engines exist, a complete tournament would have to be played).

So it would make sense to play a sufficient number of games against a sufficient number of opponents of approximately the same strength.

Thus we have a set of opponents O, and E should play a number of games against each opponent, and E' should play the same number of games against the same opponents. For practical reasons the "other version" could be included in each set of opponents. This presumably does skew the results, but arguably it is an improvement over just playing E against E'.

So what we have now is a RR tournament which includes E, E' and the set O.

What I'd like to know is, assuming E' achieves more points than E, what result should make me confident that E' is actually stronger, and this is not merely a random result?

Without loss of generality, let's set the size of O to 8 (which would make the tourney have 10 contestants).

My first conjecture is that there is an inverse relationship between the number of games per match and the points differential between E and E'.

Opinions?

(Feel free to improve on my pseudo-mathematical terminology, and of course on other things).

hgm · Post by **hgm** » Thu Sep 13, 2007 6:58 pm

bob wrote:Uri's approach doesn't eliminate anything. If you make one line change, the nodes you search are different, and that moves you into a different sample where the results can change. So testing for a fixed number of nodes is no better than testing for a fixed amount of time, unless all you are doing is speeding things up. And in that case a fixed-node search is pointless as well.

With "Uri's approach" I don't mean using fixed number of nodes per se, but the attempt to eliminate as much noise as you can. Personnally, I would use a predetermined number of complete iterations, (one version just copying the number of iterations of the other, as described above), so that I could compare two search trees with that are guaranteed to result in the same moves if the eval terms I changed / added do not change the scores such that a different move would be chosen.

Uri Blass · Post by **Uri Blass** » Thu Sep 13, 2007 7:15 pm

hgm wrote:
bob wrote:Uri's approach doesn't eliminate anything. If you make one line change, the nodes you search are different, and that moves you into a different sample where the results can change. So testing for a fixed number of nodes is no better than testing for a fixed amount of time, unless all you are doing is speeding things up. And in that case a fixed-node search is pointless as well.
With "Uri's approach" I don't mean using fixed number of nodes per se, but the attempt to eliminate as much noise as you can. Personnally, I would use a predetermined number of complete iterations, (one version just copying the number of iterations of the other, as described above), so that I could compare two search trees with that are guaranteed to result in the same moves if the eval terms I changed / added do not change the scores such that a different move would be chosen.

In my case fix depth match may be misleading because I use pruning based on evaluation factors and if I multiply my evaluation by 2 I may get a different tree.

Uri

nczempin · Post by **nczempin** » Thu Sep 13, 2007 7:32 pm

hgm wrote:
bob wrote:Uri's approach doesn't eliminate anything. If you make one line change, the nodes you search are different, and that moves you into a different sample where the results can change. So testing for a fixed number of nodes is no better than testing for a fixed amount of time, unless all you are doing is speeding things up. And in that case a fixed-node search is pointless as well.
With "Uri's approach" I don't mean using fixed number of nodes per se, but the attempt to eliminate as much noise as you can. Personnally, I would use a predetermined number of complete iterations, (one version just copying the number of iterations of the other, as described above), so that I could compare two search trees with that are guaranteed to result in the same moves if the eval terms I changed / added do not change the scores such that a different move would be chosen.

In the end, I don't see how testing under such conditions will be a better approach than testing under real-life conditions, if improvement under real-life conditions is what is wanted. (And for my engine, this includes using its own book).

To use such stand-ins, you would first have to argue (and/or show) that they are a good predictor of real-life performance. I would consider that problem alone to be extremely hard for any non-trivial difference.

nczempin · Post by **nczempin** » Thu Sep 13, 2007 7:35 pm

nczempin wrote:
hgm wrote:
bob wrote:Uri's approach doesn't eliminate anything. If you make one line change, the nodes you search are different, and that moves you into a different sample where the results can change. So testing for a fixed number of nodes is no better than testing for a fixed amount of time, unless all you are doing is speeding things up. And in that case a fixed-node search is pointless as well.
With "Uri's approach" I don't mean using fixed number of nodes per se, but the attempt to eliminate as much noise as you can. Personnally, I would use a predetermined number of complete iterations, (one version just copying the number of iterations of the other, as described above), so that I could compare two search trees with that are guaranteed to result in the same moves if the eval terms I changed / added do not change the scores such that a different move would be chosen.
In the end, I don't see how testing under such conditions will be a better approach than testing under real-life conditions, if improvement under real-life conditions is what is wanted. (And for my engine, this includes using its own book).

To use such stand-ins, you would first have to argue (and/or show) that they are a good predictor of real-life performance. I would consider that problem alone to be extremely hard for any non-trivial difference.

This reminds me of the way significance tests are done for e. g. medical treatments. Surely the human body is a far more complex structure than a chess engine, yet clinical tests do not require (AFAIK) 20000 tests before a treatment is considered effective. (Plus you don't always have to know why things work; apparently we still have no idea why anaesthesia works)

hgm · Post by **hgm** » Thu Sep 13, 2007 8:12 pm

nczempin wrote:In the end, I don't see how testing under such conditions will be a better approach than testing under real-life conditions, if improvement under real-life conditions is what is wanted.

Well, if you would be able to recognize the same level of improvement with 1000 games, in stead of a million, I think it would be quite obvious that this is a better approach.

What you say, borders on paranoia. Normally, the burden of proof lies the other way around: if you make minor changes, you can assume that it will not totally invalidate the value of all parameters you tuned before. If I have tuned my engine on a 1.3 MHz Pentium M, I will most definitely not assume that I will have to completely retune it before I can run it on a 1.6 MHz Pentium M. If I play with 64MB hash, rather than 128MB, I will not assume that I will need other piece values. I would not even believe that exact-depth hashing would have a measurable impact on playing strength until someone would show really significant evidence for it. And even if it would cost 15 Elo, it would be a true miracle if, say, extending recaptures would have a positive effect with exact-depth hashing, and a negative effect without it.

As for the medical testing, these are almost always done under as controlled conditions as possible. This starts with the selection of the participants. If you want to measure if a certain medicine reduced the risk for cardiovascular problems, you would take care to select two groups (one using the medicin, the other as a control) of 10,000 people that have the same age distribution, equal number of smokers, equal geographic distribution, equal distribution over ethnic groups, etc. If you would just pick them from the telephone directory, you are more likely to measure your (bad) luck in picking different numbers of members of obvious susceptible groups (e.g. smokers), than seeing an effect of the medicin.

nczempin · Post by **nczempin** » Thu Sep 13, 2007 8:52 pm

hgm wrote:
nczempin wrote:In the end, I don't see how testing under such conditions will be a better approach than testing under real-life conditions, if improvement under real-life conditions is what is wanted.
Well, if you would be able to recognize the same level of improvement with 1000 games, in stead of a million, I think it would be quite obvious that this is a better approach.

What you say, borders on paranoia. Normally, the burden of proof lies the other way around: if you make minor changes, you can assume that it will not totally invalidate the value of all parameters you tuned before. If I have tuned my engine on a 1.3 MHz Pentium M, I will most definitely not assume that I will have to completely retune it before I can run it on a 1.6 MHz Pentium M. If I play with 64MB hash, rather than 128MB, I will not assume that I will need other piece values. I would not even believe that exact-depth hashing would have a measurable impact on playing strength until someone would show really significant evidence for it. And even if it would cost 15 Elo, it would be a true miracle if, say, extending recaptures would have a positive effect with exact-depth hashing, and a negative effect without it.

Well, what you have just done is explain a few factors where the assumption can reasonably be made.

As for the medical testing, these are almost always done under as controlled conditions as possible. This starts with the selection of the participants. If you want to measure if a certain medicine reduced the risk for cardiovascular problems, you would take care to select two groups (one using the medicin, the other as a control) of 10,000 people that have the same age distribution, equal number of smokers, equal geographic distribution, equal distribution over ethnic groups, etc. If you would just pick them from the telephone directory, you are more likely to measure your (bad) luck in picking different numbers of members of obvious susceptible groups (e.g. smokers), than seeing an effect of the medicin.

Okay, so why is this approach infeasible for engine testing? So far I've heard that the only reliable approach is to play a lot of games.

I'm not sure why no-one has agreed with me (or perhaps you have and I missed it) that if e. g. A used to get 5 points out of 20 against B..etc. and now A' gets 15 points out of 20, that would be as significant as getting, say, 600 points out of 2000 vs. 500/2000.

So that instead of having to play those 4000 games, you could simply raise the requirements, the margin by which you'd have to increase.

What I would like to find out in this thread what the margin would have to be, under the conditions I have chosen, rather than discuss if the conditions are valid or not.

hgm · Post by **hgm** » Thu Sep 13, 2007 9:44 pm

nczempin wrote:
hgm wrote:
nczempin wrote:In the end, I don't see how testing under such conditions will be a better approach than testing under real-life conditions, if improvement under real-life conditions is what is wanted.
Well, if you would be able to recognize the same level of improvement with 1000 games, in stead of a million, I think it would be quite obvious that this is a better approach.

What you say, borders on paranoia. Normally, the burden of proof lies the other way around: if you make minor changes, you can assume that it will not totally invalidate the value of all parameters you tuned before. If I have tuned my engine on a 1.3 MHz Pentium M, I will most definitely not assume that I will have to completely retune it before I can run it on a 1.6 MHz Pentium M. If I play with 64MB hash, rather than 128MB, I will not assume that I will need other piece values. I would not even believe that exact-depth hashing would have a measurable impact on playing strength until someone would show really significant evidence for it. And even if it would cost 15 Elo, it would be a true miracle if, say, extending recaptures would have a positive effect with exact-depth hashing, and a negative effect without it.

Well, what you have just done is explain a few factors where the assumption can reasonably be made.

As for the medical testing, these are almost always done under as controlled conditions as possible. This starts with the selection of the participants. If you want to measure if a certain medicine reduced the risk for cardiovascular problems, you would take care to select two groups (one using the medicin, the other as a control) of 10,000 people that have the same age distribution, equal number of smokers, equal geographic distribution, equal distribution over ethnic groups, etc. If you would just pick them from the telephone directory, you are more likely to measure your (bad) luck in picking different numbers of members of obvious susceptible groups (e.g. smokers), than seeing an effect of the medicin.
Okay, so why is this approach infeasible for engine testing? So far I've heard that the only reliable approach is to play a lot of games.

I'm not sure why no-one has agreed with me (or perhaps you have and I missed it) that if e. g. A used to get 5 points out of 20 against B..etc. and now A' gets 15 points out of 20, that would be as significant as getting, say, 600 points out of 2000 vs. 500/2000.

So that instead of having to play those 4000 games, you could simply raise the requirements, the margin by which you'd have to increase.

What I would like to find out in this thread what the margin would have to be, under the conditions I have chosen, rather than discuss if the conditions are valid or not.

Yes, a difference of 10 out of 20 is as significant as 100 out of 2000 (100 times as many games, sqrt(100)=10 times as large a difference).

You would need an improvement of ~350 Elo to have a good chance to pass that threshold.

You have to be careful, though, that you don't start fooling yourself by testing many "improvements" that don't qualify, before you finally find one that does. Because If your baseline version gets reliably 5 out of 20, then an improvement of 175 Elo would give you on the average 10 out of 20. But the standard error in 20 games is 0.4*sqr(20) = 1.8, so 5 pt is just under 3 standard errors, which has a 1% probability to occur. So if you test 100 versions that are only halfway towards what you require, one of them usually will get lucky and pass the test for a 350 Elo improvement.

But I guess the point is that you would hardly ever be able to do a new release, if you require 350-Elo improvements. That would mean Eden 0.018 would have to be stronger than Rybka. Even at the level of Eden improvements are not easily that big. And testing for 3 times smaller improvements requires 9 times as many games.

bob · Post by **bob** » Fri Sep 14, 2007 2:18 am

hgm wrote:
bob wrote:Uri's approach doesn't eliminate anything. If you make one line change, the nodes you search are different, and that moves you into a different sample where the results can change. So testing for a fixed number of nodes is no better than testing for a fixed amount of time, unless all you are doing is speeding things up. And in that case a fixed-node search is pointless as well.
With "Uri's approach" I don't mean using fixed number of nodes per se, but the attempt to eliminate as much noise as you can. Personnally, I would use a predetermined number of complete iterations, (one version just copying the number of iterations of the other, as described above), so that I could compare two search trees with that are guaranteed to result in the same moves if the eval terms I changed / added do not change the scores such that a different move would be chosen.

Complete iterations is also bad. How to balance the two programs so that their searches are equal? N plies for one can be equivalent to M plies for another where M != N.

There is some randomness that can't be weeded out. But certainly using books is a bad start...

An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?