An objective test process for the rest of us?

bob · Post by **bob** » Fri Sep 14, 2007 2:36 am

nczempin wrote:
bob wrote:
nczempin wrote:Okay, here's a first shot at getting to a more formalized description, we can always use a more mathematical language once we have some agreement on the content:

I have an engine E and another engine E'. I want to determine whether E' is (staticstically) significantly stronger than E.

What does stronger mean in this context (some assumptions that I am making for my personal situation, YMWV)?

Stronger means that E' has a better result than E, one which is unlikely to be due to random factors alone (at some given confidence factor, I think 95 % is a good starting point).

For me, the goal is to test the engines "out of the box", which means their own book (or none, if they don't have one) is included. I'm looking to evaluate the whole package, not just the "thinking" part.
IMHO that is the wrong way to test. In any scientific experiment, the goal is to just change one thing at a time. If you use an opening book, now the program has choices and those choices introduce a second degree of freedom into the experiment. If the program has learning, there's a third degree of freedom.
My program does not have any choices within the opening book. It'll always play the same line.

But what about your opponents?

Yes, at times it is interesting to know whether A or A' is better than C,D,E and F. But if you are trying to compare A to A', making multiple changes to A' makes it impossible to attribute a different result to the actual cause of that result.
My goal is not to find out which of the (multiple or not) changes caused the improvement, my goal is only to find out whether the new version with that "black box of changes" is better than the old version.

So you would be happy with three somewhat decent improvements in your ideas, and one horrible one, just because the three decent ones make the overall thing play a bit better? That's a dangerous way of developing and testing, because that very thing is not that uncommon. How much better would it play without the three bad ideas to go with the one good one? You will never know. And crap creeps into your code without your knowing...
p

Actually I think using random or Nunn positions could skew the results, because they would show you how your engine plays in those positions. There may very well be positions that your engine doesn't "like", and that a sensible opening book would avoid. Of course, someone whose goal is the performance under those positions (or under "all likely positions", of which such a selected group is then assumed to be a good proxy) would need to use a lot more games. So in the discussion I would like this factor to be separated.
If Eden A achieves a certain result with a certain book, I would like to know if Eden B would achieve a better result with the same book.
You miss the key point. So your program does badly in some Nunn positions. So what? You don't care about how well you play against B,C and D. You only care if the new version plays _better_ against those programs. That tells you your changes are for the better.

Because my opening book will not lead to those positions, I don't care if it performs better or worse in those positions, because it will never reach them. So having to test with them will be a waste of time.

But by using books, you introduce another level of randomness into the games as well. What are you trying to test? Book improvements? Search improvements? Evaluation improvements? Learning improvements? The idea is to test exactly what you are working on, eliminating all the other noise, to reduce the number of test games needed to evaluate the change. (note I said change, not _changes_).

Don't confuse A being better than C with trying to decide whether A or A' is better. The two questions are completely different. And for me, I am asking the former, "is my change good or bad". Not "does my change make me better than B or not?"

I am not confusing the two, I am not interested in knowing if A is better than C. I am only interested in knowing whether A' performs better than A against the conglomerate of B, C and D, without knowing the individual results.

We can simply agree that our goals are different. I guess mine is not the search for scientific knowledge but just to improve my engine in a regular way.

Incidentally, if you really are looking for the answer to the question "is my change good or bad", IMHO your approach is still inadequate (unless I'm missing something here):
All you could safely say after your significant number of games would be "Making change x to Crafty version a.0 makes that version a.0 significantly stronger"

That is _exactly_ what I am trying to verify...

It could very well be that in a later version of Crafty, say a.24, that change turns out to be insignificant. That Crafty a.25 would actually be stronger than a.24 if you removed the change you added from a.0 to a.1 (or at least would make no difference by then).

Hold on. Above you are talking about making _multiple_ changes and then testing. I don't do that for the very reason you give above. I am making one change at a time and then determining if the change is worthwhile or not.

For example, when the depth an engine can reach is fairly low, certain pieces of knowledge would make it stronger, so they could be made part of the eval. When the depth gets much larger, those may become irrelevant or even counter-productive (because now they are taken into account twice).

No doubt about that. And again, the only reasonable way to recognize that you have walked into that is to test at the faster speed to see what happens. If your scores drop significantly, then you have to play detective to figure out why, and then test the resulting changes to see if the effect is gone.

And I don't think any number of games will prevent this; I think the goal of "keeping everything else constant" is basically impossible at a certain level of complexity.

I totally disagree there, but that is up to you if you want to follow that path. Just be aware that eventually you will pay the piper however..

I have already discussed why I consider opponents of far higher and of far lower strength to be inadequate indicators.
Just playing E against E' is also not sufficient. Even just playing both against some other engine D is not sufficient, because as has been mentioned elsewhere engine strength is not a transitive property (i. e. it may well be that A would beat B consistently, B would beat C and C would beat A. So to find out which is the strongest within a theoretical universe where only these three engines exist, a complete tournament would have to be played).
That is why I don't test A against A'. I test A and A' against a gauntlet of opponents, over a variety of opening positions, playing enough games to smooth out the randomness that is always present in timed games.

So it would make sense to play a sufficient number of games against a sufficient number of opponents of approximately the same strength.

Thus we have a set of opponents O, and E should play a number of games against each opponent, and E' should play the same number of games against the same opponents. For practical reasons the "other version" could be included in each set of opponents. This presumably does skew the results, but arguably it is an improvement over just playing E against E'.

So what we have now is a RR tournament which includes E, E' and the set O.

What I'd like to know is, assuming E' achieves more points than E, what result should make me confident that E' is actually stronger, and this is not merely a random result?
The answer is simple. Play enough games. How many? Play N game matches and see how much they vary. If the results are too random, play 2N game matches and check again. Keep increasing the number of games until the result settles down with very little randomness, and now you have reached "the truth".

Without loss of generality, let's set the size of O to 8 (which would make the tourney have 10 contestants).

My first conjecture is that there is an inverse relationship between the number of games per match and the points differential between E and E'.

Opinions?
Why would you believe more games will result in a smaller difference between E and E'? Does that mean that after an infinite number of games the two are exactly equal? I don't follow the inverse reasoning at all.

(Feel free to improve on my pseudo-mathematical terminology, and of course on other things).

hgm · Post by **hgm** » Fri Sep 14, 2007 9:30 am

bob wrote: Complete iterations is also bad. How to balance the two programs so that their searches are equal? N plies for one can be equivalent to M plies for another where M != N.

Equivalent in what respect? Time use? Playing strength?

I would only be doing this for programs that are very similar, two versions with a minor change between them. If you would be testing, say, addition of null-move pruning, this would indeed make little sense. But you would also not need to do it, because the Elo difference will be huge.

The point is that I want to eliminate any randomness that can be eliminated. To this end I will also exploit the tendency of two closely related versions to play the same move in the same position to the maximum. Everything inducing variability for reasons not related to the change under test should therefore be eliminated. Scientific measurements should be done under controlled conditions.

Another way of saying this is that comparing results of different games, where the two versions are thinking about different positions, is a very indirect comparison, with a lot of noise due to selections of the positions. I want to compare the two engines by the moves they pick in the same position, so I should present them with the same position, rather than having them select their own positions independently. And then disturb them as little as possible when they are at it.

nczempin · Post by **nczempin** » Fri Sep 14, 2007 3:38 pm

bob wrote:
Incidentally, if you really are looking for the answer to the question "is my change good or bad", IMHO your approach is still inadequate (unless I'm missing something here):
All you could safely say after your significant number of games would be "Making change x to Crafty version a.0 makes that version a.0 significantly stronger"
That is _exactly_ what I am trying to verify...

It could very well be that in a later version of Crafty, say a.24, that change turns out to be insignificant. That Crafty a.25 would actually be stronger than a.24 if you removed the change you added from a.0 to a.1 (or at least would make no difference by then).
Hold on. Above you are talking about making _multiple_ changes and then testing. I don't do that for the very reason you give above. I am making one change at a time and then determining if the change is worthwhile or not.

I know you're not doing multiple changes and then testing. But even when you only test one change at a time, it would have to be (assume without loss of generality that all changes are successful):

Change c1, Test t1, then change c2, test t2, ...
And eventually you'll have n changes, and n tests.
How do you make sure that ci does not affect cj?

Or do you do all the tests c1 by itself, c1 plus c2, c1 plus c3 .. c2 plus c3, ...cn-1 plus cn? And then all the three-change combinations? And the n-change combinations?

bob · Post by **bob** » Fri Sep 14, 2007 5:58 pm

hgm wrote:
bob wrote: Complete iterations is also bad. How to balance the two programs so that their searches are equal? N plies for one can be equivalent to M plies for another where M != N.
Equivalent in what respect? Time use? Playing strength?

I would only be doing this for programs that are very similar, two versions with a minor change between them. If you would be testing, say, addition of null-move pruning, this would indeed make little sense. But you would also not need to do it, because the Elo difference will be huge.

I am missing something. You are going to play A vs A'? I don't think that kind of testing is worth anything. You are going to play A vs a set of opponents and A' vs the same set? That is the case I was asking about. How can you pick some search depth D for A or A', vs some search E for opponent E1, F for opponent F1, etc, to make things reasonable? Do you adjust the depth during the game as material comes off or do you end up playing moves instantaneously in endgames?

I still believe the fairest and most accurate way to play games is to use a time limit, just like we do in real games. How you allocate time is a specific part of an engine. How it behaves when it fails low, when it fails high, etc. are all part of the thing. I would not want, for example, to take my evaluation function, and stick it into a vanilla program and test it that way.

The point is that I want to eliminate any randomness that can be eliminated. To this end I will also exploit the tendency of two closely related versions to play the same move in the same position to the maximum. Everything inducing variability for reasons not related to the change under test should therefore be eliminated. Scientific measurements should be done under controlled conditions.

But one important principle is to use a "high-impedence probe". You don't want the test methodology itself to bias the results or change them in any way. I think it best to eliminate all issues but time, and then deal with that one by playing enough games. Playing to fixed depth, for example, would influence how you write a parallel search. Yet in real games, it would be worthless.

Another way of saying this is that comparing results of different games, where the two versions are thinking about different positions, is a very indirect comparison, with a lot of noise due to selections of the positions. I want to compare the two engines by the moves they pick in the same position, so I should present them with the same position, rather than having them select their own positions independently. And then disturb them as little as possible when they are at it.

Then why play games? Why not just feed in thousands of positions, and compare the moves chosen by each version. Eliminate the duplicates and look at the positions where the two versions play a different move...

bob · Post by **bob** » Fri Sep 14, 2007 6:04 pm

nczempin wrote:
bob wrote:
Incidentally, if you really are looking for the answer to the question "is my change good or bad", IMHO your approach is still inadequate (unless I'm missing something here):
All you could safely say after your significant number of games would be "Making change x to Crafty version a.0 makes that version a.0 significantly stronger"
That is _exactly_ what I am trying to verify...

It could very well be that in a later version of Crafty, say a.24, that change turns out to be insignificant. That Crafty a.25 would actually be stronger than a.24 if you removed the change you added from a.0 to a.1 (or at least would make no difference by then).
Hold on. Above you are talking about making _multiple_ changes and then testing. I don't do that for the very reason you give above. I am making one change at a time and then determining if the change is worthwhile or not.
I know you're not doing multiple changes and then testing. But even when you only test one change at a time, it would have to be (assume without loss of generality that all changes are successful):

Change c1, Test t1, then change c2, test t2, ...
And eventually you'll have n changes, and n tests.
How do you make sure that ci does not affect cj?

Or do you do all the tests c1 by itself, c1 plus c2, c1 plus c3 .. c2 plus c3, ...cn-1 plus cn? And then all the three-change combinations? And the n-change combinations?

I do not do the latter, although it would certainly be most scientific way. The only time I look at that kind of testing is when I add or modify something and test it and the results are worse, rather than better. The question becomes "is the new change bad, or is it interacting in a bad way with something else?" If the change is intuitively good, but the results are bad, then we spend time trying to figure out if it is a bad interaction, an implementation with an unforseen bug, or if it is really bad for a completely unexpected reason.

We don't just discard changes due to poor results. We make sure that we _understand_ why the results are bad first. Tracy and I have made lots of eval changes that fit this scenario. And sometimes we simply find bugs in the new code and sometimes we find that an idea just doesn't work for other reasons.

For me, testing is a "proof of concept" only. I don't throw things out because results are worse. Worse results simply trigger more analysis to determine what is going on so that we can make an informed decision.

If you just run tests and keep or throw out ideas based solely on results, a lot of good ideas will get thrown out with the bad...

nczempin · Post by **nczempin** » Fri Sep 14, 2007 6:21 pm

bob wrote:
nczempin wrote: I know you're not doing multiple changes and then testing. But even when you only test one change at a time, it would have to be (assume without loss of generality that all changes are successful):

Change c1, Test t1, then change c2, test t2, ...
And eventually you'll have n changes, and n tests.
How do you make sure that ci does not affect cj?

Or do you do all the tests c1 by itself, c1 plus c2, c1 plus c3 .. c2 plus c3, ...cn-1 plus cn? And then all the three-change combinations? And the n-change combinations?
I do not do the latter, although it would certainly be most scientific way. The only time I look at that kind of testing is when I add or modify something and test it and the results are worse, rather than better. The question becomes "is the new change bad, or is it interacting in a bad way with something else?" If the change is intuitively good, but the results are bad, then we spend time trying to figure out if it is a bad interaction, an implementation with an unforseen bug, or if it is really bad for a completely unexpected reason.

We don't just discard changes due to poor results. We make sure that we _understand_ why the results are bad first. Tracy and I have made lots of eval changes that fit this scenario. And sometimes we simply find bugs in the new code and sometimes we find that an idea just doesn't work for other reasons.

For me, testing is a "proof of concept" only. I don't throw things out because results are worse. Worse results simply trigger more analysis to determine what is going on so that we can make an informed decision.

If you just run tests and keep or throw out ideas based solely on results, a lot of good ideas will get thrown out with the bad...

We agree that doing all the combinations would be the "most scientific" way in theory, but it is intractable. And of course, the element of a theory behind the change plays a role.

In this respect out approaches are similar.

I think the major difference is that I may be using a higher granularity of changes for tests. I don't see how your approach (or let's just say, this one part of the approach, the granularity) is a priori better.
If you take into account the resource constraints plus the maturity levels, could we agree that it would be infeasible for my test program to use the exact same approach that you're using?

Once we can agree on that, perhaps we can assume that as a given and then try to find out what would be a feasible approach for my situation (and presumably for most others, to different degrees)?

And the one question I have that no-one seems to want to answer is: If I cannot use the sheer number of experiments to determine significance, can't I use the level of difference? Even if my standard deviation is much higher for me than for you, if the result is enough s.d.s away from the mean, it should indicate a significant improvement, shouldn't it?

If so, I would like to find out under my specific conditions what values to plug into what formulas to find a "go" decision.

I'll post a real-life example (my current "base tournament") later.

Uri Blass · Post by **Uri Blass** » Fri Sep 14, 2007 7:42 pm

bob wrote:
hgm wrote:
bob wrote: Complete iterations is also bad. How to balance the two programs so that their searches are equal? N plies for one can be equivalent to M plies for another where M != N.
Equivalent in what respect? Time use? Playing strength?

I would only be doing this for programs that are very similar, two versions with a minor change between them. If you would be testing, say, addition of null-move pruning, this would indeed make little sense. But you would also not need to do it, because the Elo difference will be huge.
I am missing something. You are going to play A vs A'? I don't think that kind of testing is worth anything. You are going to play A vs a set of opponents and A' vs the same set? That is the case I was asking about. How can you pick some search depth D for A or A', vs some search E for opponent E1, F for opponent F1, etc, to make things reasonable? Do you adjust the depth during the game as material comes off or do you end up playing moves instantaneously in endgames?

I still believe the fairest and most accurate way to play games is to use a time limit, just like we do in real games. How you allocate time is a specific part of an engine. How it behaves when it fails low, when it fails high, etc. are all part of the thing. I would not want, for example, to take my evaluation function, and stick it into a vanilla program and test it that way.

The point is that I want to eliminate any randomness that can be eliminated. To this end I will also exploit the tendency of two closely related versions to play the same move in the same position to the maximum. Everything inducing variability for reasons not related to the change under test should therefore be eliminated. Scientific measurements should be done under controlled conditions.

But one important principle is to use a "high-impedence probe". You don't want the test methodology itself to bias the results or change them in any way. I think it best to eliminate all issues but time, and then deal with that one by playing enough games. Playing to fixed depth, for example, would influence how you write a parallel search. Yet in real games, it would be worthless.

Another way of saying this is that comparing results of different games, where the two versions are thinking about different positions, is a very indirect comparison, with a lot of noise due to selections of the positions. I want to compare the two engines by the moves they pick in the same position, so I should present them with the same position, rather than having them select their own positions independently. And then disturb them as little as possible when they are at it.
Then why play games? Why not just feed in thousands of positions, and compare the moves chosen by each version. Eliminate the duplicates and look at the positions where the two versions play a different move...

I disagree with you that A versus A' is not worth anything.

If you add endgame knowledge it is the obvious test to try first.
Most of the pair of games are going to be the same because the added knowledge is not going to be relevant and in games when the added knowledge is relevant you may look at pair of games and see if this knowledge helps.

I think that in most cases when A' beat A you will find that A' is also better against other opponents and if A' cannot beat A I even do not think to test it against other opponents.

Uri

Uri Blass · Post by **Uri Blass** » Fri Sep 14, 2007 7:49 pm

bob wrote:
hgm wrote:
bob wrote: Complete iterations is also bad. How to balance the two programs so that their searches are equal? N plies for one can be equivalent to M plies for another where M != N.
Equivalent in what respect? Time use? Playing strength?

I would only be doing this for programs that are very similar, two versions with a minor change between them. If you would be testing, say, addition of null-move pruning, this would indeed make little sense. But you would also not need to do it, because the Elo difference will be huge.
I am missing something. You are going to play A vs A'? I don't think that kind of testing is worth anything. You are going to play A vs a set of opponents and A' vs the same set? That is the case I was asking about. How can you pick some search depth D for A or A', vs some search E for opponent E1, F for opponent F1, etc, to make things reasonable? Do you adjust the depth during the game as material comes off or do you end up playing moves instantaneously in endgames?

I still believe the fairest and most accurate way to play games is to use a time limit, just like we do in real games. How you allocate time is a specific part of an engine. How it behaves when it fails low, when it fails high, etc. are all part of the thing. I would not want, for example, to take my evaluation function, and stick it into a vanilla program and test it that way.

The point is that I want to eliminate any randomness that can be eliminated. To this end I will also exploit the tendency of two closely related versions to play the same move in the same position to the maximum. Everything inducing variability for reasons not related to the change under test should therefore be eliminated. Scientific measurements should be done under controlled conditions.

But one important principle is to use a "high-impedence probe". You don't want the test methodology itself to bias the results or change them in any way. I think it best to eliminate all issues but time, and then deal with that one by playing enough games. Playing to fixed depth, for example, would influence how you write a parallel search. Yet in real games, it would be worthless.

Another way of saying this is that comparing results of different games, where the two versions are thinking about different positions, is a very indirect comparison, with a lot of noise due to selections of the positions. I want to compare the two engines by the moves they pick in the same position, so I should present them with the same position, rather than having them select their own positions independently. And then disturb them as little as possible when they are at it.
Then why play games? Why not just feed in thousands of positions, and compare the moves chosen by each version. Eliminate the duplicates and look at the positions where the two versions play a different move...

Games are needed to avoid bugs.
It is possible that some bug does not happen from fen position but happen in a situation of a game so it is better to play games and see the difference.

Uri

BubbaTough · Post by **BubbaTough** » Fri Sep 14, 2007 8:28 pm

Just to be annoying, I thought I would thow in one of my cents here.

I am NOT a fan of always testing one change at a time. There are two reasons:

1. separating all testing decreases the number of things I have time to test, thereby slowing progress more than the increased accuracy is worth in most cases.

2. There is a strong relationship between features, which in many cases requires multiple related things to be changed at once for a fair test.

The first point seems obvious to me, though it seems only a few of us believe it. I won't go into the reasons why in this post, they have already been mentioned to some extent.

The second issue is very important, and I have not heard it mentioned much. Almost always, my changes involve adjusting something which effects numerous other issues.

For example, if I am adding a feature that penalizes having pawns on the same color as a bishop, I may also want to increase slightly the value of a bishop, to make up for that penalty, and maybe also the value of having two bishops, which may make me increase the value of a knight on an outpost so the program is going to know when it needs to trade bishop for knight, etc. etc..

In my opinion, this kind of situation is by far the most common. And it makes no sense at all to make each change individually since they are all tied together logically. How do you test this kind of change? Well, it is as much art as science in my opinion, and is as strongly related to your physical and mental resources.

Some key issues are:

1. How easy is it for you to do lots of testing in parallel
2. How much can you rely on your debugging skills to assure you code is actually doing what was intended
3. How much can you rely on own/team's chess sense to catch odd things during testing

Most of us can only do testing on one machine max, which means effective debugging and spotting issues with your own chess knowledge is ABSOLUTELY CRITICAL. That means good coders have an edge, and good chess players have an edge. And those edges result in being able to add complex interactive features, and tuning them with minimal (by the standards here) testing. The less you can rely on catching your own coding bugs before big test runs, or catching 'odd moves' by your program during small test runs, the more exhaustive your tests have to be and the slower your program will progress.

And just to throw one more log on the fire...most programs by most people have so much room to improve that any feature that requires huge testing to prove it helps is probably not worth adding. In some other post Bob talked about the effect of adding knowledge of an outside pawn, and how easy it was to identify that it was helpful. Those kind of things are HUGE benefits. Most programs have the chess understanding of 1600-1900's, and finding efficient ways to add that type of knowledge, whether through eval or search, is way more important than fine tuning some parameter.

-Sam

hgm · Post by **hgm** » Fri Sep 14, 2007 9:04 pm

bob wrote:I am missing something. You are going to play A vs A'? I don't think that kind of testing is worth anything. You are going to play A vs a set of opponents and A' vs the same set? That is the case I was asking about. How can you pick some search depth D for A or A', vs some search E for opponent E1, F for opponent F1, etc, to make things reasonable? Do you adjust the depth during the game as material comes off or do you end up playing moves instantaneously in endgames?

I would play A and A' against the same set of opponents. To ensure a realistic time usage, A would play under a time management scheme where iterations are completely finished, but not started after a certain number of seconds. Just as you would usually do in an engine that always finishes an iteration.

Engine A' (and A", etc.) would be told how many iterations they should do (namely as many as A dis) as long as they are searching on positions that A has played before. Only when they get into positions that no member of the A family as even seen, they will be allowed to determine the number of iterations by reading the clock. That number would be remembered, together with the position, for future use, when comparing A' with other, future versions of A.

But one important principle is to use a "high-impedence probe". You don't want the test methodology itself to bias the results or change them in any way. I think it best to eliminate all issues but time, and then deal with that one by playing enough games. Playing to fixed depth, for example, would influence how you write a parallel search. Yet in real games, it would be worthless.

I don't consider what I propose as a very invasive measurement technique. Finishing iterations is a very natural and acceptable way of time management (only ~5 Elo inferior to more elaborate schemes). And it doesn't really matter to the engine if it reads the clock or the iteration counter to decide if it should continue.

If your engine would be strongly dependent n special time management, running an extra iteration (or some other emergency procedure) on a hefty score drop in the last iteration, I would simply count that as part of the previous iteration. I don't think there are real problems there. The main point is to resolve 'race conditions' between the clock and the deepening process only once, and then force all other engines that are in the same position to copy the outcome (stop / continue deepening), rather than having them resolve the same race condition themselves, in a possibly different way.

Then why play games? Why not just feed in thousands of positions, and compare the moves chosen by each version. Eliminate the duplicates and look at the positions where the two versions play a different move...

Two reasons: Playing games is a convenient way to sample positions in proportion of the likelihood that the engine will work itself into such a position.

The second reason is that, when they play a different move, you would like to know wich move was actually better. Of course I could use Rybka to decide that, but then I would be just reverse engeneering Rybka, (which is beneath me), and my engine would certainly never become any better than Rybka, (which is my aim). So continuing the game is the way to determine which move was better.

An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?

Re: An objective test process for the rest of us?