different kinds of testing

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

different kinds of testing

Post by Don »

A lot of things have been said on the forum about how to test chess programs. So this is an attempt to clarify the confusion and of course it would be good to hear many other opinions.

One of the ways people test program changes is by using problem tests. I think there is close to being a consensus that this is very difficult, the correlation between test set results and actual playing strength are pretty weak. There is always the hope that some excellent test sets can be developed that correlates really well with program strength, but so far nobody has been able to produce such a set. Nevertheless, problem sets can be pretty useful tools if used correctly.

Another way to test programs is by actually playing games. This falls into 2 different categories, self-testing and variety testing. Self-testing means that you play 1 version of your program against another version of your program in order to ascertain the value of some specific change or group of changes.

Variety testing is based on the notion that there is something "incestuous" about testing the same program against itself and that the results will be invalid. Different people take this to different extremes, some believing it makes little or no difference and other believing results of self-testing are completely invalid.

The other issue is how to set the levels. The 3 broad categories are time control games, fixed node testing and fixed depth testing. Each has unique advantages and disadvantages which will be discussed shortly.

But first we need to consider that every change to a program impacts it in 2 basic ways. A good engineer wants to understand this impact, whether it's good or bad. The change can impact the CPU performance, or it can affect the "raw" quality of the program. Every change will have these 2 effects in some ratio.

For instance any change that makes the program faster, has a positive impact on the playing strength. An example of this is just using a better compiler or better complier flags that improves the CPU and/or memory performance.

The "raw" quality of the program might be represented by an evaluation improvement or an improved search algorithm. Of course an evaluation improvement or search modification can also impact the CPU performance of the program and usually will. To summarize, the actual ELO impact of a change is dependent on some combination of these 2 things.

Now we get to the 3 main testing procedures based on playing games.

Time control based testing has the advantage that "real" games are played with these levels and is the best tool for understanding the ELO impact of the change. It's primary disadvantage is that tells you nothing about why the change is good or bad, only that it IS good or bad.

Fix node testing is attempt to standardize the testing conditions - games could be played on any hardware and directly compared for instance. The main disadvantage of fixed node testing is that it is not very good at isolating the 2 kinds of program change impacts mentioned above. For example adding evaluation to the program will slow the program down, but fixed node testing does not reveal this. This kind of testing will only tell you if some change had an impact on the NUMBER OF NODES you need to consider to reach a certain level of play. By itself that is not very useful to know.

Fixed depth testing is a way to isolate CPU performance from raw program performance. For example you may have a good idea for an evaluation change that is expensive to compute. Time control games may reveal that the program is weaker, but a good engineer will want to know WHY. Fixed depth testing might reveal that the idea is strong, but the implementation is too slow. The primary weakness of fixed depth testing is that it does not measure the total impact of the change. If the change makes the program faster but weaker, you cannot tell if the speed makes up for the weakening or visa versa.

There is an additional weakness of fixed depth testing that the other 2 methods do not share. In real games some moves are searched more deeply than others. In the endgame it is common to search an extra 2 or 3 ply or more and even in the middlegame the depth can vary significantly from one move to the next. Fixed depth testing therefore does not allocate resources in a realistic way compared to fixed node testing or fixed depth testing.

A final big issue in testing is how much resource to allocate to each game. If your program is being designed to compete in chess tournaments, you obviously want it to perform well at those time controls. That would indicate that you should test at levels that are closer to the levels you expect to compete in, or do well at.

Unfortunately, for most people that is impractical unless you have enormous computing resources to apply to the problem. If you have a program that is strong, it's unrealistic to expect anything more than minor improvements from each change. In order to accurately determine if a change is an actual improvement will generally require THOUSANDS of games unless the change affects the strength by tens of ELO points. If you accept only hundreds of games, you will end up keeping many changes that have actually harmed your program. This is such an important concept that a compromise must be struck. In order to get an adequate number of games on modest hardware you must either be incredibly patient, or test at much faster time controls (or depths, or nodes) than you are building for.

So now let's try to answer the questions, "how fast should I test?" and "which testing methods should I use?"

This is where we get into the realm of opinion, fanaticism, and superstition. This is a question for each developer to determine for himself. I can only give you my opinion here and I expect this will generate a lot of additional opinions - which I hope will stimulate everyones imagination.

In my opinion, fixed node testing is the least useful of the 3 game testing methods. In order of usefulness, I personally put it like this:

1. time control games.
2. fixed depth
3. fixed nodes.

In my own software development process I usually start with fixed depth games and graduate to time control games. I use problem testing as a sanity check and I don't use fixed nodes at all as I personally believe it is the least useful method. The fixed depth testing tries to give me an upper bound on what I can expect from the "new idea" or thing I am testing. It also tells me if the fundamental idea is sound. With fixed depth testing I have isolated the IDEA from the IMPLEMENTATION. This is a general principle that every good engineer is aware of and uses to good advantage. You will also find that it may give you an "early cutoff", you may be able to quickly determine that an idea is useless, so there is no need to proceed to the next stage of testing (real time control games.)

Which levels should you use for testing? This one is quite tricky, but the short answer is to use the fastest levels that you can get away with. I could have answered, "use the longest level you can get away with", but I think that is bad advice! The emphasis should be placed on sample size because long tests are almost completely meaningless unless you are testing huge changes or are patient enough to wait for weeks or months to get the required number of samples for each individual change!

Let's assume that you have an enormous amount of CPU resources to test with, say you have a rack of 64 quad core machines for instance. How does this change things? Should you now test at much longer levels? Previously you had 1 quad, now you have 64. Should you increase the time control 64X now?

I think the answer to that is NO. You are already testing at the fastest level you can "get away with" so you should not substantially increase this. If you increase the length of the test by 64, you are NOT going to get 64X more benefit. All you will get is slightly more valid data and you will still have to wait a long time for it. Better it is to use this extra resource to increase your turnaround time. You now have the ability to get answers in a few minutes that used to take a few hours. That is how the bulk of this additional resource should be used. Notice that I said the "bulk" of this additional resource - it may very well be sensible to increase the test time per game by some modest amount, it is a decision that you should try to make with dose of common sense realizing that whatever you do is a judicious compromise of some sort.

I must leave a lot unsaid. For instance some tests require more time to properly test than others. Some tests lend themselves more to fixed depth testing than other, sometimes fixed depth testing is a waste of time and you need to jump right to time control testing (such as when testing the time control algorithm :-)

Oh yes, one last thing about fixed nodes and fixed depth testing. An advantage I forgot to mention is that most testers do not deal with fast time controls well at all. Imagine testing at game in 15 seconds on a slow auto-tester that is doing garbage collection while a program is thinking.

My own tester can play ridiculously fast games, such as game in 5 second + 0.01 increment, and it seems to work just great, but I don't like to trust time controls that fast because I assume external processes could be having strange effects on the results. What happens when a java program starts up just when one of the programs is starting to think? I'm sure it averages out, but it would give an advantage to the weaker program I would think. Anything that randomizes the results benefits the weaker program.
Jan Brouwer
Posts: 201
Joined: Thu Mar 22, 2007 7:12 pm
Location: Netherlands

Re: different kinds of testing

Post by Jan Brouwer »

Hi Don,

I agree with most what you said.
Don wrote:A lot of things have been said on the forum about how to test chess programs. So this is an attempt to clarify the confusion and of course it would be good to hear many other opinions.

One of the ways people test program changes is by using problem tests. I think there is close to being a consensus that this is very difficult, the correlation between test set results and actual playing strength are pretty weak. There is always the hope that some excellent test sets can be developed that correlates really well with program strength, but so far nobody has been able to produce such a set. Nevertheless, problem sets can be pretty useful tools if used correctly.

Another way to test programs is by actually playing games. This falls into 2 different categories, self-testing and variety testing. Self-testing means that you play 1 version of your program against another version of your program in order to ascertain the value of some specific change or group of changes.

Variety testing is based on the notion that there is something "incestuous" about testing the same program against itself and that the results will be invalid. Different people take this to different extremes, some believing it makes little or no difference and other believing results of self-testing are completely invalid.

The other issue is how to set the levels. The 3 broad categories are time control games, fixed node testing and fixed depth testing. Each has unique advantages and disadvantages which will be discussed shortly.

But first we need to consider that every change to a program impacts it in 2 basic ways. A good engineer wants to understand this impact, whether it's good or bad. The change can impact the CPU performance, or it can affect the "raw" quality of the program. Every change will have these 2 effects in some ratio.

For instance any change that makes the program faster, has a positive impact on the playing strength. An example of this is just using a better compiler or better complier flags that improves the CPU and/or memory performance.

The "raw" quality of the program might be represented by an evaluation improvement or an improved search algorithm. Of course an evaluation improvement or search modification can also impact the CPU performance of the program and usually will. To summarize, the actual ELO impact of a change is dependent on some combination of these 2 things.

Now we get to the 3 main testing procedures based on playing games.

Time control based testing has the advantage that "real" games are played with these levels and is the best tool for understanding the ELO impact of the change. It's primary disadvantage is that tells you nothing about why the change is good or bad, only that it IS good or bad.

Fix node testing is attempt to standardize the testing conditions - games could be played on any hardware and directly compared for instance. The main disadvantage of fixed node testing is that it is not very good at isolating the 2 kinds of program change impacts mentioned above. For example adding evaluation to the program will slow the program down, but fixed node testing does not reveal this. This kind of testing will only tell you if some change had an impact on the NUMBER OF NODES you need to consider to reach a certain level of play. By itself that is not very useful to know.

Fixed depth testing is a way to isolate CPU performance from raw program performance. For example you may have a good idea for an evaluation change that is expensive to compute. Time control games may reveal that the program is weaker, but a good engineer will want to know WHY. Fixed depth testing might reveal that the idea is strong, but the implementation is too slow. The primary weakness of fixed depth testing is that it does not measure the total impact of the change. If the change makes the program faster but weaker, you cannot tell if the speed makes up for the weakening or visa versa.

There is an additional weakness of fixed depth testing that the other 2 methods do not share. In real games some moves are searched more deeply than others. In the endgame it is common to search an extra 2 or 3 ply or more and even in the middlegame the depth can vary significantly from one move to the next. Fixed depth testing therefore does not allocate resources in a realistic way compared to fixed node testing or fixed depth testing.

A final big issue in testing is how much resource to allocate to each game. If your program is being designed to compete in chess tournaments, you obviously want it to perform well at those time controls. That would indicate that you should test at levels that are closer to the levels you expect to compete in, or do well at.

Unfortunately, for most people that is impractical unless you have enormous computing resources to apply to the problem. If you have a program that is strong, it's unrealistic to expect anything more than minor improvements from each change. In order to accurately determine if a change is an actual improvement will generally require THOUSANDS of games unless the change affects the strength by tens of ELO points. If you accept only hundreds of games, you will end up keeping many changes that have actually harmed your program. This is such an important concept that a compromise must be struck. In order to get an adequate number of games on modest hardware you must either be incredibly patient, or test at much faster time controls (or depths, or nodes) than you are building for.

So now let's try to answer the questions, "how fast should I test?" and "which testing methods should I use?"

This is where we get into the realm of opinion, fanaticism, and superstition. This is a question for each developer to determine for himself. I can only give you my opinion here and I expect this will generate a lot of additional opinions - which I hope will stimulate everyones imagination.

In my opinion, fixed node testing is the least useful of the 3 game testing methods. In order of usefulness, I personally put it like this:

1. time control games.
2. fixed depth
3. fixed nodes.
I would expect fixed nodes to be more useful than fixed depth, they are very similar except fixed nodes doesn't have the resource allocation problem you mentioned above.
In my own software development process I usually start with fixed depth games and graduate to time control games. I use problem testing as a sanity check and I don't use fixed nodes at all as I personally believe it is the least useful method. The fixed depth testing tries to give me an upper bound on what I can expect from the "new idea" or thing I am testing. It also tells me if the fundamental idea is sound. With fixed depth testing I have isolated the IDEA from the IMPLEMENTATION. This is a general principle that every good engineer is aware of and uses to good advantage. You will also find that it may give you an "early cutoff", you may be able to quickly determine that an idea is useless, so there is no need to proceed to the next stage of testing (real time control games.)

Which levels should you use for testing? This one is quite tricky, but the short answer is to use the fastest levels that you can get away with. I could have answered, "use the longest level you can get away with", but I think that is bad advice! The emphasis should be placed on sample size because long tests are almost completely meaningless unless you are testing huge changes or are patient enough to wait for weeks or months to get the required number of samples for each individual change!

Let's assume that you have an enormous amount of CPU resources to test with, say you have a rack of 64 quad core machines for instance. How does this change things? Should you now test at much longer levels? Previously you had 1 quad, now you have 64. Should you increase the time control 64X now?

I think the answer to that is NO. You are already testing at the fastest level you can "get away with" so you should not substantially increase this. If you increase the length of the test by 64, you are NOT going to get 64X more benefit. All you will get is slightly more valid data and you will still have to wait a long time for it. Better it is to use this extra resource to increase your turnaround time. You now have the ability to get answers in a few minutes that used to take a few hours. That is how the bulk of this additional resource should be used. Notice that I said the "bulk" of this additional resource - it may very well be sensible to increase the test time per game by some modest amount, it is a decision that you should try to make with dose of common sense realizing that whatever you do is a judicious compromise of some sort.

I must leave a lot unsaid. For instance some tests require more time to properly test than others. Some tests lend themselves more to fixed depth testing than other, sometimes fixed depth testing is a waste of time and you need to jump right to time control testing (such as when testing the time control algorithm :-)

Oh yes, one last thing about fixed nodes and fixed depth testing. An advantage I forgot to mention is that most testers do not deal with fast time controls well at all. Imagine testing at game in 15 seconds on a slow auto-tester that is doing garbage collection while a program is thinking.

My own tester can play ridiculously fast games, such as game in 5 second + 0.01 increment, and it seems to work just great, but I don't like to trust time controls that fast because I assume external processes could be having strange effects on the results. What happens when a java program starts up just when one of the programs is starting to think? I'm sure it averages out, but it would give an advantage to the weaker program I would think. Anything that randomizes the results benefits the weaker program.
Do your opponent programs handle those very fast time controls well?
I can image that most programs are not designed or tested at those time controls.
There fixed nodes testing could be an advantage.

I am interested to hear what specific testing regimes people are using.
I use 20 opening positions x 2 (white / black) * 6 oppenents = 240 games at 10 s + 1 s / move. This takes about a night...
Since the last two versions of my program, I use a a kind of verification test which repeats the above 4 times.
I realise that this means that the uncertainty in the results is quite large, but the process seems to work so far, my program has reached 2669 on the CCRL list.
But I wonder how much further I can take this as the improvements become smaller and smaller.
Probably I also need to start testing at shorter time controls, which means finding an alternative to Arena.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: different kinds of testing

Post by bob »

Don wrote:A lot of things have been said on the forum about how to test chess programs. So this is an attempt to clarify the confusion and of course it would be good to hear many other opinions.

One of the ways people test program changes is by using problem tests. I think there is close to being a consensus that this is very difficult, the correlation between test set results and actual playing strength are pretty weak. There is always the hope that some excellent test sets can be developed that correlates really well with program strength, but so far nobody has been able to produce such a set. Nevertheless, problem sets can be pretty useful tools if used correctly.

Another way to test programs is by actually playing games. This falls into 2 different categories, self-testing and variety testing. Self-testing means that you play 1 version of your program against another version of your program in order to ascertain the value of some specific change or group of changes.

Variety testing is based on the notion that there is something "incestuous" about testing the same program against itself and that the results will be invalid. Different people take this to different extremes, some believing it makes little or no difference and other believing results of self-testing are completely invalid.

The other issue is how to set the levels. The 3 broad categories are time control games, fixed node testing and fixed depth testing. Each has unique advantages and disadvantages which will be discussed shortly.

But first we need to consider that every change to a program impacts it in 2 basic ways. A good engineer wants to understand this impact, whether it's good or bad. The change can impact the CPU performance, or it can affect the "raw" quality of the program. Every change will have these 2 effects in some ratio.

For instance any change that makes the program faster, has a positive impact on the playing strength. An example of this is just using a better compiler or better complier flags that improves the CPU and/or memory performance.

The "raw" quality of the program might be represented by an evaluation improvement or an improved search algorithm. Of course an evaluation improvement or search modification can also impact the CPU performance of the program and usually will. To summarize, the actual ELO impact of a change is dependent on some combination of these 2 things.

Now we get to the 3 main testing procedures based on playing games.

Time control based testing has the advantage that "real" games are played with these levels and is the best tool for understanding the ELO impact of the change. It's primary disadvantage is that tells you nothing about why the change is good or bad, only that it IS good or bad.

Fix node testing is attempt to standardize the testing conditions - games could be played on any hardware and directly compared for instance. The main disadvantage of fixed node testing is that it is not very good at isolating the 2 kinds of program change impacts mentioned above. For example adding evaluation to the program will slow the program down, but fixed node testing does not reveal this. This kind of testing will only tell you if some change had an impact on the NUMBER OF NODES you need to consider to reach a certain level of play. By itself that is not very useful to know.

Fixed depth testing is a way to isolate CPU performance from raw program performance. For example you may have a good idea for an evaluation change that is expensive to compute. Time control games may reveal that the program is weaker, but a good engineer will want to know WHY. Fixed depth testing might reveal that the idea is strong, but the implementation is too slow. The primary weakness of fixed depth testing is that it does not measure the total impact of the change. If the change makes the program faster but weaker, you cannot tell if the speed makes up for the weakening or visa versa.

There is an additional weakness of fixed depth testing that the other 2 methods do not share. In real games some moves are searched more deeply than others. In the endgame it is common to search an extra 2 or 3 ply or more and even in the middlegame the depth can vary significantly from one move to the next. Fixed depth testing therefore does not allocate resources in a realistic way compared to fixed node testing or fixed depth testing.

A final big issue in testing is how much resource to allocate to each game. If your program is being designed to compete in chess tournaments, you obviously want it to perform well at those time controls. That would indicate that you should test at levels that are closer to the levels you expect to compete in, or do well at.

Unfortunately, for most people that is impractical unless you have enormous computing resources to apply to the problem. If you have a program that is strong, it's unrealistic to expect anything more than minor improvements from each change. In order to accurately determine if a change is an actual improvement will generally require THOUSANDS of games unless the change affects the strength by tens of ELO points. If you accept only hundreds of games, you will end up keeping many changes that have actually harmed your program. This is such an important concept that a compromise must be struck. In order to get an adequate number of games on modest hardware you must either be incredibly patient, or test at much faster time controls (or depths, or nodes) than you are building for.

So now let's try to answer the questions, "how fast should I test?" and "which testing methods should I use?"

This is where we get into the realm of opinion, fanaticism, and superstition. This is a question for each developer to determine for himself. I can only give you my opinion here and I expect this will generate a lot of additional opinions - which I hope will stimulate everyones imagination.

In my opinion, fixed node testing is the least useful of the 3 game testing methods. In order of usefulness, I personally put it like this:

1. time control games.
2. fixed depth
3. fixed nodes.

In my own software development process I usually start with fixed depth games and graduate to time control games. I use problem testing as a sanity check and I don't use fixed nodes at all as I personally believe it is the least useful method. The fixed depth testing tries to give me an upper bound on what I can expect from the "new idea" or thing I am testing. It also tells me if the fundamental idea is sound. With fixed depth testing I have isolated the IDEA from the IMPLEMENTATION. This is a general principle that every good engineer is aware of and uses to good advantage. You will also find that it may give you an "early cutoff", you may be able to quickly determine that an idea is useless, so there is no need to proceed to the next stage of testing (real time control games.)

Which levels should you use for testing? This one is quite tricky, but the short answer is to use the fastest levels that you can get away with. I could have answered, "use the longest level you can get away with", but I think that is bad advice! The emphasis should be placed on sample size because long tests are almost completely meaningless unless you are testing huge changes or are patient enough to wait for weeks or months to get the required number of samples for each individual change!

Let's assume that you have an enormous amount of CPU resources to test with, say you have a rack of 64 quad core machines for instance. How does this change things? Should you now test at much longer levels? Previously you had 1 quad, now you have 64. Should you increase the time control 64X now?

I think the answer to that is NO. You are already testing at the fastest level you can "get away with" so you should not substantially increase this. If you increase the length of the test by 64, you are NOT going to get 64X more benefit. All you will get is slightly more valid data and you will still have to wait a long time for it. Better it is to use this extra resource to increase your turnaround time. You now have the ability to get answers in a few minutes that used to take a few hours. That is how the bulk of this additional resource should be used. Notice that I said the "bulk" of this additional resource - it may very well be sensible to increase the test time per game by some modest amount, it is a decision that you should try to make with dose of common sense realizing that whatever you do is a judicious compromise of some sort.

I must leave a lot unsaid. For instance some tests require more time to properly test than others. Some tests lend themselves more to fixed depth testing than other, sometimes fixed depth testing is a waste of time and you need to jump right to time control testing (such as when testing the time control algorithm :-)

Oh yes, one last thing about fixed nodes and fixed depth testing. An advantage I forgot to mention is that most testers do not deal with fast time controls well at all. Imagine testing at game in 15 seconds on a slow auto-tester that is doing garbage collection while a program is thinking.

My own tester can play ridiculously fast games, such as game in 5 second + 0.01 increment, and it seems to work just great, but I don't like to trust time controls that fast because I assume external processes could be having strange effects on the results. What happens when a java program starts up just when one of the programs is starting to think? I'm sure it averages out, but it would give an advantage to the weaker program I would think. Anything that randomizes the results benefits the weaker program.
I will add a couple of notes.

1. time control. Is it possible to choose a "bad one" that will provide misleading results? Absolutely. Is it probable? Based on my testing results, no. This means that very fast time controls are pretty reliable in predicting performance at longer time controls, when you are trying to determine if A' is better than A (A' is just a modified A). Have I seen cases where this is false? Yes. For example, it is not uncommon to see a ramped-up king safety score better at fast time controls where it gets the opponent into trouble. While at longer time controls, the opponent can see thru the superficiality of some moves and avoid the problems. But that only requires a bit of common sense when testing. If you do that sort of change, you probably do want to try some slower games to verify or contradict fast game results. If you change anything at all related to time usage, you have to test that at multiple time controls. And finally, I always test with an increment except when testing new time control ideas (for games without increments) as I don't want those last-second time scrambles to have a major impact on the game, I want to measure the skill I am adding, not the existing skill in out-scrambling my opponent under time pressure.

For the past 2 years, I have been carefully running tests and frequently trying different time controls to see if there is a difference. There is absolutely a difference in playing (say) Crafty vs program X at a fast time control, and at a slow time control. But I am not trying to measure the difference between Crafty and program X, I am trying to measure the difference in two different versions of Crafty against a group of opponents. And so far, avoiding the specific cases mentioned previously, when I play C vs a group and C' vs a group at time control X (very fast), and then C and C' vs the same group at time control Y (slower) the distance between C and C' remains pretty constant. The distance between C/C' and the opponents might change (some programs are very bad at very fast games, for example, and get into time trouble and get mashed because of it, increment or no increment). And that happens to be exactly what I want to see. I much prefer to run a 40,000 game test that takes an hour, so that I can determine if the new version is better or worse quickly, if the measurement itself is pretty accurate. And with the exception (so far) of changing time usage, I have not found any case where this is an issue.

2. Number of games. I see lots of conclusions based on 200-300 games. If a change is outstanding, that might be enough. Or if a change is horrible, ditto. But we rarely test such a version, as while we are working on a change, we are running test positions and such to be reasonably sure the change is doing what we expected. So our changes are not "wow" or "awww", but more often "hmmm". 2-3-4-5 Elo up or down is about the most we see. The error bar for 40,000 games is between +/-4 and +/-5. 200,000 games barely gets down to +/-2 or 3.

If your program is new, and you are adding significant features (say null-move that might be a +100 change) then a few games is enough to convince yourself the new version is better. But for the more common couple of elo up or down type changes, 200-400 games is essentially a random experiment with a random result that can't be trusted.

As far as the mode of testing (nodes/depth/time) all of my qualitative testing (to determine if a feature is good or bad) is done using time. That's the way we have to play the games. If you only use depth, then every search extension you add will look good, because you increase the size of the tree, and it doesn't have any associated cost of doing so. Fixed nodes helps with that a bit, since the size of the tree is now a function of nodes allowed, rather than depth allowed, but it still hides important characteristics that are difficult to recover. For example, a program (mine, for instance) might be slow in the opening because of developmental analysis that gets turned off later) and much faster in the endgame because missing pieces make move generation and other things faster. If you use a fixed number of nodes, a change can influence the game in ways besides skill level. For example, you add a term that reduces the tendency to trade pieces. This will likely look good, because you now avoid reaching those endgames with a higher speed than your opponent, so you lose out on that benefit. Except that you don't lose the benefit because you don't have that benefit in a fixed-node search But with fixed nodes, you won't realize this. Because all nodes are equal, the fast ones and slow ones. So you skew the result by steering the game into an area where you are better (or worse) solely because of the way you are testing, which will not be the same result you would see when using real timing. If your nodes are "equal" (your NPS remains fairly constant through the course of a game, this won't be an issue for you (although you may well make a change that steers the game into an unfavorable area for your opponent [same logic, reversed] which makes the change look good if you force your oponent into an area of the tree where the fixed node count penalizes him, or it makes the change look bad if you steer the opponent into an area of the three where the fixed node count benefits him.

To me, the above is highly problematic and requires repeated testing with time controls to verify a result. I'd just as soon start with time control testing and avoid the mess completely.

As I have mentioned previously, fixed depth or fixed node searches are great for debugging. If a program crashes in a real game, and you can't reproduce it because of pondering, and variable time usage, play games using a fixed depth. If you can create the crash, you can then repeat it as many times as you want while debugging, which is a huge help in debugging a program as complex as these things. Of course if you throw in a parallel search, this becomes a completely moot point as you can't reproduce anything anyway, fixed nodes or depth or not.

Measuring small changes requires many games. There is no way to "cheat the statistical gods" at all. Which, I realize, represents a huge problem for most. We are currently nearing +100 Elo for the current version over 23.0. When 23.0 was released, it was well over +120 elo better than the previous best version(s). All without major rewrites or changes. Just a ton of small tweaks here and there, a little code here and there. And a ton of testing to recognize what was good and what was not. There's no doubt that testing using time controls is the right way to do this, but it does require a ton of CPU time to make it accurate.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: different kinds of testing

Post by bob »

Jan Brouwer wrote:Hi Don,

I agree with most what you said.
Don wrote:A lot of things have been said on the forum about how to test chess programs. So this is an attempt to clarify the confusion and of course it would be good to hear many other opinions.

One of the ways people test program changes is by using problem tests. I think there is close to being a consensus that this is very difficult, the correlation between test set results and actual playing strength are pretty weak. There is always the hope that some excellent test sets can be developed that correlates really well with program strength, but so far nobody has been able to produce such a set. Nevertheless, problem sets can be pretty useful tools if used correctly.

Another way to test programs is by actually playing games. This falls into 2 different categories, self-testing and variety testing. Self-testing means that you play 1 version of your program against another version of your program in order to ascertain the value of some specific change or group of changes.

Variety testing is based on the notion that there is something "incestuous" about testing the same program against itself and that the results will be invalid. Different people take this to different extremes, some believing it makes little or no difference and other believing results of self-testing are completely invalid.

The other issue is how to set the levels. The 3 broad categories are time control games, fixed node testing and fixed depth testing. Each has unique advantages and disadvantages which will be discussed shortly.

But first we need to consider that every change to a program impacts it in 2 basic ways. A good engineer wants to understand this impact, whether it's good or bad. The change can impact the CPU performance, or it can affect the "raw" quality of the program. Every change will have these 2 effects in some ratio.

For instance any change that makes the program faster, has a positive impact on the playing strength. An example of this is just using a better compiler or better complier flags that improves the CPU and/or memory performance.

The "raw" quality of the program might be represented by an evaluation improvement or an improved search algorithm. Of course an evaluation improvement or search modification can also impact the CPU performance of the program and usually will. To summarize, the actual ELO impact of a change is dependent on some combination of these 2 things.

Now we get to the 3 main testing procedures based on playing games.

Time control based testing has the advantage that "real" games are played with these levels and is the best tool for understanding the ELO impact of the change. It's primary disadvantage is that tells you nothing about why the change is good or bad, only that it IS good or bad.

Fix node testing is attempt to standardize the testing conditions - games could be played on any hardware and directly compared for instance. The main disadvantage of fixed node testing is that it is not very good at isolating the 2 kinds of program change impacts mentioned above. For example adding evaluation to the program will slow the program down, but fixed node testing does not reveal this. This kind of testing will only tell you if some change had an impact on the NUMBER OF NODES you need to consider to reach a certain level of play. By itself that is not very useful to know.

Fixed depth testing is a way to isolate CPU performance from raw program performance. For example you may have a good idea for an evaluation change that is expensive to compute. Time control games may reveal that the program is weaker, but a good engineer will want to know WHY. Fixed depth testing might reveal that the idea is strong, but the implementation is too slow. The primary weakness of fixed depth testing is that it does not measure the total impact of the change. If the change makes the program faster but weaker, you cannot tell if the speed makes up for the weakening or visa versa.

There is an additional weakness of fixed depth testing that the other 2 methods do not share. In real games some moves are searched more deeply than others. In the endgame it is common to search an extra 2 or 3 ply or more and even in the middlegame the depth can vary significantly from one move to the next. Fixed depth testing therefore does not allocate resources in a realistic way compared to fixed node testing or fixed depth testing.

A final big issue in testing is how much resource to allocate to each game. If your program is being designed to compete in chess tournaments, you obviously want it to perform well at those time controls. That would indicate that you should test at levels that are closer to the levels you expect to compete in, or do well at.

Unfortunately, for most people that is impractical unless you have enormous computing resources to apply to the problem. If you have a program that is strong, it's unrealistic to expect anything more than minor improvements from each change. In order to accurately determine if a change is an actual improvement will generally require THOUSANDS of games unless the change affects the strength by tens of ELO points. If you accept only hundreds of games, you will end up keeping many changes that have actually harmed your program. This is such an important concept that a compromise must be struck. In order to get an adequate number of games on modest hardware you must either be incredibly patient, or test at much faster time controls (or depths, or nodes) than you are building for.

So now let's try to answer the questions, "how fast should I test?" and "which testing methods should I use?"

This is where we get into the realm of opinion, fanaticism, and superstition. This is a question for each developer to determine for himself. I can only give you my opinion here and I expect this will generate a lot of additional opinions - which I hope will stimulate everyones imagination.

In my opinion, fixed node testing is the least useful of the 3 game testing methods. In order of usefulness, I personally put it like this:

1. time control games.
2. fixed depth
3. fixed nodes.
I would expect fixed nodes to be more useful than fixed depth, they are very similar except fixed nodes doesn't have the resource allocation problem you mentioned above.
In my own software development process I usually start with fixed depth games and graduate to time control games. I use problem testing as a sanity check and I don't use fixed nodes at all as I personally believe it is the least useful method. The fixed depth testing tries to give me an upper bound on what I can expect from the "new idea" or thing I am testing. It also tells me if the fundamental idea is sound. With fixed depth testing I have isolated the IDEA from the IMPLEMENTATION. This is a general principle that every good engineer is aware of and uses to good advantage. You will also find that it may give you an "early cutoff", you may be able to quickly determine that an idea is useless, so there is no need to proceed to the next stage of testing (real time control games.)

Which levels should you use for testing? This one is quite tricky, but the short answer is to use the fastest levels that you can get away with. I could have answered, "use the longest level you can get away with", but I think that is bad advice! The emphasis should be placed on sample size because long tests are almost completely meaningless unless you are testing huge changes or are patient enough to wait for weeks or months to get the required number of samples for each individual change!

Let's assume that you have an enormous amount of CPU resources to test with, say you have a rack of 64 quad core machines for instance. How does this change things? Should you now test at much longer levels? Previously you had 1 quad, now you have 64. Should you increase the time control 64X now?

I think the answer to that is NO. You are already testing at the fastest level you can "get away with" so you should not substantially increase this. If you increase the length of the test by 64, you are NOT going to get 64X more benefit. All you will get is slightly more valid data and you will still have to wait a long time for it. Better it is to use this extra resource to increase your turnaround time. You now have the ability to get answers in a few minutes that used to take a few hours. That is how the bulk of this additional resource should be used. Notice that I said the "bulk" of this additional resource - it may very well be sensible to increase the test time per game by some modest amount, it is a decision that you should try to make with dose of common sense realizing that whatever you do is a judicious compromise of some sort.

I must leave a lot unsaid. For instance some tests require more time to properly test than others. Some tests lend themselves more to fixed depth testing than other, sometimes fixed depth testing is a waste of time and you need to jump right to time control testing (such as when testing the time control algorithm :-)

Oh yes, one last thing about fixed nodes and fixed depth testing. An advantage I forgot to mention is that most testers do not deal with fast time controls well at all. Imagine testing at game in 15 seconds on a slow auto-tester that is doing garbage collection while a program is thinking.

My own tester can play ridiculously fast games, such as game in 5 second + 0.01 increment, and it seems to work just great, but I don't like to trust time controls that fast because I assume external processes could be having strange effects on the results. What happens when a java program starts up just when one of the programs is starting to think? I'm sure it averages out, but it would give an advantage to the weaker program I would think. Anything that randomizes the results benefits the weaker program.
Do your opponent programs handle those very fast time controls well?
I can image that most programs are not designed or tested at those time controls.
There fixed nodes testing could be an advantage.

I am interested to hear what specific testing regimes people are using.
I use 20 opening positions x 2 (white / black) * 6 oppenents = 240 games at 10 s + 1 s / move. This takes about a night...
Since the last two versions of my program, I use a a kind of verification test which repeats the above 4 times.
I realise that this means that the uncertainty in the results is quite large, but the process seems to work so far, my program has reached 2669 on the CCRL list.
But I wonder how much further I can take this as the improvements become smaller and smaller.
Probably I also need to start testing at shorter time controls, which means finding an alternative to Arena.
There are quite a few that are very good at fast games. In 40,000 games (with 0.1s increments) I see _zero_ games lost on time for Crafty or any opponent. At zero-increment games, there will be losses, but not very many. Most programs are good at conserving time which causes the games to end for the usual reasons (resigns, draws, or mates). I've had a problem with a few programs here and there. Arasan still doesn't like very fast games, with or without increments, as an example. But there are opponents that can deal with it well enough.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: different kinds of testing

Post by Don »

Hi Jan,

To answer your questions, fixed node testing is probably not that bad. My primary problem with it is that some evaluation has an impact on the number of nodes looked at and I think it's somewhat less efficient becuase you arbitrarilty stop in the middle of the iteration. If I were to implement this kind of level for testing I would implement it as a normal time control where instead of seconds I would use millions of nodes, or something like that.

I feel like the numer of nodes is like time - so I would probably just go right to time control testing instead of doing this at all. An evaluation change can affect the number of nodes looked at - but it won't turn a 7 ply search into something else. So perhaps I'm being too arbitrary.

To answer your other question, when I run hyper bullet time controls like game in 1 second it has a different effect on different programs. Rybka can give my program 4 to 1 time odds, but not at fast time controls. I suspect rybka has a startup time issue, or perhaps the magic of Rybka requires some minimal depth to really kick in with a vengeance.

Toga is another program that does not do well at all at fast time controls. I'm pretty sure something happens before the search begins that requires a bit of time, perhaps a very sophisticated initialization of piece square tables or something like that.

My testing regime was layed out pretty well in this post, but I'll mention that I have almost 4000 openings and so if I'm only testing 2 players (which is rare for me) I can get only 8000 games in. But of course if I have more players you will play those same openings against each player. I generally don't like having test samples less than 10,000 games.

I am especially paranoid about inadvertently tuning against the first few openings in my opening set. So for any 2 players the openings are played in a randomized order. Otherwise, if I had 20 players and I always played the openings in the same order, then the first few hundred openings would be all played with the same identical opening. And each player would play his first 19 games with the same opening, then he would play the opposite side of the same opening! Since I rarely run each test to completion (I usually stop after a few thousand games per player) the openings near the end would never get exercised if I didn't do this randomization. My autotester manages the openings so that they never repeat for a specific pair of players.


Jan Brouwer wrote:Hi Don,

I agree with most what you said.
Don wrote:A lot of things have been said on the forum about how to test chess programs. So this is an attempt to clarify the confusion and of course it would be good to hear many other opinions.

One of the ways people test program changes is by using problem tests. I think there is close to being a consensus that this is very difficult, the correlation between test set results and actual playing strength are pretty weak. There is always the hope that some excellent test sets can be developed that correlates really well with program strength, but so far nobody has been able to produce such a set. Nevertheless, problem sets can be pretty useful tools if used correctly.

Another way to test programs is by actually playing games. This falls into 2 different categories, self-testing and variety testing. Self-testing means that you play 1 version of your program against another version of your program in order to ascertain the value of some specific change or group of changes.

Variety testing is based on the notion that there is something "incestuous" about testing the same program against itself and that the results will be invalid. Different people take this to different extremes, some believing it makes little or no difference and other believing results of self-testing are completely invalid.

The other issue is how to set the levels. The 3 broad categories are time control games, fixed node testing and fixed depth testing. Each has unique advantages and disadvantages which will be discussed shortly.

But first we need to consider that every change to a program impacts it in 2 basic ways. A good engineer wants to understand this impact, whether it's good or bad. The change can impact the CPU performance, or it can affect the "raw" quality of the program. Every change will have these 2 effects in some ratio.

For instance any change that makes the program faster, has a positive impact on the playing strength. An example of this is just using a better compiler or better complier flags that improves the CPU and/or memory performance.

The "raw" quality of the program might be represented by an evaluation improvement or an improved search algorithm. Of course an evaluation improvement or search modification can also impact the CPU performance of the program and usually will. To summarize, the actual ELO impact of a change is dependent on some combination of these 2 things.

Now we get to the 3 main testing procedures based on playing games.

Time control based testing has the advantage that "real" games are played with these levels and is the best tool for understanding the ELO impact of the change. It's primary disadvantage is that tells you nothing about why the change is good or bad, only that it IS good or bad.

Fix node testing is attempt to standardize the testing conditions - games could be played on any hardware and directly compared for instance. The main disadvantage of fixed node testing is that it is not very good at isolating the 2 kinds of program change impacts mentioned above. For example adding evaluation to the program will slow the program down, but fixed node testing does not reveal this. This kind of testing will only tell you if some change had an impact on the NUMBER OF NODES you need to consider to reach a certain level of play. By itself that is not very useful to know.

Fixed depth testing is a way to isolate CPU performance from raw program performance. For example you may have a good idea for an evaluation change that is expensive to compute. Time control games may reveal that the program is weaker, but a good engineer will want to know WHY. Fixed depth testing might reveal that the idea is strong, but the implementation is too slow. The primary weakness of fixed depth testing is that it does not measure the total impact of the change. If the change makes the program faster but weaker, you cannot tell if the speed makes up for the weakening or visa versa.

There is an additional weakness of fixed depth testing that the other 2 methods do not share. In real games some moves are searched more deeply than others. In the endgame it is common to search an extra 2 or 3 ply or more and even in the middlegame the depth can vary significantly from one move to the next. Fixed depth testing therefore does not allocate resources in a realistic way compared to fixed node testing or fixed depth testing.

A final big issue in testing is how much resource to allocate to each game. If your program is being designed to compete in chess tournaments, you obviously want it to perform well at those time controls. That would indicate that you should test at levels that are closer to the levels you expect to compete in, or do well at.

Unfortunately, for most people that is impractical unless you have enormous computing resources to apply to the problem. If you have a program that is strong, it's unrealistic to expect anything more than minor improvements from each change. In order to accurately determine if a change is an actual improvement will generally require THOUSANDS of games unless the change affects the strength by tens of ELO points. If you accept only hundreds of games, you will end up keeping many changes that have actually harmed your program. This is such an important concept that a compromise must be struck. In order to get an adequate number of games on modest hardware you must either be incredibly patient, or test at much faster time controls (or depths, or nodes) than you are building for.

So now let's try to answer the questions, "how fast should I test?" and "which testing methods should I use?"

This is where we get into the realm of opinion, fanaticism, and superstition. This is a question for each developer to determine for himself. I can only give you my opinion here and I expect this will generate a lot of additional opinions - which I hope will stimulate everyones imagination.

In my opinion, fixed node testing is the least useful of the 3 game testing methods. In order of usefulness, I personally put it like this:

1. time control games.
2. fixed depth
3. fixed nodes.
I would expect fixed nodes to be more useful than fixed depth, they are very similar except fixed nodes doesn't have the resource allocation problem you mentioned above.
In my own software development process I usually start with fixed depth games and graduate to time control games. I use problem testing as a sanity check and I don't use fixed nodes at all as I personally believe it is the least useful method. The fixed depth testing tries to give me an upper bound on what I can expect from the "new idea" or thing I am testing. It also tells me if the fundamental idea is sound. With fixed depth testing I have isolated the IDEA from the IMPLEMENTATION. This is a general principle that every good engineer is aware of and uses to good advantage. You will also find that it may give you an "early cutoff", you may be able to quickly determine that an idea is useless, so there is no need to proceed to the next stage of testing (real time control games.)

Which levels should you use for testing? This one is quite tricky, but the short answer is to use the fastest levels that you can get away with. I could have answered, "use the longest level you can get away with", but I think that is bad advice! The emphasis should be placed on sample size because long tests are almost completely meaningless unless you are testing huge changes or are patient enough to wait for weeks or months to get the required number of samples for each individual change!

Let's assume that you have an enormous amount of CPU resources to test with, say you have a rack of 64 quad core machines for instance. How does this change things? Should you now test at much longer levels? Previously you had 1 quad, now you have 64. Should you increase the time control 64X now?

I think the answer to that is NO. You are already testing at the fastest level you can "get away with" so you should not substantially increase this. If you increase the length of the test by 64, you are NOT going to get 64X more benefit. All you will get is slightly more valid data and you will still have to wait a long time for it. Better it is to use this extra resource to increase your turnaround time. You now have the ability to get answers in a few minutes that used to take a few hours. That is how the bulk of this additional resource should be used. Notice that I said the "bulk" of this additional resource - it may very well be sensible to increase the test time per game by some modest amount, it is a decision that you should try to make with dose of common sense realizing that whatever you do is a judicious compromise of some sort.

I must leave a lot unsaid. For instance some tests require more time to properly test than others. Some tests lend themselves more to fixed depth testing than other, sometimes fixed depth testing is a waste of time and you need to jump right to time control testing (such as when testing the time control algorithm :-)

Oh yes, one last thing about fixed nodes and fixed depth testing. An advantage I forgot to mention is that most testers do not deal with fast time controls well at all. Imagine testing at game in 15 seconds on a slow auto-tester that is doing garbage collection while a program is thinking.

My own tester can play ridiculously fast games, such as game in 5 second + 0.01 increment, and it seems to work just great, but I don't like to trust time controls that fast because I assume external processes could be having strange effects on the results. What happens when a java program starts up just when one of the programs is starting to think? I'm sure it averages out, but it would give an advantage to the weaker program I would think. Anything that randomizes the results benefits the weaker program.
Do your opponent programs handle those very fast time controls well?
I can image that most programs are not designed or tested at those time controls.
There fixed nodes testing could be an advantage.

I am interested to hear what specific testing regimes people are using.
I use 20 opening positions x 2 (white / black) * 6 oppenents = 240 games at 10 s + 1 s / move. This takes about a night...
Since the last two versions of my program, I use a a kind of verification test which repeats the above 4 times.
I realise that this means that the uncertainty in the results is quite large, but the process seems to work so far, my program has reached 2669 on the CCRL list.
But I wonder how much further I can take this as the improvements become smaller and smaller.
Probably I also need to start testing at shorter time controls, which means finding an alternative to Arena.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: different kinds of testing

Post by Don »

I actually wanted to make some of the same points you made, but my forum post was already getting pretty lengthy. I have some more comments.
bob wrote:
I will add a couple of notes.

1. time control. Is it possible to choose a "bad one" that will provide misleading results? Absolutely. Is it probable? Based on my testing results, no. This means that very fast time controls are pretty reliable in predicting performance at longer time controls, when you are trying to determine if A' is better than A (A' is just a modified A). Have I seen cases where this is false? Yes. For example, it is not uncommon to see a ramped-up king safety score better at fast time controls where it gets the opponent into trouble. While at longer time controls, the opponent can see thru the superficiality of some moves and avoid the problems. But that only requires a bit of common sense when testing. If you do that sort of change, you probably do want to try some slower games to verify or contradict fast game results. If you change anything at all related to time usage, you have to test that at multiple time controls. And finally, I always test with an increment except when testing new time control ideas (for games without increments) as I don't want those last-second time scrambles to have a major impact on the game, I want to measure the skill I am adding, not the existing skill in out-scrambling my opponent under time pressure.
I actually find myself in 100% agreement with you for a change on this one. Generally you can get away with pretty fast testing, but as you point out there are exception and you have to use some common sense. I found the same issue with king safety - this is one evaluation feature that gets better with depth and with only 2 or 3 ply searches actually tests worse! Larry Kaufman revealed to me that with Rybka, king safety was a big gain but if you test fast enough it makes Rybka play worse!

I have more or less settled on 7 ply searches for evaluation tuning. I can run those games very very quickly and get thousands in quickly on my quad, and so far it has always translated to an improvement at long time controls, even with king safety. I could probably get away with 5 ply searches, but I like the depth to be not too ridiculous. Some evaluation improvements have been expensive and my adjustment factor tells me when I should be concerned, although I do not use this in the final decision. I still believe in real time control games.

For the past 2 years, I have been carefully running tests and frequently trying different time controls to see if there is a difference. There is absolutely a difference in playing (say) Crafty vs program X at a fast time control, and at a slow time control. But I am not trying to measure the difference between Crafty and program X, I am trying to measure the difference in two different versions of Crafty against a group of opponents. And so far, avoiding the specific cases mentioned previously, when I play C vs a group and C' vs a group at time control X (very fast), and then C and C' vs the same group at time control Y (slower) the distance between C and C' remains pretty constant. The distance between C/C' and the opponents might change (some programs are very bad at very fast games, for example, and get into time trouble and get mashed because of it, increment or no increment). And that happens to be exactly what I want to see. I much prefer to run a 40,000 game test that takes an hour, so that I can determine if the new version is better or worse quickly, if the measurement itself is pretty accurate. And with the exception (so far) of changing time usage, I have not found any case where this is an issue.

2. Number of games. I see lots of conclusions based on 200-300 games. If a change is outstanding, that might be enough. Or if a change is horrible, ditto. But we rarely test such a version, as while we are working on a change, we are running test positions and such to be reasonably sure the change is doing what we expected. So our changes are not "wow" or "awww", but more often "hmmm". 2-3-4-5 Elo up or down is about the most we see. The error bar for 40,000 games is between +/-4 and +/-5. 200,000 games barely gets down to +/-2 or 3.
Yes, there is a point of diminishing returns that tests your patience. It's pretty rare that I can make a decision based on 300 games and if I do it's a decision to not use the change, never to keep it. Basically, the keepers are the ones that get tested like crazy.

There have been a few times that I was ready to stop a test as a failure, where the scientist in me said that I should let it run for a few hundred more games - and it turned out to be a non-trivial improvement. But I'm stil much more stingy about what I keep than what I throw out.

If your program is new, and you are adding significant features (say null-move that might be a +100 change) then a few games is enough to convince yourself the new version is better. But for the more common couple of elo up or down type changes, 200-400 games is essentially a random experiment with a random result that can't be trusted.

As far as the mode of testing (nodes/depth/time) all of my qualitative testing (to determine if a feature is good or bad) is done using time. That's the way we have to play the games. If you only use depth, then every search extension you add will look good, because you increase the size of the tree, and it doesn't have any associated cost of doing so.
This is not really an issue if you do what I do. I use a time adjustment formula that has proved to be very reliable, so I know at a glance whether I am in the ballpark or not. For instance I recently made a big improvement to my mobility function but it was more expensive to calculate. I saw the fixed depth score go way up, but I also saw the time go up. My depth adjusted formula told me at a glance that it was still a big improvement.

In our previous discussion I probably didn't communicate how I use this very well. The time adjustment is very useful, but I still require it to be backed up with a "real" test. I don't think you have an issue with that other than the idea that this is waste of time when I could have immediately jumped to the time control test. And yes, that is another way to do it which is perfectly valid.

However, I really do like to have the additional data even if the implementation is slow and cannot be salvaged. I like to know out of scientific curiosity if I had the right idea or not. But even more than that, this has often caused me to modify the idea in such a way that it COULD be salvaged. This happened to me with king safety for instance, and it turned out that the simpler and faster thing was actually the better way, even at fixed depth! I would never have had a clue about any of this if I did not do fixed depth testing. So I even consider it a fundamental requirement in my case for program development, to try to understand the entire nature of the change, not just how fast the implementation is. Anyway, you are a different person with a different personality and it this may work against the habits that you found most useful for program development, but to me it's fundamental for success. I don't just implement ideas I have to explore them as I don't have your natural ability.

Fixed nodes helps with that a bit, since the size of the tree is now a function of nodes allowed, rather than depth allowed, but it still hides important characteristics that are difficult to recover. For example, a program (mine, for instance) might be slow in the opening because of developmental analysis that gets turned off later) and much faster in the endgame because missing pieces make move generation and other things faster. If you use a fixed number of nodes, a change can influence the game in ways besides skill level. For example, you add a term that reduces the tendency to trade pieces. This will likely look good, because you now avoid reaching those endgames with a higher speed than your opponent, so you lose out on that benefit. Except that you don't lose the benefit because you don't have that benefit in a fixed-node search But with fixed nodes, you won't realize this. Because all nodes are equal, the fast ones and slow ones. So you skew the result by steering the game into an area where you are better (or worse) solely because of the way you are testing, which will not be the same result you would see when using real timing.
That is one problem I have with fixed nodes, I really believe more in fixed depth testing and only as a tool. No matter what kind of testing you do, there are issues and scenario's you can construct and corner cases. One could be critical of time control games too. Sure it plays great at game in 5, but what about game in 6 or 7? And what about the ELO system itself, which is known to be flawed? One has to make some simplifying assumptions. As I mentioned before I am paranoid about the order of the openings and believe they should not be played in the same order unless you are committed to running every single game to the bitter end of the test, but as you said most tests are stopped when you have have enough confidence based on the error margins. It's a matter of judgment, but I think your paranoia about fixed node testing is misplaced and you should be more concerned about the order you play the openings - but you probably feel that I'm the one that has it backwards. We can both makes points pro and con for each I'm sure.


If your nodes are "equal" (your NPS remains fairly constant through the course of a game, this won't be an issue for you (although you may well make a change that steers the game into an unfavorable area for your opponent [same logic, reversed] which makes the change look good if you force your oponent into an area of the tree where the fixed node count penalizes him, or it makes the change look bad if you steer the opponent into an area of the three where the fixed node count benefits him.

To me, the above is highly problematic and requires repeated testing with time controls to verify a result. I'd just as soon start with time control testing and avoid the mess completely.
I avoid the mess by not worrying about it. I can think of a million ways I am inadverdantly skewing the results but it's probably the one I didn't think about that is doing me in. I just try not to lose any sleep over it. I think we are just paranoid about different things - and I just hope I am paranoid about the right ones. I honestly don't think this is a huge issue but since you do, it's certainly worth giving some consideration.

As I have mentioned previously, fixed depth or fixed node searches are great for debugging. If a program crashes in a real game, and you can't reproduce it because of pondering, and variable time usage, play games using a fixed depth. If you can create the crash, you can then repeat it as many times as you want while debugging, which is a huge help in debugging a program as complex as these things. Of course if you throw in a parallel search, this becomes a completely moot point as you can't reproduce anything anyway, fixed nodes or depth or not.
I agree, fixed depth really is great for debugging. I think I'm more in debugging mode than you are when I try an idea, I don't really feel like I am in implement and test mode, but I'm in debug the idea first, then test it mode.

Measuring small changes requires many games. There is no way to "cheat the statistical gods" at all.
I really liked HG's idea of orthogonal mult-testing. That feels a bit like you are cheating and getting away with it. But I guess you are not really.
Which, I realize, represents a huge problem for most. We are currently nearing +100 Elo for the current version over 23.0.
I don't know how you can possible write a strong program without this kind of massive testing.

I always tried to do this kind of testing in the past for instance with rexchess but we didn't have the kind of resources we have now. I remember Larry and I testing rexchess on several PC-XT machines (the ones with 4.77 mhz processors) and we thought we were doing well to get 200 games in. We were not stupid, but we just didn't have ability to muster more than this and even if that is all you can do, it's still much better than doing nothing.
When 23.0 was released, it was well over +120 elo better than the previous best version(s). All without major rewrites or changes. Just a ton of small tweaks here and there, a little code here and there. And a ton of testing to recognize what was good and what was not. There's no doubt that testing using time controls is the right way to do this, but it does require a ton of CPU time to make it accurate.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: different kinds of testing

Post by bob »

Don wrote:I actually wanted to make some of the same points you made, but my forum post was already getting pretty lengthy. I have some more comments.
bob wrote:
I will add a couple of notes.

1. time control. Is it possible to choose a "bad one" that will provide misleading results? Absolutely. Is it probable? Based on my testing results, no. This means that very fast time controls are pretty reliable in predicting performance at longer time controls, when you are trying to determine if A' is better than A (A' is just a modified A). Have I seen cases where this is false? Yes. For example, it is not uncommon to see a ramped-up king safety score better at fast time controls where it gets the opponent into trouble. While at longer time controls, the opponent can see thru the superficiality of some moves and avoid the problems. But that only requires a bit of common sense when testing. If you do that sort of change, you probably do want to try some slower games to verify or contradict fast game results. If you change anything at all related to time usage, you have to test that at multiple time controls. And finally, I always test with an increment except when testing new time control ideas (for games without increments) as I don't want those last-second time scrambles to have a major impact on the game, I want to measure the skill I am adding, not the existing skill in out-scrambling my opponent under time pressure.
I actually find myself in 100% agreement with you for a change on this one. Generally you can get away with pretty fast testing, but as you point out there are exception and you have to use some common sense. I found the same issue with king safety - this is one evaluation feature that gets better with depth and with only 2 or 3 ply searches actually tests worse! Larry Kaufman revealed to me that with Rybka, king safety was a big gain but if you test fast enough it makes Rybka play worse!

I have more or less settled on 7 ply searches for evaluation tuning. I can run those games very very quickly and get thousands in quickly on my quad, and so far it has always translated to an improvement at long time controls, even with king safety. I could probably get away with 5 ply searches, but I like the depth to be not too ridiculous. Some evaluation improvements have been expensive and my adjustment factor tells me when I should be concerned, although I do not use this in the final decision. I still believe in real time control games.

For the past 2 years, I have been carefully running tests and frequently trying different time controls to see if there is a difference. There is absolutely a difference in playing (say) Crafty vs program X at a fast time control, and at a slow time control. But I am not trying to measure the difference between Crafty and program X, I am trying to measure the difference in two different versions of Crafty against a group of opponents. And so far, avoiding the specific cases mentioned previously, when I play C vs a group and C' vs a group at time control X (very fast), and then C and C' vs the same group at time control Y (slower) the distance between C and C' remains pretty constant. The distance between C/C' and the opponents might change (some programs are very bad at very fast games, for example, and get into time trouble and get mashed because of it, increment or no increment). And that happens to be exactly what I want to see. I much prefer to run a 40,000 game test that takes an hour, so that I can determine if the new version is better or worse quickly, if the measurement itself is pretty accurate. And with the exception (so far) of changing time usage, I have not found any case where this is an issue.

2. Number of games. I see lots of conclusions based on 200-300 games. If a change is outstanding, that might be enough. Or if a change is horrible, ditto. But we rarely test such a version, as while we are working on a change, we are running test positions and such to be reasonably sure the change is doing what we expected. So our changes are not "wow" or "awww", but more often "hmmm". 2-3-4-5 Elo up or down is about the most we see. The error bar for 40,000 games is between +/-4 and +/-5. 200,000 games barely gets down to +/-2 or 3.
Yes, there is a point of diminishing returns that tests your patience. It's pretty rare that I can make a decision based on 300 games and if I do it's a decision to not use the change, never to keep it. Basically, the keepers are the ones that get tested like crazy.

There have been a few times that I was ready to stop a test as a failure, where the scientist in me said that I should let it run for a few hundred more games - and it turned out to be a non-trivial improvement. But I'm stil much more stingy about what I keep than what I throw out.

If your program is new, and you are adding significant features (say null-move that might be a +100 change) then a few games is enough to convince yourself the new version is better. But for the more common couple of elo up or down type changes, 200-400 games is essentially a random experiment with a random result that can't be trusted.

As far as the mode of testing (nodes/depth/time) all of my qualitative testing (to determine if a feature is good or bad) is done using time. That's the way we have to play the games. If you only use depth, then every search extension you add will look good, because you increase the size of the tree, and it doesn't have any associated cost of doing so.
This is not really an issue if you do what I do. I use a time adjustment formula that has proved to be very reliable, so I know at a glance whether I am in the ballpark or not. For instance I recently made a big improvement to my mobility function but it was more expensive to calculate. I saw the fixed depth score go way up, but I also saw the time go up. My depth adjusted formula told me at a glance that it was still a big improvement.

In our previous discussion I probably didn't communicate how I use this very well. The time adjustment is very useful, but I still require it to be backed up with a "real" test. I don't think you have an issue with that other than the idea that this is waste of time when I could have immediately jumped to the time control test. And yes, that is another way to do it which is perfectly valid.
The question I would have to ask is this: How can you determine what kind of "timing adjustment to use? You play a bunch of games, and use those results compared to the same test using time, and come up with some sort of fitting algorithm to convert one to the other? But what happens when your change has that odd effect of pushing more of the games into a place where you would really slow down, but you don't due to fixed nodes, or it pushes you into a part where you would really speed up, but you don't due to fixed nodes? So how is it possible to calibrate based on this variable speed issue?

This is an aspect of computer chess that I really don't believe many at all really grasp. A full-width search is a _very_ strange animal. You can write an eval term that is right 99% of the time. And make a mental note "that 1% error is not going to be significant." But then the games start, and the two programs start pushing each other on many different fronts. And in the area where you have that "hole" your opponent doesn't push back. So the games will quite naturally walk right into that kind of position since your opponent doesn't think it is bad and doesn't resist, while you think it is good in 1% of the cases (and are wrong of course). Any odd behaviour will be exploited with a deep full-width search, as it continues to advance where it finds the least resistance.

And those tiny impossible-to-imagine holes suddenly become large enough to drive an 18-wheeler through, and the programmer is sitting there asking "How did that happen? That is such a rare occurrence the eval error should not affect the game at all?" When someone with some experience has the completely opposite view of "any tiny hole is actually far larger than it appears because of the magnification effect produced by a very large full-width search."

I _really_ think that very few understand the ramifications of that point. It is not just an "interesting observation". Overlooking it is a recipe for disaster.


However, I really do like to have the additional data even if the implementation is slow and cannot be salvaged. I like to know out of scientific curiosity if I had the right idea or not. But even more than that, this has often caused me to modify the idea in such a way that it COULD be salvaged. This happened to me with king safety for instance, and it turned out that the simpler and faster thing was actually the better way, even at fixed depth! I would never have had a clue about any of this if I did not do fixed depth testing. So I even consider it a fundamental requirement in my case for program development, to try to understand the entire nature of the change, not just how fast the implementation is. Anyway, you are a different person with a different personality and it this may work against the habits that you found most useful for program development, but to me it's fundamental for success. I don't just implement ideas I have to explore them as I don't have your natural ability.

Fixed nodes helps with that a bit, since the size of the tree is now a function of nodes allowed, rather than depth allowed, but it still hides important characteristics that are difficult to recover. For example, a program (mine, for instance) might be slow in the opening because of developmental analysis that gets turned off later) and much faster in the endgame because missing pieces make move generation and other things faster. If you use a fixed number of nodes, a change can influence the game in ways besides skill level. For example, you add a term that reduces the tendency to trade pieces. This will likely look good, because you now avoid reaching those endgames with a higher speed than your opponent, so you lose out on that benefit. Except that you don't lose the benefit because you don't have that benefit in a fixed-node search But with fixed nodes, you won't realize this. Because all nodes are equal, the fast ones and slow ones. So you skew the result by steering the game into an area where you are better (or worse) solely because of the way you are testing, which will not be the same result you would see when using real timing.
That is one problem I have with fixed nodes, I really believe more in fixed depth testing and only as a tool. No matter what kind of testing you do, there are issues and scenario's you can construct and corner cases. One could be critical of time control games too. Sure it plays great at game in 5, but what about game in 6 or 7? And what about the ELO system itself, which is known to be flawed? One has to make some simplifying assumptions. As I mentioned before I am paranoid about the order of the openings and believe they should not be played in the same order unless you are committed to running every single game to the bitter end of the test, but as you said most tests are stopped when you have have enough confidence based on the error margins. It's a matter of judgment, but I think your paranoia about fixed node testing is misplaced and you should be more concerned about the order you play the openings - but you probably feel that I'm the one that has it backwards. We can both makes points pro and con for each I'm sure.
First step anyone should try is to take two versions of their program and play them against a common set of opponents in a controlled way, but vary the time controls. It is almost guaranteed that your two versions will produce different Elo values at different time controls, because nearly all programs behave differently at different time controls. But if you look at the difference between the two versions, it will most likely remain constant no matter what the time control, assuming you are not modifying time usage code. I've run this test from my 10s + 0.1s fast games to 40 moves in 60 minutes, which takes forever even at over 750 games at a time using both clusters. An occasional validation is good, and we might occasionally miss a case where slower makes a change look worse. But I hope we have enough experience to anticipate those kinds of changes, which particularly include things with large scoring bonuses.



If your nodes are "equal" (your NPS remains fairly constant through the course of a game, this won't be an issue for you (although you may well make a change that steers the game into an unfavorable area for your opponent [same logic, reversed] which makes the change look good if you force your oponent into an area of the tree where the fixed node count penalizes him, or it makes the change look bad if you steer the opponent into an area of the three where the fixed node count benefits him.

To me, the above is highly problematic and requires repeated testing with time controls to verify a result. I'd just as soon start with time control testing and avoid the mess completely.
I avoid the mess by not worrying about it. I can think of a million ways I am inadverdantly skewing the results but it's probably the one I didn't think about that is doing me in. I just try not to lose any sleep over it. I think we are just paranoid about different things - and I just hope I am paranoid about the right ones. I honestly don't think this is a huge issue but since you do, it's certainly worth giving some consideration.

As I have mentioned previously, fixed depth or fixed node searches are great for debugging. If a program crashes in a real game, and you can't reproduce it because of pondering, and variable time usage, play games using a fixed depth. If you can create the crash, you can then repeat it as many times as you want while debugging, which is a huge help in debugging a program as complex as these things. Of course if you throw in a parallel search, this becomes a completely moot point as you can't reproduce anything anyway, fixed nodes or depth or not.
I agree, fixed depth really is great for debugging. I think I'm more in debugging mode than you are when I try an idea, I don't really feel like I am in implement and test mode, but I'm in debug the idea first, then test it mode.

Measuring small changes requires many games. There is no way to "cheat the statistical gods" at all.
I really liked HG's idea of orthogonal mult-testing. That feels a bit like you are cheating and getting away with it. But I guess you are not really.
It is a neat idea, but it is fraught with problems. It is difficult to guarantee your changes are completely orthogonal. If there is any correlation, the test breaks down.
Which, I realize, represents a huge problem for most. We are currently nearing +100 Elo for the current version over 23.0.
I don't know how you can possible write a strong program without this kind of massive testing.

I always tried to do this kind of testing in the past for instance with rexchess but we didn't have the kind of resources we have now. I remember Larry and I testing rexchess on several PC-XT machines (the ones with 4.77 mhz processors) and we thought we were doing well to get 200 games in. We were not stupid, but we just didn't have ability to muster more than this and even if that is all you can do, it's still much better than doing nothing.
When 23.0 was released, it was well over +120 elo better than the previous best version(s). All without major rewrites or changes. Just a ton of small tweaks here and there, a little code here and there. And a ton of testing to recognize what was good and what was not. There's no doubt that testing using time controls is the right way to do this, but it does require a ton of CPU time to make it accurate.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: different kinds of testing

Post by Don »

Jan Brouwer wrote: I am interested to hear what specific testing regimes people are using.
I use 20 opening positions x 2 (white / black) * 6 oppenents = 240 games at 10 s + 1 s / move. This takes about a night...
Since the last two versions of my program, I use a a kind of verification test which repeats the above 4 times.
I realise that this means that the uncertainty in the results is quite large, but the process seems to work so far, my program has reached 2669 on the CCRL list.
Which program is yours?
But I wonder how much further I can take this as the improvements become smaller and smaller.
Probably I also need to start testing at shorter time controls, which means finding an alternative to Arena.
I think you can still get good testing with modest means. But you are pretty much forced into much faster testing if you want to get a statistically viable sample. You can also run carefully constructed tests during periods of time when you know you will doing other things and cannot do chess.

But you cannot do this very well with Arena. Do what I did, build your own tester than is not graphical. (The graphics kills the speed when you are talking about hyper-bullet time controls.)
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: different kinds of testing

Post by hgm »

Note that XBoard / WinBoard has the option -noGUI to acheive this. It can also help to run the engines at lower priority than the GUI, (-niceEngines 20).
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: different kinds of testing

Post by Don »

hgm wrote:Note that XBoard / WinBoard has the option -noGUI to acheive this. It can also help to run the engines at lower priority than the GUI, (-niceEngines 20).
xboard is actually pretty fast even with the GUI compared to arena, however I'm sure the graphics still slows it down considerably.

Can you run fractional time controls with fraction increments? Like game in 6 seconds + 0.1 increment?

Since I have my own tester and run on linux, I really like using xboard for watching a few games at slower time controls.

On my wish list for xboard is to support UCI in addition to xboard so that matches could be played without the annoying polyglot adaptor. That would be simple to do except for the support to change things in the program - which requires a bunch of widget programming.

- Don