One of the ways people test program changes is by using problem tests. I think there is close to being a consensus that this is very difficult, the correlation between test set results and actual playing strength are pretty weak. There is always the hope that some excellent test sets can be developed that correlates really well with program strength, but so far nobody has been able to produce such a set. Nevertheless, problem sets can be pretty useful tools if used correctly.
Another way to test programs is by actually playing games. This falls into 2 different categories, self-testing and variety testing. Self-testing means that you play 1 version of your program against another version of your program in order to ascertain the value of some specific change or group of changes.
Variety testing is based on the notion that there is something "incestuous" about testing the same program against itself and that the results will be invalid. Different people take this to different extremes, some believing it makes little or no difference and other believing results of self-testing are completely invalid.
The other issue is how to set the levels. The 3 broad categories are time control games, fixed node testing and fixed depth testing. Each has unique advantages and disadvantages which will be discussed shortly.
But first we need to consider that every change to a program impacts it in 2 basic ways. A good engineer wants to understand this impact, whether it's good or bad. The change can impact the CPU performance, or it can affect the "raw" quality of the program. Every change will have these 2 effects in some ratio.
For instance any change that makes the program faster, has a positive impact on the playing strength. An example of this is just using a better compiler or better complier flags that improves the CPU and/or memory performance.
The "raw" quality of the program might be represented by an evaluation improvement or an improved search algorithm. Of course an evaluation improvement or search modification can also impact the CPU performance of the program and usually will. To summarize, the actual ELO impact of a change is dependent on some combination of these 2 things.
Now we get to the 3 main testing procedures based on playing games.
Time control based testing has the advantage that "real" games are played with these levels and is the best tool for understanding the ELO impact of the change. It's primary disadvantage is that tells you nothing about why the change is good or bad, only that it IS good or bad.
Fix node testing is attempt to standardize the testing conditions - games could be played on any hardware and directly compared for instance. The main disadvantage of fixed node testing is that it is not very good at isolating the 2 kinds of program change impacts mentioned above. For example adding evaluation to the program will slow the program down, but fixed node testing does not reveal this. This kind of testing will only tell you if some change had an impact on the NUMBER OF NODES you need to consider to reach a certain level of play. By itself that is not very useful to know.
Fixed depth testing is a way to isolate CPU performance from raw program performance. For example you may have a good idea for an evaluation change that is expensive to compute. Time control games may reveal that the program is weaker, but a good engineer will want to know WHY. Fixed depth testing might reveal that the idea is strong, but the implementation is too slow. The primary weakness of fixed depth testing is that it does not measure the total impact of the change. If the change makes the program faster but weaker, you cannot tell if the speed makes up for the weakening or visa versa.
There is an additional weakness of fixed depth testing that the other 2 methods do not share. In real games some moves are searched more deeply than others. In the endgame it is common to search an extra 2 or 3 ply or more and even in the middlegame the depth can vary significantly from one move to the next. Fixed depth testing therefore does not allocate resources in a realistic way compared to fixed node testing or fixed depth testing.
A final big issue in testing is how much resource to allocate to each game. If your program is being designed to compete in chess tournaments, you obviously want it to perform well at those time controls. That would indicate that you should test at levels that are closer to the levels you expect to compete in, or do well at.
Unfortunately, for most people that is impractical unless you have enormous computing resources to apply to the problem. If you have a program that is strong, it's unrealistic to expect anything more than minor improvements from each change. In order to accurately determine if a change is an actual improvement will generally require THOUSANDS of games unless the change affects the strength by tens of ELO points. If you accept only hundreds of games, you will end up keeping many changes that have actually harmed your program. This is such an important concept that a compromise must be struck. In order to get an adequate number of games on modest hardware you must either be incredibly patient, or test at much faster time controls (or depths, or nodes) than you are building for.
So now let's try to answer the questions, "how fast should I test?" and "which testing methods should I use?"
This is where we get into the realm of opinion, fanaticism, and superstition. This is a question for each developer to determine for himself. I can only give you my opinion here and I expect this will generate a lot of additional opinions - which I hope will stimulate everyones imagination.
In my opinion, fixed node testing is the least useful of the 3 game testing methods. In order of usefulness, I personally put it like this:
1. time control games.
2. fixed depth
3. fixed nodes.
In my own software development process I usually start with fixed depth games and graduate to time control games. I use problem testing as a sanity check and I don't use fixed nodes at all as I personally believe it is the least useful method. The fixed depth testing tries to give me an upper bound on what I can expect from the "new idea" or thing I am testing. It also tells me if the fundamental idea is sound. With fixed depth testing I have isolated the IDEA from the IMPLEMENTATION. This is a general principle that every good engineer is aware of and uses to good advantage. You will also find that it may give you an "early cutoff", you may be able to quickly determine that an idea is useless, so there is no need to proceed to the next stage of testing (real time control games.)
Which levels should you use for testing? This one is quite tricky, but the short answer is to use the fastest levels that you can get away with. I could have answered, "use the longest level you can get away with", but I think that is bad advice! The emphasis should be placed on sample size because long tests are almost completely meaningless unless you are testing huge changes or are patient enough to wait for weeks or months to get the required number of samples for each individual change!
Let's assume that you have an enormous amount of CPU resources to test with, say you have a rack of 64 quad core machines for instance. How does this change things? Should you now test at much longer levels? Previously you had 1 quad, now you have 64. Should you increase the time control 64X now?
I think the answer to that is NO. You are already testing at the fastest level you can "get away with" so you should not substantially increase this. If you increase the length of the test by 64, you are NOT going to get 64X more benefit. All you will get is slightly more valid data and you will still have to wait a long time for it. Better it is to use this extra resource to increase your turnaround time. You now have the ability to get answers in a few minutes that used to take a few hours. That is how the bulk of this additional resource should be used. Notice that I said the "bulk" of this additional resource - it may very well be sensible to increase the test time per game by some modest amount, it is a decision that you should try to make with dose of common sense realizing that whatever you do is a judicious compromise of some sort.
I must leave a lot unsaid. For instance some tests require more time to properly test than others. Some tests lend themselves more to fixed depth testing than other, sometimes fixed depth testing is a waste of time and you need to jump right to time control testing (such as when testing the time control algorithm

Oh yes, one last thing about fixed nodes and fixed depth testing. An advantage I forgot to mention is that most testers do not deal with fast time controls well at all. Imagine testing at game in 15 seconds on a slow auto-tester that is doing garbage collection while a program is thinking.
My own tester can play ridiculously fast games, such as game in 5 second + 0.01 increment, and it seems to work just great, but I don't like to trust time controls that fast because I assume external processes could be having strange effects on the results. What happens when a java program starts up just when one of the programs is starting to think? I'm sure it averages out, but it would give an advantage to the weaker program I would think. Anything that randomizes the results benefits the weaker program.