I use a combination of fixed depth testing and fischer time control. It depends on what I want to know. Fischer time control is very sane compared to sudden death. Because I have limited resources I sometimes test as fast as game in 6 seconds + 0.1 second increment. Typically I have 4 games being played simultaneously on a quad core machine and my own home brewed high quality tester that manages all of this.
When I run fixed depth testing, I also report a "time adjusted" ELO rating. It uses a formula based on how much expected ELO gain per doubling you should get. This has to be calibrated for each depth as you don't get as much for a doubling at deep levels as you would for a doubling at shallow levels. This is a very nice feature because it takes into consideration algorithms I may be testing that slow the program down but also boost the ELO rating - so this gives me a rough and ready calculation that I have found to be quite reliable.
I don't test based on fixed nodes, but the same principle would apply, as long as you measure the time of the games. But that is the important point, you must still measure time because for chess programs that is one of the important variables. Evaluation or even search modifications impact the nodes per second and thus the strength of the program . One huge advantage of node based levels is that you can test on any hardware and compare the results directly, as long as you don't care about time. Unfortunately, with chess programs, time is a very important variable for chess strength.
Fixed depth testing is useful for certain things because you can get the impact of any change on the strength AND the time. With Fischer tests I only have the ELO to go by, which of course is the most useful thing of all but sometimes you want to know how some heuristic breaks down. If I add some expensive calculation to the evaluation I want to know how much it slows the program down and how much it gives the program in "raw" elo (at some given fixed depth level.)
So for most evaluation things I prefer testing first with fixed depth and then moving on to the fischer tests. I have found that it always seems to work well this way, if something helps the fixed depth (time adjusted) strength of the program it will also play better at "real" time controls.
I will be the first to admit that all of this is imperfect. There are clearly some evaluation improvements that do not "kick in" much at short depths, the primary example being king safety. In fact, kings safety may even hurt the program on really shallow depths, such as 3 or 4 ply. In my program it is helps some at 7 ply, but at deeper levels it helps enormously. But heuristics like this seem to be the exception rather than the rule.
I don't think there is a real substitute for testing at real time control levels that are similar to levels you want your program to be good at. Unfortunately, you cannot develop at any kind of serious pace if you cannot produce thousands of test games, so compromise for most of us is a reality of life! It's way more import to get data that is statistically significant than to be overly anal about testing at long time controls where the results are more or less random unless you made an enormous improvement.
jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.
Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.
A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.