A reason for testing at fixed number of nodes.

jwes · Post by **jwes** » Fri Nov 06, 2009 7:51 pm

When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.

Ferdy · Post by **Ferdy** » Fri Nov 06, 2009 8:33 pm

jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.

Item 1 expansion: The change does not make the program better because of existing bug. If you are confident the idea is supposed to work, just save it this time and retest again when in the future you may happen to find and fix one or all of those bugs.

PGO and non-PGO compiles comparison is interesting, I make the comparison by running positions to a fixed depth only, and check the total time. It is better of course testing those in real games.

bob · Post by **bob** » Fri Nov 06, 2009 8:36 pm

jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.

The inverse is true. If you slow the program down, but test with fixed nodes, you will have no idea that you have actually _hurt_ performance overall, when you start to test with a clock.

Personally, to me, it makes no sense to test in a different way from the way the program is used. Would you test a drag racer on a chassis dyno only? More horsepower means more speed, right? What about traction? Does that extra weight on the front hurt weight transfer?

I want to test like I am going to run. And currently, all chess games use time, not nodes.

I don't follow PGO at all. That only affects speed, never the nodes searched for a given depth. So how will using PGO or not influence the final game results since you will be searching the same number of nodes, where any optimizations are absolutely irrelevant since time doesn't count?

Don · Post by **Don** » Fri Nov 06, 2009 8:56 pm

I use a combination of fixed depth testing and fischer time control. It depends on what I want to know. Fischer time control is very sane compared to sudden death. Because I have limited resources I sometimes test as fast as game in 6 seconds + 0.1 second increment. Typically I have 4 games being played simultaneously on a quad core machine and my own home brewed high quality tester that manages all of this.

When I run fixed depth testing, I also report a "time adjusted" ELO rating. It uses a formula based on how much expected ELO gain per doubling you should get. This has to be calibrated for each depth as you don't get as much for a doubling at deep levels as you would for a doubling at shallow levels. This is a very nice feature because it takes into consideration algorithms I may be testing that slow the program down but also boost the ELO rating - so this gives me a rough and ready calculation that I have found to be quite reliable.

I don't test based on fixed nodes, but the same principle would apply, as long as you measure the time of the games. But that is the important point, you must still measure time because for chess programs that is one of the important variables. Evaluation or even search modifications impact the nodes per second and thus the strength of the program . One huge advantage of node based levels is that you can test on any hardware and compare the results directly, as long as you don't care about time. Unfortunately, with chess programs, time is a very important variable for chess strength.

Fixed depth testing is useful for certain things because you can get the impact of any change on the strength AND the time. With Fischer tests I only have the ELO to go by, which of course is the most useful thing of all but sometimes you want to know how some heuristic breaks down. If I add some expensive calculation to the evaluation I want to know how much it slows the program down and how much it gives the program in "raw" elo (at some given fixed depth level.)

So for most evaluation things I prefer testing first with fixed depth and then moving on to the fischer tests. I have found that it always seems to work well this way, if something helps the fixed depth (time adjusted) strength of the program it will also play better at "real" time controls.

I will be the first to admit that all of this is imperfect. There are clearly some evaluation improvements that do not "kick in" much at short depths, the primary example being king safety. In fact, kings safety may even hurt the program on really shallow depths, such as 3 or 4 ply. In my program it is helps some at 7 ply, but at deeper levels it helps enormously. But heuristics like this seem to be the exception rather than the rule.

I don't think there is a real substitute for testing at real time control levels that are similar to levels you want your program to be good at. Unfortunately, you cannot develop at any kind of serious pace if you cannot produce thousands of test games, so compromise for most of us is a reality of life! It's way more import to get data that is statistically significant than to be overly anal about testing at long time controls where the results are more or less random unless you made an enormous improvement.

jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.

hgm · Post by **hgm** » Fri Nov 06, 2009 9:01 pm

Of course in real life almost every industrial product is tested under conditions that are not even remotely close to how they are intended to be used. Accelerated aging at elevated tempertures and under intense radition, flexing components at a high rate using unusually high force to test of fatique, etc. We would still live pretty much in the stone age if it took 80 years to figure out if a new product was more durable than an old one with a lifetime of 60 years... But fortunately most people are more clever than that!

bob · Post by **bob** » Fri Nov 06, 2009 9:14 pm

hgm wrote:Of course in real life almost every industrial product is tested under conditions that are not even remotely close to how they are intended to be used. Accelerated aging at elevated tempertures and under intense radition, flexing components at a high rate using unusually high force to test of fatique, etc. We would still live pretty much in the stone age if it took 80 years to figure out if a new product was more durable than an old one with a lifetime of 60 years... But fortunately most people are more clever than that!

That's fine. But "fixed node testing" has nothing to do with the various forms of accelerated wear testing that is done. They are testing with real-world types of wear, at much higher than normal rates. fixed nodes removes a major consideration from the equation (speed). In that light, I have no idea what your "real world testing" has to do with testing using fixed nodes. Fixed nodes would be more like a retarded wear rate, rather than an accelerated one. And it does not even test the most significant part of the engine, the speed.

bob · Post by **bob** » Fri Nov 06, 2009 9:18 pm

Don wrote:I use a combination of fixed depth testing and fischer time control. It depends on what I want to know. Fischer time control is very sane compared to sudden death. Because I have limited resources I sometimes test as fast as game in 6 seconds + 0.1 second increment. Typically I have 4 games being played simultaneously on a quad core machine and my own home brewed high quality tester that manages all of this.

When I run fixed depth testing, I also report a "time adjusted" ELO rating. It uses a formula based on how much expected ELO gain per doubling you should get. This has to be calibrated for each depth as you don't get as much for a doubling at deep levels as you would for a doubling at shallow levels. This is a very nice feature because it takes into consideration algorithms I may be testing that slow the program down but also boost the ELO rating - so this gives me a rough and ready calculation that I have found to be quite reliable.

I don't test based on fixed nodes, but the same principle would apply, as long as you measure the time of the games. But that is the important point, you must still measure time because for chess programs that is one of the important variables. Evaluation or even search modifications impact the nodes per second and thus the strength of the program . One huge advantage of node based levels is that you can test on any hardware and compare the results directly, as long as you don't care about time. Unfortunately, with chess programs, time is a very important variable for chess strength.

Fixed depth testing is useful for certain things because you can get the impact of any change on the strength AND the time. With Fischer tests I only have the ELO to go by, which of course is the most useful thing of all but sometimes you want to know how some heuristic breaks down. If I add some expensive calculation to the evaluation I want to know how much it slows the program down and how much it gives the program in "raw" elo (at some given fixed depth level.)

So for most evaluation things I prefer testing first with fixed depth and then moving on to the fischer tests. I have found that it always seems to work well this way, if something helps the fixed depth (time adjusted) strength of the program it will also play better at "real" time controls.

"time adjusted" being the operative words. And I am not sure this is so easy, since a program does not search at a constant speed throughout a game. My NPS can vary by a factor of 3 over the course of a game. Ferret (Bruce Moreland) varied much more, getting _way_ faster in endgames. Fixed nodes can never account for this, which always leaves distortions in the results that are difficult to impossible to quantify.

I will be the first to admit that all of this is imperfect. There are clearly some evaluation improvements that do not "kick in" much at short depths, the primary example being king safety. In fact, kings safety may even hurt the program on really shallow depths, such as 3 or 4 ply. In my program it is helps some at 7 ply, but at deeper levels it helps enormously. But heuristics like this seem to be the exception rather than the rule.

I don't think there is a real substitute for testing at real time control levels that are similar to levels you want your program to be good at. Unfortunately, you cannot develop at any kind of serious pace if you cannot produce thousands of test games, so compromise for most of us is a reality of life! It's way more import to get data that is statistically significant than to be overly anal about testing at long time controls where the results are more or less random unless you made an enormous improvement.

jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.

Don · Post by **Don** » Fri Nov 06, 2009 9:26 pm

bob wrote:
jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.
The inverse is true. If you slow the program down, but test with fixed nodes, you will have no idea that you have actually _hurt_ performance overall, when you start to test with a clock.

Personally, to me, it makes no sense to test in a different way from the way the program is used. Would you test a drag racer on a chassis dyno only? More horsepower means more speed, right? What about traction? Does that extra weight on the front hurt weight transfer?

I want to test like I am going to run. And currently, all chess games use time, not nodes.

I don't follow PGO at all. That only affects speed, never the nodes searched for a given depth. So how will using PGO or not influence the final game results since you will be searching the same number of nodes, where any optimizations are absolutely irrelevant since time doesn't count?

Bob,

A chassis dyno is still an important tool and if you take all the instrumentation tools away from a racing team and tell them that they must evaluate all changes by quarter mile test runs, it would put them out of business.

You really need to know what is going on and only 1 kind of testing leaves you in the dark. I'll give you an example. Suppose I do implement some evaluation improvement and then test it on 40/2 games. I find that it is slightly weaker, so I throw it out? I would consider that pretty stupid because I would want to know WHY it tested weaker. Maybe the idea is a good one and the implementation is just too slow. I would want to know that before I simply discard the idea.

With my drag racer I want to know if some engine modification boosts the horsepower. The way to find that out is probably not by doing a quarter mile run. Maybe the chassis cannot handle the additional power or the power curve has changed and this changes the ideal setup. But I would want all the information I could get about this and that could not be provided by a single test.

So please tell me you do more than just time control games at one level.

- Don

Don · Post by **Don** » Fri Nov 06, 2009 9:43 pm

bob wrote: "time adjusted" being the operative words. And I am not sure this is so easy, since a program does not search at a constant speed throughout a game. My NPS can vary by a factor of 3 over the course of a game. Ferret (Bruce Moreland) varied much more, getting _way_ faster in endgames. Fixed nodes can never account for this, which always leaves distortions in the results that are difficult to impossible to quantify.

I'm not anal about this. I take every test I run with a grain of salt as there is no completely dependable way to test - and I am a huge advocate of mixing up a lot of different kinds of testing. I think you have to do this in order to understand your program.

Having said that, I have to also say that fixed depth testing has indeed proved to be a surprisingly reliable indicator of whether a change helped or not. It has not been as good at telling me how much it helps, but I run time control games to determine that (and to prove that it actually does help.) Maybe more to the point is that I can often eliminate a change from consideration without wasting enormous resources.

Don't forget that I don't have your resources. I don't have the luxury of running tens of thousands of long time control games, but even if I did I would run the dyno test first, then put her on the track.

Don · Post by **Don** » Fri Nov 06, 2009 10:03 pm

hgm wrote:Of course in real life almost every industrial product is tested under conditions that are not even remotely close to how they are intended to be used. Accelerated aging at elevated tempertures and under intense radition, flexing components at a high rate using unusually high force to test of fatique, etc. We would still live pretty much in the stone age if it took 80 years to figure out if a new product was more durable than an old one with a lifetime of 60 years... But fortunately most people are more clever than that!

I agree with this, even for chess programs. I sometimes run problem tests even though I learned years ago that performance on problem suites is somewhat correlated, but not strongly correlated to actual results. But it's still a very useful test that reveals performance bugs and other problems that DO affect actual results.

A reason for testing at fixed number of nodes.

A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.