A reason for testing at fixed number of nodes.

jwes · Post by **jwes** » Fri Nov 06, 2009 10:52 pm

bob wrote:
jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.
The inverse is true. If you slow the program down, but test with fixed nodes, you will have no idea that you have actually _hurt_ performance overall, when you start to test with a clock.

Personally, to me, it makes no sense to test in a different way from the way the program is used. Would you test a drag racer on a chassis dyno only? More horsepower means more speed, right? What about traction? Does that extra weight on the front hurt weight transfer?

I want to test like I am going to run. And currently, all chess games use time, not nodes.

I don't follow PGO at all. That only affects speed, never the nodes searched for a given depth. So how will using PGO or not influence the final game results since you will be searching the same number of nodes, where any optimizations are absolutely irrelevant since time doesn't count?

I was not suggesting testing only with fixed number of nodes, but as a check on results from normal testing. The PGO was to get some estimate of how much results could be influenced by optimization effects, e.g if the difference were 10 ELO, then if the difference between two versions is 5 ELO, it could be explained by optimization differences and not any intrinsic difference between the strength of the versions. It makes me wonder if the problem of testing improvements to chess programs is NP complete.

bob · Post by **bob** » Sat Nov 07, 2009 12:53 am

Don wrote:
bob wrote:
jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.
The inverse is true. If you slow the program down, but test with fixed nodes, you will have no idea that you have actually _hurt_ performance overall, when you start to test with a clock.

Personally, to me, it makes no sense to test in a different way from the way the program is used. Would you test a drag racer on a chassis dyno only? More horsepower means more speed, right? What about traction? Does that extra weight on the front hurt weight transfer?

I want to test like I am going to run. And currently, all chess games use time, not nodes.

I don't follow PGO at all. That only affects speed, never the nodes searched for a given depth. So how will using PGO or not influence the final game results since you will be searching the same number of nodes, where any optimizations are absolutely irrelevant since time doesn't count?
Bob,

A chassis dyno is still an important tool and if you take all the instrumentation tools away from a racing team and tell them that they must evaluate all changes by quarter mile test runs, it would put them out of business.

news to me and a lot of my drag-racing friends. None of us visit a chassis dyno. Too expensive. Too much time. And we _still_ have to make many many trips down the strip.

You really need to know what is going on and only 1 kind of testing leaves you in the dark. I'll give you an example. Suppose I do implement some evaluation improvement and then test it on 40/2 games. I find that it is slightly weaker, so I throw it out? I would consider that pretty stupid because I would want to know WHY it tested weaker. Maybe the idea is a good one and the implementation is just too slow. I would want to know that before I simply discard the idea.

With my drag racer I want to know if some engine modification boosts the horsepower. The way to find that out is probably not by doing a quarter mile run. Maybe the chassis cannot handle the additional power or the power curve has changed and this changes the ideal setup. But I would want all the information I could get about this and that could not be provided by a single test.

Actually, you can learn _everything_ about the motor by running down the strip. There is well-known physics behind all of this. Visit your local drag strip for info. Computerization is amazingly into this activity.

So please tell me you do more than just time control games at one level.

- Don

I have already explained how I test. Most is very fast, so weed out things that hurt. But I also do confirmation tests to verify that we didn't break things at slower time controls. I test with and without increment if we fiddle with time allocation to make sure we didn't help one and break the other, etc...

bob · Post by **bob** » Sat Nov 07, 2009 12:59 am

Don wrote:
bob wrote: "time adjusted" being the operative words. And I am not sure this is so easy, since a program does not search at a constant speed throughout a game. My NPS can vary by a factor of 3 over the course of a game. Ferret (Bruce Moreland) varied much more, getting _way_ faster in endgames. Fixed nodes can never account for this, which always leaves distortions in the results that are difficult to impossible to quantify.
I'm not anal about this. I take every test I run with a grain of salt as there is no completely dependable way to test - and I am a huge advocate of mixing up a lot of different kinds of testing. I think you have to do this in order to understand your program.

Having said that, I have to also say that fixed depth testing has indeed proved to be a surprisingly reliable indicator of whether a change helped or not. It has not been as good at telling me how much it helps, but I run time control games to determine that (and to prove that it actually does help.) Maybe more to the point is that I can often eliminate a change from consideration without wasting enormous resources.

Don't forget that I don't have your resources. I don't have the luxury of running tens of thousands of long time control games, but even if I did I would run the dyno test first, then put her on the track.

Most of us put 'em on the track _first_. Horsepower is the least of the issues. One can make arbitrarily large horsepower, within a given class set of rules, but it doesn't help a bit if you can't get it to the ground... Everybody makes tons of horsepower, making it get you down the track is a completely different matter. Very much like you can add all sorts of stuff to the eval, but if it slows you down, it is a net loss...

bob · Post by **bob** » Sat Nov 07, 2009 1:00 am

jwes wrote:
bob wrote:
jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.
The inverse is true. If you slow the program down, but test with fixed nodes, you will have no idea that you have actually _hurt_ performance overall, when you start to test with a clock.

Personally, to me, it makes no sense to test in a different way from the way the program is used. Would you test a drag racer on a chassis dyno only? More horsepower means more speed, right? What about traction? Does that extra weight on the front hurt weight transfer?

I want to test like I am going to run. And currently, all chess games use time, not nodes.

I don't follow PGO at all. That only affects speed, never the nodes searched for a given depth. So how will using PGO or not influence the final game results since you will be searching the same number of nodes, where any optimizations are absolutely irrelevant since time doesn't count?
I was not suggesting testing only with fixed number of nodes, but as a check on results from normal testing. The PGO was to get some estimate of how much results could be influenced by optimization effects, e.g if the difference were 10 ELO, then if the difference between two versions is 5 ELO, it could be explained by optimization differences and not any intrinsic difference between the strength of the versions. It makes me wonder if the problem of testing improvements to chess programs is NP complete.

if you test at fixed nodes, optimizations are meaningless. I am not quite sure what you mean...

jwes · Post by **jwes** » Sat Nov 07, 2009 6:51 am

bob wrote:
jwes wrote:
bob wrote:
jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.
The inverse is true. If you slow the program down, but test with fixed nodes, you will have no idea that you have actually _hurt_ performance overall, when you start to test with a clock.

Personally, to me, it makes no sense to test in a different way from the way the program is used. Would you test a drag racer on a chassis dyno only? More horsepower means more speed, right? What about traction? Does that extra weight on the front hurt weight transfer?

I want to test like I am going to run. And currently, all chess games use time, not nodes.

I don't follow PGO at all. That only affects speed, never the nodes searched for a given depth. So how will using PGO or not influence the final game results since you will be searching the same number of nodes, where any optimizations are absolutely irrelevant since time doesn't count?
I was not suggesting testing only with fixed number of nodes, but as a check on results from normal testing. The PGO was to get some estimate of how much results could be influenced by optimization effects, e.g if the difference were 10 ELO, then if the difference between two versions is 5 ELO, it could be explained by optimization differences and not any intrinsic difference between the strength of the versions. It makes me wonder if the problem of testing improvements to chess programs is NP complete.
if you test at fixed nodes, optimizations are meaningless. I am not quite sure what you mean...

I am not explaining this well. I was proposing two related but different ideas.
1. Testing different versions with fixed nodes to get some abstract idea of strength which you can contrast with the real strength returned by normal testing.
2. Testing the same version with and without PGO using normal time controls to get some estimate of how much variability could be introduced by optimization oddities.

Don · Post by **Don** » Sat Nov 07, 2009 7:40 am

bob wrote:
Don wrote:
bob wrote: "time adjusted" being the operative words. And I am not sure this is so easy, since a program does not search at a constant speed throughout a game. My NPS can vary by a factor of 3 over the course of a game. Ferret (Bruce Moreland) varied much more, getting _way_ faster in endgames. Fixed nodes can never account for this, which always leaves distortions in the results that are difficult to impossible to quantify.
I'm not anal about this. I take every test I run with a grain of salt as there is no completely dependable way to test - and I am a huge advocate of mixing up a lot of different kinds of testing. I think you have to do this in order to understand your program.

Having said that, I have to also say that fixed depth testing has indeed proved to be a surprisingly reliable indicator of whether a change helped or not. It has not been as good at telling me how much it helps, but I run time control games to determine that (and to prove that it actually does help.) Maybe more to the point is that I can often eliminate a change from consideration without wasting enormous resources.

Don't forget that I don't have your resources. I don't have the luxury of running tens of thousands of long time control games, but even if I did I would run the dyno test first, then put her on the track.
Most of us put 'em on the track _first_. Horsepower is the least of the issues. One can make arbitrarily large horsepower, within a given class set of rules, but it doesn't help a bit if you can't get it to the ground... Everybody makes tons of horsepower, making it get you down the track is a completely different matter. Very much like you can add all sorts of stuff to the eval, but if it slows you down, it is a net loss...

I think you are putting more emphasis on the implementation than the idea itself if you only do time control testing. I use fixed depth testing as a tool to isolate the idea from the implementation. if the idea is good I can usually find a good implementation. If the idea is bad then I don't have to waste a lot of time with time control games.

In your example of adding all sort of stuff to the evaluation, you only know that it doesn't work - you don't know why. If it slows you down it is a net loss as you say, but you don't even know that much with your kind of testing because it doesn't reveal if it slowed you down or whether the evaluation itself was the problem.

wgarvin · Post by **wgarvin** » Sat Nov 07, 2009 8:23 am

jwes wrote:
bob wrote: if you test at fixed nodes, optimizations are meaningless. I am not quite sure what you mean...
I am not explaining this well. I was proposing two related but different ideas.
1. Testing different versions with fixed nodes to get some abstract idea of strength which you can contrast with the real strength returned by normal testing.
2. Testing the same version with and without PGO using normal time controls to get some estimate of how much variability could be introduced by optimization oddities.

I don't think that playing PGO vs. non-PGO is going to tell you anything meaningful about the number of ELO your engine gains because you are using PGO. If you do a full set of tests with the PGO and non-PGO builds each playing against the same set of opponents (under normal conditions, not with fixed nodes) then it might show you something about it.

Still, I would rather try and design an engine that performs well without any PGO. I have an innate distrust of profile-guided optimization, because the performance you get depends on accurate profile data, and also because it is the most likely way I've ever heard of to expose compiler optimizer bugs. And while they are rare, it does happen. Example: we just shipped a AAA console game in which we found an optimizer bug that only occurred in our optimized LTCG builds. The non-LTCG build was very stable, but the LTCG build slowly leaked memory until after 6 or 7 hours of play it would run out of memory and crash. This bug was discovered only a few days before the game shipped, but fortunately the cause of it was isolated quickly and shown to be a compiler bug. One of our programmers used an in-house tool to find the leak within a couple of hours, and he then examined the generated code and discovered that for a particular inlined function that manipulated a reference count, the generated code was bogus (instead of loading the proper field into a register and adjusting it there, it simply loaded a constant into the register--the constant was not zero, so the "delete if the count is now zero" code could never be executed). Other very similar functions throughout the codebase were compiled correctly, and several other types of build with optimizations on (but not LTCG) did not have the miscompiled code in them. Imagine how scary it is to find a bug like the above in the generated code of a large program (in our case, several millions of lines of C++). I am still fearful that we got bit somewhere else by the same bug and just didn't find it. However, extensive testing did find the crash above, in time for us to work around it.

Anyway, with other kinds of build--even LTCG--you get a somewhat reproducible result, but with PGO all bets are off. A slight difference in your profile data might cause the compiler to rearrange generated code in unexpected ways, and the optimizer might make different decisions, so if there are bugs they might manifest themselves in one build but not in another, or they might occur in different places but not be very reproducible from build to build. So extensive testing of a PGO build from yesterday (or last week) does not give much confidence about the behaviour of a new PGO build made today with today's profiling data.

In shipping games, we rely on that kind of testing to build up confidence that we have a good build, and towards the end, we change as little as possible in the code. With PGO we would never have got the kind of reproducibility needed to feel safe about shipping the final build after working around that particular case of bad codegen. Unless you practically treat the profile data as part of your source code, and refuse to change it beyond a certain point. I seem to encounter about one compiler bug a year now, and I don't think I am ever going to trust PGO. Still, with a chess engine I guess it is small enough for it to be practical to do extensive testing on a single build and convince yourself that the generated code of that particular build wasn't buggy.

jwes · Post by **jwes** » Sat Nov 07, 2009 10:05 am

wgarvin wrote:
jwes wrote:
bob wrote: if you test at fixed nodes, optimizations are meaningless. I am not quite sure what you mean...
I am not explaining this well. I was proposing two related but different ideas.
1. Testing different versions with fixed nodes to get some abstract idea of strength which you can contrast with the real strength returned by normal testing.
2. Testing the same version with and without PGO using normal time controls to get some estimate of how much variability could be introduced by optimization oddities.
I don't think that playing PGO vs. non-PGO is going to tell you anything meaningful about the number of ELO your engine gains because you are using PGO. If you do a full set of tests with the PGO and non-PGO builds each playing against the same set of opponents (under normal conditions, not with fixed nodes) then it might show you something about it.

I assumed that by now everyone knew not to test versions directly against each other. The problem I was trying to get a handle on is that changing code can noticeably change the speed of the program by altering the location of code and data in the processor even if that code is not executed. This test was my idea on how to estimate how large that change could be. If someone has a better way, I would be glad to read it.

Martin Brown · Post by **Martin Brown** » Sat Nov 07, 2009 11:38 am

bob wrote:
jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.
The inverse is true. If you slow the program down, but test with fixed nodes, you will have no idea that you have actually _hurt_ performance overall, when you start to test with a clock.

Personally, to me, it makes no sense to test in a different way from the way the program is used. Would you test a drag racer on a chassis dyno only? More horsepower means more speed, right? What about traction? Does that extra weight on the front hurt weight transfer?

I want to test like I am going to run. And currently, all chess games use time, not nodes.

I don't follow PGO at all. That only affects speed, never the nodes searched for a given depth. So how will using PGO or not influence the final game results since you will be searching the same number of nodes, where any optimizations are absolutely irrelevant since time doesn't count?

Comment on the OPs possibilities should also include that the test set does not adequately test the improvement made. And I would place far more credibility on profiling for figuring out introduced bottlenecks than fixed number of nodes testing - which seems extremely artificial. For the beginners here is there a list of common chess acronyms like eg PGO?

For the purposes of testing the basics. And my program is *very* basic I find fixed depth for a range of depths helpful for analysing test positions and then comparing time taken to solve a complete test set against score obtained. This makes it easy to see if an "improvement" works or not over an ensemble of tricky game positions.

BTW What is the Bratko-Kopek (sp?) test as applied to chess engines?

mjlef · Post by **mjlef** » Sat Nov 07, 2009 5:07 pm

My basic scheme on testing depends on the nature of the change.

If the change is purely an evaluation change, a fast set of fixed depth games can quickly tell you if the idea is worthwhile. Of course, it will not reveal if the change slowed the program down so much that it will hurt overall strength, so a followup run using timed searches is needed.

If a change is search related, running fixed depth searches and comparing node counts can hint if the idea at least saves nodes. If more programs supported fixed node testing I would use that to get an idea if the search change improved things or not. Again, a followup timed match is needed to confirm the idea.

I never have enough computer resources to test most of the ideas I come up with. I have cluster envy.

Mark

A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.