A reason for testing at fixed number of nodes.

bob · Post by **bob** » Sat Nov 07, 2009 8:34 pm

jwes wrote:
bob wrote:
jwes wrote:
bob wrote:
jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.
The inverse is true. If you slow the program down, but test with fixed nodes, you will have no idea that you have actually _hurt_ performance overall, when you start to test with a clock.

Personally, to me, it makes no sense to test in a different way from the way the program is used. Would you test a drag racer on a chassis dyno only? More horsepower means more speed, right? What about traction? Does that extra weight on the front hurt weight transfer?

I want to test like I am going to run. And currently, all chess games use time, not nodes.

I don't follow PGO at all. That only affects speed, never the nodes searched for a given depth. So how will using PGO or not influence the final game results since you will be searching the same number of nodes, where any optimizations are absolutely irrelevant since time doesn't count?
I was not suggesting testing only with fixed number of nodes, but as a check on results from normal testing. The PGO was to get some estimate of how much results could be influenced by optimization effects, e.g if the difference were 10 ELO, then if the difference between two versions is 5 ELO, it could be explained by optimization differences and not any intrinsic difference between the strength of the versions. It makes me wonder if the problem of testing improvements to chess programs is NP complete.
if you test at fixed nodes, optimizations are meaningless. I am not quite sure what you mean...
I am not explaining this well. I was proposing two related but different ideas.
1. Testing different versions with fixed nodes to get some abstract idea of strength which you can contrast with the real strength returned by normal testing.
2. Testing the same version with and without PGO using normal time controls to get some estimate of how much variability could be introduced by optimization oddities.

Aha, makes more sense. Typically I see 10%-15% speed improvements when using PGO. That produces a small but measurable Elo improvement. I might try to test this this weekend and post the results, just for interest.

bob · Post by **bob** » Sat Nov 07, 2009 8:36 pm

Martin Brown wrote:
bob wrote:
jwes wrote:When normal tests show a particular change is not an improvement, there are three possibilities.
1. The change does not make the program better.
2. The change makes the program better but slows the program more than the change is worth.
3. The change makes the program better but some optimization weirdness causes the program to run more slowly.

Tests at fixed number of nodes can separate cases 2 and 3 from case 1 and those case can be saved and tried again later or possibly optimized better.

A related test that would be of interest would be to test a version with PGO against the same version without PGO to get an estimate of the ELO difference that could be caused by optimization weirdness.
The inverse is true. If you slow the program down, but test with fixed nodes, you will have no idea that you have actually _hurt_ performance overall, when you start to test with a clock.

Personally, to me, it makes no sense to test in a different way from the way the program is used. Would you test a drag racer on a chassis dyno only? More horsepower means more speed, right? What about traction? Does that extra weight on the front hurt weight transfer?

I want to test like I am going to run. And currently, all chess games use time, not nodes.

I don't follow PGO at all. That only affects speed, never the nodes searched for a given depth. So how will using PGO or not influence the final game results since you will be searching the same number of nodes, where any optimizations are absolutely irrelevant since time doesn't count?
Comment on the OPs possibilities should also include that the test set does not adequately test the improvement made. And I would place far more credibility on profiling for figuring out introduced bottlenecks than fixed number of nodes testing - which seems extremely artificial. For the beginners here is there a list of common chess acronyms like eg PGO?

For the purposes of testing the basics. And my program is *very* basic I find fixed depth for a range of depths helpful for analysing test positions and then comparing time taken to solve a complete test set against score obtained. This makes it easy to see if an "improvement" works or not over an ensemble of tricky game positions.

BTW What is the Bratko-Kopek (sp?) test as applied to chess engines?

PGO (Profile Guided Optimizations) is not a chess term. It is a common compiler term.

bob · Post by **bob** » Sat Nov 07, 2009 8:42 pm

Don wrote:
bob wrote:
Don wrote:
bob wrote: "time adjusted" being the operative words. And I am not sure this is so easy, since a program does not search at a constant speed throughout a game. My NPS can vary by a factor of 3 over the course of a game. Ferret (Bruce Moreland) varied much more, getting _way_ faster in endgames. Fixed nodes can never account for this, which always leaves distortions in the results that are difficult to impossible to quantify.
I'm not anal about this. I take every test I run with a grain of salt as there is no completely dependable way to test - and I am a huge advocate of mixing up a lot of different kinds of testing. I think you have to do this in order to understand your program.

Having said that, I have to also say that fixed depth testing has indeed proved to be a surprisingly reliable indicator of whether a change helped or not. It has not been as good at telling me how much it helps, but I run time control games to determine that (and to prove that it actually does help.) Maybe more to the point is that I can often eliminate a change from consideration without wasting enormous resources.

Don't forget that I don't have your resources. I don't have the luxury of running tens of thousands of long time control games, but even if I did I would run the dyno test first, then put her on the track.
Most of us put 'em on the track _first_. Horsepower is the least of the issues. One can make arbitrarily large horsepower, within a given class set of rules, but it doesn't help a bit if you can't get it to the ground... Everybody makes tons of horsepower, making it get you down the track is a completely different matter. Very much like you can add all sorts of stuff to the eval, but if it slows you down, it is a net loss...
I think you are putting more emphasis on the implementation than the idea itself if you only do time control testing. I use fixed depth testing as a tool to isolate the idea from the implementation. if the idea is good I can usually find a good implementation. If the idea is bad then I don't have to waste a lot of time with time control games.

In your example of adding all sort of stuff to the evaluation, you only know that it doesn't work - you don't know why. If it slows you down it is a net loss as you say, but you don't even know that much with your kind of testing because it doesn't reveal if it slowed you down or whether the evaluation itself was the problem.

I simply don't try to write "sloppy implementations" to get a quick test done. I(AKA rapid prototyping in the software world). I start with the best I can come up with, as a good idea that can't be made fast is not going to help. So I want to make sure that the idea is good, and that a good implementation of that does not hurt speed enough to offset the original idea's gain. Every idea has a pair of offsetting features. It hopefully improves the skill level of the program, but it has a cost in terms of speed. What good does it do to find a good idea (say a very complex mobility calculation that works great in fixed nodes where speed is irrelevant) but which can't be implemented in any way that doesn't hurt Elo more because of the speed loss???

bob · Post by **bob** » Sat Nov 07, 2009 8:45 pm

mjlef wrote:My basic scheme on testing depends on the nature of the change.

If the change is purely an evaluation change, a fast set of fixed depth games can quickly tell you if the idea is worthwhile. Of course, it will not reveal if the change slowed the program down so much that it will hurt overall strength, so a followup run using timed searches is needed.

If a change is search related, running fixed depth searches and comparing node counts can hint if the idea at least saves nodes. If more programs supported fixed node testing I would use that to get an idea if the search change improved things or not. Again, a followup timed match is needed to confirm the idea.

I never have enough computer resources to test most of the ideas I come up with. I have cluster envy.

Mark

Crafty has always had this ability. But I have only used it for test sets and such, not for games, since nodes means different things for different programs.

Uri Blass · Post by **Uri Blass** » Sat Nov 07, 2009 9:14 pm

bob wrote:
Don wrote:
bob wrote:
Don wrote:
bob wrote: "time adjusted" being the operative words. And I am not sure this is so easy, since a program does not search at a constant speed throughout a game. My NPS can vary by a factor of 3 over the course of a game. Ferret (Bruce Moreland) varied much more, getting _way_ faster in endgames. Fixed nodes can never account for this, which always leaves distortions in the results that are difficult to impossible to quantify.
I'm not anal about this. I take every test I run with a grain of salt as there is no completely dependable way to test - and I am a huge advocate of mixing up a lot of different kinds of testing. I think you have to do this in order to understand your program.

Having said that, I have to also say that fixed depth testing has indeed proved to be a surprisingly reliable indicator of whether a change helped or not. It has not been as good at telling me how much it helps, but I run time control games to determine that (and to prove that it actually does help.) Maybe more to the point is that I can often eliminate a change from consideration without wasting enormous resources.

Don't forget that I don't have your resources. I don't have the luxury of running tens of thousands of long time control games, but even if I did I would run the dyno test first, then put her on the track.
Most of us put 'em on the track _first_. Horsepower is the least of the issues. One can make arbitrarily large horsepower, within a given class set of rules, but it doesn't help a bit if you can't get it to the ground... Everybody makes tons of horsepower, making it get you down the track is a completely different matter. Very much like you can add all sorts of stuff to the eval, but if it slows you down, it is a net loss...
I think you are putting more emphasis on the implementation than the idea itself if you only do time control testing. I use fixed depth testing as a tool to isolate the idea from the implementation. if the idea is good I can usually find a good implementation. If the idea is bad then I don't have to waste a lot of time with time control games.

In your example of adding all sort of stuff to the evaluation, you only know that it doesn't work - you don't know why. If it slows you down it is a net loss as you say, but you don't even know that much with your kind of testing because it doesn't reveal if it slowed you down or whether the evaluation itself was the problem.
I simply don't try to write "sloppy implementations" to get a quick test done. I(AKA rapid prototyping in the software world). I start with the best I can come up with, as a good idea that can't be made fast is not going to help. So I want to make sure that the idea is good, and that a good implementation of that does not hurt speed enough to offset the original idea's gain. Every idea has a pair of offsetting features. It hopefully improves the skill level of the program, but it has a cost in terms of speed. What good does it do to find a good idea (say a very complex mobility calculation that works great in fixed nodes where speed is irrelevant) but which can't be implemented in any way that doesn't hurt Elo more because of the speed loss???

Maybe some change in the evaluation that help at fixed number of nodes
but lose few elo at fast time control can help at longer time control.

The question is if the slow down is the same at longer time control.

I think that at longer time control you may more often get the evaluation from the hash tables so the speed difference in nodes per second may be smaller.

Don · Post by **Don** » Sat Nov 07, 2009 11:08 pm

bob wrote: What good does it do to find a good idea (say a very complex mobility calculation that works great in fixed nodes where speed is irrelevant) but which can't be implemented in any way that doesn't hurt Elo more because of the speed loss???

It's rather annoying that you present that scenario as if you have any clue about why it doesn't work. You don't even know if it's the idea that is flawed or the implementation so you have no basis for presenting that scenario.

However, I could present it to you! Because I can TELL you whether the idea is good and then I can tell you if it was difficult finding an efficient implementation or not.

I would think any good engineer would want to know WHY it worked or not and would not be satisfied with being in ignorance about this.

- Don

bob · Post by **bob** » Sun Nov 08, 2009 4:09 am

Uri Blass wrote:
bob wrote:
Don wrote:
bob wrote:
Don wrote:
bob wrote: "time adjusted" being the operative words. And I am not sure this is so easy, since a program does not search at a constant speed throughout a game. My NPS can vary by a factor of 3 over the course of a game. Ferret (Bruce Moreland) varied much more, getting _way_ faster in endgames. Fixed nodes can never account for this, which always leaves distortions in the results that are difficult to impossible to quantify.
I'm not anal about this. I take every test I run with a grain of salt as there is no completely dependable way to test - and I am a huge advocate of mixing up a lot of different kinds of testing. I think you have to do this in order to understand your program.

Having said that, I have to also say that fixed depth testing has indeed proved to be a surprisingly reliable indicator of whether a change helped or not. It has not been as good at telling me how much it helps, but I run time control games to determine that (and to prove that it actually does help.) Maybe more to the point is that I can often eliminate a change from consideration without wasting enormous resources.

Don't forget that I don't have your resources. I don't have the luxury of running tens of thousands of long time control games, but even if I did I would run the dyno test first, then put her on the track.
Most of us put 'em on the track _first_. Horsepower is the least of the issues. One can make arbitrarily large horsepower, within a given class set of rules, but it doesn't help a bit if you can't get it to the ground... Everybody makes tons of horsepower, making it get you down the track is a completely different matter. Very much like you can add all sorts of stuff to the eval, but if it slows you down, it is a net loss...
I think you are putting more emphasis on the implementation than the idea itself if you only do time control testing. I use fixed depth testing as a tool to isolate the idea from the implementation. if the idea is good I can usually find a good implementation. If the idea is bad then I don't have to waste a lot of time with time control games.

In your example of adding all sort of stuff to the evaluation, you only know that it doesn't work - you don't know why. If it slows you down it is a net loss as you say, but you don't even know that much with your kind of testing because it doesn't reveal if it slowed you down or whether the evaluation itself was the problem.
I simply don't try to write "sloppy implementations" to get a quick test done. I(AKA rapid prototyping in the software world). I start with the best I can come up with, as a good idea that can't be made fast is not going to help. So I want to make sure that the idea is good, and that a good implementation of that does not hurt speed enough to offset the original idea's gain. Every idea has a pair of offsetting features. It hopefully improves the skill level of the program, but it has a cost in terms of speed. What good does it do to find a good idea (say a very complex mobility calculation that works great in fixed nodes where speed is irrelevant) but which can't be implemented in any way that doesn't hurt Elo more because of the speed loss???
Maybe some change in the evaluation that help at fixed number of nodes
but lose few elo at fast time control can help at longer time control.

The question is if the slow down is the same at longer time control.

I think that at longer time control you may more often get the evaluation from the hash tables so the speed difference in nodes per second may be smaller.

How can eval changes speed up or slow down at different time controls? For every 100 nodes you search you do N evaluations. Time control doesn't change the proportion significantly. So speed won't change. If you do a change that works at fast games but not at slow, this is not going to be caught by a fixed node search. This is a different issue completely and can only be seen by playing the games at different time controls. I notice no difference in hash hits at longer time controls. In fact, it probably goes down since a hash table is rarely big enough given today's hardware speeds, and overwriting is going to occur more at deeper depths.

bob · Post by **bob** » Sun Nov 08, 2009 4:21 am

Don wrote:
bob wrote: What good does it do to find a good idea (say a very complex mobility calculation that works great in fixed nodes where speed is irrelevant) but which can't be implemented in any way that doesn't hurt Elo more because of the speed loss???
It's rather annoying that you present that scenario as if you have any clue about why it doesn't work. You don't even know if it's the idea that is flawed or the implementation so you have no basis for presenting that scenario.

However, I could present it to you! Because I can TELL you whether the idea is good and then I can tell you if it was difficult finding an efficient implementation or not.

I would think any good engineer would want to know WHY it worked or not and would not be satisfied with being in ignorance about this.

- Don

Take whatever "feelings" you have, and do whatever you want with them. I do know how to test a program. And I understand quite clearly the many deficiencies of fixed-node searches. They are great for debugging. SO that you have perfect reproducibility. They are terrible for actually measuring improvements, because they completely hide the speed issue. If you don't get that, that's fine by me. I think the point is recognized by most. And that's all I wanted to see happen. Other than debugging, I see absolutely no benefit to testing like this. Yes it can be convenient on non-dedicated hardware so that system load doesn't change the shape of the game tree. But it distorts so much due to hiding the time cost...

I'm not interested in having to test the same idea several times. Once with a sloppy implementation, using fixed nodes. Then moving to timed matches. And if things are worse, rewriting to make it more efficient (if possible) and then testing a 3rd time. I go directly to #3 and do the test once and accept it or throw it out and move on. Seems much more efficient to me, even though I could run those 3-test approaches faster than most anyone by using our cluster(s).

As for the "I have no idea ..." You are correct. I have no idea what you are talking about. I pretty well know what I'm doing. "implementation flaws" are rare. They do happen. But fixed node searches are not going to help me find that any quicker than a timed test. And a timed test is also going to highlight performance losses as well. I don't have to be told about difficulty of implementations. Because when one of us hatches an idea to test, we spend some time actually thinking about the most efficient way to implement that particular idea. And if it is infeasible to do so, it doesn't take much experience at all to figure out that the idea is a no-go due to computational costs, without doing any testing at all. I could give examples, such as the Cray Blitz square-by-square mobility evaluation that simply can't be done on a PC without murdering performance. There are lots of others.

If you are happy with fixed-node tests, fine. I consider them just as useless as fixed-depth searches when trying to evaluate a new idea. But both approaches (depth or nodes) is useful when chasing a bug that is only rarely seen. But IMHO not very useful for evaluating search/evaluation/etc changes themselves.

I don't like the idea of trying to do fixed node searches, then trying to somehow finagle the numbers to reflect real time usage. Too much fudging going on there. Much easier to just test as it has to play in real tournaments and go directly to the final answer with no guesswork.

IMHO of course.

Dirt · Post by **Dirt** » Sun Nov 08, 2009 9:25 am

jwes wrote:It makes me wonder if the problem of testing improvements to chess programs is NP complete.

So you think that finding a method to tell which of two programs is better will lead to a polynomial time algorithm for 3SAT? I'm very skeptical of that.

It's not clear to me how you would formalize the problem so that it's even in NP.

Uri Blass · Post by **Uri Blass** » Sun Nov 08, 2009 11:33 am

bob wrote:
Don wrote:
bob wrote: What good does it do to find a good idea (say a very complex mobility calculation that works great in fixed nodes where speed is irrelevant) but which can't be implemented in any way that doesn't hurt Elo more because of the speed loss???
It's rather annoying that you present that scenario as if you have any clue about why it doesn't work. You don't even know if it's the idea that is flawed or the implementation so you have no basis for presenting that scenario.

However, I could present it to you! Because I can TELL you whether the idea is good and then I can tell you if it was difficult finding an efficient implementation or not.

I would think any good engineer would want to know WHY it worked or not and would not be satisfied with being in ignorance about this.

- Don
And if it is infeasible to do so, it doesn't take much experience at all to figure out that the idea is a no-go due to computational costs, without doing any testing at all. I could give examples, such as the Cray Blitz square-by-square mobility evaluation that simply can't be done on a PC without murdering performance. There are lots of others.

The question is if it is not possible to have some improvement by combination of some counter productive ideas that are counter productive only because of speed

The question is if it is not possible that there are 5 ideas that everyone of them make the program 20% slower in nodes per seconds when using all of them make the program only 30% slower in nodes per second.

If a situation like that is possible then knowing that ideas A,B,C,D,E work at fixed number of nodes but do not work because of speed may help because you may try later to test a combination of all of these ideas.

Uri

A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.