A reason for testing at fixed number of nodes.

Uri Blass · Post by **Uri Blass** » Sun Nov 08, 2009 12:26 pm

bob wrote:
Uri Blass wrote:
bob wrote:
Don wrote:
bob wrote:
Don wrote:
bob wrote: "time adjusted" being the operative words. And I am not sure this is so easy, since a program does not search at a constant speed throughout a game. My NPS can vary by a factor of 3 over the course of a game. Ferret (Bruce Moreland) varied much more, getting _way_ faster in endgames. Fixed nodes can never account for this, which always leaves distortions in the results that are difficult to impossible to quantify.
I'm not anal about this. I take every test I run with a grain of salt as there is no completely dependable way to test - and I am a huge advocate of mixing up a lot of different kinds of testing. I think you have to do this in order to understand your program.

Having said that, I have to also say that fixed depth testing has indeed proved to be a surprisingly reliable indicator of whether a change helped or not. It has not been as good at telling me how much it helps, but I run time control games to determine that (and to prove that it actually does help.) Maybe more to the point is that I can often eliminate a change from consideration without wasting enormous resources.

Don't forget that I don't have your resources. I don't have the luxury of running tens of thousands of long time control games, but even if I did I would run the dyno test first, then put her on the track.
Most of us put 'em on the track _first_. Horsepower is the least of the issues. One can make arbitrarily large horsepower, within a given class set of rules, but it doesn't help a bit if you can't get it to the ground... Everybody makes tons of horsepower, making it get you down the track is a completely different matter. Very much like you can add all sorts of stuff to the eval, but if it slows you down, it is a net loss...
I think you are putting more emphasis on the implementation than the idea itself if you only do time control testing. I use fixed depth testing as a tool to isolate the idea from the implementation. if the idea is good I can usually find a good implementation. If the idea is bad then I don't have to waste a lot of time with time control games.

In your example of adding all sort of stuff to the evaluation, you only know that it doesn't work - you don't know why. If it slows you down it is a net loss as you say, but you don't even know that much with your kind of testing because it doesn't reveal if it slowed you down or whether the evaluation itself was the problem.
I simply don't try to write "sloppy implementations" to get a quick test done. I(AKA rapid prototyping in the software world). I start with the best I can come up with, as a good idea that can't be made fast is not going to help. So I want to make sure that the idea is good, and that a good implementation of that does not hurt speed enough to offset the original idea's gain. Every idea has a pair of offsetting features. It hopefully improves the skill level of the program, but it has a cost in terms of speed. What good does it do to find a good idea (say a very complex mobility calculation that works great in fixed nodes where speed is irrelevant) but which can't be implemented in any way that doesn't hurt Elo more because of the speed loss???
Maybe some change in the evaluation that help at fixed number of nodes
but lose few elo at fast time control can help at longer time control.

The question is if the slow down is the same at longer time control.

I think that at longer time control you may more often get the evaluation from the hash tables so the speed difference in nodes per second may be smaller.
How can eval changes speed up or slow down at different time controls? For every 100 nodes you search you do N evaluations. Time control doesn't change the proportion significantly. So speed won't change. If you do a change that works at fast games but not at slow, this is not going to be caught by a fixed node search. This is a different issue completely and can only be seen by playing the games at different time controls. I notice no difference in hash hits at longer time controls. In fact, it probably goes down since a hash table is rarely big enough given today's hardware speeds, and overwriting is going to occur more at deeper depths.

I thought that you may do smaller number of evaluations for every 100 nodes at slower time control.

The reason is that in the beginning of the search you get different positions and it is not the case later.

Let take extreme case
Imagine that you search only 1 ply forward or 2 plies forward.
moves are different so positions are different.

The only way to get the same position twice is by searching to bigger depth.

If you search 3 plies forward then 1.e4 d6 2.d4 has the same evaluation as 1.d4 e6 2.e4 so you do not need to do evaluation.

I may be wrong but I thought that you may have more positions in the hash for every 100 positions that you search at long time control.

If you have some simple endgame position with blocked pawns(like KBBP vs KRP) then the number of future positions may be something like 10^8 and after enough time you do not get new positions to evaluate or almost do not get new positions to evaluate.

Uri

bob · Post by **bob** » Sun Nov 08, 2009 10:34 pm

jwes wrote:
wgarvin wrote:
jwes wrote:
bob wrote: if you test at fixed nodes, optimizations are meaningless. I am not quite sure what you mean...
I am not explaining this well. I was proposing two related but different ideas.
1. Testing different versions with fixed nodes to get some abstract idea of strength which you can contrast with the real strength returned by normal testing.
2. Testing the same version with and without PGO using normal time controls to get some estimate of how much variability could be introduced by optimization oddities.
I don't think that playing PGO vs. non-PGO is going to tell you anything meaningful about the number of ELO your engine gains because you are using PGO. If you do a full set of tests with the PGO and non-PGO builds each playing against the same set of opponents (under normal conditions, not with fixed nodes) then it might show you something about it.
I assumed that by now everyone knew not to test versions directly against each other. The problem I was trying to get a handle on is that changing code can noticeably change the speed of the program by altering the location of code and data in the processor even if that code is not executed. This test was my idea on how to estimate how large that change could be. If someone has a better way, I would be glad to read it.

This is a problem I don't think you should worry about. It is impossible to know what a compiler will produce until it produces it. And compilers change regularly. I optimize things that will make a difference when I know that the compiler can't make some of the assumptions I know are true. But I don't spend any time thinking about how the optimizer works. I simply compile different ways, take the fastest code and go with that. I don't think that speed is a variable (with respect to optimization) over the course of a game.

bob · Post by **bob** » Sun Nov 08, 2009 10:37 pm

Uri Blass wrote:
bob wrote:
Uri Blass wrote:
bob wrote:
Don wrote:
bob wrote:
Don wrote:
bob wrote: "time adjusted" being the operative words. And I am not sure this is so easy, since a program does not search at a constant speed throughout a game. My NPS can vary by a factor of 3 over the course of a game. Ferret (Bruce Moreland) varied much more, getting _way_ faster in endgames. Fixed nodes can never account for this, which always leaves distortions in the results that are difficult to impossible to quantify.
I'm not anal about this. I take every test I run with a grain of salt as there is no completely dependable way to test - and I am a huge advocate of mixing up a lot of different kinds of testing. I think you have to do this in order to understand your program.

Having said that, I have to also say that fixed depth testing has indeed proved to be a surprisingly reliable indicator of whether a change helped or not. It has not been as good at telling me how much it helps, but I run time control games to determine that (and to prove that it actually does help.) Maybe more to the point is that I can often eliminate a change from consideration without wasting enormous resources.

Don't forget that I don't have your resources. I don't have the luxury of running tens of thousands of long time control games, but even if I did I would run the dyno test first, then put her on the track.
Most of us put 'em on the track _first_. Horsepower is the least of the issues. One can make arbitrarily large horsepower, within a given class set of rules, but it doesn't help a bit if you can't get it to the ground... Everybody makes tons of horsepower, making it get you down the track is a completely different matter. Very much like you can add all sorts of stuff to the eval, but if it slows you down, it is a net loss...
I think you are putting more emphasis on the implementation than the idea itself if you only do time control testing. I use fixed depth testing as a tool to isolate the idea from the implementation. if the idea is good I can usually find a good implementation. If the idea is bad then I don't have to waste a lot of time with time control games.

In your example of adding all sort of stuff to the evaluation, you only know that it doesn't work - you don't know why. If it slows you down it is a net loss as you say, but you don't even know that much with your kind of testing because it doesn't reveal if it slowed you down or whether the evaluation itself was the problem.
I simply don't try to write "sloppy implementations" to get a quick test done. I(AKA rapid prototyping in the software world). I start with the best I can come up with, as a good idea that can't be made fast is not going to help. So I want to make sure that the idea is good, and that a good implementation of that does not hurt speed enough to offset the original idea's gain. Every idea has a pair of offsetting features. It hopefully improves the skill level of the program, but it has a cost in terms of speed. What good does it do to find a good idea (say a very complex mobility calculation that works great in fixed nodes where speed is irrelevant) but which can't be implemented in any way that doesn't hurt Elo more because of the speed loss???
Maybe some change in the evaluation that help at fixed number of nodes
but lose few elo at fast time control can help at longer time control.

The question is if the slow down is the same at longer time control.

I think that at longer time control you may more often get the evaluation from the hash tables so the speed difference in nodes per second may be smaller.
How can eval changes speed up or slow down at different time controls? For every 100 nodes you search you do N evaluations. Time control doesn't change the proportion significantly. So speed won't change. If you do a change that works at fast games but not at slow, this is not going to be caught by a fixed node search. This is a different issue completely and can only be seen by playing the games at different time controls. I notice no difference in hash hits at longer time controls. In fact, it probably goes down since a hash table is rarely big enough given today's hardware speeds, and overwriting is going to occur more at deeper depths.
I thought that you may do smaller number of evaluations for every 100 nodes at slower time control.

The reason is that in the beginning of the search you get different positions and it is not the case later.

Let take extreme case
Imagine that you search only 1 ply forward or 2 plies forward.
moves are different so positions are different.

The only way to get the same position twice is by searching to bigger depth.

If you search 3 plies forward then 1.e4 d6 2.d4 has the same evaluation as 1.d4 e6 2.e4 so you do not need to do evaluation.

I may be wrong but I thought that you may have more positions in the hash for every 100 positions that you search at long time control.

If you have some simple endgame position with blocked pawns(like KBBP vs KRP) then the number of future positions may be something like 10^8 and after enough time you do not get new positions to evaluate or almost do not get new positions to evaluate.

Uri

Something tells me that beyond a few plies, this is not the case, but it would be easy enough to measure. In fact, in Crafty, I report the number of evaluations done and the total nodes searched. I'' run some quick tests and post the results later tonight...

bob · Post by **bob** » Sun Nov 08, 2009 10:44 pm

Uri Blass wrote:
bob wrote:
Don wrote:
bob wrote: What good does it do to find a good idea (say a very complex mobility calculation that works great in fixed nodes where speed is irrelevant) but which can't be implemented in any way that doesn't hurt Elo more because of the speed loss???
It's rather annoying that you present that scenario as if you have any clue about why it doesn't work. You don't even know if it's the idea that is flawed or the implementation so you have no basis for presenting that scenario.

However, I could present it to you! Because I can TELL you whether the idea is good and then I can tell you if it was difficult finding an efficient implementation or not.

I would think any good engineer would want to know WHY it worked or not and would not be satisfied with being in ignorance about this.

- Don
And if it is infeasible to do so, it doesn't take much experience at all to figure out that the idea is a no-go due to computational costs, without doing any testing at all. I could give examples, such as the Cray Blitz square-by-square mobility evaluation that simply can't be done on a PC without murdering performance. There are lots of others.
The question is if it is not possible to have some improvement by combination of some counter productive ideas that are counter productive only because of speed

The question is if it is not possible that there are 5 ideas that everyone of them make the program 20% slower in nodes per seconds when using all of them make the program only 30% slower in nodes per second.

If a situation like that is possible then knowing that ideas A,B,C,D,E work at fixed number of nodes but do not work because of speed may help because you may try later to test a combination of all of these ideas.

Uri

I simply don't believe such ideas exist, where you can't recognize that by knowing how the code will look. For example, if you do the old slate/atkin attacked-from bitboard, you can use that for move generation, mobility calculation, etc. but you would not (at least someone experienced in bitboards would not) jump into this knowing the significant incremental update costs for those attacked-from values, without having lots of ideas how that information can be used elsewhere to take advantage of the new information. I can only speak for myself, but major changes are not made on a whim. I give a lot of thought to what it will cost, where it might be used, and whether it appears to be worthwhile or not. If it is a "yes" it gets implemented, if it is a "no" it does not. Sometimes a "maybe" happens and we do test those. But the idea of learning something useful from fixed-node tests doesn't make any sense to me, because it hides important details of a chess engine (speed) which can quite easily lead to the wrong conclusion about the goodness or badness of a change.

As I said, I use this kind of testing, but only for debugging. Where something breaks, or crashes unexpectedly. I then run thousands of such games, knowing that when it crashes I can reproduce it easily enough to make debugging much less painful. But that is _all_ I use fixed node testing for.

jwes · Post by **jwes** » Mon Nov 09, 2009 2:33 am

bob wrote:
jwes wrote:
wgarvin wrote:
jwes wrote:
bob wrote: if you test at fixed nodes, optimizations are meaningless. I am not quite sure what you mean...
I am not explaining this well. I was proposing two related but different ideas.
1. Testing different versions with fixed nodes to get some abstract idea of strength which you can contrast with the real strength returned by normal testing.
2. Testing the same version with and without PGO using normal time controls to get some estimate of how much variability could be introduced by optimization oddities.
I don't think that playing PGO vs. non-PGO is going to tell you anything meaningful about the number of ELO your engine gains because you are using PGO. If you do a full set of tests with the PGO and non-PGO builds each playing against the same set of opponents (under normal conditions, not with fixed nodes) then it might show you something about it.
I assumed that by now everyone knew not to test versions directly against each other. The problem I was trying to get a handle on is that changing code can noticeably change the speed of the program by altering the location of code and data in the processor even if that code is not executed. This test was my idea on how to estimate how large that change could be. If someone has a better way, I would be glad to read it.
This is a problem I don't think you should worry about. It is impossible to know what a compiler will produce until it produces it. And compilers change regularly. I optimize things that will make a difference when I know that the compiler can't make some of the assumptions I know are true. But I don't spend any time thinking about how the optimizer works. I simply compile different ways, take the fastest code and go with that. I don't think that speed is a variable (with respect to optimization) over the course of a game.

The problem that I am worrying about is that an optimization fluke between two versions could give false test results, e.g. if a change would be 3 ELO weaker but an optimization fluke adds 7 ELO, you would accept a change that makes your program weaker. If you also tested this with fixed nodes, you would have a strong indication that this change is not an improvement. A more likely scenario is that a change would cause the minimum improvement that you accept as useful, but optimization differences would make the version slightly weaker, and you would reject the change.

bob · Post by **bob** » Mon Nov 09, 2009 5:16 pm

jwes wrote:
bob wrote:
jwes wrote:
wgarvin wrote:
jwes wrote:
bob wrote: if you test at fixed nodes, optimizations are meaningless. I am not quite sure what you mean...
I am not explaining this well. I was proposing two related but different ideas.
1. Testing different versions with fixed nodes to get some abstract idea of strength which you can contrast with the real strength returned by normal testing.
2. Testing the same version with and without PGO using normal time controls to get some estimate of how much variability could be introduced by optimization oddities.
I don't think that playing PGO vs. non-PGO is going to tell you anything meaningful about the number of ELO your engine gains because you are using PGO. If you do a full set of tests with the PGO and non-PGO builds each playing against the same set of opponents (under normal conditions, not with fixed nodes) then it might show you something about it.
I assumed that by now everyone knew not to test versions directly against each other. The problem I was trying to get a handle on is that changing code can noticeably change the speed of the program by altering the location of code and data in the processor even if that code is not executed. This test was my idea on how to estimate how large that change could be. If someone has a better way, I would be glad to read it.
This is a problem I don't think you should worry about. It is impossible to know what a compiler will produce until it produces it. And compilers change regularly. I optimize things that will make a difference when I know that the compiler can't make some of the assumptions I know are true. But I don't spend any time thinking about how the optimizer works. I simply compile different ways, take the fastest code and go with that. I don't think that speed is a variable (with respect to optimization) over the course of a game.
The problem that I am worrying about is that an optimization fluke between two versions could give false test results, e.g. if a change would be 3 ELO weaker but an optimization fluke adds 7 ELO, you would accept a change that makes your program weaker. If you also tested this with fixed nodes, you would have a strong indication that this change is not an improvement. A more likely scenario is that a change would cause the minimum improvement that you accept as useful, but optimization differences would make the version slightly weaker, and you would reject the change.

I simply ignore that case. Otherwise every time you make a single change to your program you have to test the change, and make some attempt to check the optimizer. And the optimizer could fail in completely unexpected ways. I have seen this, but we are talking 2-3 times in the past 15 years of Crafty development, and those optimizer screwups were related to long long math on 32 bit machines.

hgm · Post by **hgm** » Mon Nov 09, 2009 9:36 pm

Decoupling the measurement of strength and speed is very useful. To know if a change improves my engine in time-based play, I would be obliged to implement the change in the maximally optimized way of the most clever algorithm. That would require a lot of effort, and it might all be wasted, because the idea might even be a bust in node-based play.

Testing first in node-based play does allow me to use the most quick and dirty solution I can imagine, I just hack it in without having to pay any attention to efficiency at all. Then the node-based play will tell me how much the idea is worth independent of the quality of the implementation.

And from that info I can the get the estimate how much a speed hit would be affordable on the implementation of that idea. And that would give me a pretty clear impression whether that is feasible or not. But most of the time you don't even get to that stage. So it saves tons of time.

bob · Post by **bob** » Tue Nov 10, 2009 1:43 am

hgm wrote:Decoupling the measurement of strength and speed is very useful. To know if a change improves my engine in time-based play, I would be obliged to implement the change in the maximally optimized way of the most clever algorithm. That would require a lot of effort, and it might all be wasted, because the idea might even be a bust in node-based play.

Testing first in node-based play does allow me to use the most quick and dirty solution I can imagine, I just hack it in without having to pay any attention to efficiency at all. Then the node-based play will tell me how much the idea is worth independent of the quality of the implementation.

Or not, as I have previously explained, because all programs are not constant in their NPS over the course of a game. If you eval change pushes the game toward positions where you are slower, or where your opponent is faster, you get a time advantage you did not think about. Which makes the change look good when it might be better or worse in real timed games. Vice-versa as well.

And from that info I can the get the estimate how much a speed hit would be affordable on the implementation of that idea. And that would give me a pretty clear impression whether that is feasible or not. But most of the time you don't even get to that stage. So it saves tons of time.

hgm · Post by **hgm** » Tue Nov 10, 2009 7:39 am

My program dos not suffer from that problem. And there is no need to play the opponents at a node-based TC, as I do not alter these.

michiguel · Post by **michiguel** » Tue Nov 10, 2009 8:26 am

bob wrote:
hgm wrote:Decoupling the measurement of strength and speed is very useful. To know if a change improves my engine in time-based play, I would be obliged to implement the change in the maximally optimized way of the most clever algorithm. That would require a lot of effort, and it might all be wasted, because the idea might even be a bust in node-based play.

Testing first in node-based play does allow me to use the most quick and dirty solution I can imagine, I just hack it in without having to pay any attention to efficiency at all. Then the node-based play will tell me how much the idea is worth independent of the quality of the implementation.
Or not, as I have previously explained, because all programs are not constant in their NPS over the course of a game. If you eval change pushes the game toward positions where you are slower, or where your opponent is faster, you get a time advantage you did not think about. Which makes the change look good when it might be better or worse in real timed games. Vice-versa as well.

In experimental science many preliminary experiments are performed not to collect data, but to have an idea what other (if any) experiments should follow.

Testing with nodes may fall in the first category. You keep finding the defects of preliminary experiments, when they are not necessarily supposed to be perfect (if there is such a thing in experimental science...).

Miguel

And from that info I can the get the estimate how much a speed hit would be affordable on the implementation of that idea. And that would give me a pretty clear impression whether that is feasible or not. But most of the time you don't even get to that stage. So it saves tons of time.

A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.