A reason for testing at fixed number of nodes.

bob · Post by **bob** » Thu Nov 12, 2009 10:32 pm

This is not the point...
The idea is to gather information, not necessarily the same information you get with time-based testing.

Miguel

I will ask again. What useful information do you get from fixed node searching? This kind of testing has a built-in bias, since now all nodes are create equal, and are assumed to take the same amount of time to deal with. Unfortunately, that is not an accurate assessment. If a program slows down at some point, normally, where its NPS simply drops due to doing some sort of extra work, it doesn't slow down in a fixed node search, And effectively runs faster than it should. Since many programs have a 2x-3x NPS variance over the course of a game, giving a program a 2x-3x time handicap or advantage is a big Elo change. And it isn't accounted for. And that's more than enough to make the results suspect. So now you have to test again to see if you got burned by the NPS variance issue. Whereas with timed testing, you discover this immediately because the handicap/advantage goes away when everyone has the same amount of _time_ to use.

If you can learn from fixed node searching, you can learn the same thing from fixed depth searching. Yet we all know that plies are not a constant between programs. And the time to complete a ply is not even constant on a single program. Again, bias works its way in.

If you can figure out a way to factor the bias out of the equation, then I'd agree it will work. But if your solution is to also use a timed match, then why not start there in the first place?

Don · Post by **Don** » Thu Nov 12, 2009 11:15 pm

I will ask again. What useful information do you get from fixed node searching?

I think to put this real simple, it comes down to what you say next. I'm going to use a little hyperbole or exaggeration of your words, but you are basically implying that fixed depth testing is totally ridiculous, has no possible correlation to reality, is biased beyond belief, etc.

My facts just do not bare out what you believe. In fact, I invariably get the same results with fixed depth testing as I do with time games if I factor in my very reliable time adjustment and if I'm not stupid about it. I pick and choose the type of tests I apply this to.

I actually do not believe in testing fixed depth if one program is too much faster than another - I know the error and bias increases with the difference. When I test other programs with mine I use the level that is closest in time.

Some of the bias that you say is there, probably IS THERE. But it is just not a ridiculous amount of bias like you claim it to be. It's still an excellent approximation which almost every experiment is.

Please note that this fixed depth testing is not the only hammer in my toolbox, it is used judiciously.

You and I are not the only ones doing this either. I happen to know that there are teams much stronger than yours and mine that use fixed depth testing to great advantage and in fact do the majority of their testing with it, or some variation of it. So it's either not that horrible or it's as flawed as you say it is, but if it is, then you should be willing to admit they are even farther ahead of you in other areas. They seem to be doing ok despite their blundering along.

This kind of testing has a built-in bias, since now all nodes are create equal, and are assumed to take the same amount of time to deal with. Unfortunately, that is not an accurate assessment. If a program slows down at some point, normally, where its NPS simply drops due to doing some sort of extra work, it doesn't slow down in a fixed node search, And effectively runs faster than it should. Since many programs have a 2x-3x NPS variance over the course of a game, giving a program a 2x-3x time handicap or advantage is a big Elo change. And it isn't accounted for. And that's more than enough to make the results suspect. So now you have to test again to see if you got burned by the NPS variance issue. Whereas with timed testing, you discover this immediately because the handicap/advantage goes away when everyone has the same amount of _time_ to use.

If you can learn from fixed node searching, you can learn the same thing from fixed depth searching. Yet we all know that plies are not a constant between programs. And the time to complete a ply is not even constant on a single program. Again, bias works its way in.

If you can figure out a way to factor the bias out of the equation, then I'd agree it will work. But if your solution is to also use a timed match, then why not start there in the first place?

bob · Post by **bob** » Thu Nov 12, 2009 11:41 pm

Don wrote:
I will ask again. What useful information do you get from fixed node searching?
I think to put this real simple, it comes down to what you say next. I'm going to use a little hyperbole or exaggeration of your words, but you are basically implying that fixed depth testing is totally ridiculous, has no possible correlation to reality, is biased beyond belief, etc.

My facts just do not bare out what you believe. In fact, I invariably get the same results with fixed depth testing as I do with time games if I factor in my very reliable time adjustment and if I'm not stupid about it. I pick and choose the type of tests I apply this to.

I actually do not believe in testing fixed depth if one program is too much faster than another - I know the error and bias increases with the difference. When I test other programs with mine I use the level that is closest in time. [/quote[\]

OK, so you agree there is an issue. My first question is, does your program search at a fairly constant rate over the course of a game? Mine is significantly slower in the opening due to extra developmental stuff, and it is slower in the endgame since a single movegen produces far fewer moves there than in the middlegame. If your NPS varies a lot, then the timing issue I mentioned is a headache. It can be compounded if you take a program like Ferret which sees its NPS multiply by 3-4 in the endgame. So fixing the number of nodes becomes a compromise that affects each program differently as the game progresses.

As far as my opponents go, I do not consider NPS as a factor when choosing them. I want them to be (a) completely reliable at all time controls; (b) as strong as, or stronger than Crafty. I don't consider search, speed, evaluation etc.

Some of the bias that you say is there, probably IS THERE. But it is just not a ridiculous amount of bias like you claim it to be. It's still an excellent approximation which almost every experiment is.
Again, let me be absolutely clear about what I am measuring. I have no interest in knowing whether my program is better or worse than the opponents I am playing against. I am interested in measuring _small_ changes in the rating of my program as I/we fiddle with new or existing evaluation terms or search ideas. Keyword: "small changes". Small changes can be greatly affected by any sort of bias. If you only care if A is better than B (A and B are different programs) fixed nodes or fixed depth might work pretty well assuming you can calibrate the test to include the difference in speeds for two programs searching to the same depth or node count. But I am interested in much finer resolution than that, which means what might be a tiny bias to you is a deal-breaking bias for me, making the test results unreliable.

All of my comments are in the context of "small changes". And to measure small changes accurately, requires high precision on the part of the test methodology. And I don't see how fixed node or fixed depth can deliver that kind of precision.

Please note that this fixed depth testing is not the only hammer in my toolbox, it is used judiciously.

You and I are not the only ones doing this either. I happen to know that there are teams much stronger than yours and mine that use fixed depth testing to great advantage and in fact do the majority of their testing with it, or some variation of it. So it's either not that horrible or it's as flawed as you say it is, but if it is, then you should be willing to admit they are even farther ahead of you in other areas. They seem to be doing ok despite their blundering along.
Who are they? I know vas/larry are using ultra-fast _timed_ games to tune Rybka. I suspect for the very reasons I am discussing.

This kind of testing has a built-in bias, since now all nodes are created equal, and are assumed to take the same amount of time to deal with. Unfortunately, that is not an accurate assessment. If a program slows down at some point, normally, where its NPS simply drops due to doing some sort of extra work, it doesn't slow down in a fixed node search, And effectively runs faster than it should. Since many programs have a 2x-3x NPS variance over the course of a game, giving a program a 2x-3x time handicap or advantage is a big Elo change. And it isn't accounted for. And that's more than enough to make the results suspect. So now you have to test again to see if you got burned by the NPS variance issue. Whereas with timed testing, you discover this immediately because the handicap/advantage goes away when everyone has the same amount of _time_ to use.

If you can learn from fixed node searching, you can learn the same thing from fixed depth searching. Yet we all know that plies are not a constant between programs. And the time to complete a ply is not even constant on a single program. Again, bias works its way in.

If you can figure out a way to factor the bias out of the equation, then I'd agree it will work. But if your solution is to also use a timed match, then why not start there in the first place?

michiguel · Post by **michiguel** » Fri Nov 13, 2009 8:04 am

bob wrote:
This is not the point...
The idea is to gather information, not necessarily the same information you get with time-based testing.

Miguel

I will ask again. What useful information do you get from fixed node searching?

You use it as a "proof of concept", of whatever complex change you want to introduce. You will know the maximum benefit or ceiling that you can achieve. Based on this, you can make a decision about the direction of the project, not necessarily about whether you keep a change or not, as you seem to imply, building a straw-man. Don explained this already.

For instance, you have an idea about move ordering or something but coding it efficiently will take a long time or imply a big rewriting. Then you code a straightforward version, bug free as Don said, readable and flexible. You run the tests and see what improvements you get. That will be your reasonable ceiling. If that ceiling is tempting, you decide to go ahead and perform the needed changes. This allows you to experiment with ideas and develop hypothesis that you can later try to prove. In fact, knowing how much speed up you will need to make the idea worthy, you will know what you actually need to accomplish. For instance, if you may conclude that the idea is not workable unless there is an assembler feature that does this or that, unless you can develop a cache with a X% hit rate etc. etc.

Miguel

This kind of testing has a built-in bias, since now all nodes are create equal, and are assumed to take the same amount of time to deal with. Unfortunately, that is not an accurate assessment. If a program slows down at some point, normally, where its NPS simply drops due to doing some sort of extra work, it doesn't slow down in a fixed node search, And effectively runs faster than it should. Since many programs have a 2x-3x NPS variance over the course of a game, giving a program a 2x-3x time handicap or advantage is a big Elo change. And it isn't accounted for. And that's more than enough to make the results suspect. So now you have to test again to see if you got burned by the NPS variance issue. Whereas with timed testing, you discover this immediately because the handicap/advantage goes away when everyone has the same amount of _time_ to use.

If you can learn from fixed node searching, you can learn the same thing from fixed depth searching. Yet we all know that plies are not a constant between programs. And the time to complete a ply is not even constant on a single program. Again, bias works its way in.

If you can figure out a way to factor the bias out of the equation, then I'd agree it will work. But if your solution is to also use a timed match, then why not start there in the first place?

bob · Post by **bob** » Sat Nov 14, 2009 1:47 am

michiguel wrote:
bob wrote:
This is not the point...
The idea is to gather information, not necessarily the same information you get with time-based testing.

Miguel

I will ask again. What useful information do you get from fixed node searching?

You use it as a "proof of concept", of whatever complex change you want to introduce. You will know the maximum benefit or ceiling that you can achieve. Based on this, you can make a decision about the direction of the project, not necessarily about whether you keep a change or not, as you seem to imply, building a straw-man. Don explained this already.

I understand that "assumption". But I do not believe it is a correct observation. I gave the example of how a change can push the game into a phase where your program appears to be faster or slower than normal. Imagine what would happen if you (or your opponent) could tweak the CPU clock frequency you are using. Crafty's speed varies by 2.5x from opening to endgame, as the data I provided shows. My evaluation change can drive me toward positions where I am faster, but I am really not faster because of the fixed nodes. So I am really slower. Or vice-versa. Does that _really_ give you any sort of useful upper-bound or lower-bound on the gain or loss? No this doesn't happen for every eval change. But I don't want to have to deal with figuring out which ones it might happen with.

For instance, you have an idea about move ordering or something but coding it efficiently will take a long time or imply a big rewriting. Then you code a straightforward version, bug free as Don said, readable and flexible. You run the tests and see what improvements you get. That will be your reasonable ceiling. If that ceiling is tempting, you decide to go ahead and perform the needed changes. This allows you to experiment with ideas and develop hypothesis that you can later try to prove. In fact, knowing how much speed up you will need to make the idea worthy, you will know what you actually need to accomplish. For instance, if you may conclude that the idea is not workable unless there is an assembler feature that does this or that, unless you can develop a cache with a X% hit rate etc. etc.

Miguel

This kind of testing has a built-in bias, since now all nodes are create equal, and are assumed to take the same amount of time to deal with. Unfortunately, that is not an accurate assessment. If a program slows down at some point, normally, where its NPS simply drops due to doing some sort of extra work, it doesn't slow down in a fixed node search, And effectively runs faster than it should. Since many programs have a 2x-3x NPS variance over the course of a game, giving a program a 2x-3x time handicap or advantage is a big Elo change. And it isn't accounted for. And that's more than enough to make the results suspect. So now you have to test again to see if you got burned by the NPS variance issue. Whereas with timed testing, you discover this immediately because the handicap/advantage goes away when everyone has the same amount of _time_ to use.

If you can learn from fixed node searching, you can learn the same thing from fixed depth searching. Yet we all know that plies are not a constant between programs. And the time to complete a ply is not even constant on a single program. Again, bias works its way in.

If you can figure out a way to factor the bias out of the equation, then I'd agree it will work. But if your solution is to also use a timed match, then why not start there in the first place?

A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.

Re: A reason for testing at fixed number of nodes.