TalkChess.com

Posted: **Tue Nov 10, 2009 11:06 pm**

Here is one point to ponder (I started a new thread as the others quickly get too cumbersome to follow).

Has anyone thought about _why_ I raised this issue originally? the issue where a change can push the game toward or away from an area where you lose or gain speed? Which distorts the results?

(1) Do you believe that I spent weeks trying to find a reason why this was a bad idea?

(2) Did any of you read my _original_ discussion on cluster testing where we were trying to understand the really wild variability, even when playing the same starting position and same two opponents? Did you read the stuff about how playing the same two opponents, same starting position, 100 times, each time allowing just one opponent to search one more node than in the previous run, generally produces 100 _different_ games? And did anyone notice where I mentioned that for some odd reason, playing a fixed number of nodes, while producing repeatable results, produced results that were _significantly_ different from timed matches?

I actually spent a lot of time trying to figure out why. And I found the speedup/slowdown issue was the culprit. Fixed node games gives a bias to the program that overall searches at a lower NPS than the opponent, because fixed nodes makes nodes equal, even though one program expends more effort on a node than another. I tried the "adjustment" approach, which I explained way back when. In that I did a few test runs and came up with an average NPS for each program in the test. And the fixed node counts were adjusted so that each program took about the same amount of time. And that changed the results in unexpected ways. Because my "average NPS" ignored endgames, and the program that speeds up the most gets penalized since in fixed node testing, all nodes are treated equally and there is no adjustment to the number of nodes as the game progresses.

None of this is really new. And it wasn't something I dreamed up. It was something that took weeks to figure out. And after seeing the skewing, and considering how eval changes can screw this up (just add a trade bonus to reach endgames quicker and the program that speeds up the most in the endgame gets penalized and drops in overall score, only because of getting penalized by not being able to take advantage of the endgame speedup it would normally see.

None of this is made up. It was quite apparent. And I didn't like it because I don't want something to affect the results, yet it can't be easily quantified and the effect removed after the results are done.

Hope that helps as to where I am coming from on this. fixed nodes are still nice because of repeatability which makes debugging much simpler. But that is the _only_ advantage I can see for them, using real-world engines with significant NPS variations over the course of a game.

Posted: **Tue Nov 10, 2009 11:29 pm**

I would like to see quantitative data on that. What you mention as a problem is actually not particular to node-based play. If a program at node-based play speeds up in the end-game unually much compared to its peers, sure, it plays a comparatively wea end-game. But if you use time-based play, there are also prorams that are weaker in end-games than others.

Testing against programs that are lousy in end-games indeed makes eval terms that hasten trading appear good, while they will backfire agains programs that play a strong end-game. This is why you have to test against a variety of programs. And when you do that, the fact that the opponents jump around on the end-game-rating scale has the tendency to average out. Some get comparatively better, others get comparatively worse. The distribution will stay approximately the same.

So I am curious to see an example of how much it really matters. Show us a change in Crafty (any change) that will give a completely different rating change in an accurate nodes-based test (when both Crafty and all of the opponents play by nodes) than in the usual time-based test.

Posted: **Wed Nov 11, 2009 5:52 am**

hgm wrote:I would like to see quantitative data on that. What you mention as a problem is actually not particular to node-based play. If a program at node-based play speeds up in the end-game unually much compared to its peers, sure, it plays a comparatively wea end-game. But if you use time-based play, there are also prorams that are weaker in end-games than others.

And that's to be expected. What is not desirable is to artificially change the conditions, without even knowing you are changing them. The variable-speed issue does that quite nicely in fixed-node searches.

Testing against programs that are lousy in end-games indeed makes eval terms that hasten trading appear good, while they will backfire agains programs that play a strong end-game. This is why you have to test against a variety of programs. And when you do that, the fact that the opponents jump around on the end-game-rating scale has the tendency to average out. Some get comparatively better, others get comparatively worse. The distribution will stay approximately the same.

However, I am not testing against such programs. Unless you think stockfish, toga, fruit, glaurung, arasan, gnuchess, etc are all weak in the endgame?

So I am curious to see an example of how much it really matters. Show us a change in Crafty (any change) that will give a completely different rating change in an accurate nodes-based test (when both Crafty and all of the opponents play by nodes) than in the usual time-based test.

Sorry, but that's not my function in life. I've already chased down that issue once, at quite a cost in terms of time. I don't test with fixed nodes since I found it to be somewhat distorted/misleading. If I had nothing else to do, it would be an interesting undertaking. But we have tests waiting to be run, and are evaluating test results every day at the moment.

Posted: **Wed Nov 11, 2009 8:12 am**

The point is that if what you were telling us was fact, rather than one of your usual superstitions, you would have the data lying ready, because you aleady did the test.

But apparently there is no such data, and all you say can just be logged as unconfirmed speculation, which on top of it is sufficiently inconsistent to qualify as nonsense.

Posted: **Wed Nov 11, 2009 1:51 pm**

Do you keep the pgn and results of every match your program has played, and clear documentation of the items being tested in every match? Get real.

Posted: **Wed Nov 11, 2009 11:18 pm**

hgm wrote:The point is that if what you were telling us was fact, rather than one of your usual superstitions, you would have the data lying ready, because you aleady did the test.

But apparently there is no such data, and all you say can just be logged as unconfirmed speculation, which on top of it is sufficiently inconsistent to qualify as nonsense.

OK. I made it all up. Or I don't keep 500 million PGN games, I play about 40K games an hour generally. Or about 1M games a day Over a year, that is 360M games. So yes, I keep _everything_. Or I don't.

If you think it was made up, more power to you. If you think I should keep all the results from every game I play, you buy the disk space, and write the software to keep up with what is what. Otherwise, stop making stupid suggestions about what I "should" be doing. Reality is where I live. And keeping 3 years worth of test data is way beyond "reality".

Sorry, but the only "nonsense" comes directly from you. I can understand this problem. I analyzed it when trying to figure out what was going on. That you haven't run across it simply says a lot about the _differences_ in our testing approaches. The problem is real, whether you choose to believe it or not. It doesn't take much imagination to understand the problem. Of course "zero imagination" will run into a problem here and there. So be it.

Posted: **Wed Nov 11, 2009 11:21 pm**

Robert Pope wrote:Do you keep the pgn and results of every match your program has played, and clear documentation of the items being tested in every match? Get real.

Forget about "getting real" with him. I don't run "full speed" every day. But if I had, I could have played over one _billion_ games over the past 3 years of cluster testing. 40K here, 40K there, testing different things. I quickly decided to not keep old stuff once a test is completed. I currently have about 3,000,000 games I have saved from recent testing, so that we can go back and look at anything that seems interesting or odd. But beyond that, I do not try to save all of the PGN over the life of my testing. Too much data, no good way to organize it, and most of it would be totally worthless by next month. I do have a summary of most of the test results, test by test, saving only the BayesElo output. And I even purge that from time to time.

Posted: **Thu Nov 12, 2009 5:53 pm**

bob wrote:
Robert Pope wrote:Do you keep the pgn and results of every match your program has played, and clear documentation of the items being tested in every match? Get real.
Forget about "getting real" with him. I don't run "full speed" every day. But if I had, I could have played over one _billion_ games over the past 3 years of cluster testing. 40K here, 40K there, testing different things. I quickly decided to not keep old stuff once a test is completed. I currently have about 3,000,000 games I have saved from recent testing, so that we can go back and look at anything that seems interesting or odd. But beyond that, I do not try to save all of the PGN over the life of my testing. Too much data, no good way to organize it, and most of it would be totally worthless by next month. I do have a summary of most of the test results, test by test, saving only the BayesElo output. And I even purge that from time to time.

Bob, I don't believe there is anything seriously wrong with your testing methodology, but you have been pretty critical of ours. That is what is being reacted to. I think HG referred to your "superstitions" but we all have some superstitions when it comes to testing. When I hear other people explain their testing, I am abhored, but I notice that I will make some statement and others are abhored too. Who is right?

You said you use 4000 openings. I can be just as superstitious as you and ask you about your distribution of those openings. Do the 4000 openings you test with, exactly match the ones you play in tournaments with? If not, then you are testing a lot of positions that are not relevant to your actual strength in tournaments. You are "pushing" your evaluation to positions that are not relevant! You should not get too indignant if someone reacts to some bizzare scenario you propose in order to be critical of how they do things.

Could there be a grain of truth in what you say? Of course there could. There is probably some truth to what I said about the opening book you test with - even though I just used that to make a point. I don't really think it's wrong to use variety in testing. But it's probably clearly the case that your testing book is not particularly relevant to your playing book. That is the same with me (actually I don't even have a playing book yet) but it doesn't bother me any.

I can come up with a million other things like this to worry about. For instance you use time control games. Doesn't that favor the weaker program? Everybody knows that wall clock time in a computer is inconsistent due to all sorts of background daemons and processes running. So how do you resolve that? Doesn't it bother you than a program might lose a game because of some chaotic event inside the machine?

I think this is why some people may prefer fixed nodes testing, it is at least self consistent in a way that time games cannot be. Now I'm not in that camp, I have no problem whatsoever with time controls games and I'm not paranoid about the consistency issue, but it makes my point that each programmer has their superstitions. You will no doubt say "speak for yourself" but I'm willing to bet that you have some irrational superstitions too, just like the rest of us. When I say irrational I don't mean they may not be true, only that we tend to give them WAY too much weight.

You can of course feel free to disagree, but I think you have a superstition against fixed depth testing. Even though I admit it has shortcomings, I feel that it has some redeeming strengths and I see it as a tool. Some programming teams that have produced programs much stronger than yours also use fixed depth testing to great effect. Maybe they got it wrong, but you have to admit that they are doing SOMETHING a lot better than you are. And of course the same can be said about me.

In a testing environment there is no hope of getting everything just right and perfectly uniform and fair. We both talked about and agreed, for instance, that it's not practical to test at 40/2, even if that is the time control you would like to optimize for. So you make some assumptions and take some compromises. You extrapolate and use a little common sense. That's what I do with fixed depth testing. Those of us that use fixed depth testing more aggressively than you are not misguided idiots as you seem to infer, but we use some common sense. I take EVERY test with a grain of salt - knowing that things may not be as they seem. I see every test as some kind of approximation of something.

I think we all have to realize that there is probably more than one way to skin a cat and just because someone else does it differently, doesn't always mean their way is wrong.

- Don

Posted: **Thu Nov 12, 2009 6:57 pm**

Don wrote:You said you use 4000 openings. I can be just as superstitious as you and ask you about your distribution of those openings. Do the 4000 openings you test with, exactly match the ones you play in tournaments with? If not, then you are testing a lot of positions that are not relevant to your actual strength in tournaments. You are "pushing" your evaluation to positions that are not relevant! You should not get too indignant if someone reacts to some bizzare scenario you propose in order to be critical of how they do things.

Could there be a grain of truth in what you say? Of course there could. There is probably some truth to what I said about the opening book you test with - even though I just used that to make a point. I don't really think it's wrong to use variety in testing. But it's probably clearly the case that your testing book is not particularly relevant to your playing book. That is the same with me (actually I don't even have a playing book yet) but it doesn't bother me any.

I can come up with a million other things like this to worry about. For instance you use time control games. Doesn't that favor the weaker program? Everybody knows that wall clock time in a computer is inconsistent due to all sorts of background daemons and processes running. So how do you resolve that? Doesn't it bother you than a program might lose a game because of some chaotic event inside the machine?

The issue about which 4000 starting positions to use was hashed out what seems like years ago in these pages. It's a very old topic here and IMO, you're talking out of your hat because it seems you weren't here for those discussions, and are bringing up issues that have already been hammered out.

Your daemons comment exposes some ignorance about cluster OSes in general. If you think about it, clusters are made for high-speed, low drag operation, focusing on application performance. The OS is lean and mean Linux. If you've ever built a single application turnkey solution with Linux , you'll know rogue daemons on random nodes isn't going to be an issue. Even from the volume of games alone, you should have realized that even if ti were possible, outlier effects aren't going to affect the overall accuracy of the result.

Don wrote:I think we all have to realize that there is probably more than one way to skin a cat and just because someone else does it differently, doesn't always mean their way is wrong.

I've monitored this debate for years now, and in IMHO, there are very few ways to skin this particular cat. One thing is clear, the statistical facts proven by cluster testing are extremely disagreeable, especially to people who have little hope of obtaining access to cluster/cloud resources. As Bob has already hinted at, and since his methodology and supporting data/discussions have already outlined the procedure, it would probably be best if he cease referring to his utilization of these resources. It would forestall all this bitter feeling and gnashing of teeth.

Posted: **Thu Nov 12, 2009 8:57 pm**

Don wrote:
bob wrote:
Robert Pope wrote:Do you keep the pgn and results of every match your program has played, and clear documentation of the items being tested in every match? Get real.
Forget about "getting real" with him. I don't run "full speed" every day. But if I had, I could have played over one _billion_ games over the past 3 years of cluster testing. 40K here, 40K there, testing different things. I quickly decided to not keep old stuff once a test is completed. I currently have about 3,000,000 games I have saved from recent testing, so that we can go back and look at anything that seems interesting or odd. But beyond that, I do not try to save all of the PGN over the life of my testing. Too much data, no good way to organize it, and most of it would be totally worthless by next month. I do have a summary of most of the test results, test by test, saving only the BayesElo output. And I even purge that from time to time.
Bob, I don't believe there is anything seriously wrong with your testing methodology, but you have been pretty critical of ours. That is what is being reacted to. I think HG referred to your "superstitions" but we all have some superstitions when it comes to testing. When I hear other people explain their testing, I am abhored, but I notice that I will make some statement and others are abhored too. Who is right?

I have only been critical of the following mistakes: (mistakes _I_ have made, and tracked down, to understand why they are mistakes, I might add).

(1) too few games to draw conclusions about a small change.

(2) too few starting positions, even if you play multiple games from each position (Theron discovered this after reading my original testing threads where this turned out to be a major flaw).

(3) artificially (and unknowingly) changing a program's behaviour by such things as using a fixed node search... You have said you do not use this. Good. However, fixed depth has the same basic problem. One that I tracked down a couple of years ago and have explained multiple times.

I criticize those methods because they have flaws that can be avoided by using timed games. The repeatability of fixed-node games is completely irrelevant unless one is debugging rather than measuring the effect of a change. Every real game of chess I play is limited only by time. Playing in that way removes any hidden biases one might add by steering (unknowingly) the game into positions where they have an advantage they are not aware of (say a part of the tree where your program slows down significantly, and where normally you would search 1/3 or 1/2 the normal nodes. But with a fixed node search, you don't see that penalty, and you effectively get 2x the compute time of your opponent. Which will absolutely give you a gain of over 50 Elo each time it happens. And when you don't know it is happening, it leads to invalid conclusions about the changes that let you push your program into those kinds of positions. And then when you play real games, and push the game into those positions, suddenly you _lose_ 50+ elo because you are running at 1/2 the normal speed.

I don't care what starting positions anyone uses, whether they be mine, someone else's or something they create for themselves as I did. I think one has to be careful about only playing very fast games, particularly when making changes that influence time usage, since playing 10s +0.1s games is _very_ close to just playing 0.1s per move constantly, with very little to decide about how to spend that extra 10 seconds which goes away pretty quickly.

You said you use 4000 openings. I can be just as superstitious as you and ask you about your distribution of those openings. Do the 4000 openings you test with, exactly match the ones you play in tournaments with? If not, then you are testing a lot of positions that are not relevant to your actual strength in tournaments. You are "pushing" your evaluation to positions that are not relevant! You should not get too indignant if someone reacts to some bizzare scenario you propose in order to be critical of how they do things.

This too has been discussed many times. What is your goal? To tune to positions you include in your book, or to improve the overall skill level of your program across _all_ openings? Mine is the latter, because there are lots of tricks in the book world to get a program into a line they really didn't want to play, and didn't realize that their chosen opening could transpose. I think either of the above is perfectly acceptable. I'm working on the "general case". Our book person then tailors the book to lead Crafty into positions it plays better. And he has noticed that over the past couple of years, his choices have expanded because Crafty has improved in positions it used to not play well. So this is a philosophical issue and either answer is acceptable. You simply have to choose, as I did. My openings are from a couple of million GM games, and cover most everything they might play. No "Grobs" or any such crap, but then I think most any program can play unorthodox openings pretty well anyway if they can handle normal openings acceptably.

[

Could there be a grain of truth in what you say? Of course there could. There is probably some truth to what I said about the opening book you test with - even though I just used that to make a point. I don't really think it's wrong to use variety in testing. But it's probably clearly the case that your testing book is not particularly relevant to your playing book. That is the same with me (actually I don't even have a playing book yet) but it doesn't bother me any.

Nor does it bother me. I made the choice for how I wanted to test, because I don't want Crafty "married" to a particular sub-set of known chess openings, I prefer that it be able to play any opening, using any book, and have a reasonable chance of coping.

I can come up with a million other things like this to worry about. For instance you use time control games. Doesn't that favor the weaker program? Everybody knows that wall clock time in a computer is inconsistent due to all sorts of background daemons and processes running. So how do you resolve that? Doesn't it bother you than a program might lose a game because of some chaotic event inside the machine?

First, I don't have any "background daemons" running. The cluster runs nothing in the background. The machines I use in tournaments have been tweaked by yours-truly to eliminate unneeded crap. Yes, there might be something that burns a millisecond here or there. But in tournaments, where we are using a minute per move, that has zero impact. At least there are no "chaotic moments" on any machine I use. And this goes all the way back to my first usage on a Cray, where I did exactly the same thing (kill everything that wasn't needed to play chess).

I think this is why some people may prefer fixed nodes testing, it is at least self consistent in a way that time games cannot be. Now I'm not in that camp, I have no problem whatsoever with time controls games and I'm not paranoid about the consistency issue, but it makes my point that each programmer has their superstitions. You will no doubt say "speak for yourself" but I'm willing to bet that you have some irrational superstitions too, just like the rest of us. When I say irrational I don't mean they may not be true, only that we tend to give them WAY too much weight.

You can of course feel free to disagree, but I think you have a superstition against fixed depth testing. Even though I admit it has shortcomings, I feel that it has some redeeming strengths and I see it as a tool. Some programming teams that have produced programs much stronger than yours also use fixed depth testing to great effect. Maybe they got it wrong, but you have to admit that they are doing SOMETHING a lot better than you are. And of course the same can be said about me.

Again, I have used it many times (fixed nodes or even fixed depth). To obtain repeatability. So that when a bug happens, I can repeat the game up to that point and have some hope of finding and fixing the bug, a hope that almost disappears when using time limits. But that is for debugging. Which is not the same as measuring the effectiveness of a change.

In a testing environment there is no hope of getting everything just right and perfectly uniform and fair. We both talked about and agreed, for instance, that it's not practical to test at 40/2, even if that is the time control you would like to optimize for. So you make some assumptions and take some compromises. You extrapolate and use a little common sense. That's what I do with fixed depth testing. Those of us that use fixed depth testing more aggressively than you are not misguided idiots as you seem to infer, but we use some common sense. I take EVERY test with a grain of salt - knowing that things may not be as they seem. I see every test as some kind of approximation of something.

I think we all have to realize that there is probably more than one way to skin a cat and just because someone else does it differently, doesn't always mean their way is wrong.

- Don

TalkChess.com

more on fixed nodes

more on fixed nodes

Re: more on fixed nodes

Re: more on fixed nodes

Re: more on fixed nodes

Re: more on fixed nodes

Re: more on fixed nodes

Re: more on fixed nodes

Re: more on fixed nodes

Re: more on fixed nodes

Re: more on fixed nodes