I have played Crafty in a few human events over the past 5 years. I DO enter the moves by hand, and play in "console mode".Don wrote:I agree with you in theory but not in practice. In real tournaments a human operates the machine, I'm sure you don't test your program by manually entering the moves do you? Of course not because it's not a workable thing in practice. In theory that is how you play but in practice you would never get 100,000 games that way.bob wrote: I subscribe to the philosophy of "test like you plan on running". If you are testing yourself and only want to measure playing skill improvements, then ponder=off is perfectly OK. Might not give you the same final Elo number as with ponder=on, but if something helps with PON, it should help with POFF, unless you are changing the basic pondering code (or time allocation is different with PON and POFF).
I'm not going twiddle and tune my wife's car, then on saturday night my son and I take his mustang to the drag strip. Or I would not "practice" with a nitrous system turned off, then race with it on.
So I'm afraid that you have to pick and choose which concessions you make for the sake of practicality. We try to pick them in order of how much we believe they are relevant.
Here is a list of concessions that most of us make - probably a few exceptions such as in your case when you have a major hardware testing infrastructure but you probably make some of the same concessions too:
1. Time control A
2. Time control B
3. Ponder vs No ponder
4. Book
5. Hardware
6. Opponents
One at a time:
1. There are 2 issues with time control. The first is playing with the same style time control with same ratio of time and increments if used or moves. For example 40/2 classic should scale to 20/2 if you want to speed up the test. If you want to play in 5 minutes + 5 seconds then you should test at 1 minute + 1 second, preserving the same ratio.
2. The other time control issue is actually playing the exact time control of the tournament you are playing in. If you want Crafty to play well at 40/2 then do you test only at 40/2 ???
3. Ponder vs No ponder. I test with a 6 core i7-980x and it's not much, it's a huge bottleneck for us. Larry has a bit more than I do but it's still a huge bottleneck. We test with ponder off. If we tested with ponder ON we would have to reduce our samples by half, or increase our testing time by 2X to get the same number of games. We cannot afford to do this just to be anal retentive about this issue.
See: http://en.wikipedia.org/wiki/Anal_retentiveness
4. Book. Does Crafty use the same book that it will compete with? You would have to in order to follow your principle of testing the same as you will play.
5. Hardware. Does Crafty use the same exactly hardware and configuration for you big 20,000 game samples that you intend to compete with? I doubt it.
6. Opponents. When you test Crafty I'm sure you don't test against the same players and versions you will compete with in tournaments. This is not possible anyway since you don't know who will be there and what they will bring and what hardware they will use.
As you see, it is not even CLOSE to possible to "test like you plan on running." I don't mean to be critical about this but I don't understand why people latch on to what is probably the LEAST important factor in the list above and make it seem like a major blunder, as if there is absolutely no correlation between how a program will do with ponder vs not pondering - when all major testing is done with an opening book that does not resemble in any way, shape, or form what a program will use in a serious competition. Which do you think is the greater issue?
Larry and decided long ago that testing with Ponder although better in some idealist sense is a major trade-off in the wrong direction, where sample size means so much more.
So if the principle is "test like you plan on running" how would you justify most of these concession? Do you think ponder is more important that time control or using your tournament book or running the same exact hardware?
When it comes to things like this a good engineer knows the different between the lower order bits and and the higher order bits. The truth of the matter is that hardly anyone has the luxury of "testing like we plan on running" but we have a very good sense of what the tradeoffs are. We know that if the program improves, it will probably show up still at a different time control if it's not ridiculously different.
Let me ask you this: If we make an improvement to the evaluation function which shows a definite improvement with no ponder, do you think the results is invalid because we did not test with ponder on? I don't think so .....
You might find this interesting, but I'm a lot more anal retentive about stuff like this than Larry is, but compared to you I'm not disciplined at all!
As far as your last question, you are missing the point. The rating lists produce RATINGS. It is not about whether version A is better than version B. It is about the RATING of each program... If you change your eval, either way, if done consistently, should tell you reliably whether the change was good or not. But it might not reliably tell you the Elo of your program so that it can be compared to others.
And there, there IS a difference between ponder=on and ponder=off. Simple example. Broken pondering so you always ponder the wrong move. In a ponder=off match, you get rating X. In a ponder=on match, you get rating of X-70 or something similar, because your opponent always predicts you right, saves time, and searches deeper, you never get a ponder right, and never save time. Time == Elo. I even allocate time differently with ponder=on vs off, because I know I will save time by pondering, and I want to use some of that time "before" it is saved, during an important part of the game (early middlegame).