Don wrote:bob wrote:hgm wrote:What you propose is basically a special case of orthogonal multi-testing: In stead of just testing a single change, vary other parameters at the same time. In this case the TC. Get the extra info for free. If you do N games, in priciple you can similtaneously test the effect of 2_log N changes that way, all with the same accuracy that goes with N games.
So you believe it reasonable to ignore any differences in how a program behaves at different time controls, as if that has no effect on the final result? That running 10K games at very fast, 10K games at very slow, and then combining them to get 20K games, is just as good as running 20K games fast and then 20K games slow? I see too much change as time controls are varied. It is quite easy to find a program that varies by well over +/- 100 Elo against Crafty depending on the time control used. Changing the time control changes too many things (all opponents + crafty will behave differently).
Bob,
Orthogonal testing makes the assumption that the individual things being tested have little interaction, that they are independant. As you observe, that is not always the case.
However, so does the kind of testing you do. If you test 20 different things over the course of the month, did you test each change in combination with every other change you ever made? Of course you didn't. You assume that if you can get an improvement over the PREVIOUS version it is good in general. If it tests badly do you back out previous changes to see if there is a bad interaction? Probably not very often.
This kind of testing that we all do is pretty ugly but we have no choice but to make huge simplifying assumptions and just try to apply a little common sense to each of them.
I have noticed that you have the most incredible imagination. You can completely ignore something with a hand wave or it can become a major issue with you, depending on the point you want to make.
This all started with a simple observation that is a special case of orthogonal multi-testing, a very keen observation that I would have never thought of but seems self-evident when pointed out. That you even challenged it seemed incredible to me, like you are looking for some excuse to prove yourself. I think everyone here already has your respect so what gives?
If my "challenge" of the idea seems incredible, perhaps you might think about it a bit more. I play 40K matches for each different test. To suggest that I get the same accuracy by playing 20K fast games and 20K slow games is not exactly a reasonable assumption. If you believe it is, more power to you.
I believe it is _far_ easier to simply do the tests I want to do, in the most accurate way possible. Everyone wants to "cheat the sample sizes" and find ways to get equivalent results with fewer games. I agree that orthogonal testing works _if_ the individual tests are really independent. In my case, which you cited, your comments make _zero_ sense. Yes, it is possible that there are interactions between changes. But I only test one change at a time. I am not testing two at once and _hoping_ that they are orthogonal. I am testing the full 40K games _knowing" there is no interactions because I am only testing one thing at a time. It really is that simple, and it really does work. Doing the orthogonal stuff is a neat idea, but with some assumptions built in that are difficult, if not impossible, to verify. My way has no assumptions at all.
As far as "what gives?" I simply want to see some factual information provided when dealing with testing. Ideas that are suggested (fixed nodes) have already been tried, for millions of games, and I found a problem that I pointed out. If you want to ignore that problem, that's OK. I simply intended to point out that there _is_ a problem. Just because some don't seem to grasp/understand the problem does not mean it doesn't exist. And there is no doubt it exists because I spent quite a bit of time trying to figure out why I was producing results that seemed to be counter-intuitive when we were playing with the fixed node test after discovering that timed tests had so incredibly much variation in them.
I almost think it would be better to simply remain silent and let everyone test the way they want, and not try to share things that I discover. Because many don't like "truth". I'll keep that in mind. We've clearly identified a way to improve Crafty. And the improvement has been verified over time by independent testers and rating lists. I'll just leave it at that and let everyone figure out how to do this stuff on their own, and re-invent the wheel a few hundred times, as that will certainly lead to a lot less aggravation, IMHO. So. Do your mixed testing. Different time controls. Orthogonal (supposedly) changes. Etc. And base your decisions on the results. After all, it really isn't going to hurt +me+. I have nothing to gain by sharing information. And in a way, have more to lose since it would be better to let everyone test in a broken way (how long did the commercial guys use 40 starting positions before my test results showed how bad that was... Ask Theron about it, for more information.)
So, for me, "mum is the word" with respect to testing. Let the "I believe," "I think," "it seems," "it must be," "that can't possibly be right," and such run their course. After all, it won't hurt +me+. And it might even help by improving our rate of progress compared to those that use sub-optimal (or flawed) testing approaches...
caio.