But what about your opponents?nczempin wrote:My program does not have any choices within the opening book. It'll always play the same line.bob wrote:IMHO that is the wrong way to test. In any scientific experiment, the goal is to just change one thing at a time. If you use an opening book, now the program has choices and those choices introduce a second degree of freedom into the experiment. If the program has learning, there's a third degree of freedom.nczempin wrote:Okay, here's a first shot at getting to a more formalized description, we can always use a more mathematical language once we have some agreement on the content:
I have an engine E and another engine E'. I want to determine whether E' is (staticstically) significantly stronger than E.
What does stronger mean in this context (some assumptions that I am making for my personal situation, YMWV)?
Stronger means that E' has a better result than E, one which is unlikely to be due to random factors alone (at some given confidence factor, I think 95 % is a good starting point).
For me, the goal is to test the engines "out of the box", which means their own book (or none, if they don't have one) is included. I'm looking to evaluate the whole package, not just the "thinking" part.
But by using books, you introduce another level of randomness into the games as well. What are you trying to test? Book improvements? Search improvements? Evaluation improvements? Learning improvements? The idea is to test exactly what you are working on, eliminating all the other noise, to reduce the number of test games needed to evaluate the change. (note I said change, not _changes_).My goal is not to find out which of the (multiple or not) changes caused the improvement, my goal is only to find out whether the new version with that "black box of changes" is better than the old version.
Yes, at times it is interesting to know whether A or A' is better than C,D,E and F. But if you are trying to compare A to A', making multiple changes to A' makes it impossible to attribute a different result to the actual cause of that result.
So you would be happy with three somewhat decent improvements in your ideas, and one horrible one, just because the three decent ones make the overall thing play a bit better? That's a dangerous way of developing and testing, because that very thing is not that uncommon. How much better would it play without the three bad ideas to go with the one good one? You will never know. And crap creeps into your code without your knowing...
p
Because my opening book will not lead to those positions, I don't care if it performs better or worse in those positions, because it will never reach them. So having to test with them will be a waste of time.You miss the key point. So your program does badly in some Nunn positions. So what? You don't care about how well you play against B,C and D. You only care if the new version plays _better_ against those programs. That tells you your changes are for the better.Actually I think using random or Nunn positions could skew the results, because they would show you how your engine plays in those positions. There may very well be positions that your engine doesn't "like", and that a sensible opening book would avoid. Of course, someone whose goal is the performance under those positions (or under "all likely positions", of which such a selected group is then assumed to be a good proxy) would need to use a lot more games. So in the discussion I would like this factor to be separated.
If Eden A achieves a certain result with a certain book, I would like to know if Eden B would achieve a better result with the same book.
That is _exactly_ what I am trying to verify...I am not confusing the two, I am not interested in knowing if A is better than C. I am only interested in knowing whether A' performs better than A against the conglomerate of B, C and D, without knowing the individual results.
Don't confuse A being better than C with trying to decide whether A or A' is better. The two questions are completely different. And for me, I am asking the former, "is my change good or bad". Not "does my change make me better than B or not?"
We can simply agree that our goals are different. I guess mine is not the search for scientific knowledge but just to improve my engine in a regular way.
Incidentally, if you really are looking for the answer to the question "is my change good or bad", IMHO your approach is still inadequate (unless I'm missing something here):
All you could safely say after your significant number of games would be "Making change x to Crafty version a.0 makes that version a.0 significantly stronger"
Hold on. Above you are talking about making _multiple_ changes and then testing. I don't do that for the very reason you give above. I am making one change at a time and then determining if the change is worthwhile or not.
It could very well be that in a later version of Crafty, say a.24, that change turns out to be insignificant. That Crafty a.25 would actually be stronger than a.24 if you removed the change you added from a.0 to a.1 (or at least would make no difference by then).
No doubt about that. And again, the only reasonable way to recognize that you have walked into that is to test at the faster speed to see what happens. If your scores drop significantly, then you have to play detective to figure out why, and then test the resulting changes to see if the effect is gone.
For example, when the depth an engine can reach is fairly low, certain pieces of knowledge would make it stronger, so they could be made part of the eval. When the depth gets much larger, those may become irrelevant or even counter-productive (because now they are taken into account twice).
I totally disagree there, but that is up to you if you want to follow that path. Just be aware that eventually you will pay the piper however..
And I don't think any number of games will prevent this; I think the goal of "keeping everything else constant" is basically impossible at a certain level of complexity.
That is why I don't test A against A'. I test A and A' against a gauntlet of opponents, over a variety of opening positions, playing enough games to smooth out the randomness that is always present in timed games.
I have already discussed why I consider opponents of far higher and of far lower strength to be inadequate indicators.
Just playing E against E' is also not sufficient. Even just playing both against some other engine D is not sufficient, because as has been mentioned elsewhere engine strength is not a transitive property (i. e. it may well be that A would beat B consistently, B would beat C and C would beat A. So to find out which is the strongest within a theoretical universe where only these three engines exist, a complete tournament would have to be played).
The answer is simple. Play enough games. How many? Play N game matches and see how much they vary. If the results are too random, play 2N game matches and check again. Keep increasing the number of games until the result settles down with very little randomness, and now you have reached "the truth".
So it would make sense to play a sufficient number of games against a sufficient number of opponents of approximately the same strength.
Thus we have a set of opponents O, and E should play a number of games against each opponent, and E' should play the same number of games against the same opponents. For practical reasons the "other version" could be included in each set of opponents. This presumably does skew the results, but arguably it is an improvement over just playing E against E'.
So what we have now is a RR tournament which includes E, E' and the set O.
What I'd like to know is, assuming E' achieves more points than E, what result should make me confident that E' is actually stronger, and this is not merely a random result?
Why would you believe more games will result in a smaller difference between E and E'? Does that mean that after an infinite number of games the two are exactly equal? I don't follow the inverse reasoning at all.
Without loss of generality, let's set the size of O to 8 (which would make the tourney have 10 contestants).
My first conjecture is that there is an inverse relationship between the number of games per match and the points differential between E and E'.
Opinions?
(Feel free to improve on my pseudo-mathematical terminology, and of course on other things).