IMHO that is the wrong way to test. In any scientific experiment, the goal is to just change one thing at a time. If you use an opening book, now the program has choices and those choices introduce a second degree of freedom into the experiment. If the program has learning, there's a third degree of freedom.nczempin wrote:Okay, here's a first shot at getting to a more formalized description, we can always use a more mathematical language once we have some agreement on the content:
I have an engine E and another engine E'. I want to determine whether E' is (staticstically) significantly stronger than E.
What does stronger mean in this context (some assumptions that I am making for my personal situation, YMWV)?
Stronger means that E' has a better result than E, one which is unlikely to be due to random factors alone (at some given confidence factor, I think 95 % is a good starting point).
For me, the goal is to test the engines "out of the box", which means their own book (or none, if they don't have one) is included. I'm looking to evaluate the whole package, not just the "thinking" part.
Yes, at times it is interesting to know whether A or A' is better than C,D,E and F. But if you are trying to compare A to A', making multiple changes to A' makes it impossible to attribute a different result to the actual cause of that result.
You miss the key point. So your program does badly in some Nunn positions. So what? You don't care about how well you play against B,C and D. You only care if the new version plays _better_ against those programs. That tells you your changes are for the better.Actually I think using random or Nunn positions could skew the results, because they would show you how your engine plays in those positions. There may very well be positions that your engine doesn't "like", and that a sensible opening book would avoid. Of course, someone whose goal is the performance under those positions (or under "all likely positions", of which such a selected group is then assumed to be a good proxy) would need to use a lot more games. So in the discussion I would like this factor to be separated.
If Eden A achieves a certain result with a certain book, I would like to know if Eden B would achieve a better result with the same book.
Don't confuse A being better than C with trying to decide whether A or A' is better. The two questions are completely different. And for me, I am asking the former, "is my change good or bad". Not "does my change make me better than B or not?"
I have already discussed why I consider opponents of far higher and of far lower strength to be inadequate indicators.
Just playing E against E' is also not sufficient. Even just playing both against some other engine D is not sufficient, because as has been mentioned elsewhere engine strength is not a transitive property (i. e. it may well be that A would beat B consistently, B would beat C and C would beat A. So to find out which is the strongest within a theoretical universe where only these three engines exist, a complete tournament would have to be played).
So it would make sense to play a sufficient number of games against a sufficient number of opponents of approximately the same strength.
Thus we have a set of opponents O, and E should play a number of games against each opponent, and E' should play the same number of games against the same opponents. For practical reasons the "other version" could be included in each set of opponents. This presumably does skew the results, but arguably it is an improvement over just playing E against E'.
So what we have now is a RR tournament which includes E, E' and the set O.
What I'd like to know is, assuming E' achieves more points than E, what result should make me confident that E' is actually stronger, and this is not merely a random result?
Without loss of generality, let's set the size of O to 8 (which would make the tourney have 10 contestants).
My first conjecture is that there is an inverse relationship between the number of games per match and the points differential between E and E'.
Opinions?
(Feel free to improve on my pseudo-mathematical terminology, and of course on other things).