Why do I sometimes get the feeling that you're reading only half of what I wrote?bob wrote:OK, it's becoming clearer. If you are testing with books, you are not testing _program improvements_ at all. This means that your test results are not a reliable indicator of whether or not your recent changes are good. Because random book lines will produce different results.nczempin wrote:No, quite the opposite. I test against programs that are not much stronger, but roughly the same strength. And my decision to release is based on the tournament results (and not any individual results).bob wrote:
OK, If you are testing against other programs, and they are much stronger than yours, then I agree that the results will be more deterministic, particularly with respect to the game results, although there should be plenty of different moves being played. But against a stronger opponent, different moves will likely just result in a different loss. But against more equal results, the non-determinism should be more apparent, because rather than just different moves, same result, you should see more of different moves, different results...
if you are seeing the same exact moves, that would be an interesting issue...
This means there may very well be different results (some of the weak engines at Eden's level have quite substantial books, so that by itself is a factor).
But I'm judging the result of the overall tournament, where my hope is that some of the non-deterministic results even out. And in the end, it seems that my approach has been successful. But, again, I think Eden's situation is not representative.
I may run a test of Eden against another engine which is likely deterministic, and publish the results (the easiest would be Eden against itself, but perhaps that's not so meaningful).
Another interesting experiment would be to find out what would be needed to get Crafty to play the same game against itself over and over again. Which configuration changes would be needed, and which code changes?
Again, if you want to test _program_ improvements, you have to eliminate as many random contributors as possible. Starting with books, any kind of learning, even pondering, parallel search, etc. There will still be a random component left caused by timing differences, and since that can't be eliminated, you just have to play enough games to produce repeatable results.
If you can run the same test twice and get different results, the test is not very useful in deciding "better/worse/same". The closer your opponents are to your program's strength, the _MORE_ games you need to play to produce reliable results. And we are talking thousands, not dozens....
Don't get me wrong, I feel flattered that you would even enter this discussion with me, but I feel that I have to repeat myself.
(Edit: I think I should work on my communication if people are not getting what I'm trying to say. Please keep in mind that English is not my native language)
I'm very well aware that using a book which has random factors will dramatically lower the statistical significance of that one game. Indeed, I have criticised (or, let's say commented) the UCI engines ligue for using a bundle of strong opponents to get my considerably weaker engine's "initial rating". That way it has happened numerous times that a drawish line was entered, and Eden managed to hold on against opposition > 500 points stronger.
At the level Eden is playing, only a small minority uses their own book. In fact those are primarily the engines that introduce increased randomness. I still like to include them, in part because I like to improve my book and find good lines against their extensive books. _Again_, I change my book only _after_ I have determined that my new version is significantly improved, so that I am sure that it's not merely the new book that's better.
Of those factors you've mentioned, books are the only one actually present in my tournaments. Engines at this level don't ponder (or I switch pondering off), they don't have learning and they don't do any parallel search. I would agree that these would significantly add to the randomness. However, you'll have to give me better ones than those to convince me that I have to change my testing methods any time soon. Remember, my engine is currently rated around 1650!