bob wrote:nczempin wrote:A little perspective on weak engines; I think the situation is different once the strength exceeds about 2000 (in any semi-realistic scale).
My engine is pretty weak, and at this level the variability depends on a few factors, I'll list the ones that come immediately to mind.
1. Engines that have no book, or (like mine) always choose the same line given the same previous moves, when playing each other will play exactly the same game each time. So the variability will be zero if you play two games with swapping colours.
This is wrong. I can produce thousands of games played that start from Albert Silvers 40 starting positions. I can play the same position over and over and not get the same results every time. In fact, I can show you games played from the same starting position where I win all 4 and lose 2 of them the second time around. You should test this and you will see what I mean.
Well I did test this, and so far I am usually right in this assessment:
When two weak engines (this usually means there are no non-deterministic factors involved... I can't speak for stronger ones, and most surely not for any SMP ones) that have no book (or have a "deterministic book", always playing the same moves if the other side does too) play each other, overall the results are the same. I am only talking about my conditions, which involve Arena; can't speak for e. g. winboard, polyglot and all that other stuff that might introduce non-determinism (oh, and I use deterministic seeds for the zobrist hashes too btw).
At higher levels this behaviour is less desirable, in fact non-determinism is good because your opponents have a harder time preparing against you. But at these low levels the competition is not so strong, and e. g. having reproducible behaviour to help find bugs is much more important.
If you play a weak engine against a strong engine, what you say might actually happen, but then the weaker engine is going to lose no matter what. And there is no way to measure improvement if you can't win some games.
But your basic premise about an opening book being needed to avoid repeated games is simply wrong. If you want data I can send you data from a million games or so run on my cluster here. The variability (non-deterministic behavior) is absolutely remarkable.
I don't think I said that a book is needed to avoid repetitions. After all, Eden has a book (albeit a deterministic one, so to speak). If anything, I said that a non-deterministic book will lead to more variability. Your statement is not equivalent logically.
2. Most engines at this weak level are comparatively stronger in calculation than in evaluation. They regularly play 1. Nh3 or just make more or less random pawn moves as long as there's no material to be gained. If your engine (or let's say, my engine

plays those engines (that also have no book) I can easily "prepare" my book against those engines. My engine will seem stronger than it really is if it can get into positions where there superior tactics are no longer sufficient.
A good reason for not using a book. It introduces yet another random aspect into your games that is bad for testing/evaluation purposes. yes you need a good book for tournaments. No, you don't need any kind of a book to evaluate changes. Just start the games at the same point and see if your changes produce better results. A book adds too much random noise, and when you factor in learning, it becomes useless for evaluating changes to the program itself.
Again, remember what kind of book Eden currently has: For any position there's always just one move, or none. I will change this in the future, and I am certain that my variability will skyrocket. This is one reason why I haven't done it yet.
I also haven't mentioned this (or perhaps I have, in that case I'd like to repeat/stress it): When I update Eden's book, I always do so _after_ I am satisfied that the new version is stronger. So in effect I'm already doing what you're suggesting; starting the games at the same point (at least with those opponents that are deterministic; still the majority, but it seems the percentage is decreasing as Eden moves up the ladder).
3. At this weak level, improvements are likely fairly significant. We are usually concerned with huge bugs, plus techniques that have been proven.
So for my engine, where I explicitly have the goal that the new version should be _significantly_ stronger than the previous one, but not exceedingly so, this is the technique I use (and so far it has been pretty consistent):
I find a group of engines that are reasonably close in strength (this selection gets better with time) and start playing a double-round tournament with them with my old version plus the new version that I consider to be worthy of testing for sufficient strength.
Usually I then decide that my version is sufficiently strong if it has at least one other engine placed between it and the old version. Sometimes it places relatively much higher.
And if you don't play thousands of games, sometimes the results are going to be exactly the opposite of what is really going on...
Well, you're invited to test each version of Eden against the older versions (and other engines, of course, because just each new opening book in that "main line" when it plays itself is likely to be improved) and see if your results are different from mine. I. e. if you find an older version that is stronger than some newer version, at the significance level you've chosen.
This may actually be the case for the change from 0.0.11 to 0.0.12, where I explicitly mentioned the fact that the newer version may be only "slightly" better than the older one. Some people have told me that 11 is better, or at least that 12 is not stronger. I was willing to risk that, because there was a lot of time and bad things between 11 and 12, and I just had to get something out again. I was actually surprised that the results were indeed that it was better.
So far I have had no reason to take back my claim that for me at least, my method is working.
BTW you should also note that there are many different versions of Eden even for just one release:
1. The basic java version
2. The "server" version, which uses the JIT VM, which even fewer people have installed than just a JRE for 1)
3. The "JA" version, compiled by Jim Ablett. Although it does have really weird behaviour regarding the nps etc. reporting (that I can't explain for now; it's "lying") it is significantly stronger than the basic version. There are theoretically sound reasons for that, plus my few experiments have confirmed the theory.
And for each of these versions there is a WB version and a UCI version. They actually play differently (the UCI version may lose a little bit of time when having to go through an adapter, the WB version doesn't claim draws, and this has actually led to losing a drawn game because I threw an IllegalStateException when the position was repeated for the fourth time
But for my tests, I always keep the conditions as much the same as before, precisely to reduce variability.
And even when I have received a result which superficially would indicate that the strength has increased, I check the games to see if there are any anomalies. When there's e. g. just one engine in between the old version's placement in the tourney and the new, I am more doubtful than when version A came last out of 15 engines and the new one takes first or even third place.