Tord Romstad wrote:bob wrote:A large number of games, using a large number of starting positions (playing each twice with alternating colors to eliminate bias from unbalanced positions) and with a significant number of different opponents is the way to go.
It would be, in principle, but sometimes it seems like you don't realize that this just isn't possible for the vast majority of us. I use my iMac for all sorts of other things besides computer chess, and I can't run computer chess matches all the time. When I have a new experimental version I want to test, I rarely have enough CPU time for more than 100--200 games.
Fortunately, being able to verify tiny improvements isn't really necessary. Intuition is an imperfect, but adequate replacement for statistically significant tests. Without a doubt, it happens that some of the "improvements" in my program make it play worse, but as long as I am right more often than wrong, the net result in the long term will be that the program keeps improving.
Tord
There's a place where we simply disagree significantly. I can't count the number of ideas that Tracy and I have tried. A recent example. Rook on the 7th rank. Older versions used to require that the king be on the 8th rank for the rook on the 7th to be an issue, as well as having pawns on the 7th. Both of these ideas are mentioned in more than one chess book. At some point I simplified a lot of things just to get the latest version running. And then we started to go back to "correct" the oversights. Adding either the pawn test, or the king on the 8th test both _weakened_ the program. Not by a lot, but by enough that when you add up these small changes, they become a big change.
The only place intuition works well is that if all else is equal, faster is better. A 5% speed improvement is always a plus, assuming you don't give up something (an eval term removed, for example) to get that speed. But everywhere else, I find intuition is about as accurate as flipping a coin and using that to make decisions. Another example. I have been working on passed pawns and decided to try a fruit-like approach for passer mobility. The overhead to determine if a pawn can safely advance (when it is not blockaded) is not very much. Yet adding this resulted in worse results by a small amount. And yes, I tried testing the square in front of the pawn, or the two squares, etc. And each played slightly worse than then the normal version that just scores passers based on their rank of advance, and whether the square in front is empty or occupied.
I'm very skeptical of "intuition" since no evaluation term is free. I would bet that 75% of our changes the past year, each of which sounded perfectly reasonable in discussions, turned out to be either "no change" or "slightly weaker" in real testing.
I realize playing large numbers of games is difficult if not impossible for most. But the alternative is to accept bad changes and reject good changes based purely on a coin flip, which is all 200 games is. You only have to run a 200 game match between A and B twice and compare the results to see what I mean. No changes to A or B between the two matches.
I've previously posted a ton of 80-game matches between two identical opponents to show the randomness in such a small sample. It really is there...