Rebel wrote:I like to present a page for starters how to test a chess engine with limited hardware. I am interested in some feedback for further improvement.
http://www.top-5000.nl/tuning.htm
Hi Ed,
All your pages are valuable resources to the computer chess community. Another good page.
I have a comment on one thing you said:
Code: Select all
Understanding LOS
LOS means: Likelihood of superiority.
A LOS of 95% means that the match result statiscally ensures a 95% certainty the version is superior, it does not say anything about the elo gain, just that you have a certainty of 95% the version is better than 0.00001 elo.
Among the programmers (I think by now) there is a kind of consensus that 95% is the norm to keep provided enough games are played. "Enough" is still undefined but 5000 games seems to be lower limit currently and I tend to agree.
The latter automatically moves us to the final and important chapter, when do you terminate a runing match? A couple of advices:
Don't give up too early on a version. A 40-60 score after 100 games statistically means nothing. After 500 games if the result is still 40% it's high time to terminate.
Don't accept changes even with a LOS of 95% before 1000 games.
Larry and I have come to appreciate that the biggest factor in making steady progress is more that just the error bars. If you test 20 changes and honor the error margins you can still easily produce a regression when comparing the initial version to the version after the 20 changes. We assume you kept some of those changes and rejected others. I don't know if this is addressed anywhere in scientific testing methodology but chaining "improvements" together like this is not a very reliable way to test. I know this goes way beyond the scope of your article.
The reason is that given some number of candidate versions to test, your success will have a lot more to do with the percentage of those candidates that happen to represent good changes. And it's not simply because you make faster progress when you have more good changes although that of course is always good.
Suppose you have 50 versions and 10 of them are improvements. Somebody else starts with the same program and makes 20 versions that also have 10 improvements. We assume that the 10 improvements in both cases are of equal magnitude. You test the 20, one version at a time and are very careful to honor the error margins and use good testing methodologies. The "other guy" tests his 50 versions one version at a time and he is very careful to obey the error margins and use good testing methods. In theory both of you rejected all the bad versions and kept the 10 good versions and in the end come up with the same total improvement. Right?
In PRACTICE it doesn't work that way. In PRACTICE you will both conclude that some of the regressions are actually improvements because that is just the nature of the beast - we are operating within a certain noise threshold and it become impractical to run hundreds of thousands of games to resolve tiny differences. They guy that had to test 50 versions will have kept more regressions and will have made less total progress.
What is the solution? One solution is to make sure that at least half of your ideas are good
Other than that, I have no solution but you can minimize the problem and as a result of these findings Larry and I have modified our testing methodology. We now often run over 100,000 games and rarely less than 50,000. We don't keep anything that is a close call unless we have strong reason to believe that it cannot be bad. For example if we find a way to gain a slight speedup without risk we know it is a good thing even if the test is really close. Then test in this case is more of a sanity check against bugs.
A good percentage of our ideas ARE in fact good ideas and that has enabled us to make remarkable progress - but you cannot assume it will always be that way.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.