Running matches to determine which program is strongest when they are pretty close really takes a lot of games, 500 is something like 2 orders of magnitude too few. You need 50,000 games if there is something like 5 ELO or less.PawnStormZ wrote:Don wrote:I expected Stockfish to win this match because it was longer time control. A faster match Ivanhoe would have won and a really fast match Ivanhoe would have won big.Jouni wrote:I have feeling 200 games was overkill. In first 100 games SF won 52-48 and last 100 it won 50,5-49,5. Did it give any new information to us, I doubt..
You know Don, lately I have been wondering what, if anything, these "tests" actually tell us.
I had become interested in all the Ivanhoe versions and wanted to see which was the strongest. I ran a tourney of 2800 30-second +1sec games among 8 of the versions that I had. I took the top 4 and then re-ran (15 sec games) and got 2 to "test" further (B52aF and B46fC if anyone cares).
I decided to try and play with the many parameter settings to see if I could make the engine stronger. I made a copy of B52aF and started changing settings and playing matches against a default version of the same engine. These were 400 game matches at 5 seconds +0.25.
Most of the changes resulted in a close loss: 5 or 6 games. Then I found one that won by 18 games. I thought I was "on to something" and made further changes which fell back to losing. I wanted to get back to my "winner" and try something else but something told me to first try the exact same changes that won, to see if they won by a similar margin a second time.
Let me say that all the matches were run using exactly the same conditions, and the openings were even played in the same order. Guess what? NO improvement by my modifyed version the 2nd time; it even lost by 4 or 5 games! I know you "statistics guys" are probably laughing at me, but I was truly surprised (and disappointed by what this means for test results).
So, if a 400 game match could end so differently using the exact same conditions, then my 200 game match between Stockfish and IvanHoe which ended only +5 for SF probably does not mean anything at all! If the same match were run again, it might end up with Ivan winning by 8. Some of the changes that I made to Ivan which "failed" may just as well have shown as "better" if the matches were re-run, so what did I "learn"?
Does anyone have any "facts" on how many games are really needed to get "meaningful" results? I know Bob Hyatt runs 30,000 games to test things but even if the game lasted only 1 minute that would take 3 weeks of 24\7 to test just one change (lacking a cluster here :) ). Some changes (like split-depth?) probably need longer games to see the full effect.
Is there even any point in publishing results from 400 or 500 game matches, let alone the 20 or 50 game ones that are common here; does it give us any "information" at all? :(
I shake my head when I see 10 or even 100 games matches reported here. If a program is 50 or 100 ELO stronger, you don't need many games to prove that of course.
But that is why programs like elostat and bayeselo report error margins. In simple terms this tries to give you a sense of how confident you can be in the results. Typically the error margin is based on a confidence of 95% - in order words if bayeselo reports -15 +15 for the error margins and reports an ELO rating of 2850, it means that with 95% confidence the "true" rating is in the range 2850-15, 2850+15 or 2835 to 2865. Another way to see this is that there is still a 5% change the rating is more than 15 points different than the reported rating - it might be at least 15 ELO weaker or 15 stronger.
To complicate this further you have to realize that each program in the test has an error margin too so it's actually worse than that! Since both programs have a 5% chance of being off by 15 ELO there is MORE than a 5% chance that the DIFFERENCE is greater than 15 ELO.
What makes people tend to not believe the error margins is the fact that it's more likely to be off a little than a lot - which fools people into thinking they don't need big samples. So even though the error margin might be +/- 30 ELO there is a pretty good chance it's off by less than 10 ELO in any given test. So when these low sample tests return a more or less correct result people tend to think that this error margin business is just a lot of nonsense, but they are making a big mistake. It's sort of like thinking that Russian Roulette is safe because you played twice and nothing bad happened.
This makes improving a program very difficult because it take tens of thousands of games to resolve a small improvement. That is why we have to test at very fast levels, we really have no choice unless we want to take a week or two to thoroughly test each change. Fortunately most fast tests correlate well to long time controls, if that were not the case you would not see such rapid progress in chess programs. If these tests did not correlate well we would be forced to test at long time controls and progress would slow to a crawl.