Laskos wrote:Wow, seems a serious improvement. After 400 games 1s + 0.1s
Code: Select all
Program Score % Elo + - Draws
1 Komodo64 2.01 64 bit : 262.5/400 65.6 3256 29 29 31.2 %
2 Komodo64 1.3 JA : 137.5/400 34.4 3144 29 29 31.2 %
112 +/- 29 Elo points (95% confidence) improvement in self-play, probably a little less in a gauntlet, but the new Reptilian seems the level of SF 2.01. Will leave it for more games, then a gauntlet.
Kai
We believe it is over 100 ELO but it does not always come out the way we think
Komodo 1.3 was supposed to be about 50 ELO in our private testing but when it came out it was much less.
I think the Stockfish team also expected a lot more with version 2.1. It could be because we are all forced to test pretty fast to get statistical confidence in any changes.
We now test more against foreign opponents than we used to (although we have always done some of that) and we have gradually increased out time control to give us a better picture of how we will do at "real" time controls.
Because Komodo is now so much stronger we can no longer handicap the programs we play against which also means it takes a lot longer to test. But I think increasing the time and using foreign opponents more has been a benefit and gives us a better picture of the actual ELO gain despite the fact that increases our resource burden.
Some things we observed: Critter is exceptionally strong at time controls less than 1 minute per game and so is Robbolito. Relative to these programs Stockfish does not look so good until the time control increases, then it looks better. Critter appears to be the least scalable of the programs we test against, but it's difficult to say that for sure, we are extrapolating from out own tests and results we see from rating lists. We believe our program scales very well and we also think Stockfish is exceptional in this area.
We would like to test against Houdini too and it would be the last program we could reasonably handicap to advantage. However we develop and test on Linux because it is has much better behavior for massive testing, but there is no Linux version of Houdini. We can run the 32 bit version using wine but that throws away a lot of the benefit (although the 32 bit version is still stronger than Komodo and Stockfish, at least for now.) But we get crashes and time losses for Houdini.
We have observed that as the programs get stronger, fast time control tests are becoming less reliable predictors of strength. The correlation is still quite high most of the time, but not always. There are some things that clearly test well at game in 3 seconds but test poorly at game in 1 minute and it never used to be that way for us.
Don