Generally, I refer to the lists as ranking lists instead of rating lists. Simply because they aren't real ratings (they all disagree with the ratings but the rankings are close). Also, the conditions of the games are such that the ratings are inaccurate relative to human playing conditions. Some of the lists forced conditions have impacted the development of computer chess in a negative way. On top of that, the conditions favor some engines and disfavor others creating increased inaccuracy. This says the lists are not good at deciding which engine is best or the ratings! They are good at saying which programs are in the top 50 or the second 100 ... . Certainly, they don't say how strong an engine is going to be when installed on your machine at home or which engine is best for analyzing and helping you prepare for chess tournaments?
Now, I'll explain each point and then some.
1) Inaccurate ratings.
The computers used may not be the strongest and this is worked around by adjusting the TC relative to some old machine. Yes, this has merit for testing and creating rankings, but not ratings. We all know if you speed the program and/or the computer up, you are likely to get a stronger program.
2) Inaccurate relative to human playing conditions:
Human tournaments typically use long time controls where you can use 3 minutes per move and sometimes more. Compare that to some groups which run TC's of 40 moves in 20 minutes which is an average of 30 seconds per move. Easily a 6x faster TC. Increased time to think on each move improves the strength of play for engines. Add to that the fact that some engines are tuned to play better at fast time controls which pushes them up the lists higher than normal.
Yes, it is possible to tune an engine so that it plays 50 to 200 Elo stronger at fast time controls than at slow time controls. In fact, it is fairly easy to do, but it reduces the playing strength at long time controls.
3) Some lists force a generic opening book on the engine:
Many of us have developed programs to play with certain playing styles thus a mismatched book weakens the programs by 100 Elo or more. If you want to test Opening book theory then don't do it with 100 or more different programs while calculating engine ratings. The two practices conflict with each other.
Some of the testers have complained that many of the engines are starting to play alike. Of course they are, if you force generic books then we have to tune the engine to some generic standard.
4) Some lists adjudicate games early to speed up testing
This practice increases the ratings of engines that are weak at the endgames by not allowing the games to continue. I've seen programs that couldn't finish of a mate do to bugs. Adjudicating the game early, because one program is way ahead will not catch that type of bug, thus awarding a win to what would have been a drawn game.
I've seen many bugs from engines in live computer chess tournaments that don't show up for the rating/ranking groups.
5) How strong is a program when I get it installed on my machine?
The key to that is speed. We all know that a 2x speed up in the engine will produce 70 to 100 Elo gain in performance depending on the engine. The same applies to the hardware: a 2x speed up in hardware will make the same Elo gain for many engines, but not all. Also, a 2x increase in the TC from 40 in 20 to 40 in 40 will produce increased playing strength for most engines.
Here is a formula for approximating the performance under human conditions with your computer.
Code: Select all
Approximate Elo = CCRL Elo
+ 80 Elo for each 2x speed up in your hardware over theirs
- 80 Elo if you are forced to use 32 bit if they are testing 64 bit
+ 80 Elo for each 2x increase in the time control
+ 0 to 100 Elo for a book matching the engine
How strong is Sloppy 0.2.3.
If your computer is the same as the average of the CCRL list, it can run a 64 bit program and you are using a human TC of 40 moves in 120 minutes then
Code: Select all
Approximate Elo = 2714 + 0 - 0 + 200 = 2914.
Now lets say you have a computer 2x faster than say an i5 750. Add another 80 Elo which makes Sloppy a 2994 program. If Sloppy could use a more specialized book, then it could be at 3100 Elo. Again, I am making an estimate relative to typical human tournament conditions.
Wouldn't the improvement be the same for all the top engines? Not really. We've been noticing (for some time now) something that was predicted decades ago - the diminishing returns on performance as speed is increased. I think this is seen more in roughly the top 20 engines. Thus, increases in speed and time controls help the lower ranked engines more than the upper ranked ones. Thus, Sloppy could be nearly as good as Crafty running 4 CPU's.
Interestingly, much of this may be due to the much heavier pruning being done today than 10 years ago. The more pruning the engines search is capable of then the less a speed up seems to help and the less more processors seems to help.
Crafty 23.3 64-bit has a 2868 CCRL rating and it only gains 83 Elo when going to 4 processors which is about a 3.1x speed up. In the 90's, a 3.1x speed up would get you 300 Elo.
6) What is the best engine for analysis for me?
That heavily depends on your playing style and the openings that work best for you. An engine that prefers the openings you play, will be better than others for helping you navigate the middle game assuming it isn't sufficiently weaker than other engines. Of course, this impediment can be overcome by using a program that fits your playing style and another that is high on the rankings. That trick is probably best if you only have access to slow computers. If you can access super fast ones, then the need is reduced.
7) Which engine is the best?
NOBODY KNOWS!
Look at this tournament: http://www.harald-faber.de/thueringen/2 ... n2011.html
Deep Junior performed better than Houdini. Junior had a 2x speed up in hardware advantage, but that is not enough to overcome the nearly 200 Elo rating disadvantage that CCRL list shows. When paired against each other, Junior wins. Not only that, Houdini draws nearly all the other engines.
Sure this could be partly due to the statistical issues of a few number of rounds, but the tournament conditions had a 5x longer time control than is used on some lists, great hardware and no standardized opening books.
8) Creating an accurate computer chess engine RATING list for software programs is an IMPOSSIBLE task and should not be represented as possible.
The only group that has a chance of creating a reasonable computer rating list are the guys that rate the dedicated chess computers. Why? Because nothing changes over time. Assuming the machine doesn't break, it plays the same year after year after year.
There is merit in what the guys behind the lists are doing, so thank you fellows.
My main point here is how to properly use the rating lists. Looking at the top engine on a list and claiming that one is the best is no where near correct.


