Ranking lists vs Rating lists vs what Engine is the best

CRoberson · Post by **CRoberson** » Tue Nov 29, 2011 9:49 am

Before I start, I am not bashing the lists (well maybe a little). I am mostly saying the way people are viewing them is wrong. Also, I use CCRL as an example, but the issues pertain to all of the lists. I am not picking on CCRL specifically, I just look at their list more than the rest.

Generally, I refer to the lists as ranking lists instead of rating lists. Simply because they aren't real ratings (they all disagree with the ratings but the rankings are close). Also, the conditions of the games are such that the ratings are inaccurate relative to human playing conditions. Some of the lists forced conditions have impacted the development of computer chess in a negative way. On top of that, the conditions favor some engines and disfavor others creating increased inaccuracy. This says the lists are not good at deciding which engine is best or the ratings! They are good at saying which programs are in the top 50 or the second 100 ... . Certainly, they don't say how strong an engine is going to be when installed on your machine at home or which engine is best for analyzing and helping you prepare for chess tournaments?

Now, I'll explain each point and then some.

1) Inaccurate ratings.

The computers used may not be the strongest and this is worked around by adjusting the TC relative to some old machine. Yes, this has merit for testing and creating rankings, but not ratings. We all know if you speed the program and/or the computer up, you are likely to get a stronger program.

2) Inaccurate relative to human playing conditions:

Human tournaments typically use long time controls where you can use 3 minutes per move and sometimes more. Compare that to some groups which run TC's of 40 moves in 20 minutes which is an average of 30 seconds per move. Easily a 6x faster TC. Increased time to think on each move improves the strength of play for engines. Add to that the fact that some engines are tuned to play better at fast time controls which pushes them up the lists higher than normal.

Yes, it is possible to tune an engine so that it plays 50 to 200 Elo stronger at fast time controls than at slow time controls. In fact, it is fairly easy to do, but it reduces the playing strength at long time controls.

3) Some lists force a generic opening book on the engine:

Many of us have developed programs to play with certain playing styles thus a mismatched book weakens the programs by 100 Elo or more. If you want to test Opening book theory then don't do it with 100 or more different programs while calculating engine ratings. The two practices conflict with each other.

Some of the testers have complained that many of the engines are starting to play alike. Of course they are, if you force generic books then we have to tune the engine to some generic standard.

4) Some lists adjudicate games early to speed up testing

This practice increases the ratings of engines that are weak at the endgames by not allowing the games to continue. I've seen programs that couldn't finish of a mate do to bugs. Adjudicating the game early, because one program is way ahead will not catch that type of bug, thus awarding a win to what would have been a drawn game.

I've seen many bugs from engines in live computer chess tournaments that don't show up for the rating/ranking groups.

5) How strong is a program when I get it installed on my machine?

The key to that is speed. We all know that a 2x speed up in the engine will produce 70 to 100 Elo gain in performance depending on the engine. The same applies to the hardware: a 2x speed up in hardware will make the same Elo gain for many engines, but not all. Also, a 2x increase in the TC from 40 in 20 to 40 in 40 will produce increased playing strength for most engines.

Here is a formula for approximating the performance under human conditions with your computer.

Code: Select all

   Approximate Elo = CCRL Elo 
           + 80 Elo for each 2x speed up in your hardware over theirs
           - 80 Elo if you are forced to use 32 bit if they are testing 64 bit
           + 80 Elo for each 2x increase in the time control
           + 0 to 100 Elo for a book matching the engine

So, lets look at an example.

How strong is Sloppy 0.2.3.
If your computer is the same as the average of the CCRL list, it can run a 64 bit program and you are using a human TC of 40 moves in 120 minutes then

Code: Select all

    Approximate Elo = 2714 + 0 - 0 + 200 = 2914.

The 200 is from 2.5 doubles (6x) of the TC.

Now lets say you have a computer 2x faster than say an i5 750. Add another 80 Elo which makes Sloppy a 2994 program. If Sloppy could use a more specialized book, then it could be at 3100 Elo. Again, I am making an estimate relative to typical human tournament conditions.

Wouldn't the improvement be the same for all the top engines? Not really. We've been noticing (for some time now) something that was predicted decades ago - the diminishing returns on performance as speed is increased. I think this is seen more in roughly the top 20 engines. Thus, increases in speed and time controls help the lower ranked engines more than the upper ranked ones. Thus, Sloppy could be nearly as good as Crafty running 4 CPU's.

Interestingly, much of this may be due to the much heavier pruning being done today than 10 years ago. The more pruning the engines search is capable of then the less a speed up seems to help and the less more processors seems to help.

Crafty 23.3 64-bit has a 2868 CCRL rating and it only gains 83 Elo when going to 4 processors which is about a 3.1x speed up. In the 90's, a 3.1x speed up would get you 300 Elo.

6) What is the best engine for analysis for me?

That heavily depends on your playing style and the openings that work best for you. An engine that prefers the openings you play, will be better than others for helping you navigate the middle game assuming it isn't sufficiently weaker than other engines. Of course, this impediment can be overcome by using a program that fits your playing style and another that is high on the rankings. That trick is probably best if you only have access to slow computers. If you can access super fast ones, then the need is reduced.

7) Which engine is the best?

NOBODY KNOWS!

Look at this tournament: http://www.harald-faber.de/thueringen/2 ... n2011.html

Deep Junior performed better than Houdini. Junior had a 2x speed up in hardware advantage, but that is not enough to overcome the nearly 200 Elo rating disadvantage that CCRL list shows. When paired against each other, Junior wins. Not only that, Houdini draws nearly all the other engines.

Sure this could be partly due to the statistical issues of a few number of rounds, but the tournament conditions had a 5x longer time control than is used on some lists, great hardware and no standardized opening books.

8) Creating an accurate computer chess engine RATING list for software programs is an IMPOSSIBLE task and should not be represented as possible.

The only group that has a chance of creating a reasonable computer rating list are the guys that rate the dedicated chess computers. Why? Because nothing changes over time. Assuming the machine doesn't break, it plays the same year after year after year.

There is merit in what the guys behind the lists are doing, so thank you fellows.

My main point here is how to properly use the rating lists. Looking at the top engine on a list and claiming that one is the best is no where near correct.

rvida · Post by **rvida** » Tue Nov 29, 2011 10:04 am

CRoberson wrote: ...
Yes, it is possible to tune an engine so that it plays 50 to 200 Elo stronger at fast time controls than at slow time controls. In fact, it is fairly easy to do,
...

Fairly easy??? Well, download an open-source engine (for example Stockfish) and tune it to be 50-200 ELO stronger at fast time controls then show us your results. Good luck.

Werewolf · Post by **Werewolf** » Tue Nov 29, 2011 11:30 am

Hi Richard,

Can you give an update on how your work is progressing?

Uri Blass · Post by **Uri Blass** » Tue Nov 29, 2011 12:19 pm

I read and I totally disagree with Charles Roberson.

I believe that pruning does not make engines weaker at long time control.
The only reason that weak engines may earn more from more time is the simple fact that the level of play increase and there are more draws.

If you already play perfect in blitz then you cannot improve and the best engines are already almost perfect at blitz so they have not a lot of possibilities to improve at long time control.

practically the strong engines earn more from time if you start at the same playing strength and
I think that games with time handicap are going to prove the point.

If the CCRL test some weak engines when they get more time(200/40 against 20/40 of other engines or 20/40 against 2/40 for other engines when we talk about the blitz list) then the weak engines are going to score worse at longer time control.

unfortunately I am afraid that the CCRL is not interested in this type of testing but I remember that I tested movei against Rybka2.3.2a 32 bit under arena when I gave movei 10:1 time advantage and Rybka2.3.2a did better when the time control was longer

Note that Movei based on the CCRL is an engine that earns more from time(2769 in the 40/40 rating and 2731 in the 40/4 rating).
It means that if it is the case for movei it is certainly the case for other engines.

Milos · Post by **Milos** » Tue Nov 29, 2011 2:48 pm

Sorry Charles, but most of the things you wrote is just utter nonsense.

Sylwy · Post by **Sylwy** » Tue Nov 29, 2011 3:08 pm

CRoberson wrote: Also, I use CCRL as an example, but the issues pertain to all of the lists. I am not picking on CCRL specifically, I just look at their list more than the rest.

Why not GEGT ?
Or IPON ?
Or SSDF ?
Or Sedat Canbaz ?
Or Ruxy NSCEERL ( nice & sexy chess engines eastern rating list) ?
Or.....................?

"Je me mourrais d'ennui dans la discothèque ... Oui Georges / Je suis si faible entre tes bras / Oh Georges ..."

Silvi(Vart)anR

Question: do you know who is "la FemmeFatale de CCRL" ?

Milos · Post by **Milos** » Tue Nov 29, 2011 3:45 pm

SzG wrote:4. To adjudicating:

My current practice is to adjudicate games as draw after 200 (full) moves. This is based upon my knowledge that the longest computer game that was decided was 193 moves long. Not much risk here...

The other way of adjudicating is to decide a game if a certain threshold is overstepped for a certain time. Currently I use 1000 centipawns for 6 moves. Please note that both engines must agreee that the threshold was overstepped!

Now let us suppose I made a mistake by adjudicating and an engine was awarded a result it did not deserve. What is the amount of my error? If the mistake occurs once in every 100 games (which is probably not the case), I distort the outcome by 0.5 %. In Elo it translates to certainly not more than 1 or 2, and the error margins you find on the lists are around 20 Elo or even more.

So adjudicating is more than justified, and the CPU time gained is enermous.

100cp is usually more than 1 sigma certainty of a win for almost all strong programs. However due to some evaluation bugs tails of the distribution (of winning percentage vs. cp difference) are not of normal distribution, so using more than 400cp in practice has no meaning. Also using just 2 consecutive moves is more than sufficient to avoid any possible fluke. Only these two conditions pretty much guaranty the error far smaller than 0.1%. Room temperature (where your machines are stored) has much more influence on the tournament outcome

.

Evert · Post by **Evert** » Tue Nov 29, 2011 5:49 pm

SzG wrote: I do not have the mathematical knowledge to verify your claim. If you are serious, I gather you suggest that 400 cp for 2 moves is more than enough to adjudicate a win.

I still have doubts based on what I saw several times (mostly with average or worse engines). Many times both engines think the score is above 5, still the outcome is an inevitable draw. Typical example is a KBPK endgame with an a or h pawn and bad bishop. That is why I have chosen so high a margin.

It is of course easy to test this by running a tournament with either condition and comparing the outcome...

Milos · Post by **Milos** » Tue Nov 29, 2011 5:55 pm

SzG wrote:I still have doubts based on what I saw several times (mostly with average or worse engines). Many times both engines think the score is above 5, still the outcome is an inevitable draw. Typical example is a KBPK endgame with an a or h pawn and bad bishop. That is why I have chosen so high a margin.

(Especially weaker) engines make those kind of errors where even 600cp is not enough to claim a win. However, setting adjudication to 1000cp is enough to avoid these kind of errors but from the practical point of view is equivalent to not having any adjudication at all.
That one is even simpler to test. Take a fast control 10k games tournament with 1000cp adjudication. Record the total duration of the tournament. Than play the same tournament without adjudication. Compare durations. If the difference is less than 2% of total time, then what's the point in using adjudication at all?

CRoberson · Post by **CRoberson** » Tue Nov 29, 2011 6:17 pm

rvida wrote:
CRoberson wrote: ...
Yes, it is possible to tune an engine so that it plays 50 to 200 Elo stronger at fast time controls than at slow time controls. In fact, it is fairly easy to do,
...
Fairly easy??? Well, download an open-source engine (for example Stockfish) and tune it to be 50-200 ELO stronger at fast time controls then show us your results. Good luck.

Luck is not needed. I've been there and done that with other engines to prove the point several years ago before you started Critter. The discussion was on the older CCC server. I believe there is a backup of it somewhere. You could look it up.

Ranking lists vs Rating lists vs what Engine is the best

Ranking lists vs Rating lists vs what Engine is the best

Re: Ranking lists vs Rating lists vs what Engine is the best

Re: Ranking lists vs Rating lists vs what Engine is the best

Re: Ranking lists vs Rating lists vs what Engine is the best

Re: Ranking lists vs Rating lists vs what Engine is the best

Re: Pourquoi CCRL - ma FemmeFatale ?

Re: Ranking lists vs Rating lists vs what Engine is the best

Re: Ranking lists vs Rating lists vs what Engine is the best

Re: Ranking lists vs Rating lists vs what Engine is the best

Re: Ranking lists vs Rating lists vs what Engine is the best