How the time adjustment factor works.

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

How the time adjustment factor works.

Post by Don »

I have received private requests on how my time adjustment factor works in my tester.

So I'm going to try to answer it here with this post. This is for the benefit of those who believe it may be of some use, not to generate more controversy. I'm going to give some details on how I implemented it.

For years it has been known that chess skill in computer chess programs is almost linear with the log of the time. For instance each doubling in speed gives an (almost) constant factor of ELO improvement.

A few months ago, H. G. Muller posted the basic formula which is ELO = K * ln(time). He used 100 as his example constant, which is like saying each doubling is worth almost 70 ELO of improvement.

So I incorporated this formula into my autotester. When running fixed depth games, one would like to know at a glance which program is really stronger and by how much. Of course as Bob Hyatt points out the most direct way is to run time control games. I have no argument with this and it works perfectly. But many testers are interested in isolating the algorithm from the implementation. For instance if you improve mobility with an expensive calculation, one might want to understand the relationship between the quality improvement and the extra time involved.

In my tester, I can configure this constant. But to make the constant
represent the amount of elo improvement per doubling - (which I find
more convenient to think about) you have to multiply it by 1 / log(2)
or 1.44269504088896. So if you believe each doubling is worth 100 ELO
points, then K should be 100 * 1.442695...

So in my tester, the games are rated using the standard ELO formula,
but I will also publish a time adjusted elo rating. Let's say program
A has a rating of 2200 ELO and takes 10 seconds per move and program B
has an ELO of 2150, but only takes 7 seconds per move. Which is
stronger?

A = 144.2695 * log(10) = 332.19

B = 144.2695 * log( 7) = 280.76

This tells you that you should gain about 332-281 = 51 ELO points
going from 7 to 10 seconds (using 100 elo per doubling). So program B is slightly stronger because it is fast enough to make up for the 50 ELO difference.

With fixed depth testing, you should realize that each doubling is
worth less and less with more time. And this would vary also
based on how scalable your program is. If you have terrible move
ordering, you might get much less for a doubling. I think H.G. Muller also mentioned these things in his original forum post.

So my tester lets you configure what a doubling is worth. I have
found that if I am doing 9 ply searches, 145 is very close to the
right value. With 7 ply searches I use 155. In very rough terms it
works pretty well for my program to adjust by 5 ELO for each ply of depth. This also works extremely well with the programs I test against, they seem to follow the same ELO curve my program does.

This is one of those deals that for small variations the error is
really small. For instance if I am 5 ELO off in my calibration, it's not going to produce much error for a program that is 20% faster or slower. It's
less accurate if one program takes a LOT more time than another. You
would not want to compare a program that average 1 second a move with
one that averages 15 seconds a move for instance.

I calibrate it by running a test where there is a version of the
program doing an extra ply. I might run one at 8 ply and one at 9
ply. Then I fiddle the "elo doubling factor" to make these 2 versions
come out exactly the same. You need to run a large sample of course
if you want this to be accurate, but you don't have to do it very
often.

How much can you trust these values? In practice I have found these number extremely reliable, the correlation between these numbers and actual time control games is very high. I have always been able to predict the winner of long time control matches based on these fixed depth games within the statistical error. But it is up to each tester of course to use some common sense and decide for himself if there is any value in this for him.