Elo versus speed

petero2 · Post by **petero2** » Mon Apr 02, 2012 9:50 pm

Background

I recently converted my java chess engine CuckooChess to C++ (and renamed it to Texel). As a result the engine became approximately twice as fast. Initial tests suggest that the elo gain is much larger than what would have been expected according to the old rule that you gain 50-70 elo points for each doubling of the speed.

In an attempt to understand what is going on I played a time odds match using three different engines.

Match conditions

* Three engines tested: stockfish, crafty, texel
* Each engine plays at three different time controls:
1: 60 moves in 6 seconds
2: 60 moves in 12 seconds
3: 60 moves in 24 seconds
* Ponder off
* One CPU
* 128MB transposition table
* Openings selected randomly from a set of 7316 book lines, each line 10 ply deep.
* Each opening line played twice with colors reversed second time.
* Three test computers used, all running 64 bit linux.
- Core2 Quad 2.4GHz
- Core i7 870, 2.93GHz
- Core2 Duo, 2.6GHz
* Tournament where each engine plays all other engines.
* Total number of games: 14976
* Total number of games per engine: 3328
* Total number of games between each pair of engines: 416
* Total number of time losses: 1

Results

Rating computations were performed using bayeselo. All 14976 games together gives the following rating table:

Code: Select all

Rank Name         Elo    +    - games score oppo. draws 
   1 stockfish3   372   14   14  3328   91%   -47   14% 
   2 stockfish2   228   12   11  3328   78%   -28   21% 
   3 stockfish1    46   11   10  3328   56%    -6   23% 
   4 texel3        32   11   11  3328   55%    -4   21% 
   5 crafty3       30   11   10  3328   55%    -4   23% 
   6 crafty2      -76   11   10  3328   41%     9   23% 
   7 texel2      -116   10   11  3328   36%    14   21% 
   8 crafty1     -222   11   12  3328   23%    28   19% 
   9 texel1      -294   12   13  3328   16%    37   14%

From the table the elo increase when doubling the thinking time can be calculated:

Code: Select all

            1->2  2->3
stockfish   182   144
crafty      146   106
texel       178   148

To see how the individual engines performed against each other, all games involving a given pair of engines were extracted and bayeselo was used to calculate the rating difference given those games. This was repeated for all pairs of engines. The result was:

Code: Select all

                 s3   s2   s1   c3   c2   c1   t3   t2   t1
stockfish3        0  134  344  330  448  576  334  500  624 
stockfish2     -134    0  178  224  336  396  168  342  470 
stockfish1     -344 -178    0   16  108  268   50  160  308 
crafty3        -330 -224  -16    0  114  270   -2  136  324 
crafty2        -448 -336 -108 -114    0  174  -98   42  190 
crafty1        -576 -396 -268 -270 -174    0 -228 -104   76 
texel3         -334 -168  -50    2   98  228    0  162  372 
texel2         -500 -342 -160 -136  -42  104 -162    0  196 
texel1         -624 -470 -308 -324 -190  -76 -372 -196    0

Taking differences in the above table gives:

Code: Select all

                 s3   s2   s1   c3   c2   c1   t3   t2   t1
stockfish 2->3  134  134  166  106  112  180  166  158  154
stockfish 1->2  210  178  178  208  228  128  118  182  162
crafty    2->3  118  112   92  114  114   96   96   94  134
crafty    1->2  128   60  160  156  174  174  130  146  114
texel     2->3  166  174  110  138  140  124  162  162  176
texel     1->2  124  128  148  188  148  180  210  196  196

Discussion

I believe the "50-70 elo increase per speed doubling" estimate originally comes from "How Computers Play Chess" by David Levy and Monty Newborn in 1991. I don't have that book, so I don't know what data they based the estimate on. I guess they used longer time controls than I did, but given 20 years of hardware improvement, it is not unlikely that my tests generated larger search trees anyway.

In my last table, most values are significantly larger than 50-70. In fact only one value (crafty 1->2 vs stockfish2) falls within the 50-70 range. It is worth noting that each value in that table is calculated from only 416 games, so the error margin according to bayeselo is about 30. However, if the true values were within 50-70, getting the result I got would be extremely unlikely.

Are there any newer estimates of elo versus speed? I would not be surprised if todays engines with recursive null moves, late move reductions and other pruning techniques behave quite differently from the ones from 20 years ago.

lkaufman · Post by **lkaufman** » Mon Apr 02, 2012 10:07 pm

The elo value per doubling drops off rather sharply with each additional doubling, as your own data shows. Probably the 50-70 value you cite is still applicable at time limits somewhere between normal blitz and 40 moves in two hours. This means that the effective depth where this doubling value applies is vastly higher than 20+ years ago; the curve has been shifted substantially. The numbers you cite are pretty much in line with what we observe in Komodo at the very fast time limits you use. For hyperspeed testing (faster than bullet chess), a good rule of thumb is that each percentage point of speed is worth two elo points, whereas at real time limits like maybe the IPON 5'+3" it's roughly one for one.

Uri Blass · Post by **Uri Blass** » Mon Apr 02, 2012 10:28 pm

I can add that with bigger opening books(world championship conditions or ssdf conditions when every program use its own book) you get less from doubling the speed at the same time control.

diep · Post by **diep** » Tue Apr 03, 2012 5:34 am

Uri Blass wrote:I can add that with bigger opening books(world championship conditions or ssdf conditions when every program use its own book) you get less from doubling the speed at the same time control.

The book is one factor indeed.

Uri i had expected you'd also note that he's just testing the same engine against the same engine basically. Stockfish versus stockfish.

To quote Johan de Koning that's incesttesting.

futility last 3 plies in diep doesn't work, yet it searches 2 plies deeper. If i test diep versus diep, then it does win from the diep version without futility, which against non-diep opponents has a SIGNIFICANT higher elorating. Worth about 20% in score or so (on average - huge differences there between different engines).

for such doubling of speed contests you need quite a large amount of opponents and in your results not include playing against the same engine.

I do realize how tough this is in todays computerchess - because of the many clones that are so so similar in evaluation function - basically cut'n pasting each other and if not cut'n pasting then implementing the same thing in a different manner, tuned within 0.01 to the same value - and not because their own tests indicated that was best to do so.

This removes a lot of noise.

But even then -realize the largest effect is the full playing strength of the engine. It's 3400 elo or so?

What were engines optimistically rated back in Bronstein days, 2200 elo or so? I could beat 'em all. The first engine i had really problems with was nimzo98, as it was so accurately tuned (we know that nowadays - i didn't know that back then - values are not far away from the values Stockfish uses to give an example).

So if you start at superbullet and then double a few times, it has to get to that 3400 quickly within a few doublings. So it has to get 1200 eorating higher roughly than back then. That's a lot of elopoints. With 70 elo per doubling, if you start at superbullet you ain't gonna get 3200.

The 70 elo rule is not interesting underneath the elo of an engine. What's interesting is how much stronger the engine gets when you start at reasonable time controls at todays hardware and extrapolate that to futuristic hardware.

Please note that the standard hardware of today is a sixcore intel gulftown.

Most ratinglists are basically not even close to playing at 3 minutes a move at that hardware. I saw one Turkish guy test at 25 10 there, but nearly all others including the 'rating lists' they test at a time control where the world champs 1999 hardware still is the same speed or in some cases even considerable faster. That doesn't show the progress of the software+hardware combination very well of course.

That gulftown intel by now is 2 years old hardware. Released march 2010.
There is 8 core Xeons nowadays and 2 socket machines are very common... ...so comparing it all against a sixcore gulftown is not exactly 'luxury'.

Back in 1999 world champs, besides a bunch of supercomputers, there
was like 4 engines there playing at 3 minutes a move at quad Xeon and Fritz and junior were, if i remember well clocked 500Mhz.

That's 6Ghz minute a move.

Only Sedat's rating list which comes down to a time control slightly faster than 30 seconds a move is there 6 * 4 * 0.5 = 12Ghz minute a move. Sure better processor, but engines today also do a lot more and most engines back then were in assembler (though in world champs 1999 no longer in the majority).

So the question is: in a fair contest, sure with a book, and sure with a bunch of opponents, how much would a 100Ghz quadcore
potentially add in elo to stockfish, to give an example...

As that's 4 doublings, the 70 elorule would add 280 elo there to it and i seriously doubt that.

Vincent

Daniel Shawul · Post by **Daniel Shawul** » Tue Apr 03, 2012 11:07 am

It would also be interesting to compare ratings of parallel speedups for 1,2 and 4 processors. Incidentally this seems to roughly match your tests if you assume a 1,1.8 and 3 speedups. That is generally expected efficiency for YBW implementation. The increase in elo is much lower as expected. For example I see for stockfish 2.2.1 2952 2997 3011. So with this the +70 elo per doubling estimate looks a rather good one but this of course depends on the efficiency of parallel implementation as I already mentioned.

Adam Hair · Post by **Adam Hair** » Tue Apr 03, 2012 12:29 pm

petero2 wrote:I believe the "50-70 elo increase per speed doubling" estimate originally comes from "How Computers Play Chess" by David Levy and Monty Newborn in 1991. I don't have that book, so I don't know what data they based the estimate on. I guess they used longer time controls than I did, but given 20 years of hardware improvement, it is not unlikely that my tests generated larger search trees anyway.

In my last table, most values are significantly larger than 50-70. In fact only one value (crafty 1->2 vs stockfish2) falls within the 50-70 range. It is worth noting that each value in that table is calculated from only 416 games, so the error margin according to bayeselo is about 30. However, if the true values were within 50-70, getting the result I got would be extremely unlikely.

Are there any newer estimates of elo versus speed? I would not be surprised if todays engines with recursive null moves, late move reductions and other pruning techniques behave quite differently from the ones from 20 years ago.

I did something similar ( http://talkchess.com/forum/viewtopic.ph ... 68&t=42553 ) and my data is similar to your data. Likewise, I believe that Don Dailey has also done some testing that is related and found results that are similar, IIRC.

I started a new test at longer time controls. I put it to the side for awhile, but I may start it again.

Thanks for sharing your data,
Adam

Rebel · Post by **Rebel** » Tue Apr 03, 2012 3:14 pm

petero2 wrote: I would not be surprised if todays engines with recursive null moves, late move reductions and other pruning techniques behave quite differently from the ones from 20 years ago.

Very likely. Remember that 50-70 elo increase was based on a typical branch factor of 4 in these days. Nowadays with a branch factor between 1.5 and 2.0 average the elo gain tends to go up.

Rebel · Post by **Rebel** » Tue Apr 03, 2012 3:22 pm

Daniel Shawul wrote:It would also be interesting to compare ratings of parallel speedups for 1,2 and 4 processors. Incidentally this seems to roughly match your tests if you assume a 1,1.8 and 3 speedups. That is generally expected efficiency for YBW implementation. The increase in elo is much lower as expected. For example I see for stockfish 2.2.1 2952 2997 3011. So with this the +70 elo per doubling estimate looks a rather good one but this of course depends on the efficiency of parallel implementation as I already mentioned.

Perhaps the hash-table-size plays a role in this? Branch factor goes up when the hash table becomes full. What happens if you triple the hash-table size for a quad?

Daniel Shawul · Post by **Daniel Shawul** » Tue Apr 03, 2012 4:23 pm

Rebel wrote:
Daniel Shawul wrote:It would also be interesting to compare ratings of parallel speedups for 1,2 and 4 processors. Incidentally this seems to roughly match your tests if you assume a 1,1.8 and 3 speedups. That is generally expected efficiency for YBW implementation. The increase in elo is much lower as expected. For example I see for stockfish 2.2.1 2952 2997 3011. So with this the +70 elo per doubling estimate looks a rather good one but this of course depends on the efficiency of parallel implementation as I already mentioned.
Perhaps the hash-table-size plays a role in this? Branch factor goes up when the hash table becomes full. What happens if you triple the hash-table size for a quad?

That could be another reason. From looking at CEGT test conditions I can not be sure if they use 4x hash table size for 4 processors tests, but they seem to double the hash size for 2 processor test.

Given the different hardware from testers we agreed to adapt to AMD64X2 4200+ for 40/120 and 40/20 and 2 GHz Pentium CPU for 40/4. Hash given is usually 256 MB for each engine. Very few testers who have less RAM available are allowed to give 128 MB.
Deep versions: Deep Shredder 9. Deep Fritz 8, Deep Junior 9 and others are tested on dual machines using 2 CPU´s and 512 MB hash. There is an exception for Junior 9.003 using only 256 MB, because there seem to occur bugs when giving 512 MB to this one.

The 40/4 result is better but still <= 70. Stockfish 2.01 2926 2988 3026. The hyperbullet testing condition used for this test probably contributes more than one would expect.

diep · Post by **diep** » Tue Apr 03, 2012 6:03 pm

Rebel wrote:
petero2 wrote: I would not be surprised if todays engines with recursive null moves, late move reductions and other pruning techniques behave quite differently from the ones from 20 years ago.
Very likely. Remember that 50-70 elo increase was based on a typical branch factor of 4 in these days. Nowadays with a branch factor between 1.5 and 2.0 average the elo gain tends to go up.

Well, Jonny is rated around a 2950 with its engine at fast time controls; jonny is significantly stronger at slower time controls and say a core or 4
and world champs played at 2 minutes a move.

Typically it gets just over 20 ply at a PC.

Last world champs it ran on a 800+ core supercomputer.

It systematically reached 40 ply with minimumsearch depth 35 ply during the world champs 2011.

800 cores versus 1 core is more than 9 doublings and it reached 15 ply deeper worst case.

70 * 9 = +630 elo.

Yet it still castled long against a Rybka twin, losing because of that a game.
As big problem also for Diep in past times to castle like that. Fixed years ago in evaluation function. 40 ply doesn't correct this seemingly trivial problem.

It obviously didn't play like that 3600 elo. It's just stories that doubling story.

People test at specific hardware and when you get above that hardware, programs tend to not scale very well elowise, that's the truth.

Elo versus speed

Elo versus speed

Re: Elo versus speed

Re: Elo versus speed

Re: Elo versus speed

Re: Elo versus speed

Re: Elo versus speed

Re: Elo versus speed

Re: Elo versus speed

Re: Elo versus speed

Re: Elo versus speed