Another attempt at comparing Evals ELO-wise

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
Laskos
Posts: 9446
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Another attempt at comparing Evals ELO-wise

Post by Laskos » Mon May 22, 2017 11:06 am

I had the hunch to test top engines not at low fixed nodes ("nodes" are pretty relative to engine), not at fixed low depth (this seems again relative), but at fixed very short time. The problem with this is that in Cutechess-Cli on Windows engines overstep their allotted very short time by several milliseconds (for example, using ST=0.001 command in Cutechess-Cli). I inferred the overstepping in ms from their behaviour under doubling time control. Komodo 10.4 sees clearly the doubling already at 0.002s vs 0.001 time control, while Stockfish sees it only at 0.008 vs 0.004 or higher time control. Thus, the latency of Komodo is 0.000, of Stockfish 0.004.

Here is the table of latencies in milliseconds (Windows 8.1 and Cutechess-Cli):

Code: Select all

Komodo 10.4:      0
Stockfish 8:      4 
Houdini 5:        4 
Deep Shredder 13: 0
Andscacs 0.91:    0
Fruit 2.1:       12
Doubling at these short times is about 250 ELO points, and if I use time control 0.005/move, I can infer adjusted ratings for effective time used. Unadjusted (games at 5ms/move):

Code: Select all

   # PLAYER               : TIME   RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)

   1 Stockfish 8          : 9 ms   1664.8   13.5    1470.0    2000    73.5     100    
   2 Houdini 5            : 9 ms   1607.5   12.3    1310.5    2000    65.5     100    
   3 Deep Shredder 13     : 5 ms   1517.0   12.5    1040.0    2000    52.0      70    
   4 Komodo10.4           : 5 ms   1512.0   12.2    1025.0    2000    51.3     100    
   5 Andscacs 0.91        : 5 ms   1446.3   12.7     827.0    2000    41.4     100    
   6 Fruit 2.1            :17 ms   1252.5   15.2     327.5    2000    16.4     ---    
Adjusted for time used:

EVAL RATING:

Code: Select all

   # PLAYER                 : RATING  

   1 Deep Shredder 13       :  687    
   2 Komodo 10.4            :  682
   3 Stockfish 8            :  625      
   4 Andscacs 0.91          :  616  
   5 Houdini 5              :  567   
   6 Fruit 2.1              :    0     
It seems Deep Shredder and Komodo have the best eval of top engines, while Houdini the weakest. Also, the progress from Fruit 2.1 basic eval is remarkable. Andscas seems on par with Stockfish, and only search is hampering it. The same for Shredder.

User avatar
cdani
Posts: 2104
Joined: Sat Jan 18, 2014 9:24 am
Location: Andorra
Contact:

Re: Another attempt at comparing Evals ELO-wise

Post by cdani » Mon May 22, 2017 1:03 pm

Nice test. Thanks!

I use QueryPerformanceCounter to obtain the current time. This gives very fine grained time. Also I query the time taking into account the thinking time, so more often if the alloted time is much lower. This allows a very exact control of the time. So I don't use another thread for this, like many other engines.
Laskos wrote:Andscacs seems on par with Stockfish, and only search is hampering it.
Andscacs eval is more complete than the Stockfish one, but is clearly less well tuned. So its compensating precission with quantity. Also should be tuned for long time control. I think I can grew it clearly more, and overcome Stockfish relative simple eval is not very complicated.

Abou search, well, is not easy at all :-)

Also Andscacs is losing probably like 30-50 elo only due to speed, as it has been written as my first serious engine, thus many things on it are less than optimally written, even if I have rewritten most parts of it various times.

Houdini eval is suprising. Maybe is too much simple, who knows.

sandermvdb
Posts: 157
Joined: Sat Jan 28, 2017 12:29 pm
Location: The Netherlands

Re: Another attempt at comparing Evals ELO-wise

Post by sandermvdb » Mon May 22, 2017 4:00 pm

cdani wrote:Nice test. Thanks!

I use QueryPerformanceCounter to obtain the current time. This gives very fine grained time. Also I query the time taking into account the thinking time, so more often if the alloted time is much lower. This allows a very exact control of the time. So I don't use another thread for this, like many other engines.
Maybe a stupid question, but why is it important to have fine grained time? Is this the one used in the uci output (time) and in some way used by the GUI (or cutechess cli)?

User avatar
cdani
Posts: 2104
Joined: Sat Jan 18, 2014 9:24 am
Location: Andorra
Contact:

Re: Another attempt at comparing Evals ELO-wise

Post by cdani » Mon May 22, 2017 4:53 pm

sandermvdb wrote: Maybe a stupid question, but why is it important to have fine grained time? Is this the one used in the uci output (time) and in some way used by the GUI (or cutechess cli)?
I tried other ways, but this one allowed to run at faster time controls than other ways without losing on time. As you can see this system should have no overhead.

Of course it does not work for Linux, for which I used chrono::steady_clock.

User avatar
Laskos
Posts: 9446
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Another attempt at comparing Evals ELO-wise

Post by Laskos » Mon May 22, 2017 7:54 pm

SzG wrote:Please! Elo!
Well, I understand that Arpad Elo is a Hungarian, but "ELO" since long transcended the proper name "Elo" as a unit of relative strength. Also, ELO as used in computer chess is not FIDE Elo and is not what Arpad Elo did (he used normal distribution, for once). I use ELO because it suits better me when others are reading fast my posts, as it happens on forums.

Sven
Posts: 3826
Joined: Thu May 15, 2008 7:57 pm
Location: Berlin, Germany
Full name: Sven Schüle
Contact:

Re: Another attempt at comparing Evals ELO-wise

Post by Sven » Mon May 22, 2017 8:31 pm

Laskos wrote:
SzG wrote:Please! Elo!
Well, I understand that Arpad Elo is a Hungarian, but "ELO" since long transcended the proper name "Elo" as a unit of relative strength.
But everyone writes Watt, Kelvin or Newton and not "WATT", "KELVIN" or "NEWTON". So why "ELO"? Many people using it wrongly doesn't make it right ...
Laskos wrote:Also, ELO as used in computer chess is not FIDE Elo and is not what Arpad Elo did (he used normal distribution, for once).
That could be a reason to use a different name than Elo but not to write "ELO" instead of Elo.

User avatar
Laskos
Posts: 9446
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Another attempt at comparing Evals ELO-wise

Post by Laskos » Mon May 22, 2017 10:00 pm

Sven Schüle wrote:
Laskos wrote:
SzG wrote:Please! Elo!
Well, I understand that Arpad Elo is a Hungarian, but "ELO" since long transcended the proper name "Elo" as a unit of relative strength.
But everyone writes Watt, Kelvin or Newton and not "WATT", "KELVIN" or "NEWTON". So why "ELO"? Many people using it wrongly doesn't make it right ...
Laskos wrote:Also, ELO as used in computer chess is not FIDE Elo and is not what Arpad Elo did (he used normal distribution, for once).
That could be a reason to use a different name than Elo but not to write "ELO" instead of Elo.
An important issue worth mentioning.

User avatar
Laskos
Posts: 9446
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

Re: Another attempt at comparing Evals ELO-wise

Post by Laskos » Mon May 22, 2017 10:24 pm

cdani wrote:Nice test. Thanks!

I use QueryPerformanceCounter to obtain the current time. This gives very fine grained time. Also I query the time taking into account the thinking time, so more often if the alloted time is much lower. This allows a very exact control of the time. So I don't use another thread for this, like many other engines.
Laskos wrote:Andscacs seems on par with Stockfish, and only search is hampering it.
Andscacs eval is more complete than the Stockfish one, but is clearly less well tuned. So its compensating precission with quantity. Also should be tuned for long time control. I think I can grew it clearly more, and overcome Stockfish relative simple eval is not very complicated.

Abou search, well, is not easy at all :-)

Also Andscacs is losing probably like 30-50 elo only due to speed, as it has been written as my first serious engine, thus many things on it are less than optimally written, even if I have rewritten most parts of it various times.

Houdini eval is suprising. Maybe is too much simple, who knows.
What you say might be important for longer analysis, where often LTC and eval are more important. Do you have any ideal if better eval could mean better scaling with time ELO-wise? This list is strangely similar in certain aspects to scaling of engines I derived from FGRL rating list.

lkaufman
Posts: 3724
Joined: Sun Jan 10, 2010 5:15 am
Location: Maryland USA
Contact:

Re: Another attempt at comparing Evals ELO-wise

Post by lkaufman » Mon May 22, 2017 11:13 pm

Laskos wrote:
cdani wrote:Nice test. Thanks!

I use QueryPerformanceCounter to obtain the current time. This gives very fine grained time. Also I query the time taking into account the thinking time, so more often if the alloted time is much lower. This allows a very exact control of the time. So I don't use another thread for this, like many other engines.
Laskos wrote:Andscacs seems on par with Stockfish, and only search is hampering it.
Andscacs eval is more complete than the Stockfish one, but is clearly less well tuned. So its compensating precission with quantity. Also should be tuned for long time control. I think I can grew it clearly more, and overcome Stockfish relative simple eval is not very complicated.

Abou search, well, is not easy at all :-)

Also Andscacs is losing probably like 30-50 elo only due to speed, as it has been written as my first serious engine, thus many things on it are less than optimally written, even if I have rewritten most parts of it various times.

Houdini eval is suprising. Maybe is too much simple, who knows.
What you say might be important for longer analysis, where often LTC and eval are more important. Do you have any ideal if better eval could mean better scaling with time ELO-wise? This list is strangely similar in certain aspects to scaling of engines I derived from FGRL rating list.

To me it is obvious that better eval correlates with better scaling, although it is not a perfect correlation. Tactics become less important with more time, while errors in eval don't generally go away with more time, although perhaps there is a difference between static and dynamic eval features in this respect. Better eval usually takes more time, but the slowdown is probably fairly constant so the elo loss dissipates with increased depth while the elo gain from better eval may remain fairly constant or perhaps even grow. This could be tested by, for example, making a version of Stockfish with the basic material values distorted, perhaps by reducing all of them by a constant like 50 "SF" points, while giving the distorted version double time at various time limits. Or perhaps cut the total of pawn structure in half with double time.
I'm a bit unclear on why you say that super-fast play measures eval. Is it so fast that search differences like LMR mostly vanish? What is the average search depth you get at these levels?
Komodo rules!

elpapa
Posts: 203
Joined: Sun Jan 18, 2009 10:27 pm
Location: Sweden
Full name: Patrik Karlsson

Re: Another attempt at comparing Evals ELO-wise

Post by elpapa » Mon May 22, 2017 11:56 pm

Laskos wrote:
SzG wrote:Please! Elo!
Well, I understand that Arpad Elo is a Hungarian
Let's just be glad his last name wasn't Oberknezsevics.

Post Reply