An attempt to measure the knowledge of engines

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

bob wrote:
Laskos wrote:
The consistency is encouraging, but it does not prove yet that the methodology is correct.
At least it suggests it is not somewhat random noise.
Indeed, I hope to make some sense out of this. I am encouraged by several indices that something is up there:

1/ Andscacs-Sungorus is an abomination of extremely weak eval and very strong search. Although in normal games it beats SOS, Fruit, Shredder 6, in this list it comes by far the last, as it should.
2/ Fruit was known to have a simple eval, and although it beats both Shredder 6 and SOS in normal games, it comes lower on this list, as it should.
3/ While strength is not a determinant factor in the list, the newer and stronger engines still tend to score better, denoting a progress in time, as it should.

I re-tested everything taking more care of the averages. I took the geometric average between opening-middlegame and endgame phases, taking into account that opening-middlegame is more important in play than the endgame. I also took into account that geometric averages instead of arithmetic averages should be used when computing the means for each count.
So, if a=number of nodes used during the late opening, b=number of nodes used during endgame, the mean is: (b*a^2)^(1/3), instead of the old (a+b)/2. The results are close to the old ones, showing the apparent robustness of these Elos, which are hopefully related to the mostly Eval of engines.

The new, a bit more refined list, with node counts to depth=4 included:

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)      NODES to depth 4
   1 Gaviota 1.0             :  153.9     706.5    1000   70.7%      1078
   2 Komodo 3                :  138.6     688.0    1000   68.8%      1852
   3 Houdini 4               :  128.9     676.0    1000   67.6%      1967
   4 Komodo 8                :  120.2     665.0    1000   66.5%      1418
   5 Houdini 1.5             :   91.4     627.5    1000   62.8%      2177
   6 RobboLito 0.085         :   62.3     588.0    1000   58.8%      1862
   7 Shredder 12 depth       :   13.0     518.5    1000   51.9%      2350
   8 Shredder 12             :    0.0    9569.0   17992   53.2%       
   9 Stockfish 21.03.2015    :   -2.1     497.0    1000   49.7%       558
  10 Andscacs 0.72           :  -20.1     469.5     996   47.1%      3351
  11 Texel 1.05              :  -29.9     457.5    1000   45.8%      1823
  12 Stockfish 2.1.1         :  -45.1     436.0    1000   43.6%      1899
  13 Crafty 24.1             :  -84.7     380.0     996   38.2%      1364
  14 Shredder 6PB            :  -92.1     371.5    1000   37.1%      3269
  15 Shredder 9.1            : -104.6     355.0    1000   35.5%      3375
  16 Strelka 2.0             : -122.9     331.5    1000   33.1%      4508
  17 SOS 5.1                 : -178.3     265.5    1000   26.6%      3079
  18 Fruit 2.1               : -179.7     264.0    1000   26.4%      7268
  19 Andscacs-Sung           : -339.4     126.0    1000   12.6%      3699
I estimate one standard deviation of these Elos (which may mean something :)) to be about 15 Elo points.
Karlo Bala
Posts: 373
Joined: Wed Mar 22, 2006 10:17 am
Location: Novi Sad, Serbia
Full name: Karlo Balla

Re: An attempt to measure the knowledge of engines

Post by Karlo Bala »

I like the approach, however there is one thing to think about. Testing engines at small depths often favors bigger values for queen, mobility, king safety, passed pawns etc. I played with Toga, and some very weird material values was better then original at small depth, while original was much better at bigger depths. If you have time, run the test for depths 2, 3, 5, 6. I expect very different results.
Best Regards,
Karlo Balla Jr.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

Karlo Bala wrote:I like the approach, however there is one thing to think about. Testing engines at small depths often favors bigger values for queen, mobility, king safety, passed pawns etc. I played with Toga, and some very weird material values was better then original at small depth, while original was much better at bigger depths. If you have time, run the test for depths 2, 3, 5, 6. I expect very different results.
I tested some engines to depth=5, the results seem similar to depth=4. I think the search begins to kick a bit, so Komodo 3 overcomes Gaviota 1.0 by a tiny margin. I still think that depth=4 is an equilibrium between too high depths, where search is important, and too low depths, where the quality of the games is too low to test eval.

depth=5:

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)
   1 Komodo 3                :  162.9     717.0    1000   71.7%
   2 Gaviota 1.0             :  154.4     707.0    1000   70.7%
   3 Houdini 4               :  148.9     700.5    1000   70.0%
   4 Komodo 8                :  141.9     692.0    1000   69.2%
   5 Shredder 12             :    0.0    3483.0    7000   49.8%
   6 Stockfish 21.03.2015    : -108.9     349.5    1000   35.0%
   7 Fruit 2.1               : -215.2     226.5    1000   22.6%
   8 Andscacs-Sung           : -341.8     124.5    1000   12.4%
The stability of results is encouraging.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

I tried to include Rybka engines in the list, although there are mainly speculations about its node-counting and depth. Rybka node-counting method seems to simply count White make-moves and divide by 7, so it is low by a factor of ~14, or maybe ~16 if one includes null moves in accounting. With the reported depth I choose Rybka's depth=2 instead of depth=4. Depth is less stringent, even if the "real" depth is off by one, it doesn't matter too much, as the node counts are for the depth used.
Rybka seems one of the most "knowledgeable" engines:

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)
   1 Gaviota 1.0             :  153.9     706.5    1000   70.7%
   2 Rybka 4.1               :  141.9     692.0    1000   69.2%
   3 Komodo 3                :  138.6     688.0    1000   68.8%
   4 Houdini 4               :  128.9     676.0    1000   67.6%
   5 Komodo 8                :  120.2     665.0    1000   66.5%
   6 Rybka 3                 :   94.8     632.0    1000   63.2%
   7 Houdini 1.5             :   91.4     627.5    1000   62.8%
   8 RobboLito 0.085         :   62.3     588.0    1000   58.8%
   9 Shredder 12 depth       :   13.0     518.5    1000   51.9%
  10 Shredder 12             :    0.0   10245.0   19992   51.2%
  11 Stockfish 21.03.2015    :   -2.1     497.0    1000   49.7%
  12 Andscacs 0.72           :  -20.1     469.5     996   47.1%
  13 Texel 1.05              :  -29.9     457.5    1000   45.8%
  14 Stockfish 2.1.1         :  -45.1     436.0    1000   43.6%
  15 Crafty 24.1             :  -84.7     380.0     996   38.2%
  16 Shredder 6PB            :  -92.1     371.5    1000   37.1%
  17 Shredder 9.1            : -104.6     355.0    1000   35.5%
  18 Strelka 2.0             : -122.9     331.5    1000   33.1%
  19 SOS 5.1                 : -178.3     265.5    1000   26.6%
  20 Fruit 2.1               : -179.7     264.0    1000   26.4%
  21 Ands-Sung               : -339.4     126.0    1000   12.6%