An attempt to measure the knowledge of engines

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An attempt to measure the knowledge of engines

Post by bob »

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:This still leaves a hole. 4 plies is not a "constant". Some programs extend more aggressively, some less so. Some reduce more aggressively (stockfish particularly) some less so. So you are STILL measuring search differences along with eval differences. This was what I was trying to do in my simple 1 ply test. But even then I had to modify both programs so that one ply meant the same thing. And that is not even a sure thing since reducing at the root could still be different between the two programs.

One good test would be a vs b to fixed depth, then swap their evaluations and play again. But that would be beyond a royal pain, obviously.

I am not sure there is any really accurate way to compare evaluations so long as searches are not similar. One thing I could do is if you compare stockfish to Crafty your way, I could modify both to do a bare-bones 4 ply search + quiescence, no extensions, no reductions, nothing in q-search except captures, and you could run the test again to the same depth and compare the numbers. If they are close, then what you are doing is probably more than good enough. If they vary significantly, you could experiment to see if there is any simple way of getting them to produce the same results as the code I sent.

4 plies would still be instant without any pruning or reductions involved, so it would be quick.
It is not simple 4 ply search. It is a 4 ply search of engine X against Shredder with nodes = to the average of nodes the engine X is using during a 4 ply search. Fixed nodes versus fixed depth may not be desirable, but all engines "benefited" from the same treatment against Shredder. Still, there are many un-documented quantities, one simple being that a single fixed depth search may be off the average nodes used by a factor of 4 easily.
But if you think about it, you are trying to say a 4 ply search with X is equivalent to N nodes with Y. How confident are you that just because you see how many nodes a 4 ply search traverses, that the two are equal? A very good selective searcher will be able to go far deeper if it is getting to search N nodes. If you tried this with the old chess master program, you would find lots of positions where it could not finish a 4 ply search in any reasonable amount of time, due to the way it selectively searched things.

If you say 4plies = N nodes for X, and then you let Y search for N nodes, that may or may not be very accurate. Impossible to say. What I would expect to see, most likely, is that two programs with a very similar eval, but very different searches, might produce a lop-sided result because of the way you are equivalencing their search.

Not saying it is right, wrong, good or bad. Saying it is "unknown". I would prefer to normalize searches (hard to do with commercials of course, take months of RE study to figure out how) so that the only difference is evaluations. That was what I did in my test. 1 ply search + captures which would REALLY seem to lean on the evaluation for everything. And let me tell you there are some really ODD results when you do that. Walking into repetitions without knowing until it is too late, missing simple mate threats since you only allow the opponent to recapture. Etc. Only good thing is that playing a million games is very fast at sd=1, so you can cover up some of the noise with sheer volume of games.
I performed a self-consistency check, which I should have performed to start with, luckily it was fine. I got the average nodes of Shredder 12 itself to depth=4 (on average 2441 nodes):

Code: Select all

 1) Stockfish 21.03.2015  598 nodes
 2) Komodo 8             1455 nodes
 3) Houdini 4            1957 nodes
 4) Robbolito 0.085      1853 nodes
 5) Texel 1.05           1831 nodes
 6) Gaviota 1.0          1091 nodes
 7) Strelka 2.0          4485 nodes
 8) Fruit 2.1            7294 nodes
 9) Komodo 3             1865 nodes
10) Stockfish 2.1.1      1892 nodes
11) Houdini 1.5          2167 nodes
12) Crafty 24.1          1584 nodes

13) Shredder 12          2441 nodes
Then I put Shredder 12 depth=4 to play against the standard candle fixed nodes Shredder 12, at nodes=2441, and the result is equal within error margins:

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)
   1 Gaviota 1.0             :  178.8     735.0    1000   73.5%
   2 Komodo 3                :  159.9     713.5    1000   71.3%
   3 Houdini 4               :  125.3     671.5    1000   67.2%
   4 Komodo 8                :  109.3     651.0    1000   65.1%
   5 Houdini 1.5             :  101.6     641.0    1000   64.1%
   6 RobboLito 0.085         :   43.0     561.0    1000   56.1%
   7 Shredder 12 nodes       :    0.0    6213.5   13000   47.8%
   8 Shredder 12 depth       :   -7.0     490.0    1000   49.0%
   9 Stockfish 21.03.2015    :  -11.2     484.0    1000   48.4%
  10 Stockfish 2.1.1         :  -22.1     468.5    1000   46.9%
  11 Texel 1.05              :  -43.0     439.0    1000   43.9%
  12 Strelka 2.0             : -109.6     348.5    1000   34.9%
  13 Crafty 24.1             : -115.8     340.5    1000   34.0%
  14 Fruit 2.1               : -199.1     243.0    1000   24.3%
The consistency is encouraging, but it does not prove yet that the methodology is correct.
At least it suggests it is not somewhat random noise.
mjlef
Posts: 1494
Joined: Thu Mar 30, 2006 2:08 pm

Re: An attempt to measure the knowledge of engines

Post by mjlef »

Although I do not have access to Komodo 3 code (although I will look around for it), Komodo has become much more selective over the years. That selectivity lets it search deeper, but will make a fixed depth search look weaker. Also, I think the more recent Stockfish programs have become more selective even in PV nodes (hence the very small number of nodes it shows for a 4 ply search). So a fixed depth search would hurt programs like that a lot. The selectivity is meant to make the program reach more plies of depth, which will not happen in a fixed depth search.

I like the concept of your experiment, but the high selectivity of programs means some of them will not look at a lot of the nodes, and hence under preform at low depths. Most programs I have seen do not prune moves at the first ply, so a one ply search might kinda work. Then again, a program that extends more moves would gain a lot of elo in a one ply search which does not. Possibly you could take the nps for an engine, and its branching factor and reach some conclusion. Or modify each program to have the same nonselectve search.

A few months ago I tried stripping Komodo's evaluation down to something akin to piece square tables and material eval. It cost between 500 and 600 elo (and sped the node rate up a lot). Top programs have gained a lot from both evaluation and search improvements.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: An attempt to measure the knowledge of engines

Post by cdani »

Laskos wrote: I am curious if this abomination "Andscacs - Sungorus" has serious issues, visible with naked eye in games.
The easier problem I see on it is that it plays without plan quite often. And of course it does not sees coming relative easy conceptual problems.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: An attempt to measure the knowledge of engines

Post by michiguel »

mjlef wrote:Although I do not have access to Komodo 3 code (although I will look around for it), Komodo has become much more selective over the years. That selectivity lets it search deeper, but will make a fixed depth search look weaker. Also, I think the more recent Stockfish programs have become more selective even in PV nodes (hence the very small number of nodes it shows for a 4 ply search). So a fixed depth search would hurt programs like that a lot. The selectivity is meant to make the program reach more plies of depth, which will not happen in a fixed depth search.

I like the concept of your experiment, but the high selectivity of programs means some of them will not look at a lot of the nodes, and hence under preform at low depths. Most programs I have seen do not prune moves at the first ply, so a one ply search might kinda work. Then again, a program that extends more moves would gain a lot of elo in a one ply search which does not. Possibly you could take the nps for an engine, and its branching factor and reach some conclusion. Or modify each program to have the same nonselectve search.

A few months ago I tried stripping Komodo's evaluation down to something akin to piece square tables and material eval. It cost between 500 and 600 elo (and sped the node rate up a lot). Top programs have gained a lot from both evaluation and search improvements.
If a program is very selective, it will search less nodes at depth 4. Kai's idea is to correct that by making the opponent (Shredder) to search less nodes to compensate. So, this is not a typical fixed depth=4 experiment.

This is more like Shredder @ x nodes vs. Engine_A @ x nodes (average, which happens to be depth 4).

You can see which programs are more selective in the table that Kai provided, and you will see that the experiment does not correlate with it.

Miguel
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: An attempt to measure the knowledge of engines

Post by Lyudmil Tsvetkov »

mjlef wrote:Although I do not have access to Komodo 3 code (although I will look around for it), Komodo has become much more selective over the years. That selectivity lets it search deeper, but will make a fixed depth search look weaker. Also, I think the more recent Stockfish programs have become more selective even in PV nodes (hence the very small number of nodes it shows for a 4 ply search). So a fixed depth search would hurt programs like that a lot. The selectivity is meant to make the program reach more plies of depth, which will not happen in a fixed depth search.

I like the concept of your experiment, but the high selectivity of programs means some of them will not look at a lot of the nodes, and hence under preform at low depths. Most programs I have seen do not prune moves at the first ply, so a one ply search might kinda work. Then again, a program that extends more moves would gain a lot of elo in a one ply search which does not. Possibly you could take the nps for an engine, and its branching factor and reach some conclusion. Or modify each program to have the same nonselectve search.

A few months ago I tried stripping Komodo's evaluation down to something akin to piece square tables and material eval. It cost between 500 and 600 elo (and sped the node rate up a lot). Top programs have gained a lot from both evaluation and search improvements.
Who could have said it better?
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

mjlef wrote:Although I do not have access to Komodo 3 code (although I will look around for it), Komodo has become much more selective over the years. That selectivity lets it search deeper, but will make a fixed depth search look weaker. Also, I think the more recent Stockfish programs have become more selective even in PV nodes (hence the very small number of nodes it shows for a 4 ply search). So a fixed depth search would hurt programs like that a lot. The selectivity is meant to make the program reach more plies of depth, which will not happen in a fixed depth search.

I like the concept of your experiment, but the high selectivity of programs means some of them will not look at a lot of the nodes, and hence under preform at low depths. Most programs I have seen do not prune moves at the first ply, so a one ply search might kinda work. Then again, a program that extends more moves would gain a lot of elo in a one ply search which does not. Possibly you could take the nps for an engine, and its branching factor and reach some conclusion. Or modify each program to have the same nonselectve search.

A few months ago I tried stripping Komodo's evaluation down to something akin to piece square tables and material eval. It cost between 500 and 600 elo (and sped the node rate up a lot). Top programs have gained a lot from both evaluation and search improvements.
I don't quite get this pruning reason in the context of the present test. Pruning, correctly done, does not deprecate the value of a node, so testing at fixed nodes should be insensitive to pruning in this context. Test here mimics fixed nodes at low number of nodes, although most engines don't handle "go nodes" directive very well (or at all). Not so low number of nodes as depth=1, but I saw games so crappy at depth=1 and engines differ so much at what they are doing, that integration of eval with say ~2000 searched nodes (depth=4 varies from engine to engine in nodes) seems more representative of eval performance.

About the eval, does it happen that you have to remove or to shortcut an eval component in order to gain Elo? I would imagine that these sorts of things happen.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: An attempt to measure the knowledge of engines

Post by cdani »

Laskos wrote: About the eval, does it happen that you have to remove or to shortcut an eval component in order to gain Elo? I would imagine that these sorts of things happen.
If I understood well the question, of course it happens with all. An example is the mg/eg separation.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

cdani wrote:
Laskos wrote: About the eval, does it happen that you have to remove or to shortcut an eval component in order to gain Elo? I would imagine that these sorts of things happen.
If I understood well the question, of course it happens with all. An example is the mg/eg separation.
I mean from version to version, not so much during the game. I would think that eval features are often more volatile and piecemeal than say pruning techniques.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: An attempt to measure the knowledge of engines

Post by lucasart »

Laskos wrote:Shredder GUI has a nice feature of giving the output statistic by ply after an EPD set of positions is fed to it for analysis. The positions are FENs without a solution, and are balanced. To compare the engines directly at depth=1 is tricky, and Bob Hyatt explained it well:
bob wrote: I can't say "for most reasonable programs". I was directly addressing a claim about stockfish and Crafty.

The "depth 1" can be tricky. Here's why:

(1) some programs extend depth when they give check (Crafty, for example). While others extend depth when escaping check. That is a big difference, in that with the latter, you drop into the q-search while in check. Nothing wrong, but you might or might not recognize a mate. Which means that with a 1 ply search, some programs will recognize a mate in 1, some won't.

(2) q-search. Some do a very simple q-search. Some escape or give check at the first ply of q-search. Some give check at the first search ply, escape at the second and give check again at the third. Those are not equal.

When I tried this with stockfish, i ran into the same problem. All 1 ply searches are not created equal.

As I mentioned, I am not sure my test is all that useful, because a program's evaluation is written around its search, and vice-versa. If you limit one part, you might be limiting the other part without knowing, skewing the results.
To make the full of "eval driven search", I adopted the following methodology:

1/ Use depth=4 to avoid the issue of threats, q-search imbalances, inadequacy of depth=1 search in games.
2/ Most engines do not follow well "go nodes" UCI command for small number of nodes. They do follow "go depth" command rather well.
3/ Shredder engines (here I used Shredder 12 as standard candle) follow "go nodes" command literally, even for small number of nodes.
4/ Calculate average nodes for each engine to depth=4 on many positions (150 late opening, 171 endgame).
5/ Set the matches: Engine X depth=4 versus Shredder 12 number of nodes which each engine uses to depth 4.

In Shredder GUI, as an example Gaviota 1.0:

Code: Select all

Late opening positions:

  TotTime: 25s    SolTime: 25s
  Ply: 0   Positions:150   Avg Nodes:       0   Branching = 0.00
  Ply: 1   Positions:150   Avg Nodes:      90   Branching = 0.00
  Ply: 2   Positions:150   Avg Nodes:     255   Branching = 2.83
  Ply: 3   Positions:150   Avg Nodes:     524   Branching = 2.05
  Ply: 4   Positions:150   Avg Nodes:    1254   Branching = 2.39
Engine: Gaviota v1.0 (2048 MB)
by Miguel A. Ballicora


Endgame positions:

  TotTime: 28s    SolTime: 28s
  Ply: 0   Positions:171   Avg Nodes:       0   Branching = 0.00
  Ply: 1   Positions:171   Avg Nodes:      46   Branching = 0.00
  Ply: 2   Positions:171   Avg Nodes:     158   Branching = 3.43
  Ply: 3   Positions:171   Avg Nodes:     386   Branching = 2.44
  Ply: 4   Positions:171   Avg Nodes:     928   Branching = 2.40
Engine: Gaviota v1.0 (2048 MB)
by Miguel A. Ballicora
I took the average nodes to depth=4 for a dozen engines:

Code: Select all

 1) Stockfish 21.03.2015  598 nodes
 2) Komodo 8             1455 nodes
 3) Houdini 4            1957 nodes
 4) Robbolito 0.085      1853 nodes
 5) Texel 1.05           1831 nodes
 6) Gaviota 1.0          1091 nodes
 7) Strelka 2.0          4485 nodes
 8) Fruit 2.1            7294 nodes
 9) Komodo 3             1865 nodes
10) Stockfish 2.1.1      1892 nodes
11) Houdini 1.5          2167 nodes
12) Crafty 24.1          1584 nodes
Play all of them at fixed depth=4 against Shredder 12 at fixed nodes determined for each engine.

Shredder 12 standard candle, the strength of evals:

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)
   1 Gaviota 1.0             :  178.8     735.0    1000   73.5%
   2 Komodo 3 (2011)         :  159.9     713.5    1000   71.3%
   3 Houdini 4               :  125.3     671.5    1000   67.2%
   4 Komodo 8                :  109.3     651.0    1000   65.1%
   5 Houdini 1.5 (2011)      :  101.6     641.0    1000   64.1%
   6 RobboLito 0.085         :   43.0     561.0    1000   56.1%
   7 Shredder 12             :    0.0    5703.5   12000   47.5%
   8 Stockfish 21.03.2015    :  -11.2     484.0    1000   48.4%
   9 Stockfish 2.1.1 (2011)  :  -22.1     468.5    1000   46.9%
  10 Texel 1.05              :  -43.0     439.0    1000   43.9%
  11 Strelka 2.0 (Rybka 1.0) : -109.6     348.5    1000   34.9%
  12 Crafty 24.1             : -115.8     340.5    1000   34.0%
  13 Fruit 2.1               : -199.1     243.0    1000   24.3%
a) This is as close as I can get to some meaning of evals integrated into search without messing with sources (often unavailable).
b) Besides the "anomaly" with Gaviota 1.0 (in fact there were earlier indications that Gaviota has a strong eval), there is another of Komodo 3 having a better eval (larger maybe) than Komodo 8. Larry or Mark could confirm or infirm the regression (or thinning) of the eval.
c) Stockfish eval seems no stronger than that of Shredder 12, but it seems significantly stronger than that of Crafty 24.1.
Interesting. But I think SF is significantly undervalued by this measure, because even at depth=4, there is a lot of pretty crazy pruning... For testing eval onle patches (with limited resources) I typically use depth=10.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: An attempt to measure the knowledge of engines

Post by lucasart »

Laskos wrote:
Lyudmil Tsvetkov wrote: Well, Daniel just answered you very convincingly: you can not separate eval from search, and as a rule a more sophisticated search goes hand in hand with more sophisticated eval.

There are of course smaller or larger variances, some engines might emphasize eval, and others search, but as a whole, or emphasize it at different periods of engine development, but, as a whole, eval and search should be harmonised, being close together in contribution.

That is why Mr. Zibi managed just the 2600-elo barrier. :)

And why Fruit, despite having tremendous search, was still much inferior to Rybka, Houdini and other engines having both good eval and good search.

But, on the other hand, who says Fruit had such a bad/basic eval?
Was not it Fabien who introduced for the first time scoring of passers in terms of ranks, and this proved a big winner?
Well, that Fuit had a simple eval even compared to Crafty was visible in its sources. If your only argument is that the strongest engines must have the most elaborate and largest eval, then somebody has to check this sort of hand-waving.
Indeed. Quantity != Quality...
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.