An attempt to measure the knowledge of engines

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

An attempt to measure the knowledge of engines

Post by Laskos »

Shredder GUI has a nice feature of giving the output statistic by ply after an EPD set of positions is fed to it for analysis. The positions are FENs without a solution, and are balanced. To compare the engines directly at depth=1 is tricky, and Bob Hyatt explained it well:
bob wrote: I can't say "for most reasonable programs". I was directly addressing a claim about stockfish and Crafty.

The "depth 1" can be tricky. Here's why:

(1) some programs extend depth when they give check (Crafty, for example). While others extend depth when escaping check. That is a big difference, in that with the latter, you drop into the q-search while in check. Nothing wrong, but you might or might not recognize a mate. Which means that with a 1 ply search, some programs will recognize a mate in 1, some won't.

(2) q-search. Some do a very simple q-search. Some escape or give check at the first ply of q-search. Some give check at the first search ply, escape at the second and give check again at the third. Those are not equal.

When I tried this with stockfish, i ran into the same problem. All 1 ply searches are not created equal.

As I mentioned, I am not sure my test is all that useful, because a program's evaluation is written around its search, and vice-versa. If you limit one part, you might be limiting the other part without knowing, skewing the results.
To make the full of "eval driven search", I adopted the following methodology:

1/ Use depth=4 to avoid the issue of threats, q-search imbalances, inadequacy of depth=1 search in games.
2/ Most engines do not follow well "go nodes" UCI command for small number of nodes. They do follow "go depth" command rather well.
3/ Shredder engines (here I used Shredder 12 as standard candle) follow "go nodes" command literally, even for small number of nodes.
4/ Calculate average nodes for each engine to depth=4 on many positions (150 late opening, 171 endgame).
5/ Set the matches: Engine X depth=4 versus Shredder 12 number of nodes which each engine uses to depth 4.

In Shredder GUI, as an example Gaviota 1.0:

Code: Select all

Late opening positions:

  TotTime: 25s    SolTime: 25s
  Ply: 0   Positions:150   Avg Nodes:       0   Branching = 0.00
  Ply: 1   Positions:150   Avg Nodes:      90   Branching = 0.00
  Ply: 2   Positions:150   Avg Nodes:     255   Branching = 2.83
  Ply: 3   Positions:150   Avg Nodes:     524   Branching = 2.05
  Ply: 4   Positions:150   Avg Nodes:    1254   Branching = 2.39
Engine: Gaviota v1.0 (2048 MB)
by Miguel A. Ballicora


Endgame positions:

  TotTime: 28s    SolTime: 28s
  Ply: 0   Positions:171   Avg Nodes:       0   Branching = 0.00
  Ply: 1   Positions:171   Avg Nodes:      46   Branching = 0.00
  Ply: 2   Positions:171   Avg Nodes:     158   Branching = 3.43
  Ply: 3   Positions:171   Avg Nodes:     386   Branching = 2.44
  Ply: 4   Positions:171   Avg Nodes:     928   Branching = 2.40
Engine: Gaviota v1.0 (2048 MB)
by Miguel A. Ballicora
I took the average nodes to depth=4 for a dozen engines:

Code: Select all

 1) Stockfish 21.03.2015  598 nodes
 2) Komodo 8             1455 nodes
 3) Houdini 4            1957 nodes
 4) Robbolito 0.085      1853 nodes
 5) Texel 1.05           1831 nodes
 6) Gaviota 1.0          1091 nodes
 7) Strelka 2.0          4485 nodes
 8) Fruit 2.1            7294 nodes
 9) Komodo 3             1865 nodes
10) Stockfish 2.1.1      1892 nodes
11) Houdini 1.5          2167 nodes
12) Crafty 24.1          1584 nodes
Play all of them at fixed depth=4 against Shredder 12 at fixed nodes determined for each engine.

Shredder 12 standard candle, the strength of evals:

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)
   1 Gaviota 1.0             :  178.8     735.0    1000   73.5%
   2 Komodo 3 (2011)         :  159.9     713.5    1000   71.3%
   3 Houdini 4               :  125.3     671.5    1000   67.2%
   4 Komodo 8                :  109.3     651.0    1000   65.1%
   5 Houdini 1.5 (2011)      :  101.6     641.0    1000   64.1%
   6 RobboLito 0.085         :   43.0     561.0    1000   56.1%
   7 Shredder 12             :    0.0    5703.5   12000   47.5%
   8 Stockfish 21.03.2015    :  -11.2     484.0    1000   48.4%
   9 Stockfish 2.1.1 (2011)  :  -22.1     468.5    1000   46.9%
  10 Texel 1.05              :  -43.0     439.0    1000   43.9%
  11 Strelka 2.0 (Rybka 1.0) : -109.6     348.5    1000   34.9%
  12 Crafty 24.1             : -115.8     340.5    1000   34.0%
  13 Fruit 2.1               : -199.1     243.0    1000   24.3%
a) This is as close as I can get to some meaning of evals integrated into search without messing with sources (often unavailable).
b) Besides the "anomaly" with Gaviota 1.0 (in fact there were earlier indications that Gaviota has a strong eval), there is another of Komodo 3 having a better eval (larger maybe) than Komodo 8. Larry or Mark could confirm or infirm the regression (or thinning) of the eval.
c) Stockfish eval seems no stronger than that of Shredder 12, but it seems significantly stronger than that of Crafty 24.1.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: An attempt to measure the knowledge of engines

Post by Lyudmil Tsvetkov »

Interesting, but I think you measure something else than evals.

Gaviota having better eval than Komodo?
Or SF?

Seems funny to me.

As obvious, eval and search are completely inseparable: specific evals work only for specific searches, and vice-versa. So I really do not know what is the added value of such measurements.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: An attempt to measure the knowledge of engines

Post by Lyudmil Tsvetkov »

Other intersting question is: is move oredering taken into account?
Does it belong to search or to eval?

As, when you order moves, you are supposed to use evaluation terms for ordering, like psqt, etc., and it is possible that move ordering has a very important role to play in engine strength, after all.

So, does move ordering belong to search or eval?

I think you simply can not separate search and eval.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: An attempt to measure the knowledge of engines

Post by Lyudmil Tsvetkov »

Sorry for posting 3 messages in a row, but I always forget to add something.

Another hypothesis: I presume that no matter what engine you take, the contribution of its search and eval will be split roughly equal.

This is based on very simple observations:

- I am following the framework; in the past 2 years, the contribution of search and eval to SF elo increase has been roughly the same. I did not notice a single case, when all elo would be due entirely to eval, or search

- I presume the same is true of Komodo, at least based on the rare comments I have managed to get here from Larry and Mark, Don

- looking at very weak engines, that just start their existence, you make a move generator, create move ordering and basic search routines, add material values and mobility to eval, and you have a simple engine; later those weak engines report on this forum they have added some other eval features, maybe psqt, passers, etc., maybe check extensions, LMR, etc., and all this goes hand in hand

So I think real-life observations tend to heavily support the claim that search and eval more or less develop in parallel, with no sharp distinctions between the two, as the engine strength increases, so it is reasonable to suppose that you can not have 2 engines of roughly equal strength, with one having excellent eval and very bad search, and the other excellent search and very bad eval.

I suppose better eval goes hand in hand with better search, and vice-versa.

Is there a single author on this forum, who would claim that, as his engine strength increased, he has been consistently working only on eval, or only on search?
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

Lyudmil Tsvetkov wrote:Other intersting question is: is move oredering taken into account?
Does it belong to search or to eval?

As, when you order moves, you are supposed to use evaluation terms for ordering, like psqt, etc., and it is possible that move ordering has a very important role to play in engine strength, after all.

So, does move ordering belong to search or eval?

I think you simply can not separate search and eval.
Move ordering is driving alpha-beta and generally the pruning, and I consider it part of the search. PSQTs are part of the eval, and many eval parts are linked to search. Still, search here is just a small part I kept for consistency of eval. If I want eval-search together, I can do fixed time or whatever regular testing.

It is not inconceivable that Shredder 12 and Gaviota 1.0 have a larger, more extensive eval, but less efficient ELO-wise at fixed time than Stockfish. Likewise, that Komodo from 3 to 8 went a bit slimmer in eval. IIRC Rybka 4 went so compared to Rybka 3, and generally, ELO at fixed time is a different matter to trying to compare mostly evals. It's possible that my results are meaningless, but not based on someone's skepticism that a much stronger engine cannot have a slimmer eval.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

See the result of this, 2 year old thread, for very short searches and longer ones:
http://talkchess.com/forum/viewtopic.ph ... 0&start=35

Gaviota seemed to have at that time better endgame knowledge than the much stronger Stockfish, and Stockfish improves dramatically through longer search. Here I used 4 plies, meaning on average some 1ms searches.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An attempt to measure the knowledge of engines

Post by bob »

This still leaves a hole. 4 plies is not a "constant". Some programs extend more aggressively, some less so. Some reduce more aggressively (stockfish particularly) some less so. So you are STILL measuring search differences along with eval differences. This was what I was trying to do in my simple 1 ply test. But even then I had to modify both programs so that one ply meant the same thing. And that is not even a sure thing since reducing at the root could still be different between the two programs.

One good test would be a vs b to fixed depth, then swap their evaluations and play again. But that would be beyond a royal pain, obviously.

I am not sure there is any really accurate way to compare evaluations so long as searches are not similar. One thing I could do is if you compare stockfish to Crafty your way, I could modify both to do a bare-bones 4 ply search + quiescence, no extensions, no reductions, nothing in q-search except captures, and you could run the test again to the same depth and compare the numbers. If they are close, then what you are doing is probably more than good enough. If they vary significantly, you could experiment to see if there is any simple way of getting them to produce the same results as the code I sent.

4 plies would still be instant without any pruning or reductions involved, so it would be quick.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: An attempt to measure the knowledge of engines

Post by Lyudmil Tsvetkov »

Laskos wrote:
Lyudmil Tsvetkov wrote:Other intersting question is: is move oredering taken into account?
Does it belong to search or to eval?

As, when you order moves, you are supposed to use evaluation terms for ordering, like psqt, etc., and it is possible that move ordering has a very important role to play in engine strength, after all.

So, does move ordering belong to search or eval?

I think you simply can not separate search and eval.
Move ordering is driving alpha-beta and generally the pruning, and I consider it part of the search. PSQTs are part of the eval, and many eval parts are linked to search. Still, search here is just a small part I kept for consistency of eval. If I want eval-search together, I can do fixed time or whatever regular testing.

It is not inconceivable that Shredder 12 and Gaviota 1.0 have a larger, more extensive eval, but less efficient ELO-wise at fixed time than Stockfish. Likewise, that Komodo from 3 to 8 went a bit slimmer in eval. IIRC Rybka 4 went so compared to Rybka 3, and generally, ELO at fixed time is a different matter to trying to compare mostly evals. It's possible that my results are meaningless, but not based on someone's skepticism that a much stronger engine cannot have a slimmer eval.
Hi Kai.

Of course, the first thing I say is, thanks for the experiments.

And after that, I am 100 convinced that a much stronger engine can not have a slimmer eval; its eval will always be larger.

I have all the reasons in the world to suppose that Komodo and SF have both the most refined eval and the most coplicated search.

Btw., I am sure this also applies to time management and speed optimisations: it is very likely that Komodo and SF have the 2 best time managements and are the 2 engines that are best optimised speed-wise.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

bob wrote:This still leaves a hole. 4 plies is not a "constant". Some programs extend more aggressively, some less so. Some reduce more aggressively (stockfish particularly) some less so. So you are STILL measuring search differences along with eval differences. This was what I was trying to do in my simple 1 ply test. But even then I had to modify both programs so that one ply meant the same thing. And that is not even a sure thing since reducing at the root could still be different between the two programs.

One good test would be a vs b to fixed depth, then swap their evaluations and play again. But that would be beyond a royal pain, obviously.

I am not sure there is any really accurate way to compare evaluations so long as searches are not similar. One thing I could do is if you compare stockfish to Crafty your way, I could modify both to do a bare-bones 4 ply search + quiescence, no extensions, no reductions, nothing in q-search except captures, and you could run the test again to the same depth and compare the numbers. If they are close, then what you are doing is probably more than good enough. If they vary significantly, you could experiment to see if there is any simple way of getting them to produce the same results as the code I sent.

4 plies would still be instant without any pruning or reductions involved, so it would be quick.
It is not simple 4 ply search. It is a 4 ply search of engine X against Shredder with nodes = to the average of nodes the engine X is using during a 4 ply search. Fixed nodes versus fixed depth may not be desirable, but all engines "benefited" from the same treatment against Shredder. Still, there are many un-documented quantities, one simple being that a single fixed depth search may be off the average nodes used by a factor of 4 easily.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: An attempt to measure the knowledge of engines

Post by Lyudmil Tsvetkov »

Laskos wrote:See the result of this, 2 year old thread, for very short searches and longer ones:
http://talkchess.com/forum/viewtopic.ph ... 0&start=35

Gaviota seemed to have at that time better endgame knowledge than the much stronger Stockfish, and Stockfish improves dramatically through longer search. Here I used 4 plies, meaning on average some 1ms searches.
Did not quite read the whole thread, as I was referred to a single message, and not the original post, but from what I understood, those were very specific endgame positions, having mothing to do with real-life eval.

At very short TC, Komodo and SF performed best, based on the fact that thos engines did have large specific endgame eval, probably meaningless in real life. For SF that is certain, it still has a large chunk of probably useless specific endgame knowledge, and its developers are quarreling all the time whether to remove it or not. :)

I guess the same was true of Komodo 5, Larry could only confirm that.

The fact that Rybka scored worse is probably due to the fact that Railich preferred not to include such very specific endgame eval, maybe Larry could confrim this fact too.

Concerning Gaviota, congrats to Miguel, but I bet, that if he reads this thread and wants to reply, he will only confirm that Gaviota was extremely specific in its endgame eval.

But again, those specific evals have nothing to do with real life, for example tests on the framework have shown that the elo impact of such specific endgame eval is almost unnoticeable.

So that for me this is not a test that could confirm which engines have better eval.