An attempt to measure the knowledge of engines

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

Lyudmil Tsvetkov wrote: Well, Daniel just answered you very convincingly: you can not separate eval from search, and as a rule a more sophisticated search goes hand in hand with more sophisticated eval.

There are of course smaller or larger variances, some engines might emphasize eval, and others search, but as a whole, or emphasize it at different periods of engine development, but, as a whole, eval and search should be harmonised, being close together in contribution.

That is why Mr. Zibi managed just the 2600-elo barrier. :)

And why Fruit, despite having tremendous search, was still much inferior to Rybka, Houdini and other engines having both good eval and good search.

But, on the other hand, who says Fruit had such a bad/basic eval?
Was not it Fabien who introduced for the first time scoring of passers in terms of ranks, and this proved a big winner?
Well, that Fuit had a simple eval even compared to Crafty was visible in its sources. If your only argument is that the strongest engines must have the most elaborate and largest eval, then somebody has to check this sort of hand-waving.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: An attempt to measure the knowledge of engines

Post by Lyudmil Tsvetkov »

Laskos wrote:
Lyudmil Tsvetkov wrote: Well, Daniel just answered you very convincingly: you can not separate eval from search, and as a rule a more sophisticated search goes hand in hand with more sophisticated eval.

There are of course smaller or larger variances, some engines might emphasize eval, and others search, but as a whole, or emphasize it at different periods of engine development, but, as a whole, eval and search should be harmonised, being close together in contribution.

That is why Mr. Zibi managed just the 2600-elo barrier. :)

And why Fruit, despite having tremendous search, was still much inferior to Rybka, Houdini and other engines having both good eval and good search.

But, on the other hand, who says Fruit had such a bad/basic eval?
Was not it Fabien who introduced for the first time scoring of passers in terms of ranks, and this proved a big winner?
Well, that Fuit had a simple eval even compared to Crafty was visible in its sources. If your only argument is that the strongest engines must have the most elaborate and largest eval, then somebody has to check this sort of hand-waving.
How do you know Crafty had better eval than Fruit?

Better eval does not mean largest eval necessarily, but more useful eval.

If Fabien had implemented passer bonus in terms of rank, and Crafty did not quite do that at the time, then that single passer rank bonus would easily outweigh elowise a couple of very specific Crafty endgame rules.

Another option is that Fruit's eval was better or much better tuned than that of Crafty, without having necessarily many terms. That is also part of eval assessment. So it is possible that Fruit eval was in actual fact more useful/efficient, while Crafty eval was bulkier, but not that well tuned.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

Lyudmil Tsvetkov wrote:
Laskos wrote:
Lyudmil Tsvetkov wrote: Well, Daniel just answered you very convincingly: you can not separate eval from search, and as a rule a more sophisticated search goes hand in hand with more sophisticated eval.

There are of course smaller or larger variances, some engines might emphasize eval, and others search, but as a whole, or emphasize it at different periods of engine development, but, as a whole, eval and search should be harmonised, being close together in contribution.

That is why Mr. Zibi managed just the 2600-elo barrier. :)

And why Fruit, despite having tremendous search, was still much inferior to Rybka, Houdini and other engines having both good eval and good search.

But, on the other hand, who says Fruit had such a bad/basic eval?
Was not it Fabien who introduced for the first time scoring of passers in terms of ranks, and this proved a big winner?
Well, that Fuit had a simple eval even compared to Crafty was visible in its sources. If your only argument is that the strongest engines must have the most elaborate and largest eval, then somebody has to check this sort of hand-waving.
How do you know Crafty had better eval than Fruit?

Better eval does not mean largest eval necessarily, but more useful eval.

If Fabien had implemented passer bonus in terms of rank, and Crafty did not quite do that at the time, then that single passer rank bonus would easily outweigh elowise a couple of very specific Crafty endgame rules.

Another option is that Fruit's eval was better or much better tuned than that of Crafty, without having necessarily many terms. That is also part of eval assessment. So it is possible that Fruit eval was in actual fact more useful/efficient, while Crafty eval was bulkier, but not that well tuned.
Take year 2002. The strongest engine was Shredder 6PB. Comes that year Daniel's "Andscacs - Sungorus" and is stronger than Shredder 6PB (checked). Would you in 2002 say that "Andscacs - Sungorus" with ridiculous vertically symmetric PSQT has the strongest eval? In "eval test" it performs 200 Elo points weaker than Shredder 6PB eval, even if it beats Shredder in games. That's why somebody has to check hand-waving arguments about best everything.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: An attempt to measure the knowledge of engines

Post by Lyudmil Tsvetkov »

Also, where do alpha-beta search windows belong?
To evaluation, or to search?

This is the most fundamental search technique, but it uses eval scores in order to function.

Similarly, you can not claim that eval scores at depth 1 are fully the same as eval scores at depth 4 and eval scores at depth 20, with perfect play.

This is simply not so, depending on the range of the score, the mg values tend to increase and the endgame decrease, so there is not an exact correspondence. So that eval factors change with increasing depth, even with perfect play, changing also the score in the process.

So, if an engine has tuned its eval parameters to perform well at depth 20, and you measure their performance at ply 1 or 4, something will simply not quite fit perfectly.

Again, I simply do not see how possibly you could separate search from eval.

Same with time management.
Can you possibly separate search and eval from time management?

Or speed.
Can you possibly separate search and eval from the speed at which the engine operates?
For example, one engine has implemented bitboards, or rotated/magic bitboards, which give additional speed, and another one has not. Use of bitboards is not quite search, but a speed optimisation. So, if 2 engines have exactly the same eval and search, the only difference being that the first one is 2 times better optimised in terms of speed, when you try to measure eval performance at depth 1 or 4, you will conclude that the eval of the first engine is 2 times better.
Wrong, as it is the speed here that makes the difference, and neither eval, nor the search.

Did you account for speed optimisations, when you measured evals at low depths?

I think all those are a single whole.
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: An attempt to measure the knowledge of engines

Post by Lyudmil Tsvetkov »

Laskos wrote:
Lyudmil Tsvetkov wrote:
Laskos wrote:
Lyudmil Tsvetkov wrote: Well, Daniel just answered you very convincingly: you can not separate eval from search, and as a rule a more sophisticated search goes hand in hand with more sophisticated eval.

There are of course smaller or larger variances, some engines might emphasize eval, and others search, but as a whole, or emphasize it at different periods of engine development, but, as a whole, eval and search should be harmonised, being close together in contribution.

That is why Mr. Zibi managed just the 2600-elo barrier. :)

And why Fruit, despite having tremendous search, was still much inferior to Rybka, Houdini and other engines having both good eval and good search.

But, on the other hand, who says Fruit had such a bad/basic eval?
Was not it Fabien who introduced for the first time scoring of passers in terms of ranks, and this proved a big winner?
Well, that Fuit had a simple eval even compared to Crafty was visible in its sources. If your only argument is that the strongest engines must have the most elaborate and largest eval, then somebody has to check this sort of hand-waving.
How do you know Crafty had better eval than Fruit?

Better eval does not mean largest eval necessarily, but more useful eval.

If Fabien had implemented passer bonus in terms of rank, and Crafty did not quite do that at the time, then that single passer rank bonus would easily outweigh elowise a couple of very specific Crafty endgame rules.

Another option is that Fruit's eval was better or much better tuned than that of Crafty, without having necessarily many terms. That is also part of eval assessment. So it is possible that Fruit eval was in actual fact more useful/efficient, while Crafty eval was bulkier, but not that well tuned.
Take year 2002. The strongest engine was Shredder 6PB. Comes that year Daniel's "Andscacs - Sungorus" and is stronger than Shredder 6PB (checked). Would you in 2002 say that "Andscacs - Sungorus" with ridiculous vertically symmetric PSQT has the strongest eval? In "eval test" it performs 200 Elo points weaker than Shredder 6PB eval, even if it beats Shredder in games. That's why somebody has to check hand-waving arguments about best everything.
:D :D

But that is an artificial engine, do not you understand?

This engine would never have been created, in the real world, without Daniel wishing to set us a nice trap. :)

The search knowledge Andscacs-Sungorus possesses is simply an artificial quantity; nobody in 2002 had such advanced searches, nobody even thought of that.

In real life, things never happen like that.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

Lyudmil Tsvetkov wrote:
Laskos wrote:
Lyudmil Tsvetkov wrote:
Laskos wrote:
Lyudmil Tsvetkov wrote: Well, Daniel just answered you very convincingly: you can not separate eval from search, and as a rule a more sophisticated search goes hand in hand with more sophisticated eval.

There are of course smaller or larger variances, some engines might emphasize eval, and others search, but as a whole, or emphasize it at different periods of engine development, but, as a whole, eval and search should be harmonised, being close together in contribution.

That is why Mr. Zibi managed just the 2600-elo barrier. :)

And why Fruit, despite having tremendous search, was still much inferior to Rybka, Houdini and other engines having both good eval and good search.

But, on the other hand, who says Fruit had such a bad/basic eval?
Was not it Fabien who introduced for the first time scoring of passers in terms of ranks, and this proved a big winner?
Well, that Fuit had a simple eval even compared to Crafty was visible in its sources. If your only argument is that the strongest engines must have the most elaborate and largest eval, then somebody has to check this sort of hand-waving.
How do you know Crafty had better eval than Fruit?

Better eval does not mean largest eval necessarily, but more useful eval.

If Fabien had implemented passer bonus in terms of rank, and Crafty did not quite do that at the time, then that single passer rank bonus would easily outweigh elowise a couple of very specific Crafty endgame rules.

Another option is that Fruit's eval was better or much better tuned than that of Crafty, without having necessarily many terms. That is also part of eval assessment. So it is possible that Fruit eval was in actual fact more useful/efficient, while Crafty eval was bulkier, but not that well tuned.
Take year 2002. The strongest engine was Shredder 6PB. Comes that year Daniel's "Andscacs - Sungorus" and is stronger than Shredder 6PB (checked). Would you in 2002 say that "Andscacs - Sungorus" with ridiculous vertically symmetric PSQT has the strongest eval? In "eval test" it performs 200 Elo points weaker than Shredder 6PB eval, even if it beats Shredder in games. That's why somebody has to check hand-waving arguments about best everything.
:D :D

But that is an artificial engine, do not you understand?

This engine would never have been created, in the real world, without Daniel wishing to set us a nice trap. :)

The search knowledge Andscacs-Sungorus possesses is simply an artificial quantity; nobody in 2002 had such advanced searches, nobody even thought of that.

In real life, things never happen like that.
You are certainly a prolific poster.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: An attempt to measure the knowledge of engines

Post by cdani »

Laskos wrote: I don't know how you implemented the foreign eval into Andscacs
I just copied it to Andscacs and renamed the variables to Andscacs ones. I changed some minimal things to make it work and that's it.

Interesting your results!
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: An attempt to measure the knowledge of engines

Post by michiguel »

Laskos wrote:
Uri Blass wrote:
Laskos wrote:
Uri Blass wrote:
I do not see how you measure evaluation in this way.
Different Programs use different tree when they search depth 4.

I guess Komodo8 does more pruning than komodo3 at depth 4 so it may be weaker at depth 4 not because of worse evaluation.
It's not simply depth 4. Each engine depth 4 has a calibrated, different nodes Shredder as opponent. Komodo 8 does indeed more pruning (1455 nodes vs 1865 nodes), but has a weaker opponent too compared to Komodo 3.
I understand but there is a problem with average number of nodes and it is effected by big values.

An engine that search one time 100000 nodes for depth 4 and many times 1000 nodes can get the same average as an engine that use every time 2000 nodes when every time 2000 nodes is usually clearly stronger.

Maybe it is better to use average of log(nodes) base 2 and calculate power of 2 to calculate fixed number of nodes so one big value does not effect so much

By this way
2^20 nodes for 1 position and 2^10 nodes for 9 positions at depth 4
is still average of only (20*1+10*9)/10=11 that means 2^11 nodes
Yes, that is a concern, go depth 4 may be off by a factor of 3-4 easily in nodes with regard to the average. I checked for consistency Shredder 12 average nodes at depth=4 versus Shredder 12 depth 4, and the result was satisfying, within error margins. It is no guarantee that other engines are so nice. I also would like to have a geometric average, (Product of nodes per position)^(1/positions), but Shredder GUI gives only (Sum of nodes per position)/positions. It is a long standing idea for computing many quantities related to plies and depths in chess, like time-to-depth, which needs logarithmic scale.

My test basically wants to reproduce fixed nodes result with say 1000 nodes per move, without engines obeying go nodes command. So I have to use go depth, mimicking the nodes.

Concerning the apparent eval "regression" of Komodo, yes, Komodo 8 prunes more at depth=4, but after calibration, it becomes irrelevant (if those averages are fine). I compared two different versions of Shredder, 12 and 9. Shredder 12 prunes more too, but it still has a stronger "eval" than Shredder 9 by some 100 Elo points.

The list including it is here:

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)
   1 Gaviota 1.0             :  178.8     735.0    1000   73.5%
   2 Komodo 3                :  159.9     713.5    1000   71.3%
   3 Houdini 4               :  125.3     671.5    1000   67.2%
   4 Komodo 8                :  109.3     651.0    1000   65.1%
   5 Houdini 1.5             :  101.6     641.0    1000   64.1%
   6 RobboLito 0.085         :   43.0     561.0    1000   56.1%
   7 Shredder 12             :    0.0    7598.0   15000   50.7%
   8 Shredder 12 depth       :   -7.0     490.0    1000   49.0%
   9 Stockfish 21.03.2015    :  -11.2     484.0    1000   48.4%
  10 Stockfish 2.1.1         :  -22.1     468.5    1000   46.9%
  11 Texel 1.05              :  -43.0     439.0    1000   43.9%
  12 Shredder 9              : -106.9     352.0    1000   35.2%
  13 Strelka 2.0             : -109.6     348.5    1000   34.9%
  14 Crafty 24.1             : -115.8     340.5    1000   34.0%
  15 SOS 5.1                 : -180.1     263.5    1000   26.4%
  16 Fruit 2.1               : -199.1     243.0    1000   24.3% 
Nice looking ranking 8-)

If understand correctly, this is trying to mimic a ranking in which engines play at fixed (very low number) nodes. In other words, it is trying to measure how well the engines use those nodes at the tip of the tree. Evaluation becomes more important in these conditions than in regular ones. Is my statement accurate in trying to get the spirit of these tests?

Interesting experiment. For me, this means that Gaviota search totally sucks closer to the root. :-) (not a big surprise...).

Are you sure that the granularity of Shredder is good enough not to mess up the results? In other words, it could be bad if you tell Shredder to search 4000 or 6000 nodes and it searches 4096 in both cases. That is the only artifact I can imagine.

Whether this experiment evaluates eval or not then it is a matter of interpretation, but it is interesting regardless.

Miguel
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

michiguel wrote:
Nice looking ranking 8-)

If understand correctly, this is trying to mimic a ranking in which engines play at fixed (very low number) nodes. In other words, it is trying to measure how well the engines use those nodes at the tip of the tree. Evaluation becomes more important in these conditions than in regular ones. Is my statement accurate in trying to get the spirit of these tests?
Yes, exactly. My test is nothing more than a fixed nodes test with nodes somewhere between 1000 to 4000, equal for opponents. It is only inability of engines to obey reasonably (or at all) go nodes command, that I circumvent it by go depth.
Interesting experiment. For me, this means that Gaviota search totally sucks closer to the root. :-) (not a big surprise...).

Are you sure that the granularity of Shredder is good enough not to mess up the results? In other words, it could be bad if you tell Shredder to search 4000 or 6000 nodes and it searches 4096 in both cases. That is the only artifact I can imagine.
Yes, Shredder 12 obeys go nodes command literally, with a precision of few nodes. I am surprised it has such a counter which stops and gives output after almost each node. Example:

go nodes 3561
info nps 238200 nodes 3573 hashfull 2
bestmove g4d7 ponder d3b5

go nodes 67
info nodes 69 hashfull 0
bestmove f3d5

go nodes 2328
info nodes 2339 hashfull 2
bestmove f1e1 ponder f8e8
Whether this experiment evaluates eval or not then it is a matter of interpretation, but it is interesting regardless.

Miguel
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

cdani wrote:
Laskos wrote: I don't know how you implemented the foreign eval into Andscacs
I just copied it to Andscacs and renamed the variables to Andscacs ones. I changed some minimal things to make it work and that's it.

Interesting your results!
I am curious if this abomination "Andscacs - Sungorus" has serious issues, visible with naked eye in games. It beats Shredder 6PB, having a score in this test of eval at least 200 Elo points weaker compared to Shredder 6PB:

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)
   1 Gaviota 1.0             :  178.8     735.0    1000   73.5%
   2 Komodo 3                :  159.9     713.5    1000   71.3%
   3 Houdini 4               :  125.3     671.5    1000   67.2%
   4 Komodo 8                :  109.3     651.0    1000   65.1%
   5 Houdini 1.5             :  101.6     641.0    1000   64.1%
   6 RobboLito 0.085         :   43.0     561.0    1000   56.1%
   7 Shredder 12             :    0.0    9781.5   18293   53.5%
   8 Shredder 12 depth       :   -7.0     490.0    1000   49.0%
   9 Stockfish 21.03.2015    :  -11.2     484.0    1000   48.4%
  10 Andscacs 0.72           :  -14.9     644.5    1346   47.9%
  11 Stockfish 2.1.1         :  -22.1     468.5    1000   46.9%
  12 Texel 1.05              :  -43.0     439.0    1000   43.9%
  13 Shredder 9              : -106.9     352.0    1000   35.2%
  14 Strelka 2.0             : -109.6     348.5    1000   34.9%
  15 Shredder 6PB            : -110.5     329.0     947   34.7%
  16 Crafty 24.1             : -115.8     340.5    1000   34.0%
  17 SOS 5.1                 : -180.1     263.5    1000   26.4%
  18 Fruit 2.1               : -199.1     243.0    1000   24.3%
  19 Ands-Sung               : -324.0     136.0    1000   13.6%
Games at 10s+100ms:

Code: Select all

   # PLAYER          : RATING    POINTS  PLAYED    (%)
   1 Ands-Sung       :   13.2     107.5     200   53.8%
   2 Shredder 6PB    :  -13.2      92.5     200   46.3%