An attempt to measure the knowledge of engines

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An attempt to measure the knowledge of engines

Post by bob »

Laskos wrote:
bob wrote:This still leaves a hole. 4 plies is not a "constant". Some programs extend more aggressively, some less so. Some reduce more aggressively (stockfish particularly) some less so. So you are STILL measuring search differences along with eval differences. This was what I was trying to do in my simple 1 ply test. But even then I had to modify both programs so that one ply meant the same thing. And that is not even a sure thing since reducing at the root could still be different between the two programs.

One good test would be a vs b to fixed depth, then swap their evaluations and play again. But that would be beyond a royal pain, obviously.

I am not sure there is any really accurate way to compare evaluations so long as searches are not similar. One thing I could do is if you compare stockfish to Crafty your way, I could modify both to do a bare-bones 4 ply search + quiescence, no extensions, no reductions, nothing in q-search except captures, and you could run the test again to the same depth and compare the numbers. If they are close, then what you are doing is probably more than good enough. If they vary significantly, you could experiment to see if there is any simple way of getting them to produce the same results as the code I sent.

4 plies would still be instant without any pruning or reductions involved, so it would be quick.
It is not simple 4 ply search. It is a 4 ply search of engine X against Shredder with nodes = to the average of nodes the engine X is using during a 4 ply search. Fixed nodes versus fixed depth may not be desirable, but all engines "benefited" from the same treatment against Shredder. Still, there are many un-documented quantities, one simple being that a single fixed depth search may be off the average nodes used by a factor of 4 easily.
But if you think about it, you are trying to say a 4 ply search with X is equivalent to N nodes with Y. How confident are you that just because you see how many nodes a 4 ply search traverses, that the two are equal? A very good selective searcher will be able to go far deeper if it is getting to search N nodes. If you tried this with the old chess master program, you would find lots of positions where it could not finish a 4 ply search in any reasonable amount of time, due to the way it selectively searched things.

If you say 4plies = N nodes for X, and then you let Y search for N nodes, that may or may not be very accurate. Impossible to say. What I would expect to see, most likely, is that two programs with a very similar eval, but very different searches, might produce a lop-sided result because of the way you are equivalencing their search.

Not saying it is right, wrong, good or bad. Saying it is "unknown". I would prefer to normalize searches (hard to do with commercials of course, take months of RE study to figure out how) so that the only difference is evaluations. That was what I did in my test. 1 ply search + captures which would REALLY seem to lean on the evaluation for everything. And let me tell you there are some really ODD results when you do that. Walking into repetitions without knowing until it is too late, missing simple mate threats since you only allow the opponent to recapture. Etc. Only good thing is that playing a million games is very fast at sd=1, so you can cover up some of the noise with sheer volume of games.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

bob wrote:
Laskos wrote:
bob wrote:This still leaves a hole. 4 plies is not a "constant". Some programs extend more aggressively, some less so. Some reduce more aggressively (stockfish particularly) some less so. So you are STILL measuring search differences along with eval differences. This was what I was trying to do in my simple 1 ply test. But even then I had to modify both programs so that one ply meant the same thing. And that is not even a sure thing since reducing at the root could still be different between the two programs.

One good test would be a vs b to fixed depth, then swap their evaluations and play again. But that would be beyond a royal pain, obviously.

I am not sure there is any really accurate way to compare evaluations so long as searches are not similar. One thing I could do is if you compare stockfish to Crafty your way, I could modify both to do a bare-bones 4 ply search + quiescence, no extensions, no reductions, nothing in q-search except captures, and you could run the test again to the same depth and compare the numbers. If they are close, then what you are doing is probably more than good enough. If they vary significantly, you could experiment to see if there is any simple way of getting them to produce the same results as the code I sent.

4 plies would still be instant without any pruning or reductions involved, so it would be quick.
It is not simple 4 ply search. It is a 4 ply search of engine X against Shredder with nodes = to the average of nodes the engine X is using during a 4 ply search. Fixed nodes versus fixed depth may not be desirable, but all engines "benefited" from the same treatment against Shredder. Still, there are many un-documented quantities, one simple being that a single fixed depth search may be off the average nodes used by a factor of 4 easily.
But if you think about it, you are trying to say a 4 ply search with X is equivalent to N nodes with Y. How confident are you that just because you see how many nodes a 4 ply search traverses, that the two are equal? A very good selective searcher will be able to go far deeper if it is getting to search N nodes. If you tried this with the old chess master program, you would find lots of positions where it could not finish a 4 ply search in any reasonable amount of time, due to the way it selectively searched things.

If you say 4plies = N nodes for X, and then you let Y search for N nodes, that may or may not be very accurate. Impossible to say. What I would expect to see, most likely, is that two programs with a very similar eval, but very different searches, might produce a lop-sided result because of the way you are equivalencing their search.

Not saying it is right, wrong, good or bad. Saying it is "unknown". I would prefer to normalize searches (hard to do with commercials of course, take months of RE study to figure out how) so that the only difference is evaluations. That was what I did in my test. 1 ply search + captures which would REALLY seem to lean on the evaluation for everything. And let me tell you there are some really ODD results when you do that. Walking into repetitions without knowing until it is too late, missing simple mate threats since you only allow the opponent to recapture. Etc. Only good thing is that playing a million games is very fast at sd=1, so you can cover up some of the noise with sheer volume of games.
I performed a self-consistency check, which I should have performed to start with, luckily it was fine. I got the average nodes of Shredder 12 itself to depth=4 (on average 2441 nodes):

Code: Select all

 1) Stockfish 21.03.2015  598 nodes
 2) Komodo 8             1455 nodes
 3) Houdini 4            1957 nodes
 4) Robbolito 0.085      1853 nodes
 5) Texel 1.05           1831 nodes
 6) Gaviota 1.0          1091 nodes
 7) Strelka 2.0          4485 nodes
 8) Fruit 2.1            7294 nodes
 9) Komodo 3             1865 nodes
10) Stockfish 2.1.1      1892 nodes
11) Houdini 1.5          2167 nodes
12) Crafty 24.1          1584 nodes

13) Shredder 12          2441 nodes
Then I put Shredder 12 depth=4 to play against the standard candle fixed nodes Shredder 12, at nodes=2441, and the result is equal within error margins:

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)
   1 Gaviota 1.0             :  178.8     735.0    1000   73.5%
   2 Komodo 3                :  159.9     713.5    1000   71.3%
   3 Houdini 4               :  125.3     671.5    1000   67.2%
   4 Komodo 8                :  109.3     651.0    1000   65.1%
   5 Houdini 1.5             :  101.6     641.0    1000   64.1%
   6 RobboLito 0.085         :   43.0     561.0    1000   56.1%
   7 Shredder 12 nodes       :    0.0    6213.5   13000   47.8%
   8 Shredder 12 depth       :   -7.0     490.0    1000   49.0%
   9 Stockfish 21.03.2015    :  -11.2     484.0    1000   48.4%
  10 Stockfish 2.1.1         :  -22.1     468.5    1000   46.9%
  11 Texel 1.05              :  -43.0     439.0    1000   43.9%
  12 Strelka 2.0             : -109.6     348.5    1000   34.9%
  13 Crafty 24.1             : -115.8     340.5    1000   34.0%
  14 Fruit 2.1               : -199.1     243.0    1000   24.3%
The consistency is encouraging, but it does not prove yet that the methodology is correct.
Uri Blass
Posts: 10282
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: An attempt to measure the knowledge of engines

Post by Uri Blass »

Laskos wrote:Shredder GUI has a nice feature of giving the output statistic by ply after an EPD set of positions is fed to it for analysis. The positions are FENs without a solution, and are balanced. To compare the engines directly at depth=1 is tricky, and Bob Hyatt explained it well:
bob wrote: I can't say "for most reasonable programs". I was directly addressing a claim about stockfish and Crafty.

The "depth 1" can be tricky. Here's why:

(1) some programs extend depth when they give check (Crafty, for example). While others extend depth when escaping check. That is a big difference, in that with the latter, you drop into the q-search while in check. Nothing wrong, but you might or might not recognize a mate. Which means that with a 1 ply search, some programs will recognize a mate in 1, some won't.

(2) q-search. Some do a very simple q-search. Some escape or give check at the first ply of q-search. Some give check at the first search ply, escape at the second and give check again at the third. Those are not equal.

When I tried this with stockfish, i ran into the same problem. All 1 ply searches are not created equal.

As I mentioned, I am not sure my test is all that useful, because a program's evaluation is written around its search, and vice-versa. If you limit one part, you might be limiting the other part without knowing, skewing the results.
To make the full of "eval driven search", I adopted the following methodology:

1/ Use depth=4 to avoid the issue of threats, q-search imbalances, inadequacy of depth=1 search in games.
2/ Most engines do not follow well "go nodes" UCI command for small number of nodes. They do follow "go depth" command rather well.
3/ Shredder engines (here I used Shredder 12 as standard candle) follow "go nodes" command literally, even for small number of nodes.
4/ Calculate average nodes for each engine to depth=4 on many positions (150 late opening, 171 endgame).
5/ Set the matches: Engine X depth=4 versus Shredder 12 number of nodes which each engine uses to depth 4.

In Shredder GUI, as an example Gaviota 1.0:

Code: Select all

Late opening positions:

  TotTime: 25s    SolTime: 25s
  Ply: 0   Positions:150   Avg Nodes:       0   Branching = 0.00
  Ply: 1   Positions:150   Avg Nodes:      90   Branching = 0.00
  Ply: 2   Positions:150   Avg Nodes:     255   Branching = 2.83
  Ply: 3   Positions:150   Avg Nodes:     524   Branching = 2.05
  Ply: 4   Positions:150   Avg Nodes:    1254   Branching = 2.39
Engine: Gaviota v1.0 (2048 MB)
by Miguel A. Ballicora


Endgame positions:

  TotTime: 28s    SolTime: 28s
  Ply: 0   Positions:171   Avg Nodes:       0   Branching = 0.00
  Ply: 1   Positions:171   Avg Nodes:      46   Branching = 0.00
  Ply: 2   Positions:171   Avg Nodes:     158   Branching = 3.43
  Ply: 3   Positions:171   Avg Nodes:     386   Branching = 2.44
  Ply: 4   Positions:171   Avg Nodes:     928   Branching = 2.40
Engine: Gaviota v1.0 (2048 MB)
by Miguel A. Ballicora
I took the average nodes to depth=4 for a dozen engines:

Code: Select all

 1) Stockfish 21.03.2015  598 nodes
 2) Komodo 8             1455 nodes
 3) Houdini 4            1957 nodes
 4) Robbolito 0.085      1853 nodes
 5) Texel 1.05           1831 nodes
 6) Gaviota 1.0          1091 nodes
 7) Strelka 2.0          4485 nodes
 8) Fruit 2.1            7294 nodes
 9) Komodo 3             1865 nodes
10) Stockfish 2.1.1      1892 nodes
11) Houdini 1.5          2167 nodes
12) Crafty 24.1          1584 nodes
Play all of them at fixed depth=4 against Shredder 12 at fixed nodes determined for each engine.

Shredder 12 standard candle, the strength of evals:

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)
   1 Gaviota 1.0             :  178.8     735.0    1000   73.5%
   2 Komodo 3 (2011)         :  159.9     713.5    1000   71.3%
   3 Houdini 4               :  125.3     671.5    1000   67.2%
   4 Komodo 8                :  109.3     651.0    1000   65.1%
   5 Houdini 1.5 (2011)      :  101.6     641.0    1000   64.1%
   6 RobboLito 0.085         :   43.0     561.0    1000   56.1%
   7 Shredder 12             :    0.0    5703.5   12000   47.5%
   8 Stockfish 21.03.2015    :  -11.2     484.0    1000   48.4%
   9 Stockfish 2.1.1 (2011)  :  -22.1     468.5    1000   46.9%
  10 Texel 1.05              :  -43.0     439.0    1000   43.9%
  11 Strelka 2.0 (Rybka 1.0) : -109.6     348.5    1000   34.9%
  12 Crafty 24.1             : -115.8     340.5    1000   34.0%
  13 Fruit 2.1               : -199.1     243.0    1000   24.3%
a) This is as close as I can get to some meaning of evals integrated into search without messing with sources (often unavailable).
b) Besides the "anomaly" with Gaviota 1.0 (in fact there were earlier indications that Gaviota has a strong eval), there is another of Komodo 3 having a better eval (larger maybe) than Komodo 8. Larry or Mark could confirm or infirm the regression (or thinning) of the eval.
c) Stockfish eval seems no stronger than that of Shredder 12, but it seems significantly stronger than that of Crafty 24.1.
I do not see how you measure evaluation in this way.
Different Programs use different tree when they search depth 4.

I guess Komodo8 does more pruning than komodo3 at depth 4 so it may be weaker at depth 4 not because of worse evaluation.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

Uri Blass wrote:
I do not see how you measure evaluation in this way.
Different Programs use different tree when they search depth 4.

I guess Komodo8 does more pruning than komodo3 at depth 4 so it may be weaker at depth 4 not because of worse evaluation.
It's not simply depth 4. Each engine depth 4 has a calibrated, different nodes Shredder as opponent. Komodo 8 does indeed more pruning (1455 nodes vs 1865 nodes), but has a weaker opponent too compared to Komodo 3.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

Lyudmil Tsvetkov wrote:
Laskos wrote:
Lyudmil Tsvetkov wrote:Other intersting question is: is move oredering taken into account?
Does it belong to search or to eval?

As, when you order moves, you are supposed to use evaluation terms for ordering, like psqt, etc., and it is possible that move ordering has a very important role to play in engine strength, after all.

So, does move ordering belong to search or eval?

I think you simply can not separate search and eval.
Move ordering is driving alpha-beta and generally the pruning, and I consider it part of the search. PSQTs are part of the eval, and many eval parts are linked to search. Still, search here is just a small part I kept for consistency of eval. If I want eval-search together, I can do fixed time or whatever regular testing.

It is not inconceivable that Shredder 12 and Gaviota 1.0 have a larger, more extensive eval, but less efficient ELO-wise at fixed time than Stockfish. Likewise, that Komodo from 3 to 8 went a bit slimmer in eval. IIRC Rybka 4 went so compared to Rybka 3, and generally, ELO at fixed time is a different matter to trying to compare mostly evals. It's possible that my results are meaningless, but not based on someone's skepticism that a much stronger engine cannot have a slimmer eval.
Hi Kai.

Of course, the first thing I say is, thanks for the experiments.

And after that, I am 100 convinced that a much stronger engine can not have a slimmer eval; its eval will always be larger.

I have all the reasons in the world to suppose that Komodo and SF have both the most refined eval and the most coplicated search.

Btw., I am sure this also applies to time management and speed optimisations: it is very likely that Komodo and SF have the 2 best time managements and are the 2 engines that are best optimised speed-wise.
Sometimes it's useful to read the pertinent opinions of chess engine authors. Franck Zibi (author of Pharaon) said the following in 2005:

Most chess players tend to overestimate the influence of the chess knowledge in an engine strength. I believe that what makes a strong program is first the quality of its search (hashtables and null move management, sorting of moves, pruning, etc...), not the 'quantity of chess knowledge' that it includes.
For instance, if you look at the first versions of Fruit, you'll see a program with little chess knowledge, but a good and reliable search, and it was already playing strong!
At the same time, some kind of chess knowledge is definitively needed and can not be replaced by search, for instance king safety, passed pawn evaluation...
So, in the last year I have mainly improved the search part of Pharaon, the chess knowledge being nearly the same as in Pharaon v2.62 (released +3 years ago).


That Fruit has a weak eval was confirmed here, and it was known since ages. To the credit of Vas, Rybka 1.0 has improved the eval of Fruit by some 100 Elo points (look at Strelka 2.0), a large jump, and not by slowing down Rybka (in fact speeding it up with bitboards and 64bit).
Uri Blass
Posts: 10282
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: An attempt to measure the knowledge of engines

Post by Uri Blass »

Laskos wrote:
Uri Blass wrote:
I do not see how you measure evaluation in this way.
Different Programs use different tree when they search depth 4.

I guess Komodo8 does more pruning than komodo3 at depth 4 so it may be weaker at depth 4 not because of worse evaluation.
It's not simply depth 4. Each engine depth 4 has a calibrated, different nodes Shredder as opponent. Komodo 8 does indeed more pruning (1455 nodes vs 1865 nodes), but has a weaker opponent too compared to Komodo 3.
I understand but there is a problem with average number of nodes and it is effected by big values.

An engine that search one time 100000 nodes for depth 4 and many times 1000 nodes can get the same average as an engine that use every time 2000 nodes when every time 2000 nodes is usually clearly stronger.

Maybe it is better to use average of log(nodes) base 2 and calculate power of 2 to calculate fixed number of nodes so one big value does not effect so much

By this way
2^20 nodes for 1 position and 2^10 nodes for 9 positions at depth 4
is still average of only (20*1+10*9)/10=11 that means 2^11 nodes
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: An attempt to measure the knowledge of engines

Post by cdani »

Laskos wrote: Most chess players tend to overestimate the influence of the chess knowledge in an engine strength.
May be you want to play with Andscacs - Sungorus, a version of andscacs 0.72 that is may be 20 elo stronger than 0.72, but it has the very simple eval of Sungorus, so it's may be 300 - 350 elo weaker, really did not tested it well:

http://talkchess.com/forum/viewtopic.ph ... =&start=10
Lyudmil Tsvetkov
Posts: 6052
Joined: Tue Jun 12, 2012 12:41 pm

Re: An attempt to measure the knowledge of engines

Post by Lyudmil Tsvetkov »

Laskos wrote:
Lyudmil Tsvetkov wrote:
Laskos wrote:
Lyudmil Tsvetkov wrote:Other intersting question is: is move oredering taken into account?
Does it belong to search or to eval?

As, when you order moves, you are supposed to use evaluation terms for ordering, like psqt, etc., and it is possible that move ordering has a very important role to play in engine strength, after all.

So, does move ordering belong to search or eval?

I think you simply can not separate search and eval.
Move ordering is driving alpha-beta and generally the pruning, and I consider it part of the search. PSQTs are part of the eval, and many eval parts are linked to search. Still, search here is just a small part I kept for consistency of eval. If I want eval-search together, I can do fixed time or whatever regular testing.

It is not inconceivable that Shredder 12 and Gaviota 1.0 have a larger, more extensive eval, but less efficient ELO-wise at fixed time than Stockfish. Likewise, that Komodo from 3 to 8 went a bit slimmer in eval. IIRC Rybka 4 went so compared to Rybka 3, and generally, ELO at fixed time is a different matter to trying to compare mostly evals. It's possible that my results are meaningless, but not based on someone's skepticism that a much stronger engine cannot have a slimmer eval.
Hi Kai.

Of course, the first thing I say is, thanks for the experiments.

And after that, I am 100 convinced that a much stronger engine can not have a slimmer eval; its eval will always be larger.

I have all the reasons in the world to suppose that Komodo and SF have both the most refined eval and the most coplicated search.

Btw., I am sure this also applies to time management and speed optimisations: it is very likely that Komodo and SF have the 2 best time managements and are the 2 engines that are best optimised speed-wise.
Sometimes it's useful to read the pertinent opinions of chess engine authors. Franck Zibi (author of Pharaon) said the following in 2005:

Most chess players tend to overestimate the influence of the chess knowledge in an engine strength. I believe that what makes a strong program is first the quality of its search (hashtables and null move management, sorting of moves, pruning, etc...), not the 'quantity of chess knowledge' that it includes.
For instance, if you look at the first versions of Fruit, you'll see a program with little chess knowledge, but a good and reliable search, and it was already playing strong!
At the same time, some kind of chess knowledge is definitively needed and can not be replaced by search, for instance king safety, passed pawn evaluation...
So, in the last year I have mainly improved the search part of Pharaon, the chess knowledge being nearly the same as in Pharaon v2.62 (released +3 years ago).


That Fruit has a weak eval was confirmed here, and it was known since ages. To the credit of Vas, Rybka 1.0 has improved the eval of Fruit by some 100 Elo points (look at Strelka 2.0), a large jump, and not by slowing down Rybka (in fact speeding it up with bitboards and 64bit).
Well, Daniel just answered you very convincingly: you can not separate eval from search, and as a rule a more sophisticated search goes hand in hand with more sophisticated eval.

There are of course smaller or larger variances, some engines might emphasize eval, and others search, but as a whole, or emphasize it at different periods of engine development, but, as a whole, eval and search should be harmonised, being close together in contribution.

That is why Mr. Zibi managed just the 2600-elo barrier. :)

And why Fruit, despite having tremendous search, was still much inferior to Rybka, Houdini and other engines having both good eval and good search.

But, on the other hand, who says Fruit had such a bad/basic eval?
Was not it Fabien who introduced for the first time scoring of passers in terms of ranks, and this proved a big winner?
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

Uri Blass wrote:
Laskos wrote:
Uri Blass wrote:
I do not see how you measure evaluation in this way.
Different Programs use different tree when they search depth 4.

I guess Komodo8 does more pruning than komodo3 at depth 4 so it may be weaker at depth 4 not because of worse evaluation.
It's not simply depth 4. Each engine depth 4 has a calibrated, different nodes Shredder as opponent. Komodo 8 does indeed more pruning (1455 nodes vs 1865 nodes), but has a weaker opponent too compared to Komodo 3.
I understand but there is a problem with average number of nodes and it is effected by big values.

An engine that search one time 100000 nodes for depth 4 and many times 1000 nodes can get the same average as an engine that use every time 2000 nodes when every time 2000 nodes is usually clearly stronger.

Maybe it is better to use average of log(nodes) base 2 and calculate power of 2 to calculate fixed number of nodes so one big value does not effect so much

By this way
2^20 nodes for 1 position and 2^10 nodes for 9 positions at depth 4
is still average of only (20*1+10*9)/10=11 that means 2^11 nodes
Yes, that is a concern, go depth 4 may be off by a factor of 3-4 easily in nodes with regard to the average. I checked for consistency Shredder 12 average nodes at depth=4 versus Shredder 12 depth 4, and the result was satisfying, within error margins. It is no guarantee that other engines are so nice. I also would like to have a geometric average, (Product of nodes per position)^(1/positions), but Shredder GUI gives only (Sum of nodes per position)/positions. It is a long standing idea for computing many quantities related to plies and depths in chess, like time-to-depth, which needs logarithmic scale.

My test basically wants to reproduce fixed nodes result with say 1000 nodes per move, without engines obeying go nodes command. So I have to use go depth, mimicking the nodes.

Concerning the apparent eval "regression" of Komodo, yes, Komodo 8 prunes more at depth=4, but after calibration, it becomes irrelevant (if those averages are fine). I compared two different versions of Shredder, 12 and 9. Shredder 12 prunes more too, but it still has a stronger "eval" than Shredder 9 by some 100 Elo points.

The list including it is here:

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)
   1 Gaviota 1.0             :  178.8     735.0    1000   73.5%
   2 Komodo 3                :  159.9     713.5    1000   71.3%
   3 Houdini 4               :  125.3     671.5    1000   67.2%
   4 Komodo 8                :  109.3     651.0    1000   65.1%
   5 Houdini 1.5             :  101.6     641.0    1000   64.1%
   6 RobboLito 0.085         :   43.0     561.0    1000   56.1%
   7 Shredder 12             :    0.0    7598.0   15000   50.7%
   8 Shredder 12 depth       :   -7.0     490.0    1000   49.0%
   9 Stockfish 21.03.2015    :  -11.2     484.0    1000   48.4%
  10 Stockfish 2.1.1         :  -22.1     468.5    1000   46.9%
  11 Texel 1.05              :  -43.0     439.0    1000   43.9%
  12 Shredder 9              : -106.9     352.0    1000   35.2%
  13 Strelka 2.0             : -109.6     348.5    1000   34.9%
  14 Crafty 24.1             : -115.8     340.5    1000   34.0%
  15 SOS 5.1                 : -180.1     263.5    1000   26.4%
  16 Fruit 2.1               : -199.1     243.0    1000   24.3% 
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: An attempt to measure the knowledge of engines

Post by Laskos »

cdani wrote:
Laskos wrote: Most chess players tend to overestimate the influence of the chess knowledge in an engine strength.
May be you want to play with Andscacs - Sungorus, a version of andscacs 0.72 that is may be 20 elo stronger than 0.72, but it has the very simple eval of Sungorus, so it's may be 300 - 350 elo weaker, really did not tested it well:

http://talkchess.com/forum/viewtopic.ph ... =&start=10
I don't know how you implemented the foreign eval into Andscacs, but it's surely conceivable that it may lose 300 Elo points. But the eval strength (if not the general strength) is visible in the experiment.

Code: Select all

   # PLAYER                  : RATING    POINTS  PLAYED    (%)
   1 Gaviota 1.0             :  178.8     735.0    1000   73.5%
   2 Komodo 3                :  159.9     713.5    1000   71.3%
   3 Houdini 4               :  125.3     671.5    1000   67.2%
   4 Komodo 8                :  109.3     651.0    1000   65.1%
   5 Houdini 1.5             :  101.6     641.0    1000   64.1%
   6 RobboLito 0.085         :   43.0     561.0    1000   56.1%
   7 Shredder 12             :    0.0    8989.0   17000   52.9%
   8 Shredder 12 depth       :   -7.0     490.0    1000   49.0%
   9 Stockfish 21.03.2015    :  -11.2     484.0    1000   48.4%
  10 Andscacs 0.72           :  -19.0     473.0    1000   47.3%
  11 Stockfish 2.1.1         :  -22.1     468.5    1000   46.9%
  12 Texel 1.05              :  -43.0     439.0    1000   43.9%
  13 Shredder 9              : -106.9     352.0    1000   35.2%
  14 Strelka 2.0             : -109.6     348.5    1000   34.9%
  15 Crafty 24.1             : -115.8     340.5    1000   34.0%
  16 SOS 5.1                 : -180.1     263.5    1000   26.4%
  17 Fruit 2.1               : -199.1     243.0    1000   24.3%
  18 Andscacs sung           : -324.0     136.0    1000   13.6%