At least it suggests it is not somewhat random noise.Laskos wrote:I performed a self-consistency check, which I should have performed to start with, luckily it was fine. I got the average nodes of Shredder 12 itself to depth=4 (on average 2441 nodes):bob wrote:But if you think about it, you are trying to say a 4 ply search with X is equivalent to N nodes with Y. How confident are you that just because you see how many nodes a 4 ply search traverses, that the two are equal? A very good selective searcher will be able to go far deeper if it is getting to search N nodes. If you tried this with the old chess master program, you would find lots of positions where it could not finish a 4 ply search in any reasonable amount of time, due to the way it selectively searched things.Laskos wrote:It is not simple 4 ply search. It is a 4 ply search of engine X against Shredder with nodes = to the average of nodes the engine X is using during a 4 ply search. Fixed nodes versus fixed depth may not be desirable, but all engines "benefited" from the same treatment against Shredder. Still, there are many un-documented quantities, one simple being that a single fixed depth search may be off the average nodes used by a factor of 4 easily.bob wrote:This still leaves a hole. 4 plies is not a "constant". Some programs extend more aggressively, some less so. Some reduce more aggressively (stockfish particularly) some less so. So you are STILL measuring search differences along with eval differences. This was what I was trying to do in my simple 1 ply test. But even then I had to modify both programs so that one ply meant the same thing. And that is not even a sure thing since reducing at the root could still be different between the two programs.
One good test would be a vs b to fixed depth, then swap their evaluations and play again. But that would be beyond a royal pain, obviously.
I am not sure there is any really accurate way to compare evaluations so long as searches are not similar. One thing I could do is if you compare stockfish to Crafty your way, I could modify both to do a bare-bones 4 ply search + quiescence, no extensions, no reductions, nothing in q-search except captures, and you could run the test again to the same depth and compare the numbers. If they are close, then what you are doing is probably more than good enough. If they vary significantly, you could experiment to see if there is any simple way of getting them to produce the same results as the code I sent.
4 plies would still be instant without any pruning or reductions involved, so it would be quick.
If you say 4plies = N nodes for X, and then you let Y search for N nodes, that may or may not be very accurate. Impossible to say. What I would expect to see, most likely, is that two programs with a very similar eval, but very different searches, might produce a lop-sided result because of the way you are equivalencing their search.
Not saying it is right, wrong, good or bad. Saying it is "unknown". I would prefer to normalize searches (hard to do with commercials of course, take months of RE study to figure out how) so that the only difference is evaluations. That was what I did in my test. 1 ply search + captures which would REALLY seem to lean on the evaluation for everything. And let me tell you there are some really ODD results when you do that. Walking into repetitions without knowing until it is too late, missing simple mate threats since you only allow the opponent to recapture. Etc. Only good thing is that playing a million games is very fast at sd=1, so you can cover up some of the noise with sheer volume of games.Then I put Shredder 12 depth=4 to play against the standard candle fixed nodes Shredder 12, at nodes=2441, and the result is equal within error margins:Code: Select all
1) Stockfish 21.03.2015 598 nodes 2) Komodo 8 1455 nodes 3) Houdini 4 1957 nodes 4) Robbolito 0.085 1853 nodes 5) Texel 1.05 1831 nodes 6) Gaviota 1.0 1091 nodes 7) Strelka 2.0 4485 nodes 8) Fruit 2.1 7294 nodes 9) Komodo 3 1865 nodes 10) Stockfish 2.1.1 1892 nodes 11) Houdini 1.5 2167 nodes 12) Crafty 24.1 1584 nodes 13) Shredder 12 2441 nodes
The consistency is encouraging, but it does not prove yet that the methodology is correct.Code: Select all
# PLAYER : RATING POINTS PLAYED (%) 1 Gaviota 1.0 : 178.8 735.0 1000 73.5% 2 Komodo 3 : 159.9 713.5 1000 71.3% 3 Houdini 4 : 125.3 671.5 1000 67.2% 4 Komodo 8 : 109.3 651.0 1000 65.1% 5 Houdini 1.5 : 101.6 641.0 1000 64.1% 6 RobboLito 0.085 : 43.0 561.0 1000 56.1% 7 Shredder 12 nodes : 0.0 6213.5 13000 47.8% 8 Shredder 12 depth : -7.0 490.0 1000 49.0% 9 Stockfish 21.03.2015 : -11.2 484.0 1000 48.4% 10 Stockfish 2.1.1 : -22.1 468.5 1000 46.9% 11 Texel 1.05 : -43.0 439.0 1000 43.9% 12 Strelka 2.0 : -109.6 348.5 1000 34.9% 13 Crafty 24.1 : -115.8 340.5 1000 34.0% 14 Fruit 2.1 : -199.1 243.0 1000 24.3%
An attempt to measure the knowledge of engines
Moderators: hgm, Rebel, chrisw
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: An attempt to measure the knowledge of engines
-
- Posts: 1494
- Joined: Thu Mar 30, 2006 2:08 pm
Re: An attempt to measure the knowledge of engines
Although I do not have access to Komodo 3 code (although I will look around for it), Komodo has become much more selective over the years. That selectivity lets it search deeper, but will make a fixed depth search look weaker. Also, I think the more recent Stockfish programs have become more selective even in PV nodes (hence the very small number of nodes it shows for a 4 ply search). So a fixed depth search would hurt programs like that a lot. The selectivity is meant to make the program reach more plies of depth, which will not happen in a fixed depth search.
I like the concept of your experiment, but the high selectivity of programs means some of them will not look at a lot of the nodes, and hence under preform at low depths. Most programs I have seen do not prune moves at the first ply, so a one ply search might kinda work. Then again, a program that extends more moves would gain a lot of elo in a one ply search which does not. Possibly you could take the nps for an engine, and its branching factor and reach some conclusion. Or modify each program to have the same nonselectve search.
A few months ago I tried stripping Komodo's evaluation down to something akin to piece square tables and material eval. It cost between 500 and 600 elo (and sped the node rate up a lot). Top programs have gained a lot from both evaluation and search improvements.
I like the concept of your experiment, but the high selectivity of programs means some of them will not look at a lot of the nodes, and hence under preform at low depths. Most programs I have seen do not prune moves at the first ply, so a one ply search might kinda work. Then again, a program that extends more moves would gain a lot of elo in a one ply search which does not. Possibly you could take the nps for an engine, and its branching factor and reach some conclusion. Or modify each program to have the same nonselectve search.
A few months ago I tried stripping Komodo's evaluation down to something akin to piece square tables and material eval. It cost between 500 and 600 elo (and sped the node rate up a lot). Top programs have gained a lot from both evaluation and search improvements.
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: An attempt to measure the knowledge of engines
The easier problem I see on it is that it plays without plan quite often. And of course it does not sees coming relative easy conceptual problems.Laskos wrote: I am curious if this abomination "Andscacs - Sungorus" has serious issues, visible with naked eye in games.
Daniel José - http://www.andscacs.com
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: An attempt to measure the knowledge of engines
If a program is very selective, it will search less nodes at depth 4. Kai's idea is to correct that by making the opponent (Shredder) to search less nodes to compensate. So, this is not a typical fixed depth=4 experiment.mjlef wrote:Although I do not have access to Komodo 3 code (although I will look around for it), Komodo has become much more selective over the years. That selectivity lets it search deeper, but will make a fixed depth search look weaker. Also, I think the more recent Stockfish programs have become more selective even in PV nodes (hence the very small number of nodes it shows for a 4 ply search). So a fixed depth search would hurt programs like that a lot. The selectivity is meant to make the program reach more plies of depth, which will not happen in a fixed depth search.
I like the concept of your experiment, but the high selectivity of programs means some of them will not look at a lot of the nodes, and hence under preform at low depths. Most programs I have seen do not prune moves at the first ply, so a one ply search might kinda work. Then again, a program that extends more moves would gain a lot of elo in a one ply search which does not. Possibly you could take the nps for an engine, and its branching factor and reach some conclusion. Or modify each program to have the same nonselectve search.
A few months ago I tried stripping Komodo's evaluation down to something akin to piece square tables and material eval. It cost between 500 and 600 elo (and sped the node rate up a lot). Top programs have gained a lot from both evaluation and search improvements.
This is more like Shredder @ x nodes vs. Engine_A @ x nodes (average, which happens to be depth 4).
You can see which programs are more selective in the table that Kai provided, and you will see that the experiment does not correlate with it.
Miguel
-
- Posts: 6052
- Joined: Tue Jun 12, 2012 12:41 pm
Re: An attempt to measure the knowledge of engines
Who could have said it better?mjlef wrote:Although I do not have access to Komodo 3 code (although I will look around for it), Komodo has become much more selective over the years. That selectivity lets it search deeper, but will make a fixed depth search look weaker. Also, I think the more recent Stockfish programs have become more selective even in PV nodes (hence the very small number of nodes it shows for a 4 ply search). So a fixed depth search would hurt programs like that a lot. The selectivity is meant to make the program reach more plies of depth, which will not happen in a fixed depth search.
I like the concept of your experiment, but the high selectivity of programs means some of them will not look at a lot of the nodes, and hence under preform at low depths. Most programs I have seen do not prune moves at the first ply, so a one ply search might kinda work. Then again, a program that extends more moves would gain a lot of elo in a one ply search which does not. Possibly you could take the nps for an engine, and its branching factor and reach some conclusion. Or modify each program to have the same nonselectve search.
A few months ago I tried stripping Komodo's evaluation down to something akin to piece square tables and material eval. It cost between 500 and 600 elo (and sped the node rate up a lot). Top programs have gained a lot from both evaluation and search improvements.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: An attempt to measure the knowledge of engines
I don't quite get this pruning reason in the context of the present test. Pruning, correctly done, does not deprecate the value of a node, so testing at fixed nodes should be insensitive to pruning in this context. Test here mimics fixed nodes at low number of nodes, although most engines don't handle "go nodes" directive very well (or at all). Not so low number of nodes as depth=1, but I saw games so crappy at depth=1 and engines differ so much at what they are doing, that integration of eval with say ~2000 searched nodes (depth=4 varies from engine to engine in nodes) seems more representative of eval performance.mjlef wrote:Although I do not have access to Komodo 3 code (although I will look around for it), Komodo has become much more selective over the years. That selectivity lets it search deeper, but will make a fixed depth search look weaker. Also, I think the more recent Stockfish programs have become more selective even in PV nodes (hence the very small number of nodes it shows for a 4 ply search). So a fixed depth search would hurt programs like that a lot. The selectivity is meant to make the program reach more plies of depth, which will not happen in a fixed depth search.
I like the concept of your experiment, but the high selectivity of programs means some of them will not look at a lot of the nodes, and hence under preform at low depths. Most programs I have seen do not prune moves at the first ply, so a one ply search might kinda work. Then again, a program that extends more moves would gain a lot of elo in a one ply search which does not. Possibly you could take the nps for an engine, and its branching factor and reach some conclusion. Or modify each program to have the same nonselectve search.
A few months ago I tried stripping Komodo's evaluation down to something akin to piece square tables and material eval. It cost between 500 and 600 elo (and sped the node rate up a lot). Top programs have gained a lot from both evaluation and search improvements.
About the eval, does it happen that you have to remove or to shortcut an eval component in order to gain Elo? I would imagine that these sorts of things happen.
-
- Posts: 2204
- Joined: Sat Jan 18, 2014 10:24 am
- Location: Andorra
Re: An attempt to measure the knowledge of engines
If I understood well the question, of course it happens with all. An example is the mg/eg separation.Laskos wrote: About the eval, does it happen that you have to remove or to shortcut an eval component in order to gain Elo? I would imagine that these sorts of things happen.
Daniel José - http://www.andscacs.com
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: An attempt to measure the knowledge of engines
I mean from version to version, not so much during the game. I would think that eval features are often more volatile and piecemeal than say pruning techniques.cdani wrote:If I understood well the question, of course it happens with all. An example is the mg/eg separation.Laskos wrote: About the eval, does it happen that you have to remove or to shortcut an eval component in order to gain Elo? I would imagine that these sorts of things happen.
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: An attempt to measure the knowledge of engines
Interesting. But I think SF is significantly undervalued by this measure, because even at depth=4, there is a lot of pretty crazy pruning... For testing eval onle patches (with limited resources) I typically use depth=10.Laskos wrote:Shredder GUI has a nice feature of giving the output statistic by ply after an EPD set of positions is fed to it for analysis. The positions are FENs without a solution, and are balanced. To compare the engines directly at depth=1 is tricky, and Bob Hyatt explained it well:To make the full of "eval driven search", I adopted the following methodology:bob wrote: I can't say "for most reasonable programs". I was directly addressing a claim about stockfish and Crafty.
The "depth 1" can be tricky. Here's why:
(1) some programs extend depth when they give check (Crafty, for example). While others extend depth when escaping check. That is a big difference, in that with the latter, you drop into the q-search while in check. Nothing wrong, but you might or might not recognize a mate. Which means that with a 1 ply search, some programs will recognize a mate in 1, some won't.
(2) q-search. Some do a very simple q-search. Some escape or give check at the first ply of q-search. Some give check at the first search ply, escape at the second and give check again at the third. Those are not equal.
When I tried this with stockfish, i ran into the same problem. All 1 ply searches are not created equal.
As I mentioned, I am not sure my test is all that useful, because a program's evaluation is written around its search, and vice-versa. If you limit one part, you might be limiting the other part without knowing, skewing the results.
1/ Use depth=4 to avoid the issue of threats, q-search imbalances, inadequacy of depth=1 search in games.
2/ Most engines do not follow well "go nodes" UCI command for small number of nodes. They do follow "go depth" command rather well.
3/ Shredder engines (here I used Shredder 12 as standard candle) follow "go nodes" command literally, even for small number of nodes.
4/ Calculate average nodes for each engine to depth=4 on many positions (150 late opening, 171 endgame).
5/ Set the matches: Engine X depth=4 versus Shredder 12 number of nodes which each engine uses to depth 4.
In Shredder GUI, as an example Gaviota 1.0:
I took the average nodes to depth=4 for a dozen engines:Code: Select all
Late opening positions: TotTime: 25s SolTime: 25s Ply: 0 Positions:150 Avg Nodes: 0 Branching = 0.00 Ply: 1 Positions:150 Avg Nodes: 90 Branching = 0.00 Ply: 2 Positions:150 Avg Nodes: 255 Branching = 2.83 Ply: 3 Positions:150 Avg Nodes: 524 Branching = 2.05 Ply: 4 Positions:150 Avg Nodes: 1254 Branching = 2.39 Engine: Gaviota v1.0 (2048 MB) by Miguel A. Ballicora Endgame positions: TotTime: 28s SolTime: 28s Ply: 0 Positions:171 Avg Nodes: 0 Branching = 0.00 Ply: 1 Positions:171 Avg Nodes: 46 Branching = 0.00 Ply: 2 Positions:171 Avg Nodes: 158 Branching = 3.43 Ply: 3 Positions:171 Avg Nodes: 386 Branching = 2.44 Ply: 4 Positions:171 Avg Nodes: 928 Branching = 2.40 Engine: Gaviota v1.0 (2048 MB) by Miguel A. Ballicora
Play all of them at fixed depth=4 against Shredder 12 at fixed nodes determined for each engine.Code: Select all
1) Stockfish 21.03.2015 598 nodes 2) Komodo 8 1455 nodes 3) Houdini 4 1957 nodes 4) Robbolito 0.085 1853 nodes 5) Texel 1.05 1831 nodes 6) Gaviota 1.0 1091 nodes 7) Strelka 2.0 4485 nodes 8) Fruit 2.1 7294 nodes 9) Komodo 3 1865 nodes 10) Stockfish 2.1.1 1892 nodes 11) Houdini 1.5 2167 nodes 12) Crafty 24.1 1584 nodes
Shredder 12 standard candle, the strength of evals:
a) This is as close as I can get to some meaning of evals integrated into search without messing with sources (often unavailable).Code: Select all
# PLAYER : RATING POINTS PLAYED (%) 1 Gaviota 1.0 : 178.8 735.0 1000 73.5% 2 Komodo 3 (2011) : 159.9 713.5 1000 71.3% 3 Houdini 4 : 125.3 671.5 1000 67.2% 4 Komodo 8 : 109.3 651.0 1000 65.1% 5 Houdini 1.5 (2011) : 101.6 641.0 1000 64.1% 6 RobboLito 0.085 : 43.0 561.0 1000 56.1% 7 Shredder 12 : 0.0 5703.5 12000 47.5% 8 Stockfish 21.03.2015 : -11.2 484.0 1000 48.4% 9 Stockfish 2.1.1 (2011) : -22.1 468.5 1000 46.9% 10 Texel 1.05 : -43.0 439.0 1000 43.9% 11 Strelka 2.0 (Rybka 1.0) : -109.6 348.5 1000 34.9% 12 Crafty 24.1 : -115.8 340.5 1000 34.0% 13 Fruit 2.1 : -199.1 243.0 1000 24.3%
b) Besides the "anomaly" with Gaviota 1.0 (in fact there were earlier indications that Gaviota has a strong eval), there is another of Komodo 3 having a better eval (larger maybe) than Komodo 8. Larry or Mark could confirm or infirm the regression (or thinning) of the eval.
c) Stockfish eval seems no stronger than that of Shredder 12, but it seems significantly stronger than that of Crafty 24.1.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
-
- Posts: 3232
- Joined: Mon May 31, 2010 1:29 pm
- Full name: lucasart
Re: An attempt to measure the knowledge of engines
Indeed. Quantity != Quality...Laskos wrote:Well, that Fuit had a simple eval even compared to Crafty was visible in its sources. If your only argument is that the strongest engines must have the most elaborate and largest eval, then somebody has to check this sort of hand-waving.Lyudmil Tsvetkov wrote: Well, Daniel just answered you very convincingly: you can not separate eval from search, and as a rule a more sophisticated search goes hand in hand with more sophisticated eval.
There are of course smaller or larger variances, some engines might emphasize eval, and others search, but as a whole, or emphasize it at different periods of engine development, but, as a whole, eval and search should be harmonised, being close together in contribution.
That is why Mr. Zibi managed just the 2600-elo barrier.
And why Fruit, despite having tremendous search, was still much inferior to Rybka, Houdini and other engines having both good eval and good search.
But, on the other hand, who says Fruit had such a bad/basic eval?
Was not it Fabien who introduced for the first time scoring of passers in terms of ranks, and this proved a big winner?
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.