But if you think about it, you are trying to say a 4 ply search with X is equivalent to N nodes with Y. How confident are you that just because you see how many nodes a 4 ply search traverses, that the two are equal? A very good selective searcher will be able to go far deeper if it is getting to search N nodes. If you tried this with the old chess master program, you would find lots of positions where it could not finish a 4 ply search in any reasonable amount of time, due to the way it selectively searched things.Laskos wrote:It is not simple 4 ply search. It is a 4 ply search of engine X against Shredder with nodes = to the average of nodes the engine X is using during a 4 ply search. Fixed nodes versus fixed depth may not be desirable, but all engines "benefited" from the same treatment against Shredder. Still, there are many un-documented quantities, one simple being that a single fixed depth search may be off the average nodes used by a factor of 4 easily.bob wrote:This still leaves a hole. 4 plies is not a "constant". Some programs extend more aggressively, some less so. Some reduce more aggressively (stockfish particularly) some less so. So you are STILL measuring search differences along with eval differences. This was what I was trying to do in my simple 1 ply test. But even then I had to modify both programs so that one ply meant the same thing. And that is not even a sure thing since reducing at the root could still be different between the two programs.
One good test would be a vs b to fixed depth, then swap their evaluations and play again. But that would be beyond a royal pain, obviously.
I am not sure there is any really accurate way to compare evaluations so long as searches are not similar. One thing I could do is if you compare stockfish to Crafty your way, I could modify both to do a bare-bones 4 ply search + quiescence, no extensions, no reductions, nothing in q-search except captures, and you could run the test again to the same depth and compare the numbers. If they are close, then what you are doing is probably more than good enough. If they vary significantly, you could experiment to see if there is any simple way of getting them to produce the same results as the code I sent.
4 plies would still be instant without any pruning or reductions involved, so it would be quick.
If you say 4plies = N nodes for X, and then you let Y search for N nodes, that may or may not be very accurate. Impossible to say. What I would expect to see, most likely, is that two programs with a very similar eval, but very different searches, might produce a lop-sided result because of the way you are equivalencing their search.
Not saying it is right, wrong, good or bad. Saying it is "unknown". I would prefer to normalize searches (hard to do with commercials of course, take months of RE study to figure out how) so that the only difference is evaluations. That was what I did in my test. 1 ply search + captures which would REALLY seem to lean on the evaluation for everything. And let me tell you there are some really ODD results when you do that. Walking into repetitions without knowing until it is too late, missing simple mate threats since you only allow the opponent to recapture. Etc. Only good thing is that playing a million games is very fast at sd=1, so you can cover up some of the noise with sheer volume of games.