Tree compare

hgm · Post by **hgm** » Sat Apr 28, 2007 11:12 pm

I had a crazy idea to compare two versions of the same chess program:

Just let the two versions play the same game against some opponent. If they don't differ too much, most of the time they will play the same moves. (Even different engines play mostly the same moves, as the ponder hit rate of ~60% shows.) We do this at a fixed amount of thinking time per move, to prevent accumulation of speed changes leading to different moves that are not directly due to the difference between the versions. (We can record the time actually searched separately.)

If at some point they play a different move, you 'split the game', and you let both engines finish both variations. If in any position they again play a different you repeat this recursively. So in stead of a linear sequence of moves, the game will become a binary tree. Eventually each branch ends in checkmate or legal draw, but there might be hundreds of such ends.

Now to extract an evaluation of the engine performance, you order the end leaves: the best is the shortest win, the worst is the shortest loss. Now at each branch point, you can see which of the two moves led to the best result, and which of the two engines thus made the better decision. You could us the percentage of good vs bad decision in the total tree game to evaluate the two versions.

As each engine has been presented exactly the same set of positions, it seems to me that this way of testing should be intrinsically less noisy than just having it play games. By extracting the statistics in detail, you should be even able to see if changes are, say, benificial in the middle-game, but counter-productive in the end-game. All that would be hard to see from just game results, as the bare result doesn't tell you were the error or genius move was made that caused it.

Has something like this ever been tried?

Onno Garms · Post by **Onno Garms** » Sun Apr 29, 2007 11:37 am

I think that this idea is very intersting. If you implement this, let me know.

I doubt that anybody has tried this aproach, because it seems to require considerable programming efford for the program that runs the tests. Such a program would certainly be worth being published, but I don't know of such a publication.

Even a fixed amount of thinking time per move does not make all engines play determistically. For instance, an engine might make descissions other than wether to terminate or not based on the remaining time (which might vary a little due to small differences in execution speed). To avoid problems and to save time, every position should be computed only once by any engine and the result should be saved.

Also you might need to prune games. Assume that the two versions play a different move in 25% of the positions. (You say 40% for completely different engines.) I think that enigine-vs-engine-games take more moves on an average than human-vs-human, so assume 60 moves. These should be in the order of 4 * 2^(60/4+1) = 262.144 moves to calculate and for each of it one ply with the two versions in question and one ply with the oponent. At one second per move this should take 9 days per tree of games. Of course the number of moves to be examined is extremely sensitive to the percentage of positions where the engines play different moves.

I did not quite understand how you decide which is the better move in a position? You wrote

Now to extract an evaluation of the engine performance, you order the end leaves: the best is the shortest win, the worst is the shortest loss. Now at each branch point, you can see which of the two moves led to the best result, and which of the two engines thus made the better decision.

Does that mean you only consider the shortest win in both of the subtrees? What if e.g. move A only produced wins between 30 and 50 moves, while move B produed one win within 10 moves due to a blunder by the opponent and only losses otherwise?

Also, I'm not sure if it's good to include the game length in the evaluation. Going into obviously lost endgames still can make the game last longer then loosing a difficult middle game.

hgm · Post by **hgm** » Sun Apr 29, 2007 10:14 pm

OK, I see your points. There obviously is a limit to how different the engines can be without getting an unweildly large tree. I was hoping on 10% of the moves being different, so that a 100-move game would lead to ~1024 game ends. E.g. after changing the value of an edge-Pawn push.

Your point about the game length being an inaccurate measure is also a good one. Perhaps it would be better to measure how long it takes to drive the opponent's own score to -2 or -3.

Tree compare

Tree compare

Re: Tree compare

Re: Tree compare