Sven Schüle wrote:benstoker wrote:Sven Schüle wrote:Don wrote:mjlef wrote:Another idea using this test is to compare matching move percentages as depth increases. Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves? How quickly? This might even measure how much various programs improve with depth.
For this test to really be valid, I'm assuming that it's not neccessary to converge on a single move - in many positions there should be a choice of 2 or more moves that all good. Of course ultimately it will probably turn out to be the case that one good move leads to checkmate in 57 moves with best play and another leads to checkmate in 58 moves.
Or, to take it to the extreme, 30 moves draw and 8 moves lose.
It is possible that move choices of engines are not significant enough in some cases to allow that kind of comparison. Could matching the whole (final) PVs be an improvement? We would need a definition for "closeness of two PVs", and we would need access to the PVs themselves somehow. The latter should be easy.
Sven
Code: Select all
movetime: 2000
fen pos: rn3rk1/p1q2pbp/2pp1npB/1p2p3/3PP1b1/3B1N2/PPPQNPPP/R4RK1 w - - 2 11
ih pv: h6g7 g8g7 a2a4 b5a4 a1a4 b8d7 a4a6 c7b7 d2g5 g4f3 g2f3 f8e8 f1d1 ***
r3 pv: h6g7 g8g7 a2a4 b5a4 a1a4 b8d7 a4a6 f8b8 b2b3 g4f3 g2f3 c6c5 d4d5 ***
match: TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE FALSE
pv ply: 1 2 3 4 5 6 7 8 9 10 11 12 13
RSQ 0.388889
Binomial Distribution 0.95385742
Do same thing for 50 or 100 positions. Cut off pv data at 8 or so.
Then do gobs of stats.
With this method, you have 8 or more data points per position, instead of merely one, with with to do statistical wizardry.
Yes, I thought about something in that direction. Only all TRUE's below (right to) the first FALSE are quite meaningless since it is already a different PV, and if later on both PVs contain the string "c2c3" then it may even be different pieces for that move.
Cutting PVs after 8 moves is possible, of course. But at today's search depths we could also say 16 or 32. A measurement to express the degree of closeness of two PVs would have to be developed. E.g. two 2-ply PVs being 100% equal have less significance IMO than two 16-ply PVs where the first 14 plies are equal. You propose RSQ, which I am not familiar with, and binomial distribution. But one could as well count all matching PV moves from ply 0 until the first mismatch, and always stop at a given max ply, say 16, and finally divide number of matching PV moves by 16 to get a "PV closeness percentage". One could also say 1 match => 50%, 2 => 75%, 3 => 87,5% and so on.
Just some thoughts. Of course this introduces more complexity, I know.
Sven
Dang, you are obviously right about the data being meaningless after first false! The "PV closeness" is going therefore key off of the number TRUE's per position.
Once the data is collected, all kinds of statistics can be done. Over an adequate and sufficient position population sample, there will be number of matches @ ply 1, ply 2. There needs to be a cutoff for practical purposes. If the pv match, f.e., stops at 4, then 5 through 8 are counted as false.
Why cut off at 6? Start a UCI engine from a terminal and feed it a fen. Then try different movetimes. Sometimes the pv would only be 2 ply in r3. Don't know why. Sometimes it stops at 8 or 10. IvanHoe's pv's seem to be longer -- it just feeds more pv info. One engine's pv output may not be as long as another's. Therefore, with short time controls of 1 or 2 secs, the data generating engine control script will definitely bring in some non-comparable pv strings. If an engine only returns a 2 pv line, it shouldn't be compared to an 8 pv string b/c there will be at least 6 false negatives.
Also, the script needs to deal with the varying info display. It seems some UCI engines have a lot of 'currmove' lines between the last 'bestmove' string and the last pv string. I'm sure you all know more about that.
Finally, I think that in the spirit of the times, it should be called a "PV Profiler", since the ippo/robbo/ivan cabal is not unlike terrorists in the computer chess domain.
It seems to me a perl junkie could script this in about 15 minutes. Just start the engine, feed commands to stdin, and slurp stdout with some regex, clean up the string, insert field separators and dump to file for a spreadsheet prog to run stat functions.
Any perl junkies out there?