Sven Schüle wrote:benstoker wrote:Sven Schüle wrote:Don wrote:mjlef wrote:Another idea using this test is to compare matching move percentages as depth increases.  Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves?  How quickly?  This might even measure how much various programs improve with depth.
For this test to really be valid,  I'm assuming that it's not neccessary to converge on a single move - in many positions there should be a choice of 2 or more moves that all good.    Of course ultimately it will probably turn out to be the case that one good move leads to checkmate in 57 moves with best play and another leads to checkmate in 58 moves.
 
Or, to take it to the extreme, 30 moves draw and 8 moves lose.
It is possible that move choices of engines are not significant enough in some cases to allow that kind of comparison. Could matching the whole (final) PVs be an improvement? We would need a definition for "closeness of two PVs", and we would need access to the PVs themselves somehow. The latter should be easy.
Sven
 
Code: Select all
movetime:	2000												
fen pos:	rn3rk1/p1q2pbp/2pp1npB/1p2p3/3PP1b1/3B1N2/PPPQNPPP/R4RK1 w - - 2 11												
ih pv:	h6g7	g8g7	a2a4	b5a4	a1a4	b8d7	a4a6	c7b7	d2g5	g4f3	g2f3	f8e8	f1d1 ***
r3 pv:	h6g7	g8g7	a2a4	b5a4	a1a4	b8d7	a4a6	f8b8	b2b3	g4f3	g2f3	c6c5	d4d5 ***
match:	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE	TRUE	TRUE	FALSE	FALSE
pv ply:	1	2	3	4	5	6	7	8	9	10	11	12	13
													
RSQ	0.388889												
Binomial Distribution	0.95385742		
Do same thing for 50 or 100 positions. Cut off pv data at 8 or so. 
Then do gobs of stats.  
With this method, you have 8 or more data points per position, instead of merely one, with with to do statistical wizardry.
 
Yes, I thought about something in that direction. Only all TRUE's below (right to) the first FALSE are quite meaningless since it is already a different PV, and if later on both PVs contain the string "c2c3" then it may even be different pieces for that move.
Cutting PVs after 8 moves is possible, of course. But at today's search depths we could also say 16 or 32. A measurement to express the degree of closeness of two PVs would have to be developed. E.g. two 2-ply PVs being 100% equal have less significance IMO than two 16-ply PVs where the first 14 plies are equal. You propose RSQ, which I am not familiar with, and binomial distribution. But one could as well count all matching PV moves from ply 0 until the first mismatch, and always stop at a given max ply, say 16, and finally divide number of matching PV moves by 16 to get a "PV closeness percentage". One could also say 1 match => 50%, 2 => 75%, 3 => 87,5% and so on.
Just some thoughts. Of course this introduces more complexity, I know.
Sven
 
Dang, you are obviously right about the data being meaningless after first false!  The "PV closeness" is going therefore key off of the number TRUE's per position.  
Once the data is collected, all kinds of statistics can be done.  Over an adequate and sufficient position population sample, there will be number of matches @ ply 1, ply 2.  There needs to be a cutoff for practical purposes.  If the pv match, f.e., stops at 4, then 5 through 8 are counted as false.  
Why cut off at 6?  Start a UCI engine from a terminal and feed it a fen.  Then try different movetimes.  Sometimes the pv would only be 2 ply in r3.  Don't know why.  Sometimes it stops at 8 or 10.  IvanHoe's pv's seem to be longer -- it just feeds more pv info.  One engine's pv output may not be as long as another's.  Therefore, with short time controls of 1 or 2 secs, the data generating engine control script will definitely bring in some non-comparable pv strings.  If an engine only returns a 2 pv line, it shouldn't be compared to an 8 pv string b/c there will be at least 6 false negatives.
Also, the script needs to deal with the varying info display.  It seems some UCI engines have a lot of 'currmove' lines between the last 'bestmove' string and the last pv string.  I'm sure you all know more about that.  
Finally, I think that in the spirit of the times, it should be called a "PV Profiler", since the ippo/robbo/ivan cabal is not unlike terrorists in the computer chess domain.
It seems to me a perl junkie could script this in about 15 minutes.  Just start the engine, feed commands to stdin, and slurp stdout with some regex, clean up the string, insert field separators and dump to file for a spreadsheet prog to run stat functions.
Any perl junkies out there?