Clone detection test

Don · Post by **Don** » Thu Jan 28, 2010 8:08 pm

Graham Banks wrote:Thanks Don to yourself and any other programmers involved. If you can come up with a watertight clone detection test, it's going to save a lot of future hassles.

Cheers,
Graham.

I fear there will never be a watertight test. However I believe a strong test can be developed that has a lot of statistical signficance. This is sort of like a polygraph test in that it's not admissible as a watertight proof but is still a useful tool if used in an open-minded way.

If nothing else it will give us something else to argue about

benstoker · Post by **benstoker** » Thu Jan 28, 2010 11:31 pm

Sven Schüle wrote:
Don wrote:
mjlef wrote:Another idea using this test is to compare matching move percentages as depth increases. Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves? How quickly? This might even measure how much various programs improve with depth.
For this test to really be valid, I'm assuming that it's not neccessary to converge on a single move - in many positions there should be a choice of 2 or more moves that all good. Of course ultimately it will probably turn out to be the case that one good move leads to checkmate in 57 moves with best play and another leads to checkmate in 58 moves.
Or, to take it to the extreme, 30 moves draw and 8 moves lose.

It is possible that move choices of engines are not significant enough in some cases to allow that kind of comparison. Could matching the whole (final) PVs be an improvement? We would need a definition for "closeness of two PVs", and we would need access to the PVs themselves somehow. The latter should be easy.

Sven

Code: Select all

movetime:	2000												
fen pos:	rn3rk1/p1q2pbp/2pp1npB/1p2p3/3PP1b1/3B1N2/PPPQNPPP/R4RK1 w - - 2 11												
ih pv:	h6g7	g8g7	a2a4	b5a4	a1a4	b8d7	a4a6	c7b7	d2g5	g4f3	g2f3	f8e8	f1d1 ***
r3 pv:	h6g7	g8g7	a2a4	b5a4	a1a4	b8d7	a4a6	f8b8	b2b3	g4f3	g2f3	c6c5	d4d5 ***
match:	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE	TRUE	TRUE	FALSE	FALSE
pv ply:	1	2	3	4	5	6	7	8	9	10	11	12	13
													
RSQ	0.388889												
Binomial Distribution	0.95385742

Do same thing for 50 or 100 positions. Cut off pv data at 8 or so.
Then do gobs of stats.
With this method, you have 8 or more data points per position, instead of merely one, with with to do statistical wizardry.

Sven · Post by **Sven** » Thu Jan 28, 2010 11:58 pm

benstoker wrote:
Sven Schüle wrote:
Don wrote:
mjlef wrote:Another idea using this test is to compare matching move percentages as depth increases. Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves? How quickly? This might even measure how much various programs improve with depth.
For this test to really be valid, I'm assuming that it's not neccessary to converge on a single move - in many positions there should be a choice of 2 or more moves that all good. Of course ultimately it will probably turn out to be the case that one good move leads to checkmate in 57 moves with best play and another leads to checkmate in 58 moves.
Or, to take it to the extreme, 30 moves draw and 8 moves lose.

It is possible that move choices of engines are not significant enough in some cases to allow that kind of comparison. Could matching the whole (final) PVs be an improvement? We would need a definition for "closeness of two PVs", and we would need access to the PVs themselves somehow. The latter should be easy.

Sven
Code: Select all
movetime:	2000												
fen pos:	rn3rk1/p1q2pbp/2pp1npB/1p2p3/3PP1b1/3B1N2/PPPQNPPP/R4RK1 w - - 2 11												
ih pv:	h6g7	g8g7	a2a4	b5a4	a1a4	b8d7	a4a6	c7b7	d2g5	g4f3	g2f3	f8e8	f1d1 ***
r3 pv:	h6g7	g8g7	a2a4	b5a4	a1a4	b8d7	a4a6	f8b8	b2b3	g4f3	g2f3	c6c5	d4d5 ***
match:	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE	TRUE	TRUE	FALSE	FALSE
pv ply:	1	2	3	4	5	6	7	8	9	10	11	12	13
													
RSQ	0.388889												
Binomial Distribution	0.95385742		
Do same thing for 50 or 100 positions. Cut off pv data at 8 or so.
Then do gobs of stats.
With this method, you have 8 or more data points per position, instead of merely one, with with to do statistical wizardry.

Yes, I thought about something in that direction. Only all TRUE's below (right to) the first FALSE are quite meaningless since it is already a different PV, and if later on both PVs contain the string "c2c3" then it may even be different pieces for that move.

Cutting PVs after 8 moves is possible, of course. But at today's search depths we could also say 16 or 32. A measurement to express the degree of closeness of two PVs would have to be developed. E.g. two 2-ply PVs being 100% equal have less significance IMO than two 16-ply PVs where the first 14 plies are equal. You propose RSQ, which I am not familiar with, and binomial distribution. But one could as well count all matching PV moves from ply 0 until the first mismatch, and always stop at a given max ply, say 16, and finally divide number of matching PV moves by 16 to get a "PV closeness percentage". One could also say 1 match => 50%, 2 => 75%, 3 => 87,5% and so on.

Just some thoughts. Of course this introduces more complexity, I know.

Sven

benstoker · Post by **benstoker** » Fri Jan 29, 2010 1:48 am

Sven Schüle wrote:
benstoker wrote:
Sven Schüle wrote:
Don wrote:
mjlef wrote:Another idea using this test is to compare matching move percentages as depth increases. Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves? How quickly? This might even measure how much various programs improve with depth.
For this test to really be valid, I'm assuming that it's not neccessary to converge on a single move - in many positions there should be a choice of 2 or more moves that all good. Of course ultimately it will probably turn out to be the case that one good move leads to checkmate in 57 moves with best play and another leads to checkmate in 58 moves.
Or, to take it to the extreme, 30 moves draw and 8 moves lose.

It is possible that move choices of engines are not significant enough in some cases to allow that kind of comparison. Could matching the whole (final) PVs be an improvement? We would need a definition for "closeness of two PVs", and we would need access to the PVs themselves somehow. The latter should be easy.

Sven
Code: Select all
movetime:	2000												
fen pos:	rn3rk1/p1q2pbp/2pp1npB/1p2p3/3PP1b1/3B1N2/PPPQNPPP/R4RK1 w - - 2 11												
ih pv:	h6g7	g8g7	a2a4	b5a4	a1a4	b8d7	a4a6	c7b7	d2g5	g4f3	g2f3	f8e8	f1d1 ***
r3 pv:	h6g7	g8g7	a2a4	b5a4	a1a4	b8d7	a4a6	f8b8	b2b3	g4f3	g2f3	c6c5	d4d5 ***
match:	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	TRUE	FALSE	FALSE	TRUE	TRUE	FALSE	FALSE
pv ply:	1	2	3	4	5	6	7	8	9	10	11	12	13
													
RSQ	0.388889												
Binomial Distribution	0.95385742		
Do same thing for 50 or 100 positions. Cut off pv data at 8 or so.
Then do gobs of stats.
With this method, you have 8 or more data points per position, instead of merely one, with with to do statistical wizardry.
Yes, I thought about something in that direction. Only all TRUE's below (right to) the first FALSE are quite meaningless since it is already a different PV, and if later on both PVs contain the string "c2c3" then it may even be different pieces for that move.

Cutting PVs after 8 moves is possible, of course. But at today's search depths we could also say 16 or 32. A measurement to express the degree of closeness of two PVs would have to be developed. E.g. two 2-ply PVs being 100% equal have less significance IMO than two 16-ply PVs where the first 14 plies are equal. You propose RSQ, which I am not familiar with, and binomial distribution. But one could as well count all matching PV moves from ply 0 until the first mismatch, and always stop at a given max ply, say 16, and finally divide number of matching PV moves by 16 to get a "PV closeness percentage". One could also say 1 match => 50%, 2 => 75%, 3 => 87,5% and so on.

Just some thoughts. Of course this introduces more complexity, I know.

Sven

Dang, you are obviously right about the data being meaningless after first false! The "PV closeness" is going therefore key off of the number TRUE's per position.

Once the data is collected, all kinds of statistics can be done. Over an adequate and sufficient position population sample, there will be number of matches @ ply 1, ply 2. There needs to be a cutoff for practical purposes. If the pv match, f.e., stops at 4, then 5 through 8 are counted as false.

Why cut off at 6? Start a UCI engine from a terminal and feed it a fen. Then try different movetimes. Sometimes the pv would only be 2 ply in r3. Don't know why. Sometimes it stops at 8 or 10. IvanHoe's pv's seem to be longer -- it just feeds more pv info. One engine's pv output may not be as long as another's. Therefore, with short time controls of 1 or 2 secs, the data generating engine control script will definitely bring in some non-comparable pv strings. If an engine only returns a 2 pv line, it shouldn't be compared to an 8 pv string b/c there will be at least 6 false negatives.

Also, the script needs to deal with the varying info display. It seems some UCI engines have a lot of 'currmove' lines between the last 'bestmove' string and the last pv string. I'm sure you all know more about that.

Finally, I think that in the spirit of the times, it should be called a "PV Profiler", since the ippo/robbo/ivan cabal is not unlike terrorists in the computer chess domain.

It seems to me a perl junkie could script this in about 15 minutes. Just start the engine, feed commands to stdin, and slurp stdout with some regex, clean up the string, insert field separators and dump to file for a spreadsheet prog to run stat functions.

Any perl junkies out there?

BubbaTough · Post by **BubbaTough** » Fri Jan 29, 2010 2:02 am

An argument could be made...I am not sure I would make it but it could be made, that only the move made and PERHAPS the ponder move should count because other aspects could be faked without hurting engine strength.

Also, I would keep the list of positions you are using to yourself. They could be useful to those trying to foil the test (maybe not, but maybe) and those disinclined to to give credence to your results would be unlikely to be swayed by having the positions themselves.

-Sam

Hart · Post by **Hart** » Sat Jan 30, 2010 8:14 am

Code: Select all

                         +-----------------------------------------Strelka_3R
                     +---6 
                     !   +-----------------------------------------Rybka_3   
              +------9 
              !      !                +----------------------------FeBird_1.0
              !      +----------------3 
              !                       +----------------------------RLito_85g3
           +-12  
           !  !                        +--------------------------Naum_4.1  
           !  !     +------------------2 
           !  !     !                  +--------------------------Naum_4.0  
        +-13  +----11  
        !  !        !                             +----------------Rybka_1_64
        !  !        +-----------------------------1 
        !  !                                      +----------------Rybka_1_32
        !  !  
     +-14  +------------------------------------------------------Komodo_1.0
     !  !  
     !  !             +------------------------------------------Sfish1.6.2
     !  !          +--7 
     !  !          !  !      +-----------------------------------Sfish1.5.1
     !  +---------10  +------4 
  +-15             !         +-----------------------------------Sfish_1.4 
  !  !             !  
  !  !             +----------------------------------------------Glrung_2.2
  !  !  
  !  !              +------------------------------------------Fruit_2.1 
-16  +--------------8 
  !                 !   +--------------------------------------Toga1.4.5c
  !                 +---5 
  !                     +--------------------------------------Toga_1.3.1
  !  
  +-----------------------------------------------------------spark_0.3a

Here is the one I ran from 1600 positions. I think it is easy to interpret which are the "closest". How do you count the branches to determine which are the "farthest"?

michiguel · Post by **michiguel** » Sat Jan 30, 2010 9:19 am

Hart wrote:

Code: Select all

                         +-----------------------------------------Strelka_3R
                     +---6 
                     !   +-----------------------------------------Rybka_3   
              +------9 
              !      !                +----------------------------FeBird_1.0
              !      +----------------3 
              !                       +----------------------------RLito_85g3
           +-12  
           !  !                        +--------------------------Naum_4.1  
           !  !     +------------------2 
           !  !     !                  +--------------------------Naum_4.0  
        +-13  +----11  
        !  !        !                             +----------------Rybka_1_64
        !  !        +-----------------------------1 
        !  !                                      +----------------Rybka_1_32
        !  !  
     +-14  +------------------------------------------------------Komodo_1.0
     !  !  
     !  !             +------------------------------------------Sfish1.6.2
     !  !          +--7 
     !  !          !  !      +-----------------------------------Sfish1.5.1
     !  +---------10  +------4 
  +-15             !         +-----------------------------------Sfish_1.4 
  !  !             !  
  !  !             +----------------------------------------------Glrung_2.2
  !  !  
  !  !              +------------------------------------------Fruit_2.1 
-16  +--------------8 
  !                 !   +--------------------------------------Toga1.4.5c
  !                 +---5 
  !                     +--------------------------------------Toga_1.3.1
  !  
  +-----------------------------------------------------------spark_0.3a

Here is the one I ran from 1600 positions. I think it is easy to interpret which are the "closest". How do you count the branches to determine which are the "farthest"?

Great job Michael.

Interestingly, with a complete different set of positions, you get the same type of results that Don got. We still have to do a bootstrap analysis, but this is a strong hint that this could be significant.

If you have to go more to the left to change branches, the difference is bigger. For instance, spark is the one that probably is most different than the rest. However, this may not be significant. I think that nodes 12 to 16 are "fuzzy". The bootstrap analysis will tell us where the fuzziness ends (coming from the left).

From this plot the following families are possible
1) Spark
2) Fruit and Togas
3) Glaurung and Stockfishes
4) Komodo
5) Rybka1 and Naum
6) Rybka3 and Strelka3, robbolitos

From this perspective, R1 style is not related to the fruits and the closest relative to the Ippos style is R3.

Miguel
PS: I want to make clear that this test does not identify clones, but detects similarity of decision making.

Hart · Post by **Hart** » Sat Jan 30, 2010 10:11 pm

Added Strelka 2B, Fruit 2.1 (with 8x more time) and Rybka 3 (depth = 1 (or 4)). The latter failed to group which is not too surprising as the average depth went from something like 8 to 1. However, I thought giving Fruit 2.1 8x more time would possibly group it with stronger engines, but it is still "closest" to plain ol' Fruit 2.1. In other words, the only engine I could make significantly diverge from their group was Rybka 3 with depth = 1, which was probably a reduction in ~1000 Elo (guesstimate).

How would you bootstrap this?

michiguel · Post by **michiguel** » Tue Feb 02, 2010 7:38 am

Hart wrote:

Added Strelka 2B, Fruit 2.1 (with 8x more time) and Rybka 3 (depth = 1 (or 4)). The latter failed to group which is not too surprising as the average depth went from something like 8 to 1. However, I thought giving Fruit 2.1 8x more time would possibly group it with stronger engines, but it is still "closest" to plain ol' Fruit 2.1. In other words, the only engine I could make significantly diverge from their group was Rybka 3 with depth = 1, which was probably a reduction in ~1000 Elo (guesstimate).

How would you bootstrap this?

This is the result of a boostrapped analysis (actually, a modification of half-block jackknifing, but it is the same idea). The concept is the following. The whole dataset is 1000 positions. Rather than taking all the data, we take 500 randomly chosen. We do the neighbor joining analysis, and ask the question: Is Rybka, Ippo, and Robbo separated as one family? (edit: the software actually ask the question with all the branches that appear)The anwer could be yes or no. We repeat the process, but this time we chose another 500 positions, again, randomly chosen. We repeat the process and ask the same question. We do this 1000 times. If the signal/ratio do determine that Rybka, Robbo, Ippo are one family, most of the time that we do this, the answer should be yes.

Look at the 1000 that is in the branch that separates Rybka, Ippo, Robbo to the rest. That means every single time they were one separated family, despite of the different re-sampling. That is super significant. Conclusion: the move decision making of Rybka is more similar to Robbo/Ippo than any of the other engines tested by Don.

Code: Select all

CONSENSUS TREE:
the numbers on the branches indicate the number
of times the partition of the species into the two sets
which are separated by that branch occurred
among the trees, out of 1000.00 trees

                                     +------doch-1.0
                              +925.0-|
                       +-1000-|      +------doch-1.2
                       |      |
                       |      +-------------komodo
                +931.0-|
                |      |             +------ippolito
                |      |      +-1000-|
                |      +-1000-|      +------robbo
                |             |
                |             +-------------rybka
         +-1000-|
         |      |                    +------sf14
         |      |             +920.0-|
         |      |             |      +------sf15
         |      |      +894.0-|
  +------|      |      |      |      +------sf16
  |      |      +-1000-|      +-1000-|
  |      |             |             +------sf strong
  |      |             |
  |      |             +--------------------glaurung
  |      |
  |      +----------------------------------fruit
  |
  +-----------------------------------------toga2

Can you show me what kind of format your solution file has? For instance, each engine has a file with the solutions like this

Code: Select all

Program successfully attached
     Level: go movetime 100
 Positions: 1 - 1000

      1  d8c7  N   0.0000 out of   1.0000 =   0.0000
      2  c8g4  N   0.0000 out of   2.0000 =   0.0000
      3  h8g8  N   0.0000 out of   3.0000 =   0.0000
      4  a3a4  Y   1.0000 out of   4.0000 =  25.0000
      5  c8f8  N   1.0000 out of   5.0000 =  20.0000

I could modify the code to accommodate that format and and send it to you. Alternatively, if you send me the solution files, I could do the analysis.

Miguel

Hart · Post by **Hart** » Tue Feb 02, 2010 8:14 am

Sure, I'll point you to my data and then I'd be interested in seeing a similar analysis. This a new set with 20 engines and 6400 positions.
The data is in polyglot epd-test output format with some changes:

Code: Select all

:bright_04a
 1    1    1 -  5   0.01     14501  +0.39 Bd3 Rac8 f4 Rfe8 Bf5 exf4 Bxd7 fxe3 Bxe8 Rxe8
 1    2    2 -  5   0.02     13608  -0.16 Qb6 Bxa8 Rxa8 Ne4 Ne8 Re3 Rad8 Rd3 Rxd3 cxd3 b4 d4 bxa3 bxa3
 0    2    3 - 10   1.22   2171691  +0.16 Na6 Re1 Nb4 Qc4 Qd6 Bc6 Nxc6 dxc6 b5 Qxb5 Rxd4 Nxd4 Qh2+ Kf1 Qh1+ Ke2 Qxh6
:strelka_2
 1    1    1 -  6   0.02     16535  +0.24 Bd3 Qc7 h5 Nc5 Bf5 Qb7 Qxb7 Nxb7 h6 g6 Be4 Rab8
 0    1    2 - 11   1.27   2181503  -0.04 Qd8 Bxa8 Qxa8 Ne4 Qd8 b4 Rd5 Rc1 g6 c4 Nxe4 Rxe4 Rd3 Re3
 1    2    3 - 10   0.64   1110013  +0.02 Bf4 Re1 Nd7 Bxf4 Rxf4 Qc6 Nf8 Qc4 Rc8 Bc6 Qd6 Qa6 Qc7

http://www.mediafire.com/?lnq3mzn1fmo

Also attached is the C++ program I use to build the table for input into "Phylip Neighbor". It should be easy enough to modify for output format of your choosing. See line 55.

Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test