Clone detection test

rjgibert · Post by **rjgibert** » Wed Jan 27, 2010 9:56 pm

Don wrote:Suppose you ran 1000 random positions on many different versions of a
the same program, then run the same positions on many versions of
other programs. What could be deduced statistically from how often
the various program versions picked the same move?

Such a thing could serve as a crude clone detector. I ran such an
experiment on many different programs to get a kind of measurment of
corelation between different program "families" and different versions
within the same family of programs and the result is surprising.

The 1000 positions are from a set of positions that Larry Kaufman and
I created long ago that are designed to compare chess programs to
humans in playing style. So few problems are blatantly tactical and
in many of these positions the choice of moves is going to based on
preference more than raw strength.

The test compares any two programs by how often they pick the same
move, out of a sample of 1000 positions. I run each program to the
same time limit which in this case is 1/10 of a second.

Below is a table of the results, starting with the most corelated to
the least corelated programs. I ran various verisons of my own
program, all the stockfish versions including glaurung, and all the so
called Rybka clones as well as Rybka herself.

What is interesting in the table is that if you assume that ippolitio,
Robbolito and Rybka are in the same family of programs, and that any
program with a score above 594 is to be considered a clone, then my
test gets it right in every single case. It identifies families of
programs and non-related programs accurately.

The most corelated set of programs that we know are not clones of each
other are rybka and doch-1.2. However, these 2 program do have a
program author in common.

Just in case this could be interpreted as a strength tester, I added a
version of stockfish 1.6 which I call sf_strong. It is stockfish 1.6
run at 1/4 of a second instead of 1/10 of a second. This was a sanity
test to determine if stockfish would look more like Rybka if it was
run at a level where it was closer to Rybka's chess strength. As you
can see, this did not foil my test.

I'm not going to attach any special signficance to this test - look at
the data and draw your own conclusions. I don't pretend it's
scientifically accurate or anything like this. It is whatever it is.
Code: Select all
  846  sf_strong         sf16            
  758  doch-1.2          doch-1.0        
  734  robbo             ippolito        
  720  komodo            doch-1.2        
  706  sf15              sf14            
  687  komodo            doch-1.0        
  671  sf16              sf15            
  655  rybka             robbo           
  649  sf16              sf14            
  644  rybka             ippolito        
  639  sf14              glaurung        
  638  sf_strong         sf15            
  630  sf15              glaurung        
  617  sf_strong         sf14            
  600  sf_strong         glaurung        
  595  sf16              glaurung        

  594  rybka             doch-1.2        
  582  rybka             doch-1.0        
  581  rybka             komodo          
  579  ippolito          doch-1.0        
  573  robbo             komodo          
  571  sf15              robbo           
  571  ippolito          doch-1.2        
  569  sf15              rybka           
  568  robbo             doch-1.2        
  565  sf_strong         rybka           
  563  sf14              ippolito        
  560  komodo            ippolito        
  559  sf14              robbo           
  559  robbo             doch-1.0        
  558  sf14              rybka           
  557  sf_strong         robbo           
  556  sf15              ippolito        
  554  sf16              rybka           
  554  sf16              robbo           
  554  sf15              komodo          
  551  sf14              doch-1.0        
  549  sf15              doch-1.0        
  544  glaurung          doch-1.0        
  542  rybka             glaurung        
  541  sf16              doch-1.0        
  538  sf16              ippolito        
  536  sf_strong         doch-1.0        
  536  sf14              komodo          
  532  komodo            glaurung        
  531  sf_strong         ippolito        
  531  sf15              doch-1.2        
  528  glaurung          doch-1.2        
  527  sf_strong         komodo          
  527  sf16              doch-1.2        
  525  robbo             glaurung        
  524  sf_strong         doch-1.2        
  523  sf14              doch-1.2        
  521  ippolito          glaurung        
  519  sf16              komodo          

The individuals who will find your test the most useful are the cloners. With it, the easiest and most effective types of changes in their clones will become well known and the types of changes that are not worth the time and effort will also become well known. IOW, you will be making the cloners more efficient and more competent.

BTW, am I right in stating that sf_strong & sf14 are highly related? And that sf_15 & robbo are highly independent? If true on both counts, then the respective scores for these pairings: 617 & 571 seems too close for comfort.

Another thing that bothers me about your test, is that calibrating it (which positions to use and which to not) requires making assumptions about which programs are clones and which are not? How else? A lawyer would argue that all that you have done with your test is pick positions that tend corroborate your belief and omit positions that tend to not and that an easier way to determine your belief is to simply ask you.

Dann Corbit · Post by **Dann Corbit** » Wed Jan 27, 2010 10:00 pm

Don wrote:Suppose you ran 1000 random positions on many different versions of a
the same program, then run the same positions on many versions of
other programs. What could be deduced statistically from how often
the various program versions picked the same move?

[snip]
Here is what I suspect:
The strongest programs will pick the same moves. I guess that we probably won't see that effect among the weakest programs since they choose bad moves all the time. So this idea will label all very strong programs as probable clones. So what is the threshold we should set to say, "Yes, this is a clone?"

P.S.
If I wanted to fool this dectector, I would simply tweak an eval or search parameter.

P.P.S.
I do think that a very high correlation in the ponder hit statistic is something suspicious. Be that as it may, if I were a determined cloner I am absolutely certain I could fool that statistic.

P.P.P.S.
I think that one false accusation verses 1000 correct detections renders the technique a bad idea. But I guess that I am probably alone in that odd stance.

P.P.P.P.S.
The "Look and feel" lawsuit of Lotus 1-2-3 verses Microsoft Excel showed that identical outputs are not copyrightable.

michiguel · Post by **michiguel** » Wed Jan 27, 2010 10:06 pm

rjgibert wrote:
Don wrote:Suppose you ran 1000 random positions on many different versions of a
the same program, then run the same positions on many versions of
other programs. What could be deduced statistically from how often
the various program versions picked the same move?

Such a thing could serve as a crude clone detector. I ran such an
experiment on many different programs to get a kind of measurment of
corelation between different program "families" and different versions
within the same family of programs and the result is surprising.

The 1000 positions are from a set of positions that Larry Kaufman and
I created long ago that are designed to compare chess programs to
humans in playing style. So few problems are blatantly tactical and
in many of these positions the choice of moves is going to based on
preference more than raw strength.

The test compares any two programs by how often they pick the same
move, out of a sample of 1000 positions. I run each program to the
same time limit which in this case is 1/10 of a second.

Below is a table of the results, starting with the most corelated to
the least corelated programs. I ran various verisons of my own
program, all the stockfish versions including glaurung, and all the so
called Rybka clones as well as Rybka herself.

What is interesting in the table is that if you assume that ippolitio,
Robbolito and Rybka are in the same family of programs, and that any
program with a score above 594 is to be considered a clone, then my
test gets it right in every single case. It identifies families of
programs and non-related programs accurately.

The most corelated set of programs that we know are not clones of each
other are rybka and doch-1.2. However, these 2 program do have a
program author in common.

Just in case this could be interpreted as a strength tester, I added a
version of stockfish 1.6 which I call sf_strong. It is stockfish 1.6
run at 1/4 of a second instead of 1/10 of a second. This was a sanity
test to determine if stockfish would look more like Rybka if it was
run at a level where it was closer to Rybka's chess strength. As you
can see, this did not foil my test.

I'm not going to attach any special signficance to this test - look at
the data and draw your own conclusions. I don't pretend it's
scientifically accurate or anything like this. It is whatever it is.
Code: Select all
  846  sf_strong         sf16            
  758  doch-1.2          doch-1.0        
  734  robbo             ippolito        
  720  komodo            doch-1.2        
  706  sf15              sf14            
  687  komodo            doch-1.0        
  671  sf16              sf15            
  655  rybka             robbo           
  649  sf16              sf14            
  644  rybka             ippolito        
  639  sf14              glaurung        
  638  sf_strong         sf15            
  630  sf15              glaurung        
  617  sf_strong         sf14            
  600  sf_strong         glaurung        
  595  sf16              glaurung        

  594  rybka             doch-1.2        
  582  rybka             doch-1.0        
  581  rybka             komodo          
  579  ippolito          doch-1.0        
  573  robbo             komodo          
  571  sf15              robbo           
  571  ippolito          doch-1.2        
  569  sf15              rybka           
  568  robbo             doch-1.2        
  565  sf_strong         rybka           
  563  sf14              ippolito        
  560  komodo            ippolito        
  559  sf14              robbo           
  559  robbo             doch-1.0        
  558  sf14              rybka           
  557  sf_strong         robbo           
  556  sf15              ippolito        
  554  sf16              rybka           
  554  sf16              robbo           
  554  sf15              komodo          
  551  sf14              doch-1.0        
  549  sf15              doch-1.0        
  544  glaurung          doch-1.0        
  542  rybka             glaurung        
  541  sf16              doch-1.0        
  538  sf16              ippolito        
  536  sf_strong         doch-1.0        
  536  sf14              komodo          
  532  komodo            glaurung        
  531  sf_strong         ippolito        
  531  sf15              doch-1.2        
  528  glaurung          doch-1.2        
  527  sf_strong         komodo          
  527  sf16              doch-1.2        
  525  robbo             glaurung        
  524  sf_strong         doch-1.2        
  523  sf14              doch-1.2        
  521  ippolito          glaurung        
  519  sf16              komodo          
The individuals who will find your test the most useful are the cloners. With it, the easiest and most effective types of changes in their clones will become well known and the types of changes that are not worth the time and effort will also become well known. IOW, you will be making the cloners more efficient and more competent.

BTW, am I right in stating that sf_strong & sf14 are highly related? And that sf_15 & robbo are highly independent? If true on both counts, then the respective scores for these pairings: 617 & 571 seems too close for comfort.

Another thing that bothers me about your test, is that calibrating it (which positions to use and which to not) requires making assumptions about which programs are clones and which are not? How else? A lawyer would argue that all that you have done with your test is pick positions that tend corroborate your belief and omit positions that tend to not and that an easier way to determine your belief is to simply ask you.

We use 1 million positions chosen randomly and tell the lawyer to... well...
go somewhere else

Miguel
PS: I have the strong feeling we could be onto something here.

Aaron Becker · Post by **Aaron Becker** » Wed Jan 27, 2010 10:13 pm

rjgibert wrote:Another thing that bothers me about your test, is that calibrating it (which positions to use and which to not) requires making assumptions about which programs are clones and which are not? How else? A lawyer would argue that all that you have done with your test is pick positions that tend corroborate your belief and omit positions that tend to not and that an easier way to determine your belief is to simply ask you.

You can solve this problem by calibrating using only well-known families of programs. That is, calibration can be done by comparing results from multiple versions from Glaurung/Stockfish family, the Fruit family, different Shredder versions, etc. Versions within each family are considered "clones" (this is a fairly liberal interpretation of cloning since substantial effort at improving and tweaking values goes on from release to release), and versions across families are non-clones. No engines whose provenance is in question would be used for calibration.

Don · Post by **Don** » Wed Jan 27, 2010 10:15 pm

hgm wrote:That is even better. Just give a list of the moves. (E.g. in long algebraic notation, all concatenated to a very long string.)

That is basically how the test works, and a trivial tcl script process it. In the tester it's 1 line per position with other information in the line. But if anyone wants the data I have processed I can make it available as a text file, 1 move per line.

benstoker · Post by **benstoker** » Wed Jan 27, 2010 10:21 pm

Don wrote:
hgm wrote:That is even better. Just give a list of the moves. (E.g. in long algebraic notation, all concatenated to a very long string.)
That is basically how the test works, and a trivial tcl script process it. In the tester it's 1 line per position with other information in the line. But if anyone wants the data I have processed I can make it available as a text file, 1 move per line.

Can you make available the tcl script also?

Thank you.

michiguel · Post by **michiguel** » Wed Jan 27, 2010 10:22 pm

Don wrote:
hgm wrote:That is even better. Just give a list of the moves. (E.g. in long algebraic notation, all concatenated to a very long string.)
That is basically how the test works, and a trivial tcl script process it. In the tester it's 1 line per position with other information in the line. But if anyone wants the data I have processed I can make it available as a text file, 1 move per line.

Please do... my gmail account is mballicora

Miguel

Don · Post by **Don** » Wed Jan 27, 2010 10:24 pm

rjgibert wrote: The individuals who will find your test the most useful are the cloners. With it, the easiest and most effective types of changes in their clones will become well known and the types of changes that are not worth the time and effort will also become well known. IOW, you will be making the cloners more efficient and more competent.

BTW, am I right in stating that sf_strong & sf14 are highly related? And that sf_15 & robbo are highly independent? If true on both counts, then the respective scores for these pairings: 617 & 571 seems too close for comfort.

Another thing that bothers me about your test, is that calibrating it (which positions to use and which to not) requires making assumptions about which programs are clones and which are not? How else? A lawyer would argue that all that you have done with your test is pick positions that tend corroborate your belief and omit positions that tend to not and that an easier way to determine your belief is to simply ask you.

I did not rig the test in any way nor did I try to calibrate it. The positions I used are the first 1000 positions in a set we created a couple of years ago for a different purpose.

I don't claim anything grandiose for this particular test. I am just presenting the data as it came out the tester and you can make anything out of it that you want to.

I'm sure if I ran 10 more families of programs with different versions within the families I would find some exceptions. I view this as a very crude test that may or may not have some value.

benstoker · Post by **benstoker** » Wed Jan 27, 2010 10:28 pm

rjgibert wrote:
Don wrote:Suppose you ran 1000 random positions on many different versions of a
the same program, then run the same positions on many versions of
other programs. What could be deduced statistically from how often
the various program versions picked the same move?

Such a thing could serve as a crude clone detector. I ran such an
experiment on many different programs to get a kind of measurment of
corelation between different program "families" and different versions
within the same family of programs and the result is surprising.

The 1000 positions are from a set of positions that Larry Kaufman and
I created long ago that are designed to compare chess programs to
humans in playing style. So few problems are blatantly tactical and
in many of these positions the choice of moves is going to based on
preference more than raw strength.

The test compares any two programs by how often they pick the same
move, out of a sample of 1000 positions. I run each program to the
same time limit which in this case is 1/10 of a second.

Below is a table of the results, starting with the most corelated to
the least corelated programs. I ran various verisons of my own
program, all the stockfish versions including glaurung, and all the so
called Rybka clones as well as Rybka herself.

What is interesting in the table is that if you assume that ippolitio,
Robbolito and Rybka are in the same family of programs, and that any
program with a score above 594 is to be considered a clone, then my
test gets it right in every single case. It identifies families of
programs and non-related programs accurately.

The most corelated set of programs that we know are not clones of each
other are rybka and doch-1.2. However, these 2 program do have a
program author in common.

Just in case this could be interpreted as a strength tester, I added a
version of stockfish 1.6 which I call sf_strong. It is stockfish 1.6
run at 1/4 of a second instead of 1/10 of a second. This was a sanity
test to determine if stockfish would look more like Rybka if it was
run at a level where it was closer to Rybka's chess strength. As you
can see, this did not foil my test.

I'm not going to attach any special signficance to this test - look at
the data and draw your own conclusions. I don't pretend it's
scientifically accurate or anything like this. It is whatever it is.
Code: Select all
  846  sf_strong         sf16            
  758  doch-1.2          doch-1.0        
  734  robbo             ippolito        
  720  komodo            doch-1.2        
  706  sf15              sf14            
  687  komodo            doch-1.0        
  671  sf16              sf15            
  655  rybka             robbo           
  649  sf16              sf14            
  644  rybka             ippolito        
  639  sf14              glaurung        
  638  sf_strong         sf15            
  630  sf15              glaurung        
  617  sf_strong         sf14            
  600  sf_strong         glaurung        
  595  sf16              glaurung        

  594  rybka             doch-1.2        
  582  rybka             doch-1.0        
  581  rybka             komodo          
  579  ippolito          doch-1.0        
  573  robbo             komodo          
  571  sf15              robbo           
  571  ippolito          doch-1.2        
  569  sf15              rybka           
  568  robbo             doch-1.2        
  565  sf_strong         rybka           
  563  sf14              ippolito        
  560  komodo            ippolito        
  559  sf14              robbo           
  559  robbo             doch-1.0        
  558  sf14              rybka           
  557  sf_strong         robbo           
  556  sf15              ippolito        
  554  sf16              rybka           
  554  sf16              robbo           
  554  sf15              komodo          
  551  sf14              doch-1.0        
  549  sf15              doch-1.0        
  544  glaurung          doch-1.0        
  542  rybka             glaurung        
  541  sf16              doch-1.0        
  538  sf16              ippolito        
  536  sf_strong         doch-1.0        
  536  sf14              komodo          
  532  komodo            glaurung        
  531  sf_strong         ippolito        
  531  sf15              doch-1.2        
  528  glaurung          doch-1.2        
  527  sf_strong         komodo          
  527  sf16              doch-1.2        
  525  robbo             glaurung        
  524  sf_strong         doch-1.2        
  523  sf14              doch-1.2        
  521  ippolito          glaurung        
  519  sf16              komodo          
The individuals who will find your test the most useful are the cloners. With it, the easiest and most effective types of changes in their clones will become well known and the types of changes that are not worth the time and effort will also become well known. IOW, you will be making the cloners more efficient and more competent.

BTW, am I right in stating that sf_strong & sf14 are highly related? And that sf_15 & robbo are highly independent? If true on both counts, then the respective scores for these pairings: 617 & 571 seems too close for comfort.

Another thing that bothers me about your test, is that calibrating it (which positions to use and which to not) requires making assumptions about which programs are clones and which are not? How else? A lawyer would argue that all that you have done with your test is pick positions that tend corroborate your belief and omit positions that tend to not and that an easier way to determine your belief is to simply ask you.

Why not just gen a 1000 totally random but legal fen positions?

rjgibert · Post by **rjgibert** » Wed Jan 27, 2010 10:36 pm

Dann Corbit wrote:
Don wrote:Suppose you ran 1000 random positions on many different versions of a
the same program, then run the same positions on many versions of
other programs. What could be deduced statistically from how often
the various program versions picked the same move?
[snip]
Here is what I suspect:
The strongest programs will pick the same moves. I guess that we probably won't see that effect among the weakest programs since they choose bad moves all the time. So this idea will label all very strong programs as probable clones. So what is the threshold we should set to say, "Yes, this is a clone?"

P.S.
If I wanted to fool this dectector, I would simply tweak an eval or search parameter.

P.P.S.
I do think that a very high correlation in the ponder hit statistic is something suspicious. Be that as it may, if I were a determined cloner I am absolutely certain I could fool that statistic.

P.P.P.S.
I think that one false accusation verses 1000 correct detections renders the technique a bad idea. But I guess that I am probably alone in that odd stance.

P.P.P.P.S.
The "Look and feel" lawsuit of Lotus 1-2-3 verses Microsoft Excel showed that identical outputs are not copyrightable.

There is a way of making this test useful in a way that nobody expects to a cloner. A cloner might just decide to add a multiple personality feature to his program. And just for fun, decide that for each of the personalities, he would make sure none of them would correlate with any of the other personalities by better than say a 600 score. The amusing result would be that "his" engine could easily pass the test as not being a clone of itself!

Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test