Clone detection test

Don · Post by **Don** » Wed Jan 27, 2010 10:41 pm

michiguel wrote:
Don wrote:
hgm wrote:That is even better. Just give a list of the moves. (E.g. in long algebraic notation, all concatenated to a very long string.)
That is basically how the test works, and a trivial tcl script process it. In the tester it's 1 line per position with other information in the line. But if anyone wants the data I have processed I can make it available as a text file, 1 move per line.
Please do... my gmail account is mballicora

Miguel

grab it from here:

http://greencheeks.homelinux.org:8015/~drd/clone.tar.gz

The tcl script is included and it will try to process all files with a .res extension in the same directory the script runs in.

It's not polished, it's a kludge. And the ippolito data is pieced together because the program crashed 3 times and I had to restart it at the problem that crashed, but it is accurate.

Don

Don · Post by **Don** » Wed Jan 27, 2010 10:52 pm

Dann Corbit wrote:
Don wrote:Suppose you ran 1000 random positions on many different versions of a
the same program, then run the same positions on many versions of
other programs. What could be deduced statistically from how often
the various program versions picked the same move?
[snip]
Here is what I suspect:
The strongest programs will pick the same moves. I guess that we probably won't see that effect among the weakest programs since they choose bad moves all the time. So this idea will label all very strong programs as probable clones. So what is the threshold we should set to say, "Yes, this is a clone?"

I would like to point out that I had the same exact thought. I added a "strong" stockfish, which you can see in the data, to simulate a program that was more in line with Rybka as far as strength and this did not fool the test.

There is no question that really strong programs are probably more likely to play the same moves on average and this will have some impact. However there are a huge number of chess positions where the move choice is a matter of style and my hope is that this set includes enough of them. But even if it doesn't, that could be fixed.

In fact I would guess you would be more qualified than I am to build such a test. I did not set out to build a polished product but only to run a single experiment.

P.S.
If I wanted to fool this dectector, I would simply tweak an eval or search parameter.

One idea I had was to make the tester and the positions used secret for this very reason.

P.P.S.
I do think that a very high correlation in the ponder hit statistic is something suspicious. Be that as it may, if I were a determined cloner I am absolutely certain I could fool that statistic.

Of course. I don't pretend this is anything foolproof or special.

P.P.P.S.
I think that one false accusation verses 1000 correct detections renders the technique a bad idea. But I guess that I am probably alone in that odd stance.

Such a test should never be considered definitive, even if were much improved. It should be a tool, not a club. Lie detector tests are not considered foolproof either but are useful tools. And of course I know this is defeatable.

P.P.P.P.S.
The "Look and feel" lawsuit of Lotus 1-2-3 verses Microsoft Excel showed that identical outputs are not copyrightable.

Don · Post by **Don** » Wed Jan 27, 2010 10:57 pm

Aaron Becker wrote:
rjgibert wrote:Another thing that bothers me about your test, is that calibrating it (which positions to use and which to not) requires making assumptions about which programs are clones and which are not? How else? A lawyer would argue that all that you have done with your test is pick positions that tend corroborate your belief and omit positions that tend to not and that an easier way to determine your belief is to simply ask you.
You can solve this problem by calibrating using only well-known families of programs. That is, calibration can be done by comparing results from multiple versions from Glaurung/Stockfish family, the Fruit family, different Shredder versions, etc. Versions within each family are considered "clones" (this is a fairly liberal interpretation of cloning since substantial effort at improving and tweaking values goes on from release to release), and versions across families are non-clones. No engines whose provenance is in question would be used for calibration.

There is no question that this test could be calibrated, but one must be aware that by doing so you are adding some bias to the test. In a sense you are cooking the books by making sure the test does what you have already decided it should do. For instance if I want to believe that ippolito, robbolito and Rybka are in the same family, I could attempt to make sure the test returns that result. But in this case I did not do any of that. I did not pick and choose the positions.

Don · Post by **Don** » Wed Jan 27, 2010 11:40 pm

One other consideration here. Not meant to refute you, but I suspect that it won't be that easy to defeat the test, at least without seriously weakening the program.

Building a strong evaluation function is very difficult and I believe that even making a version of komodo that would defeat this test would be very challenging. It would be easy to defeat if I were willing to make big changes but those changes would weaken the program significantly.

I could be wrong about all of this, but I think it's more than just tuning a bunch of weights, it probably has a lot to do with the specific evaluation features in the program too.

So to reiterate I think it would be very challenging to take the 50 or so evaluation feature in komodo, change them enough to defeat the test and still end up with a really strong program.

Dann Corbit wrote:
Don wrote:Suppose you ran 1000 random positions on many different versions of a
the same program, then run the same positions on many versions of
other programs. What could be deduced statistically from how often
the various program versions picked the same move?
[snip]
Here is what I suspect:
The strongest programs will pick the same moves. I guess that we probably won't see that effect among the weakest programs since they choose bad moves all the time. So this idea will label all very strong programs as probable clones. So what is the threshold we should set to say, "Yes, this is a clone?"

P.S.
If I wanted to fool this dectector, I would simply tweak an eval or search parameter.

P.P.S.
I do think that a very high correlation in the ponder hit statistic is something suspicious. Be that as it may, if I were a determined cloner I am absolutely certain I could fool that statistic.

P.P.P.S.
I think that one false accusation verses 1000 correct detections renders the technique a bad idea. But I guess that I am probably alone in that odd stance.

P.P.P.P.S.
The "Look and feel" lawsuit of Lotus 1-2-3 verses Microsoft Excel showed that identical outputs are not copyrightable.

lkaufman · Post by **lkaufman** » Wed Jan 27, 2010 11:55 pm

I've run the test on various versions of Rybka, RobboLito, and Komodo (so far, others to follow) at both five and six ply (with 3 ply added to stated depth of Rybka versions). The closest correlations for the most part are programs with themselves at the different depth, which indicates that for this test eval similarities trump depth. Next are versions of the same program very near in time of release. However, the correlation of the versions of Robbo to Rybka 3 is closer than the correlation of either one of them to Rybka 2.3.2a. In other words, this test indicates that while Robbo is certainly not the same as Rybka 3, it appears to be more of a derivative of Rybka 3 than Rybka 3 is of its predecessor version. This doesn't surprise me, because the eval was completely redone in R3.

Gian-Carlo Pascutto · Thu Jan 28, 2010 12:29 am

Don wrote: What is interesting in the table is that if you assume that ippolitio,
Robbolito and Rybka are in the same family of programs, and that any
program with a score above 594 is to be considered a clone, then my
test gets it right in every single case. It identifies families of
programs and non-related programs accurately.

So if we make some assumptions, and then pick a parameter arbitrarily to fit some preconceptions that we have, the result of our test with this parameter will confirm our preconceptions!

My stomach turns when I read something like this from an experienced scientist like you. This is a nice example of designing the experiment to fit the conclusion. You're supposed to do it the other way around.

This test might be quite useful, but not when you make the kind of statements like you made above.

Milos · Post by **Milos** » Thu Jan 28, 2010 12:58 am

Don wrote:Suppose you ran 1000 random positions on many different versions of a
the same program, then run the same positions on many versions of
other programs. What could be deduced statistically from how often
the various program versions picked the same move?

Such a thing could serve as a crude clone detector. I ran such an
experiment on many different programs to get a kind of measurment of
corelation between different program "families" and different versions
within the same family of programs and the result is surprising.

The 1000 positions are from a set of positions that Larry Kaufman and
I created long ago that are designed to compare chess programs to
humans in playing style. So few problems are blatantly tactical and
in many of these positions the choice of moves is going to based on
preference more than raw strength.

The test compares any two programs by how often they pick the same
move, out of a sample of 1000 positions. I run each program to the
same time limit which in this case is 1/10 of a second.

Below is a table of the results, starting with the most corelated to
the least corelated programs. I ran various verisons of my own
program, all the stockfish versions including glaurung, and all the so
called Rybka clones as well as Rybka herself.

What is interesting in the table is that if you assume that ippolitio,
Robbolito and Rybka are in the same family of programs, and that any
program with a score above 594 is to be considered a clone, then my
test gets it right in every single case. It identifies families of
programs and non-related programs accurately.

The most corelated set of programs that we know are not clones of each
other are rybka and doch-1.2. However, these 2 program do have a
program author in common.

Just in case this could be interpreted as a strength tester, I added a
version of stockfish 1.6 which I call sf_strong. It is stockfish 1.6
run at 1/4 of a second instead of 1/10 of a second. This was a sanity
test to determine if stockfish would look more like Rybka if it was
run at a level where it was closer to Rybka's chess strength. As you
can see, this did not foil my test.

I'm not going to attach any special signficance to this test - look at
the data and draw your own conclusions. I don't pretend it's
scientifically accurate or anything like this. It is whatever it is.
Code: Select all
  846  sf_strong         sf16            
  758  doch-1.2          doch-1.0        
  734  robbo             ippolito        
  720  komodo            doch-1.2        
  706  sf15              sf14            
  687  komodo            doch-1.0        
  671  sf16              sf15            
  655  rybka             robbo           
  649  sf16              sf14            
  644  rybka             ippolito        
  639  sf14              glaurung        
  638  sf_strong         sf15            
  630  sf15              glaurung        
  617  sf_strong         sf14            
  600  sf_strong         glaurung        
  595  sf16              glaurung        

  594  rybka             doch-1.2        
  582  rybka             doch-1.0        
  581  rybka             komodo          
  579  ippolito          doch-1.0        
  573  robbo             komodo          
  571  sf15              robbo           
  571  ippolito          doch-1.2        
  569  sf15              rybka           
  568  robbo             doch-1.2        
  565  sf_strong         rybka           
  563  sf14              ippolito        
  560  komodo            ippolito        
  559  sf14              robbo           
  559  robbo             doch-1.0        
  558  sf14              rybka           
  557  sf_strong         robbo           
  556  sf15              ippolito        
  554  sf16              rybka           
  554  sf16              robbo           
  554  sf15              komodo          
  551  sf14              doch-1.0        
  549  sf15              doch-1.0        
  544  glaurung          doch-1.0        
  542  rybka             glaurung        
  541  sf16              doch-1.0        
  538  sf16              ippolito        
  536  sf_strong         doch-1.0        
  536  sf14              komodo          
  532  komodo            glaurung        
  531  sf_strong         ippolito        
  531  sf15              doch-1.2        
  528  glaurung          doch-1.2        
  527  sf_strong         komodo          
  527  sf16              doch-1.2        
  525  robbo             glaurung        
  524  sf_strong         doch-1.2        
  523  sf14              doch-1.2        
  521  ippolito          glaurung        
  519  sf16              komodo          

The number of the left represents in fact "Average ELO of both engines" - k*"actual difference between engines" and not "actual similarity between engines" as you think.

Dann Corbit · Post by **Dann Corbit** » Thu Jan 28, 2010 2:31 am

Don wrote:One other consideration here. Not meant to refute you, but I suspect that it won't be that easy to defeat the test, at least without seriously weakening the program.

Building a strong evaluation function is very difficult and I believe that even making a version of komodo that would defeat this test would be very challenging. It would be easy to defeat if I were willing to make big changes but those changes would weaken the program significantly.

I could be wrong about all of this, but I think it's more than just tuning a bunch of weights, it probably has a lot to do with the specific evaluation features in the program too.

So to reiterate I think it would be very challenging to take the 50 or so evaluation feature in komodo, change them enough to defeat the test and still end up with a really strong program.

Suggestion:
Change your null move depth by 1/2 ply and see if the programs appear similar.

I would be interested in a test like this that is not easily defeated.

Don · Post by **Don** » Thu Jan 28, 2010 4:08 am

Milos wrote:
Don wrote:Suppose you ran 1000 random positions on many different versions of a
the same program, then run the same positions on many versions of
other programs. What could be deduced statistically from how often
the various program versions picked the same move?

Such a thing could serve as a crude clone detector. I ran such an
experiment on many different programs to get a kind of measurment of
corelation between different program "families" and different versions
within the same family of programs and the result is surprising.

The 1000 positions are from a set of positions that Larry Kaufman and
I created long ago that are designed to compare chess programs to
humans in playing style. So few problems are blatantly tactical and
in many of these positions the choice of moves is going to based on
preference more than raw strength.

The test compares any two programs by how often they pick the same
move, out of a sample of 1000 positions. I run each program to the
same time limit which in this case is 1/10 of a second.

Below is a table of the results, starting with the most corelated to
the least corelated programs. I ran various verisons of my own
program, all the stockfish versions including glaurung, and all the so
called Rybka clones as well as Rybka herself.

What is interesting in the table is that if you assume that ippolitio,
Robbolito and Rybka are in the same family of programs, and that any
program with a score above 594 is to be considered a clone, then my
test gets it right in every single case. It identifies families of
programs and non-related programs accurately.

The most corelated set of programs that we know are not clones of each
other are rybka and doch-1.2. However, these 2 program do have a
program author in common.

Just in case this could be interpreted as a strength tester, I added a
version of stockfish 1.6 which I call sf_strong. It is stockfish 1.6
run at 1/4 of a second instead of 1/10 of a second. This was a sanity
test to determine if stockfish would look more like Rybka if it was
run at a level where it was closer to Rybka's chess strength. As you
can see, this did not foil my test.

I'm not going to attach any special signficance to this test - look at
the data and draw your own conclusions. I don't pretend it's
scientifically accurate or anything like this. It is whatever it is.
Code: Select all
  846  sf_strong         sf16            
  758  doch-1.2          doch-1.0        
  734  robbo             ippolito        
  720  komodo            doch-1.2        
  706  sf15              sf14            
  687  komodo            doch-1.0        
  671  sf16              sf15            
  655  rybka             robbo           
  649  sf16              sf14            
  644  rybka             ippolito        
  639  sf14              glaurung        
  638  sf_strong         sf15            
  630  sf15              glaurung        
  617  sf_strong         sf14            
  600  sf_strong         glaurung        
  595  sf16              glaurung        

  594  rybka             doch-1.2        
  582  rybka             doch-1.0        
  581  rybka             komodo          
  579  ippolito          doch-1.0        
  573  robbo             komodo          
  571  sf15              robbo           
  571  ippolito          doch-1.2        
  569  sf15              rybka           
  568  robbo             doch-1.2        
  565  sf_strong         rybka           
  563  sf14              ippolito        
  560  komodo            ippolito        
  559  sf14              robbo           
  559  robbo             doch-1.0        
  558  sf14              rybka           
  557  sf_strong         robbo           
  556  sf15              ippolito        
  554  sf16              rybka           
  554  sf16              robbo           
  554  sf15              komodo          
  551  sf14              doch-1.0        
  549  sf15              doch-1.0        
  544  glaurung          doch-1.0        
  542  rybka             glaurung        
  541  sf16              doch-1.0        
  538  sf16              ippolito        
  536  sf_strong         doch-1.0        
  536  sf14              komodo          
  532  komodo            glaurung        
  531  sf_strong         ippolito        
  531  sf15              doch-1.2        
  528  glaurung          doch-1.2        
  527  sf_strong         komodo          
  527  sf16              doch-1.2        
  525  robbo             glaurung        
  524  sf_strong         doch-1.2        
  523  sf14              doch-1.2        
  521  ippolito          glaurung        
  519  sf16              komodo          
The number of the left represents in fact "Average ELO of both engines" - k*"actual difference between engines" and not "actual similarity between engines" as you think.

The number on the left is the count of the number of positions where the programs agreed on the best move. It's not an ELO calculation.

Don · Post by **Don** » Thu Jan 28, 2010 4:08 am

Gian-Carlo Pascutto wrote:
Don wrote: What is interesting in the table is that if you assume that ippolitio,
Robbolito and Rybka are in the same family of programs, and that any
program with a score above 594 is to be considered a clone, then my
test gets it right in every single case. It identifies families of
programs and non-related programs accurately.
So if we make some assumptions, and then pick a parameter arbitrarily to fit some preconceptions that we have, the result of our test with this parameter will confirm our preconceptions!

My stomach turns when I read something like this from an experienced scientist like you. This is a nice example of designing the experiment to fit the conclusion. You're supposed to do it the other way around.

This test might be quite useful, but not when you make the kind of statements like you made above.

Thanks you for the kind words.

Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test