Clone detection test

michiguel · Post by **michiguel** » Thu Jan 28, 2010 7:54 am

Don wrote:
michiguel wrote:
Don wrote:
hgm wrote:That is even better. Just give a list of the moves. (E.g. in long algebraic notation, all concatenated to a very long string.)
That is basically how the test works, and a trivial tcl script process it. In the tester it's 1 line per position with other information in the line. But if anyone wants the data I have processed I can make it available as a text file, 1 move per line.
Please do... my gmail account is mballicora

Miguel
grab it from here:

http://greencheeks.homelinux.org:8015/~drd/clone.tar.gz

The tcl script is included and it will try to process all files with a .res extension in the same directory the script runs in.

It's not polished, it's a kludge. And the ippolito data is pieced together because the program crashed 3 times and I had to restart it at the problem that crashed, but it is accurate.

Don

Here you have a Neighbor-Joining analysis of the data. Clearly, similarities are not related to strength, otherwise the type of tree you will obtain will be completely different (almost linear). How significant is this tree? it will be necessary to do a bootstrap analysis, or repeat the same experiment with different positions. In other words, if the analysis after random re-sampling of the data keeps giving the same type of tree, it will be very significant. That would indicate that the most similar move selection abilities of the ones tested here to Robbo/Ippo is Rybka.

I would be *very* cautious with this. More engines should be included in the analysis too. However, I think this could be a potentially interesting screening method if more engines are included, more sets of positions are added, and a more rigorous bootstrap analysis are performed.

The only thing I can say for certain, is that if people saw similarities in style between R3 and RL/IPPO, it may not have been a complete allucination.

Miguel

hgm · Post by **hgm** » Thu Jan 28, 2010 9:07 am

I think this could be a very useful test, because it really tests what is of pragmatic interest. If I organize a tournament, what I want to prevent is that one strong program would participate in multiple versions with only an infinitesimal difference between them. I would like diversity. If I had the choice between having two Fruit derivatives in a tournament that share 90% of the code, but that have the evaluation altered so that they play in a completely different style, or having Fruit together with a completely original program, which does not share a single letter of code with Fruit, has a different board representation, but uses the same extensions and reductions, the same evaluation terms, and the same tuning of those... I would in fact prefer to have the former. (It is not illegal, after all, to copy Fruit code. Of course I would try to keep out criminals.)

Having an empirical test to determine if two programs are sufficiently different _in behavior_ to be allowed in the same tournament would make life enormously more easy than having to request source code, verifying it is of the program that is playing, comparing it to other source code of all known and unknown engins, etc. Just let the participant deposit a copy of their executible, and after the tournament let that run an EPD suite which contains the test positions, plus a test-suite from a number of positions from it own tournament games. Obviously is has to very closely reproduce the tournament moves, or you know you are being conned. And the result of the derivative test would then be decisive for the verdict in clone accusations.

It would be very nice if this test could be made to work with randomly chosen positions. (To some criterion of not being too tactical or obvious, which could be automatically tested by having a few standard engines solve them.) The participants could then be tested before the tournament with one set of positions. Actually, you would let them run the EPD suite themselves, and they would send you the result string of moves, so that the TD could flag matches between opponents even before the tournament starts, so there will not be any unpleasant surprises later. And then you would do the test after the tournament with a different set of positions, so that malicious participants could not rig their engine to produce false answers only on those test positions, or sneakily switch engines, as they could do on the first test to get their engines in. (In the certainty they will be found out afterwards.)

Gian-Carlo Pascutto · Thu Jan 28, 2010 11:24 am

Don wrote:
Gian-Carlo Pascutto wrote:
Don wrote: What is interesting in the table is that if you assume that ippolitio,
Robbolito and Rybka are in the same family of programs, and that any
program with a score above 594 is to be considered a clone, then my
test gets it right in every single case. It identifies families of
programs and non-related programs accurately.
So if we make some assumptions, and then pick a parameter arbitrarily to fit some preconceptions that we have, the result of our test with this parameter will confirm our preconceptions!

My stomach turns when I read something like this from an experienced scientist like you. This is a nice example of designing the experiment to fit the conclusion. You're supposed to do it the other way around.

This test might be quite useful, but not when you make the kind of statements like you made above.
Thanks you for the kind words.

Don't take it personally, it's not intended as such

Do you disagree with the contents? i.e. That this is designing the experiment to fit the conclusion, and that arbitrarily picking the value just because it confirms what you want the test to confirm, is scientifically just wrong?

Don · Post by **Don** » Thu Jan 28, 2010 2:24 pm

Dann Corbit wrote:
Don wrote:One other consideration here. Not meant to refute you, but I suspect that it won't be that easy to defeat the test, at least without seriously weakening the program.

Building a strong evaluation function is very difficult and I believe that even making a version of komodo that would defeat this test would be very challenging. It would be easy to defeat if I were willing to make big changes but those changes would weaken the program significantly.

I could be wrong about all of this, but I think it's more than just tuning a bunch of weights, it probably has a lot to do with the specific evaluation features in the program too.

So to reiterate I think it would be very challenging to take the 50 or so evaluation feature in komodo, change them enough to defeat the test and still end up with a really strong program.

Suggestion:
Change your null move depth by 1/2 ply and see if the programs appear similar.

I would be interested in a test like this that is not easily defeated.

The various versions of doch have significant search differences which go way beyond a minor change to the null move depth. However if you want me to run that test I can. I did run a test with stockfish 1.6 where I increase the level 2.5 times and it still showed up as being closely related to itself. The reason I did that particularly test was to bring it in line with the strength of Robbolito, thinking I might be able to fool the test into thinking it belongs to a different "family" of program.

What is interesting is that even changing the level substantially does not seem to defeat this test. That was a big surprise to me.

This has shattered my own conceptions about chess programs. I thought the choice of move was primarily dependent on the strength of the chess program and the move, but apparently that is not the case.

There is no reason why anyone else cannot repeat this same exact experiment and it would make me very happy if someone did and report their findings.

The setup is simple, take 1000 (or some high number) of positions and compare pairs of programs to see how many times they match the move. Don't use problems from a tactical set, just random positions from games.

Don · Post by **Don** » Thu Jan 28, 2010 2:37 pm

The number 595 is totally arbitrary, it means nothing. I did not intend that value to represent the distinctions between what is a clone and what is not a clone - I picked it only because that was the point in the data set where you could draw a line that separated related programs from unrelated programs.

If I ran the test on another set of programs, the line would almost certainly show up a little higher or a little lower or you might even get overlap and mis-identification. That all needs to be checked out before the test is taken too seriously.

Think of the line as not black and white but a big fat grey band which is fuzzy. Anything near this line could be disputable, anything well above it would appear to be highly related and everything below it is not very related.

Spacious_Mind wrote:I am really sorry for perhaps asking some questions for which the answers from all you learned gentlemen might be far too obvious. But perhaps it will help me understand what you are trying to achieve and enable me to try some other tests with dedicateds. If someone would kindly provide some answers to a question below that I am asking based on your proposed test, I would appreciate this.
I have some interest in clone tests based on dedicated computer. And here is a test I made a while back on a few, which is not too dissimilar to what you are now discussing. But obviously very basic.

http://spacious-mind.com/html/gk_2100_clones_test.html

Now in my test I had several known clones and also two computers that are known to be totally different but of similar playing strength, and one previously suspected clone.

As you can see the real clones scored on average about 95%. playing back exactly the same moves. The related computer to the clones played back 78% percent of all the moves, the non related clones 65% and 60%.

Now here is my question with regards to your proposed tests. If you assume that each moves was a board set-up move (game position) then for dedicateds the number 595 of your test would in my test suggest that all the computers are clones of each other. Ponder was off in my test therefore each position was in reality a unique test position.

Therefore what I am not sure about is the number 595 and how in your test a good number could be provided because in my dedicated test that would actually mean that each computer in the test was in fact a clone?

Or did I understand your proposed test wrong?

Best regards

Nick

Don · Post by **Don** » Thu Jan 28, 2010 3:25 pm

Gian-Carlo Pascutto wrote:
Don wrote:
Gian-Carlo Pascutto wrote:
Don wrote: What is interesting in the table is that if you assume that ippolitio,
Robbolito and Rybka are in the same family of programs, and that any
program with a score above 594 is to be considered a clone, then my
test gets it right in every single case. It identifies families of
programs and non-related programs accurately.
So if we make some assumptions, and then pick a parameter arbitrarily to fit some preconceptions that we have, the result of our test with this parameter will confirm our preconceptions!

My stomach turns when I read something like this from an experienced scientist like you. This is a nice example of designing the experiment to fit the conclusion. You're supposed to do it the other way around.

This test might be quite useful, but not when you make the kind of statements like you made above.
Thanks you for the kind words.
Don't take it personally, it's not intended as such

Do you disagree with the contents? i.e. That this is designing the experiment to fit the conclusion, and that arbitrarily picking the value just because it confirms what you want the test to confirm, is scientifically just wrong?

Yes, I disagree. I prefaced the statements with "if you assume" so that it would be clear that what I said was intended to be pure speculation.

All my other posts on this matter should indicate that I don't take the test too seriously but it's nothing more than an experiment and that you can make what you want to out of the data.

michiguel · Post by **michiguel** » Thu Jan 28, 2010 4:31 pm

hgm wrote:I think this could be a very useful test, because it really tests what is of pragmatic interest. If I organize a tournament, what I want to prevent is that one strong program would participate in multiple versions with only an infinitesimal difference between them. I would like diversity. If I had the choice between having two Fruit derivatives in a tournament that share 90% of the code, but that have the evaluation altered so that they play in a completely different style, or having Fruit together with a completely original program, which does not share a single letter of code with Fruit, has a different board representation, but uses the same extensions and reductions, the same evaluation terms, and the same tuning of those... I would in fact prefer to have the former. (It is not illegal, after all, to copy Fruit code. Of course I would try to keep out criminals.)

Having an empirical test to determine if two programs are sufficiently different _in behavior_ to be allowed in the same tournament would make life enormously more easy than having to request source code, verifying it is of the program that is playing, comparing it to other source code of all known and unknown engins, etc. Just let the participant deposit a copy of their executible, and after the tournament let that run an EPD suite which contains the test positions, plus a test-suite from a number of positions from it own tournament games. Obviously is has to very closely reproduce the tournament moves, or you know you are being conned. And the result of the derivative test would then be decisive for the verdict in clone accusations.

It would be very nice if this test could be made to work with randomly chosen positions. (To some criterion of not being too tactical or obvious, which could be automatically tested by having a few standard engines solve them.) The participants could then be tested before the tournament with one set of positions. Actually, you would let them run the EPD suite themselves, and they would send you the result string of moves, so that the TD could flag matches between opponents even before the tournament starts, so there will not be any unpleasant surprises later. And then you would do the test after the tournament with a different set of positions, so that malicious participants could not rig their engine to produce false answers only on those test positions, or sneakily switch engines, as they could do on the first test to get their engines in. (In the certainty they will be found out afterwards.)

This could be very easily automatized. Obvious positions for the engines are not a problem because they are mathematically discarded in the process. They just do not add new information. Choosing a big random number of positions should be ok.

My only concern is how fuzzy the base of the branches are.

BTW, I left Glaurung out by mistake. Here is the updated version.

Miguel

Don · Post by **Don** » Thu Jan 28, 2010 4:36 pm

That is very cool. I don't know anything about neighbor joining analysis but I love the graph.

michiguel wrote:
Don wrote:
michiguel wrote:
Don wrote:
hgm wrote:That is even better. Just give a list of the moves. (E.g. in long algebraic notation, all concatenated to a very long string.)
That is basically how the test works, and a trivial tcl script process it. In the tester it's 1 line per position with other information in the line. But if anyone wants the data I have processed I can make it available as a text file, 1 move per line.
Please do... my gmail account is mballicora

Miguel
grab it from here:

http://greencheeks.homelinux.org:8015/~drd/clone.tar.gz

The tcl script is included and it will try to process all files with a .res extension in the same directory the script runs in.

It's not polished, it's a kludge. And the ippolito data is pieced together because the program crashed 3 times and I had to restart it at the problem that crashed, but it is accurate.

Don
Here you have a Neighbor-Joining analysis of the data. Clearly, similarities are not related to strength, otherwise the type of tree you will obtain will be completely different (almost linear). How significant is this tree? it will be necessary to do a bootstrap analysis, or repeat the same experiment with different positions. In other words, if the analysis after random re-sampling of the data keeps giving the same type of tree, it will be very significant. That would indicate that the most similar move selection abilities of the ones tested here to Robbo/Ippo is Rybka.

I would be *very* cautious with this. More engines should be included in the analysis too. However, I think this could be a potentially interesting screening method if more engines are included, more sets of positions are added, and a more rigorous bootstrap analysis are performed.

The only thing I can say for certain, is that if people saw similarities in style between R3 and RL/IPPO, it may not have been a complete allucination.

Miguel

Don · Post by **Don** » Thu Jan 28, 2010 4:44 pm

Edmund wrote:
If I understand your setup correctly you replayed whole games, whilest Don Dailey just tests on individual postitions. The positions in this tests are not very tactical in nature as Don stated, but only represent the engine style or preference.

I have no idea how you came to that conclusion. Larry ran the same exact test I did except he used 5 ply tests and I used 0.1 second tests. These positions came from strong corespondance games and so many of them might be sequences of moves from games. However, the thousand positions we both used are taken from a much larger set of positions and those were scrambled to not appear in any particular order.

This set was created 2 or 3 years ago, I don't remember. There was some attempt to remove positions that had obvious answers but I don't remember the specifics. I am just guessing here, but it seems like I did something like remove all positions where a variety of programs played the same moves at low depths. The goal was to remove positions that would just waste time running through the tester because they were obvious moves, such as recaptures that must be played or else you lose.

Looking at your games, at most moves all the engines play the same move. That is not surprising, as in most chess positions arising in a game one move will be significantly superiour to others. Interesting are only those moves where a lot of different good moves are possible. Maybe like in your first testgame at move #61, where 9 engines played 3 different moves.

Furthermore, you only test if or if not a certain move was matched by all engines. Don in the contrary doesn't test whether or not a certain move was found, but whether it was matched with some other engine.

regards,
Edmund

Don · Post by **Don** » Thu Jan 28, 2010 4:52 pm

Dann Corbit wrote:
Don wrote:One other consideration here. Not meant to refute you, but I suspect that it won't be that easy to defeat the test, at least without seriously weakening the program.

Building a strong evaluation function is very difficult and I believe that even making a version of komodo that would defeat this test would be very challenging. It would be easy to defeat if I were willing to make big changes but those changes would weaken the program significantly.

I could be wrong about all of this, but I think it's more than just tuning a bunch of weights, it probably has a lot to do with the specific evaluation features in the program too.

So to reiterate I think it would be very challenging to take the 50 or so evaluation feature in komodo, change them enough to defeat the test and still end up with a really strong program.

Suggestion:
Change your null move depth by 1/2 ply and see if the programs appear similar.

I would be interested in a test like this that is not easily defeated.

Dan, I ran this against komodo version 341 (an intenal version) with and without the minor change you suggestion. knull does 1/2 more reductions in null move searching. In the interest of brevity I'm just including the addional results of these 2 versions - which you can mix in with the other results if you choose.

What is interesting to me is that such a minor change only scores 763. And yet major version differences, such as knull and doch-1.0 score as high as 701.

Code: Select all

  763  knull             k341            
  747  komodo            k341            
  719  komodo            knull           
  717  k341              doch-1.2        
  712  k341              doch-1.0        
  704  knull             doch-1.2        
  701  knull             doch-1.0        
  588  rybka             k341            
  573  rybka             knull           
  570  k341              fruit           
  562  knull             ippolito        
  561  sf15              k341            
  559  k341              ippolito        
  556  robbo             k341            
  553  knull             fruit           
  552  sf15              knull           
  551  robbo             knull           
  548  sf14              k341            
  546  toga2             k341            
  544  sf14              knull           
  539  toga2             knull           
  535  knull             glaurung        
  531  sf16              k341            
  529  sf_strong         k341            
  528  k341              glaurung        
  526  sf_strong         knull           
  526  sf16              knull

Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test