Similarity test

Pio · Post by **Pio** » Sun Jan 25, 2015 1:14 am

michiguel wrote:
Pio wrote:Hi

I think it is really dangerous to accuse someone of cheating just because the person's engine makes similar moves to other engines.

There are many reasons why an engine plays similar moves to another engine:

1) I suspect that good engines play more similar moves just because they are good.

This is not true, and it is one of the interesting aspects of testing "similarity". Strong engines play different moves in situations in which there are several reasonable options. This is a statistical test over a big number of positions.

Miguel

2) A person might have tuned his engine against another engine. If that is the case I guess the moves will be more similar.

3) A person might have tuned his weights using another engine as an oracle/reference using the other engine as a sort of trainer.

4) You can get the same result from very different ways. Bats can fly and so can birds too. Whales can swim and so can sharks. I have not even mentioned insects.

I do not think it is so wise thinking an engine is original just because it has low similarity to all the other engines. What about better/smarter cheaters that steal from many engines and take ideas from wiki pages. You might call that research but it is still not original. I also believe that you can make small changes to the evaluation and search leading to very big differences in the similarity test.

Hola Miguel

First of all I have to say I usually like your posts and that I have followed Gaviota's games in TCEC and other competitions with great interest.

I said

1) I suspect that good engines play more similar moves just because they are good.

and I cannot see what is wrong with that. It is a logical statement since a bad engine might choose a really big mistake that no strong engine will make.

What was wrong with my statement?

Also, if I am not mistaken I believe Don used the UPGMA-algorithm (I might be wrong), and I think you know that it is not the best algorithm to use even though it is a very fast one and easy to implement.

Hasta luego

lucasart · Post by **lucasart** » Sun Jan 25, 2015 1:50 am

mar wrote:Just for fun, I ran Don's sim test on cheng4 0.36c and 0.38.
These engines are exactly the same (neglecting some other minor changes in 0.38, search is exactly the same).
The most important difference is tuned evaluation.
The result: 47.7%

This is in line with what I thought, that sim test is only good for comparing eval between engines.
What this implies? Not much really. Just that it's simply not enough to accuse someone of cloning based on high sim result
(and consequently that low sim percentage doesn't really mean anything either).

Exactly. In fact, you don't even need to retune your whole eval to fool the sim test. Simply the PST is enough. Uri did an experiment by tweaking the PST of SF only, and got 55%.

This is completely obvious, and anyone with a programming background should understand that move similarity has little to do with code similarity. I've been saying that ever since I heard of this similarity test.

Adam Hair · Post by **Adam Hair** » Sun Jan 25, 2015 3:28 am

lucasart wrote:
mar wrote:Just for fun, I ran Don's sim test on cheng4 0.36c and 0.38.
These engines are exactly the same (neglecting some other minor changes in 0.38, search is exactly the same).
The most important difference is tuned evaluation.
The result: 47.7%

This is in line with what I thought, that sim test is only good for comparing eval between engines.
What this implies? Not much really. Just that it's simply not enough to accuse someone of cloning based on high sim result
(and consequently that low sim percentage doesn't really mean anything either).
Exactly. In fact, you don't even need to retune your whole eval to fool the sim test. Simply the PST is enough. Uri did an experiment by tweaking the PST of SF only, and got 55%.

This is completely obvious, and anyone with a programming background should understand that move similarity has little to do with code similarity. I've been saying that ever since I heard of this similarity test.

Would it not be more accurate to state that move similarity does not necessarily imply code similarity? That is something I have kept in mind from the beginning.

I have looked hard to find examples of false positives. But what I kept finding was that every example of excessive move similarity (if we accept that 60% move similarity from my testing is excessive) involved an open source engine. That by itself is not proof, but in several of those cases the code was examined and code similarity was found. It does lend weight to the notion that move similarity probably indicates code similarity.

michiguel · Post by **michiguel** » Sun Jan 25, 2015 3:42 am

Pio wrote:
michiguel wrote:
Pio wrote:Hi

I think it is really dangerous to accuse someone of cheating just because the person's engine makes similar moves to other engines.

There are many reasons why an engine plays similar moves to another engine:

1) I suspect that good engines play more similar moves just because they are good.

This is not true, and it is one of the interesting aspects of testing "similarity". Strong engines play different moves in situations in which there are several reasonable options. This is a statistical test over a big number of positions.

Miguel

2) A person might have tuned his engine against another engine. If that is the case I guess the moves will be more similar.

3) A person might have tuned his weights using another engine as an oracle/reference using the other engine as a sort of trainer.

4) You can get the same result from very different ways. Bats can fly and so can birds too. Whales can swim and so can sharks. I have not even mentioned insects.

I do not think it is so wise thinking an engine is original just because it has low similarity to all the other engines. What about better/smarter cheaters that steal from many engines and take ideas from wiki pages. You might call that research but it is still not original. I also believe that you can make small changes to the evaluation and search leading to very big differences in the similarity test.

Hola Miguel

First of all I have to say I usually like your posts and that I have followed Gaviota's games in TCEC and other competitions with great interest.

Thanks

I said
1) I suspect that good engines play more similar moves just because they are good.
and I cannot see what is wrong with that. It is a logical statement since a bad engine might choose a really big mistake that no strong engine will make.

What was wrong with my statement?

It is just not what it has been observed. There are plenty of options for strong engines to give different outputs on many positions. In those, even if there are only two equally good options, that is enough to tell the apart if many positions are used. This is very easy to demonstrate. But most importantly, this is what it has been observed experimentally. Strength is not the major reason to pick a higher % of similar moves.

Also, if I am not mistaken I believe Don used the UPGMA-algorithm (I might be wrong), and I think you know that it is not the best algorithm to use even though it is a very fast one and easy to implement.

No he did not, I introduced the idea of clustering the results of the similarity tool. I used Neighbor joining with and ad-hoc form of bootstrap analysis. But, that is a post analysis of the matrix already obtained by Don's tool just to see how statistically robust the results are. Of course there is noise, but the main signals are strong enough to see them present regardless of what type of positions you use, as long as they are many.

Miguel

Hasta luego

Adam Hair · Post by **Adam Hair** » Sun Jan 25, 2015 4:12 am

bob wrote:
Adam Hair wrote:
Laskos wrote:
mar wrote:.
But I still think that eval has most most impact on sim results.
Yes, Don himself hinted to that when releasing the Sim. And it's still a good tool anyway, as most cloners are unable to modify significantly the eval.
It is easy to change the sim results by changing the parameter values. But it is damn tough to fool the sim test without screwing up the strength too much.
This is something that needs to be tested. I've seen the statement made repeatedly, but I haven't seen anyone make a concerted effort to actually try and do this. I'm skeptical myself, because I encounter so many tuning parameters where changing them makes little difference in terms of skill, but makes a lot of difference in how a program plays, style-wise. The problem is that all eval terms are general-purpose. If you think of them as a linear programming problem with an objective function, there are a BUNCH of feasible solutions that will meet the same requirements. A general-purpose piece of "knowledge" is, by definition, imperfect. I have seen cases where two different values produce no Elo change, yet they change the personality of the engine quite a bit. That's all that is needed to fool a test that is based mainly on the evaluation.

I will have to dig up some of the testing I have done. I think that Ed also did some tests to measure the change in strength when trying to fool the test. One thing that I have not seen is two different values for a parameter having little effect on Elo but a large effect on move selection.

If this test can be run on linux, tell me how. I can test a bunch of old crafty versions from the last year or so and see if they all show up as very similar or if some drop below the "magic number" that some are using to claim originality. I wouldn't waste the time myself, either, trying to disprove this, because it would take some time. But since I have lots of old versions laying around, many of which are very modest Elo improvements, if I can find a version N and N+1 that fools the similarity test, yet Elos are close, that would prove this is not as hard to do as might be expected...

Don wrote the tool in Tcl. If you have any familiarity with Tcl, then you could modify it to send CECP commands rather than UCI commands. I have used the adapter wb2uci with Wine to test Crafty, but I have no idea if that would work in your enviroment. I can send you what you need to work with the tool in Linux (if Tcl is supported in the version of Linux you use).

If this would not work for you, the CSVN has a similar sort of test that they use to screen for clones entering their tournament. I could dig up that link for you.

I believe this "false-positive" argument is irrelevant. What we REALLY care about is a clone that produces a false-negative and hides its origin and gets away with it. Of course there will be false positives. But they can easily be proven to be false with code examination. False negatives avoid the scrutiny and represent a real problem.

Jesse Gersenson · Post by **Jesse Gersenson** » Tue Jan 27, 2015 9:57 am

bob wrote: [snip]If this test can be run on linux, tell me how.[/snip]

The Linux version of "sim version 3", includes a README:
http://komodochess.com/pub/sim03_linux64.zip

The only parameter I see is time in milliseconds:

similar.exe -t myChessProgram 100

But the command line mentions

usage:
similar
similar -test INVOKE {time_in_ms}
similar -report N
similar -config FILE
similar -matrix

Anyone have a config file?

Ferdy · Post by **Ferdy** » Tue Jan 27, 2015 10:33 am

Jesse Gersenson wrote:
bob wrote: [snip]If this test can be run on linux, tell me how.[/snip]
The Linux version of "sim version 3", includes a README:
http://komodochess.com/pub/sim03_linux64.zip

The only parameter I see is time in milliseconds:
similar.exe -t myChessProgram 100
But the command line mentions

usage:
similar
similar -test INVOKE {time_in_ms}
similar -report N
similar -config FILE
similar -matrix
Anyone have a config file?

In windows it looks like this.
sf5.cfg

Code: Select all

exe = sf5.exe
name = Stockfish 5
Hash = 64

Usage:

Code: Select all

sim03w64.exe -c sf5.cfg 20

velmarin · Post by **velmarin** » Tue Jan 27, 2015 10:59 am

The application is very short on options or are not documented.
It happens that "engines SMP" work with all trheads machine,
others how Stockfish with one.
This may change some results.
Running the application with a very little time, 10 example shows how motors flying, and others take quite.

Similarity test

Re: Similarity test

Re: Similarity test

Re: Similarity test

Re: Similarity test

Re: Similarity test

Re: Similarity test

Re: Similarity test

Re: Similarity test