Clone detection test

Dann Corbit · Post by **Dann Corbit** » Mon Feb 08, 2010 9:40 pm

Shaun wrote:
Dann Corbit wrote:
Uri Blass wrote:
Dann Corbit wrote:
swami wrote:
BubbaTough wrote:
swami wrote:
I hope STS can be used for the clone detection test as it also comes with the partial credit moves which just maximizes the probing and the choices of engines will be more easily and comprehensively assessed.

Next version will be released probably by the end of this month, and we will have 1000 positions.
I thought the point of STS was that there was an objectively best move (as well as a possibly 2nd or 3rd best for partial credit). If this is the case, then the better the programs are the more they would look like each other in terms of STS results. If anything, the positions you have rejected are more likely to be good test candidates, because assumably you rejected them as not having a clear best move. Or better yet, the positions you did not even consider using, because it is completely unclear what the best move might be.

-Sam
Yes, you have raised an interesting point. I've usually sent about 160-200 positions to Dann, of which 100 gets selected because they have the best moves as well as partial credit ones. What about the rejected ones? They are rejected because there's no best objective solution.

So, yes, rejected positions might be the good test candidates because it assesses engine's input in positions where there's no clear best move. Perhaps Dann had saved a list of all the rejected positions for each one of the test?
I have decided that the entire idea is a very bad one.

The idea is really just like the ponder hit of CCRL. (IOW, engines that very frequently have the same pv nodes). Consider this query:
http://www.computerchess.org.uk/ccrl/40 ... es+only%29
Code: Select all
# Pair Ponder hit Moves
counted 
1 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 4CPU 84.1 981 
2 Pro Deo 1.1 Silver – Booot 4.11.1 83.8 439 
3 Booot 4.11.1 – Zeus 1.28 83.3 654 
4 Rybka 2.3.2a 64-bit – Naum 3 64-bit 4CPU 83.3 1386 
5 Sloppy 0.1.1 – Booot 4.11.1 83.3 460 
6 GreKo 5.5 – Deuterium 06.08.25.04 83.0 383 
7 Rybka 3 64-bit – Naum 3 64-bit 82.7 394 
8 Naum 3 64-bit – Deep Sjeng 3.0 64-bit 1CPU 82.7 306 
9 BBChess 1.3a – Cyrano 0.2f 82.6 397 
10 Dragon 4.6 – Tytan 9.3 82.3 368 
11 Matacz 1.1 – Homer 2.0 82.2 573 
12 Naum 2.0 32-bit – Delfi 5.4 82.2 1379 
13 Delfi 5.2 – Hamsters 0.6 81.6 2312 
14 Uralochka 1.1b – AliChess 4.08 81.5 536 
15 Stockfish 1.4 32-bit – Booot 4.15.0 81.5 1700 
16 Chessmaster 11 – Delfi 5.2 81.4 1996 
17 Ufim 8.02 – Rotor 0.4 81.4 2716 
18 Naum 3 64-bit – Glaurung 2.1 64-bit 81.3 578 
19 Hiarcs Paderborn 2007 – Chess Tiger 2007.1 81.3 1761 
20 Alf 1.09 – Matheus 2.3 81.2 351 
21 Naum 4 32-bit – TheMadPrune 1.1.25 81.1 1259 
22 Ufim 7.01 – Homer 2.0 81.0 596 
23 Booot 4.11.1 – Tytan 9.3 81.0 405 
24 Toga II 1.4 beta5c – Naum 2.2 64-bit 80.9 1327 
25 Movei 00.8.438 (10 10 10) – Dragon 4.6 80.9 482 
26 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 2CPU 80.8 2108 
27 Chess Tiger 2007.1 – Chessmaster 11 80.8 1973 
28 Tornado 2.2 – Pupsi2 0.07 80.7 990 
29 Naum 2.2 64-bit – Hiarcs 11.2 80.7 378 
30 Arasan 10.1 – LittleThought 1.00 32-bit 80.7 409 
Are Rybka and Naum clones of each other?
How about Booot and ProDeo?
Chessmaster and Delfi?

This is an incredibly thorough analysis of engines that "think alike" and yet what it shows me is that there is no correlation to geneology to be assumed from this.
I think that the number of moves is clearly too small and they are from games so they are not independent.

It is possible that a program was involved in some tactical games with many forced moves.

Taking fixed positions is clearly better then playing games and calculating ponder hit.

Uri
The number of moves is from 306 to 2716. It will include all phases of the game, since the moves are counted from games played between the opponents. If we ignore all of those pairs for which the number of moves counted is less than 1000, I don't think it changes anything.
Code: Select all
# Pair Ponder hit Moves counted 
4 Rybka 2.3.2a 64-bit – Naum 3 64-bit 4CPU 83.3 1386 
12 Naum 2.0 32-bit – Delfi 5.4 82.2 1379 
13 Delfi 5.2 – Hamsters 0.6 81.6 2312 
15 Stockfish 1.4 32-bit – Booot 4.15.0 81.5 1700 
16 Chessmaster 11 – Delfi 5.2 81.4 1996 
17 Ufim 8.02 – Rotor 0.4 81.4 2716 
19 Hiarcs Paderborn 2007 – Chess Tiger 2007.1 81.3 1761 
21 Naum 4 32-bit – TheMadPrune 1.1.25 81.1 1259 
24 Toga II 1.4 beta5c – Naum 2.2 64-bit 80.9 1327 
26 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 2CPU 80.8 2108 
27 Chess Tiger 2007.1 – Chessmaster 11 80.8 1973 
See above - perhaps we need to remove the ponder hit stats untill the scripts can be updated to provide reliable stats again....

Perhaps the parser used and the relavant equations can be given and we could parse the source files ourselves (I guess that the information is simply derived from head to head matches with ponder moves pulled from the engine pvs).

Shaun · Post by **Shaun** » Mon Feb 08, 2010 11:26 pm

Dann Corbit wrote:Perhaps the parser used and the relavant equations can be given and we could parse the source files ourselves (I guess that the information is simply derived from head to head matches with ponder moves pulled from the engine pvs).

You are correct the ponder hits are calculated based on the reported ponder information in the pgn. I have not looked at Kirills scripts in this area so I would rather not comment, incase I give the wrong information. However our games can be downloaded and the pgn with comments should include all ponder information that the GUI / engine combination provided.

Shaun

Mincho Georgiev · Post by **Mincho Georgiev** » Tue Feb 09, 2010 7:20 am

Hi guys!
Wouldn't be more correct if the test includes the similarities in the pv as well.
A single move could misleads since the reason for picking it could be even a move ordering bug. I know that huge amount of positions could suppress these kind of possibilities, but they are still remains. I think that the test make sense, but in a perfect world, could be more correct prove if it is combined with the pv similarities and the evaluation value.

Dann Corbit · Post by **Dann Corbit** » Tue Feb 09, 2010 11:54 am

xcomponent wrote:Hi guys!
Wouldn't be more correct if the test includes the similarities in the pv as well.
A single move could misleads since the reason for picking it could be even a move ordering bug. I know that huge amount of positions could suppress these kind of possibilities, but they are still remains. I think that the test make sense, but in a perfect world, could be more correct prove if it is combined with the pv similarities and the evaluation value.

Kiril's data contains both the predicted move and also the score.
However, it is quite easy to fudge the score (for example, multiply the every score by 1.5 or by 0.75 and the engine will play the same).

I think that other interesting things might fall out of this sort of classification. For instance, we might discover families of engines that like to lock the position and play a slow, closed game. We might discover families of engines that like fireworks and pirates storming over the wall.
We might discover engines that can build a fortress or engines that can dismantle a fortress (and conversely those that can't).

I am beginning to think more and more that it is perhaps not a great idea to accuse someone of something very bad because his engine plays similarly to another engine. But it is also possible that some magic formula will occur that is foolproof. In any case, the idea makes me very nervous.

Mincho Georgiev · Post by **Mincho Georgiev** » Tue Feb 09, 2010 12:36 pm

Dann Corbit wrote:
xcomponent wrote:Hi guys!
Wouldn't be more correct if the test includes the similarities in the pv as well.
A single move could misleads since the reason for picking it could be even a move ordering bug. I know that huge amount of positions could suppress these kind of possibilities, but they are still remains. I think that the test make sense, but in a perfect world, could be more correct prove if it is combined with the pv similarities and the evaluation value.
Kiril's data contains both the predicted move and also the score.
However, it is quite easy to fudge the score (for example, multiply the every score by 1.5 or by 0.75 and the engine will play the same).

I think that other interesting things might fall out of this sort of classification. For instance, we might discover families of engines that like to lock the position and play a slow, closed game. We might discover families of engines that like fireworks and pirates storming over the wall.
We might discover engines that can build a fortress or engines that can dismantle a fortress (and conversely those that can't).

I am beginning to think more and more that it is perhaps not a great idea to accuse someone of something very bad because his engine plays similarly to another engine. But it is also possible that some magic formula will occur that is foolproof. In any case, the idea makes me very nervous.

I agree. Probably the purpose of tests like this one is more valuable as
analysis and style similarity instead of plag. detection. But still there is
one direction in which it may be very usable. If new closed source engine arrives and the test shows ,let's say, about 99% similarity to already known engine, maybe this could be a serous argument for cloning issue.

Don · Post by **Don** » Tue Feb 09, 2010 2:39 pm

xcomponent wrote:Hi guys!
Wouldn't be more correct if the test includes the similarities in the pv as well.
A single move could misleads since the reason for picking it could be even a move ordering bug. I know that huge amount of positions could suppress these kind of possibilities, but they are still remains. I think that the test make sense, but in a perfect world, could be more correct prove if it is combined with the pv similarities and the evaluation value.

I believe the only reliable measure is the actual move played because everything else can be faked. But you cannot fake the move.

But even if it were not faked, I'm not sure the PV is very reliable. In my own program the PV changes frequently, even if the first move does not. My guess is that each successive move is increasingly unreliable as a measure of similarity. And I admit that I'm guessing here, but my sense of this is that even if it was an improvement it would be only a very minor one. And as I mentioned, it can still be faked.

But there is no need to guess, try the experiment yourself, it was very easy to construct - and see if you can produce a better measure using many moves of the PV.

Don · Post by **Don** » Tue Feb 09, 2010 3:40 pm

Dann Corbit wrote: I am beginning to think more and more that it is perhaps not a great idea to accuse someone of something very bad because his engine plays similarly to another engine. But it is also possible that some magic formula will occur that is foolproof. In any case, the idea makes me very nervous.

I think you are being overly concerned. I have seen over the years many cheaters exposed (or let's just say misunderstandings cleared up) simply because the author of a program noticed that a program in a tournament was playing like his.

For instance several years ago I am playing at the Dutch Computer Chess Championship and get an email from John Stanback - who is watching the games from the states and asks me to check into a problem, he notices that one of the program is playing just like Zarkov.

In another tournament Richard Lang is sitting across from a clone of his own program (in this case it's a perfect clone) and he notices that a program is just too similar to his even though it's disguised in a different housing (it's one of those hardware chess computers.)

In yet another tournament Bob Hyatt notices remotely that one of the contestants is playing exactly the same moves as Crafty.

As a result of these observations, the problems in each case were investigated and were brought to some kind of resolution. Just noticing the similarity itself was not considered the proof but it was good enough to start asking questions.

So just relax - if we build this tool it will be with the understanding that it's imperfect - it's just a crude measurement. Like almost any tool, it's not a bad thing but someone could grab a screwdriver and use as a weapon to hurt someone with, it doesn't mean we should not have screwdrivers.

So I believe this is a powerful test but like you I agree that it should not be used as a weapon to hit someone over the head with.

I could build a simple UCI test harness to run my test and produce a result file - anyone interesting in doing some kind of blind test?

Don

diep · Post by **diep** » Tue Feb 09, 2010 4:24 pm

Don wrote:
Dann Corbit wrote: I am beginning to think more and more that it is perhaps not a great idea to accuse someone of something very bad because his engine plays similarly to another engine. But it is also possible that some magic formula will occur that is foolproof. In any case, the idea makes me very nervous.
I think you are being overly concerned. I have seen over the years many cheaters exposed (or let's just say misunderstandings cleared up) simply because the author of a program noticed that a program in a tournament was playing like his.

For instance several years ago I am playing at the Dutch Computer Chess Championship and get an email from John Stanback - who is watching the games from the states and asks me to check into a problem, he notices that one of the program is playing just like Zarkov.

In another tournament Richard Lang is sitting across from a clone of his own program (in this case it's a perfect clone) and he notices that a program is just too similar to his even though it's disguised in a different housing (it's one of those hardware chess computers.)

In yet another tournament Bob Hyatt notices remotely that one of the contestants is playing exactly the same moves as Crafty.

As a result of these observations, the problems in each case were investigated and were brought to some kind of resolution. Just noticing the similarity itself was not considered the proof but it was good enough to start asking questions.

So just relax - if we build this tool it will be with the understanding that it's imperfect - it's just a crude measurement. Like almost any tool, it's not a bad thing but someone could grab a screwdriver and use as a weapon to hurt someone with, it doesn't mean we should not have screwdrivers.

So I believe this is a powerful test but like you I agree that it should not be used as a weapon to hit someone over the head with.

I could build a simple UCI test harness to run my test and produce a result file - anyone interesting in doing some kind of blind test?

Don

I didn't follow the thread, but what's all this idiocy man?

Just install IDA pro and look in the assembler code of an engine and you know instantly whether it's a clone.

And nothing else can prove anything unless the guy who clones is a major idiot; note majority is major idiots to copy things 100%. With source code available now it's easy to make some modifications modifying behaviour.

So just take a look in the assembler code of the engine and you know it all. Simple as that.

Thanks,
Vincent

govert · Post by **govert** » Tue Feb 09, 2010 4:48 pm

How about we take a step back and review what has been said:

We have a tool which looks at the move made, and from that, we can determine similarities and differences in play style.

Let's leave it at that.

We could go on forever arguing about whether the full PV should be analyzed or the pondermove, etc etc. Let's leave that to another tool and keep this simple.

Let's not aspire to do a clone detection test. Let's make a "Play Style Proximity Detector". We already have the specification for it, and the analysis done so far has yielded a lot of interesting information.

Then, if someone wants to use the results to claim that A may be a clone of B, they can do so, and the discussion can start for that particular case in a particular thread.

Don · Post by **Don** » Tue Feb 09, 2010 4:50 pm

diep wrote:
Don wrote:
Dann Corbit wrote: I am beginning to think more and more that it is perhaps not a great idea to accuse someone of something very bad because his engine plays similarly to another engine. But it is also possible that some magic formula will occur that is foolproof. In any case, the idea makes me very nervous.
I think you are being overly concerned. I have seen over the years many cheaters exposed (or let's just say misunderstandings cleared up) simply because the author of a program noticed that a program in a tournament was playing like his.

For instance several years ago I am playing at the Dutch Computer Chess Championship and get an email from John Stanback - who is watching the games from the states and asks me to check into a problem, he notices that one of the program is playing just like Zarkov.

In another tournament Richard Lang is sitting across from a clone of his own program (in this case it's a perfect clone) and he notices that a program is just too similar to his even though it's disguised in a different housing (it's one of those hardware chess computers.)

In yet another tournament Bob Hyatt notices remotely that one of the contestants is playing exactly the same moves as Crafty.

As a result of these observations, the problems in each case were investigated and were brought to some kind of resolution. Just noticing the similarity itself was not considered the proof but it was good enough to start asking questions.

So just relax - if we build this tool it will be with the understanding that it's imperfect - it's just a crude measurement. Like almost any tool, it's not a bad thing but someone could grab a screwdriver and use as a weapon to hurt someone with, it doesn't mean we should not have screwdrivers.

So I believe this is a powerful test but like you I agree that it should not be used as a weapon to hit someone over the head with.

I could build a simple UCI test harness to run my test and produce a result file - anyone interesting in doing some kind of blind test?

Don
I didn't follow the thread, but what's all this idiocy man?

Just install IDA pro and look in the assembler code of an engine and you know instantly whether it's a clone.

And nothing else can prove anything unless the guy who clones is a major idiot; note majority is major idiots to copy things 100%. With source code available now it's easy to make some modifications modifying behaviour.

So just take a look in the assembler code of the engine and you know it all. Simple as that.

Thanks,
Vincent

Disassembled code looks different on different compilers and requires an expert. It's easy for you and I, but not everyone. A lot of good C programmers do not know assembler.

The similarity testers would be a tool and nothing more. It would be used in conjunction with other things, such as the disassembler.

Don

Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test