Clone detection test

Discussion of chess software programming and technical issues.

Moderator: Ras

swami
Posts: 6662
Joined: Thu Mar 09, 2006 4:21 am

Re: Clone detection test

Post by swami »

BubbaTough wrote:
swami wrote:
I hope STS can be used for the clone detection test as it also comes with the partial credit moves which just maximizes the probing and the choices of engines will be more easily and comprehensively assessed.

Next version will be released probably by the end of this month, and we will have 1000 positions.
I thought the point of STS was that there was an objectively best move (as well as a possibly 2nd or 3rd best for partial credit). If this is the case, then the better the programs are the more they would look like each other in terms of STS results. If anything, the positions you have rejected are more likely to be good test candidates, because assumably you rejected them as not having a clear best move. Or better yet, the positions you did not even consider using, because it is completely unclear what the best move might be.

-Sam
Yes, you have raised an interesting point. I've usually sent about 160-200 positions to Dann, of which 100 gets selected because they have the best moves as well as partial credit ones. What about the rejected ones? They are rejected because there's no best objective solution.

So, yes, rejected positions might be the good test candidates because it assesses engine's input in positions where there's no clear best move. Perhaps Dann had saved a list of all the rejected positions for each one of the test?
Dann Corbit
Posts: 12792
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Clone detection test

Post by Dann Corbit »

swami wrote:
BubbaTough wrote:
swami wrote:
I hope STS can be used for the clone detection test as it also comes with the partial credit moves which just maximizes the probing and the choices of engines will be more easily and comprehensively assessed.

Next version will be released probably by the end of this month, and we will have 1000 positions.
I thought the point of STS was that there was an objectively best move (as well as a possibly 2nd or 3rd best for partial credit). If this is the case, then the better the programs are the more they would look like each other in terms of STS results. If anything, the positions you have rejected are more likely to be good test candidates, because assumably you rejected them as not having a clear best move. Or better yet, the positions you did not even consider using, because it is completely unclear what the best move might be.

-Sam
Yes, you have raised an interesting point. I've usually sent about 160-200 positions to Dann, of which 100 gets selected because they have the best moves as well as partial credit ones. What about the rejected ones? They are rejected because there's no best objective solution.

So, yes, rejected positions might be the good test candidates because it assesses engine's input in positions where there's no clear best move. Perhaps Dann had saved a list of all the rejected positions for each one of the test?
I have decided that the entire idea is a very bad one.

The idea is really just like the ponder hit of CCRL. (IOW, engines that very frequently have the same pv nodes). Consider this query:
http://www.computerchess.org.uk/ccrl/40 ... es+only%29

Code: Select all

# Pair Ponder hit Moves
counted 
1 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 4CPU 84.1 981 
2 Pro Deo 1.1 Silver – Booot 4.11.1 83.8 439 
3 Booot 4.11.1 – Zeus 1.28 83.3 654 
4 Rybka 2.3.2a 64-bit – Naum 3 64-bit 4CPU 83.3 1386 
5 Sloppy 0.1.1 – Booot 4.11.1 83.3 460 
6 GreKo 5.5 – Deuterium 06.08.25.04 83.0 383 
7 Rybka 3 64-bit – Naum 3 64-bit 82.7 394 
8 Naum 3 64-bit – Deep Sjeng 3.0 64-bit 1CPU 82.7 306 
9 BBChess 1.3a – Cyrano 0.2f 82.6 397 
10 Dragon 4.6 – Tytan 9.3 82.3 368 
11 Matacz 1.1 – Homer 2.0 82.2 573 
12 Naum 2.0 32-bit – Delfi 5.4 82.2 1379 
13 Delfi 5.2 – Hamsters 0.6 81.6 2312 
14 Uralochka 1.1b – AliChess 4.08 81.5 536 
15 Stockfish 1.4 32-bit – Booot 4.15.0 81.5 1700 
16 Chessmaster 11 – Delfi 5.2 81.4 1996 
17 Ufim 8.02 – Rotor 0.4 81.4 2716 
18 Naum 3 64-bit – Glaurung 2.1 64-bit 81.3 578 
19 Hiarcs Paderborn 2007 – Chess Tiger 2007.1 81.3 1761 
20 Alf 1.09 – Matheus 2.3 81.2 351 
21 Naum 4 32-bit – TheMadPrune 1.1.25 81.1 1259 
22 Ufim 7.01 – Homer 2.0 81.0 596 
23 Booot 4.11.1 – Tytan 9.3 81.0 405 
24 Toga II 1.4 beta5c – Naum 2.2 64-bit 80.9 1327 
25 Movei 00.8.438 (10 10 10) – Dragon 4.6 80.9 482 
26 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 2CPU 80.8 2108 
27 Chess Tiger 2007.1 – Chessmaster 11 80.8 1973 
28 Tornado 2.2 – Pupsi2 0.07 80.7 990 
29 Naum 2.2 64-bit – Hiarcs 11.2 80.7 378 
30 Arasan 10.1 – LittleThought 1.00 32-bit 80.7 409 
Are Rybka and Naum clones of each other?
How about Booot and ProDeo?
Chessmaster and Delfi?

This is an incredibly thorough analysis of engines that "think alike" and yet what it shows me is that there is no correlation to geneology to be assumed from this.
Uri Blass
Posts: 10905
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Clone detection test

Post by Uri Blass »

Dann Corbit wrote:
swami wrote:
BubbaTough wrote:
swami wrote:
I hope STS can be used for the clone detection test as it also comes with the partial credit moves which just maximizes the probing and the choices of engines will be more easily and comprehensively assessed.

Next version will be released probably by the end of this month, and we will have 1000 positions.
I thought the point of STS was that there was an objectively best move (as well as a possibly 2nd or 3rd best for partial credit). If this is the case, then the better the programs are the more they would look like each other in terms of STS results. If anything, the positions you have rejected are more likely to be good test candidates, because assumably you rejected them as not having a clear best move. Or better yet, the positions you did not even consider using, because it is completely unclear what the best move might be.

-Sam
Yes, you have raised an interesting point. I've usually sent about 160-200 positions to Dann, of which 100 gets selected because they have the best moves as well as partial credit ones. What about the rejected ones? They are rejected because there's no best objective solution.

So, yes, rejected positions might be the good test candidates because it assesses engine's input in positions where there's no clear best move. Perhaps Dann had saved a list of all the rejected positions for each one of the test?
I have decided that the entire idea is a very bad one.

The idea is really just like the ponder hit of CCRL. (IOW, engines that very frequently have the same pv nodes). Consider this query:
http://www.computerchess.org.uk/ccrl/40 ... es+only%29

Code: Select all

# Pair Ponder hit Moves
counted 
1 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 4CPU 84.1 981 
2 Pro Deo 1.1 Silver – Booot 4.11.1 83.8 439 
3 Booot 4.11.1 – Zeus 1.28 83.3 654 
4 Rybka 2.3.2a 64-bit – Naum 3 64-bit 4CPU 83.3 1386 
5 Sloppy 0.1.1 – Booot 4.11.1 83.3 460 
6 GreKo 5.5 – Deuterium 06.08.25.04 83.0 383 
7 Rybka 3 64-bit – Naum 3 64-bit 82.7 394 
8 Naum 3 64-bit – Deep Sjeng 3.0 64-bit 1CPU 82.7 306 
9 BBChess 1.3a – Cyrano 0.2f 82.6 397 
10 Dragon 4.6 – Tytan 9.3 82.3 368 
11 Matacz 1.1 – Homer 2.0 82.2 573 
12 Naum 2.0 32-bit – Delfi 5.4 82.2 1379 
13 Delfi 5.2 – Hamsters 0.6 81.6 2312 
14 Uralochka 1.1b – AliChess 4.08 81.5 536 
15 Stockfish 1.4 32-bit – Booot 4.15.0 81.5 1700 
16 Chessmaster 11 – Delfi 5.2 81.4 1996 
17 Ufim 8.02 – Rotor 0.4 81.4 2716 
18 Naum 3 64-bit – Glaurung 2.1 64-bit 81.3 578 
19 Hiarcs Paderborn 2007 – Chess Tiger 2007.1 81.3 1761 
20 Alf 1.09 – Matheus 2.3 81.2 351 
21 Naum 4 32-bit – TheMadPrune 1.1.25 81.1 1259 
22 Ufim 7.01 – Homer 2.0 81.0 596 
23 Booot 4.11.1 – Tytan 9.3 81.0 405 
24 Toga II 1.4 beta5c – Naum 2.2 64-bit 80.9 1327 
25 Movei 00.8.438 (10 10 10) – Dragon 4.6 80.9 482 
26 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 2CPU 80.8 2108 
27 Chess Tiger 2007.1 – Chessmaster 11 80.8 1973 
28 Tornado 2.2 – Pupsi2 0.07 80.7 990 
29 Naum 2.2 64-bit – Hiarcs 11.2 80.7 378 
30 Arasan 10.1 – LittleThought 1.00 32-bit 80.7 409 
Are Rybka and Naum clones of each other?
How about Booot and ProDeo?
Chessmaster and Delfi?

This is an incredibly thorough analysis of engines that "think alike" and yet what it shows me is that there is no correlation to geneology to be assumed from this.
I think that the number of moves is clearly too small and they are from games so they are not independent.

It is possible that a program was involved in some tactical games with many forced moves.

Taking fixed positions is clearly better then playing games and calculating ponder hit.

Uri
Shaun
Posts: 323
Joined: Wed Mar 08, 2006 9:55 pm
Location: Brighton - UK

Re: Clone detection test

Post by Shaun »

Dann Corbit wrote:
swami wrote:
BubbaTough wrote:
swami wrote:
I hope STS can be used for the clone detection test as it also comes with the partial credit moves which just maximizes the probing and the choices of engines will be more easily and comprehensively assessed.

Next version will be released probably by the end of this month, and we will have 1000 positions.
I thought the point of STS was that there was an objectively best move (as well as a possibly 2nd or 3rd best for partial credit). If this is the case, then the better the programs are the more they would look like each other in terms of STS results. If anything, the positions you have rejected are more likely to be good test candidates, because assumably you rejected them as not having a clear best move. Or better yet, the positions you did not even consider using, because it is completely unclear what the best move might be.

-Sam
Yes, you have raised an interesting point. I've usually sent about 160-200 positions to Dann, of which 100 gets selected because they have the best moves as well as partial credit ones. What about the rejected ones? They are rejected because there's no best objective solution.

So, yes, rejected positions might be the good test candidates because it assesses engine's input in positions where there's no clear best move. Perhaps Dann had saved a list of all the rejected positions for each one of the test?
I have decided that the entire idea is a very bad one.

The idea is really just like the ponder hit of CCRL. (IOW, engines that very frequently have the same pv nodes). Consider this query:
http://www.computerchess.org.uk/ccrl/40 ... es+only%29

Code: Select all

# Pair Ponder hit Moves
counted 
1 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 4CPU 84.1 981 
2 Pro Deo 1.1 Silver – Booot 4.11.1 83.8 439 
3 Booot 4.11.1 – Zeus 1.28 83.3 654 
4 Rybka 2.3.2a 64-bit – Naum 3 64-bit 4CPU 83.3 1386 
5 Sloppy 0.1.1 – Booot 4.11.1 83.3 460 
6 GreKo 5.5 – Deuterium 06.08.25.04 83.0 383 
7 Rybka 3 64-bit – Naum 3 64-bit 82.7 394 
8 Naum 3 64-bit – Deep Sjeng 3.0 64-bit 1CPU 82.7 306 
9 BBChess 1.3a – Cyrano 0.2f 82.6 397 
10 Dragon 4.6 – Tytan 9.3 82.3 368 
11 Matacz 1.1 – Homer 2.0 82.2 573 
12 Naum 2.0 32-bit – Delfi 5.4 82.2 1379 
13 Delfi 5.2 – Hamsters 0.6 81.6 2312 
14 Uralochka 1.1b – AliChess 4.08 81.5 536 
15 Stockfish 1.4 32-bit – Booot 4.15.0 81.5 1700 
16 Chessmaster 11 – Delfi 5.2 81.4 1996 
17 Ufim 8.02 – Rotor 0.4 81.4 2716 
18 Naum 3 64-bit – Glaurung 2.1 64-bit 81.3 578 
19 Hiarcs Paderborn 2007 – Chess Tiger 2007.1 81.3 1761 
20 Alf 1.09 – Matheus 2.3 81.2 351 
21 Naum 4 32-bit – TheMadPrune 1.1.25 81.1 1259 
22 Ufim 7.01 – Homer 2.0 81.0 596 
23 Booot 4.11.1 – Tytan 9.3 81.0 405 
24 Toga II 1.4 beta5c – Naum 2.2 64-bit 80.9 1327 
25 Movei 00.8.438 (10 10 10) – Dragon 4.6 80.9 482 
26 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 2CPU 80.8 2108 
27 Chess Tiger 2007.1 – Chessmaster 11 80.8 1973 
28 Tornado 2.2 – Pupsi2 0.07 80.7 990 
29 Naum 2.2 64-bit – Hiarcs 11.2 80.7 378 
30 Arasan 10.1 – LittleThought 1.00 32-bit 80.7 409 
Are Rybka and Naum clones of each other?
How about Booot and ProDeo?
Chessmaster and Delfi?

This is an incredibly thorough analysis of engines that "think alike" and yet what it shows me is that there is no correlation to geneology to be assumed from this.
Hi Dann,

we currently have problems with the ponder hit stats, new GUIs/versions have changed the PGN output and this has broken some ponder hit results.

:(

It is a lot of work to fix and Kirill has a big list of enhancements and changes he wants to make.

Shaun
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Clone detection test

Post by michiguel »

Dann Corbit wrote:
swami wrote:
BubbaTough wrote:
swami wrote:
I hope STS can be used for the clone detection test as it also comes with the partial credit moves which just maximizes the probing and the choices of engines will be more easily and comprehensively assessed.

Next version will be released probably by the end of this month, and we will have 1000 positions.
I thought the point of STS was that there was an objectively best move (as well as a possibly 2nd or 3rd best for partial credit). If this is the case, then the better the programs are the more they would look like each other in terms of STS results. If anything, the positions you have rejected are more likely to be good test candidates, because assumably you rejected them as not having a clear best move. Or better yet, the positions you did not even consider using, because it is completely unclear what the best move might be.

-Sam
Yes, you have raised an interesting point. I've usually sent about 160-200 positions to Dann, of which 100 gets selected because they have the best moves as well as partial credit ones. What about the rejected ones? They are rejected because there's no best objective solution.

So, yes, rejected positions might be the good test candidates because it assesses engine's input in positions where there's no clear best move. Perhaps Dann had saved a list of all the rejected positions for each one of the test?
I have decided that the entire idea is a very bad one.

The idea is really just like the ponder hit of CCRL. (IOW, engines that very frequently have the same pv nodes). Consider this query:
http://www.computerchess.org.uk/ccrl/40 ... es+only%29

Code: Select all

# Pair Ponder hit Moves
counted 
1 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 4CPU 84.1 981 
2 Pro Deo 1.1 Silver – Booot 4.11.1 83.8 439 
3 Booot 4.11.1 – Zeus 1.28 83.3 654 
4 Rybka 2.3.2a 64-bit – Naum 3 64-bit 4CPU 83.3 1386 
5 Sloppy 0.1.1 – Booot 4.11.1 83.3 460 
6 GreKo 5.5 – Deuterium 06.08.25.04 83.0 383 
7 Rybka 3 64-bit – Naum 3 64-bit 82.7 394 
8 Naum 3 64-bit – Deep Sjeng 3.0 64-bit 1CPU 82.7 306 
9 BBChess 1.3a – Cyrano 0.2f 82.6 397 
10 Dragon 4.6 – Tytan 9.3 82.3 368 
11 Matacz 1.1 – Homer 2.0 82.2 573 
12 Naum 2.0 32-bit – Delfi 5.4 82.2 1379 
13 Delfi 5.2 – Hamsters 0.6 81.6 2312 
14 Uralochka 1.1b – AliChess 4.08 81.5 536 
15 Stockfish 1.4 32-bit – Booot 4.15.0 81.5 1700 
16 Chessmaster 11 – Delfi 5.2 81.4 1996 
17 Ufim 8.02 – Rotor 0.4 81.4 2716 
18 Naum 3 64-bit – Glaurung 2.1 64-bit 81.3 578 
19 Hiarcs Paderborn 2007 – Chess Tiger 2007.1 81.3 1761 
20 Alf 1.09 – Matheus 2.3 81.2 351 
21 Naum 4 32-bit – TheMadPrune 1.1.25 81.1 1259 
22 Ufim 7.01 – Homer 2.0 81.0 596 
23 Booot 4.11.1 – Tytan 9.3 81.0 405 
24 Toga II 1.4 beta5c – Naum 2.2 64-bit 80.9 1327 
25 Movei 00.8.438 (10 10 10) – Dragon 4.6 80.9 482 
26 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 2CPU 80.8 2108 
27 Chess Tiger 2007.1 – Chessmaster 11 80.8 1973 
28 Tornado 2.2 – Pupsi2 0.07 80.7 990 
29 Naum 2.2 64-bit – Hiarcs 11.2 80.7 378 
30 Arasan 10.1 – LittleThought 1.00 32-bit 80.7 409 
Are Rybka and Naum clones of each other?
How about Booot and ProDeo?
Chessmaster and Delfi?

This is an incredibly thorough analysis of engines that "think alike" and yet what it shows me is that there is no correlation to geneology to be assumed from this.
There is difference with this ponder hit statistics: The positions are not the same for each engine. In other words, the comparison between engine A vs B is done with positions played in only their games. B vs C have another set of positions. This introduces a huge amount of noise that makes the whole thing impossible to handle.

Again, the previous technique does not establish genealogy, but tries to address statistical similarity in move selection with a clustering technique. Yes, it is a bad idea to wildly extrapolate conclusions beyond what it represent. It just add an objective measure (at least try) to the perception of some people when they say "This engine is plays similar to this one".

Miguel
Shaun
Posts: 323
Joined: Wed Mar 08, 2006 9:55 pm
Location: Brighton - UK

Re: Clone detection test

Post by Shaun »

I would like to add that I really like the idea of a tool to highlight similar move selection. While IMO not proof one way or the other it will make a useful tool for possible clone detection. It will also highlight engines with very different move selection, ideal for selecting analysis partners.

A public tool that also allows you to substitute your own positions and experiment with different time controls would be FANTASTIC.

Shaun
Dann Corbit
Posts: 12792
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Clone detection test

Post by Dann Corbit »

Uri Blass wrote:
Dann Corbit wrote:
swami wrote:
BubbaTough wrote:
swami wrote:
I hope STS can be used for the clone detection test as it also comes with the partial credit moves which just maximizes the probing and the choices of engines will be more easily and comprehensively assessed.

Next version will be released probably by the end of this month, and we will have 1000 positions.
I thought the point of STS was that there was an objectively best move (as well as a possibly 2nd or 3rd best for partial credit). If this is the case, then the better the programs are the more they would look like each other in terms of STS results. If anything, the positions you have rejected are more likely to be good test candidates, because assumably you rejected them as not having a clear best move. Or better yet, the positions you did not even consider using, because it is completely unclear what the best move might be.

-Sam
Yes, you have raised an interesting point. I've usually sent about 160-200 positions to Dann, of which 100 gets selected because they have the best moves as well as partial credit ones. What about the rejected ones? They are rejected because there's no best objective solution.

So, yes, rejected positions might be the good test candidates because it assesses engine's input in positions where there's no clear best move. Perhaps Dann had saved a list of all the rejected positions for each one of the test?
I have decided that the entire idea is a very bad one.

The idea is really just like the ponder hit of CCRL. (IOW, engines that very frequently have the same pv nodes). Consider this query:
http://www.computerchess.org.uk/ccrl/40 ... es+only%29

Code: Select all

# Pair Ponder hit Moves
counted 
1 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 4CPU 84.1 981 
2 Pro Deo 1.1 Silver – Booot 4.11.1 83.8 439 
3 Booot 4.11.1 – Zeus 1.28 83.3 654 
4 Rybka 2.3.2a 64-bit – Naum 3 64-bit 4CPU 83.3 1386 
5 Sloppy 0.1.1 – Booot 4.11.1 83.3 460 
6 GreKo 5.5 – Deuterium 06.08.25.04 83.0 383 
7 Rybka 3 64-bit – Naum 3 64-bit 82.7 394 
8 Naum 3 64-bit – Deep Sjeng 3.0 64-bit 1CPU 82.7 306 
9 BBChess 1.3a – Cyrano 0.2f 82.6 397 
10 Dragon 4.6 – Tytan 9.3 82.3 368 
11 Matacz 1.1 – Homer 2.0 82.2 573 
12 Naum 2.0 32-bit – Delfi 5.4 82.2 1379 
13 Delfi 5.2 – Hamsters 0.6 81.6 2312 
14 Uralochka 1.1b – AliChess 4.08 81.5 536 
15 Stockfish 1.4 32-bit – Booot 4.15.0 81.5 1700 
16 Chessmaster 11 – Delfi 5.2 81.4 1996 
17 Ufim 8.02 – Rotor 0.4 81.4 2716 
18 Naum 3 64-bit – Glaurung 2.1 64-bit 81.3 578 
19 Hiarcs Paderborn 2007 – Chess Tiger 2007.1 81.3 1761 
20 Alf 1.09 – Matheus 2.3 81.2 351 
21 Naum 4 32-bit – TheMadPrune 1.1.25 81.1 1259 
22 Ufim 7.01 – Homer 2.0 81.0 596 
23 Booot 4.11.1 – Tytan 9.3 81.0 405 
24 Toga II 1.4 beta5c – Naum 2.2 64-bit 80.9 1327 
25 Movei 00.8.438 (10 10 10) – Dragon 4.6 80.9 482 
26 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 2CPU 80.8 2108 
27 Chess Tiger 2007.1 – Chessmaster 11 80.8 1973 
28 Tornado 2.2 – Pupsi2 0.07 80.7 990 
29 Naum 2.2 64-bit – Hiarcs 11.2 80.7 378 
30 Arasan 10.1 – LittleThought 1.00 32-bit 80.7 409 
Are Rybka and Naum clones of each other?
How about Booot and ProDeo?
Chessmaster and Delfi?

This is an incredibly thorough analysis of engines that "think alike" and yet what it shows me is that there is no correlation to geneology to be assumed from this.
I think that the number of moves is clearly too small and they are from games so they are not independent.

It is possible that a program was involved in some tactical games with many forced moves.

Taking fixed positions is clearly better then playing games and calculating ponder hit.

Uri
The number of moves is from 306 to 2716. It will include all phases of the game, since the moves are counted from games played between the opponents. If we ignore all of those pairs for which the number of moves counted is less than 1000, I don't think it changes anything.

Code: Select all

# Pair Ponder hit Moves counted 
4 Rybka 2.3.2a 64-bit – Naum 3 64-bit 4CPU 83.3 1386 
12 Naum 2.0 32-bit – Delfi 5.4 82.2 1379 
13 Delfi 5.2 – Hamsters 0.6 81.6 2312 
15 Stockfish 1.4 32-bit – Booot 4.15.0 81.5 1700 
16 Chessmaster 11 – Delfi 5.2 81.4 1996 
17 Ufim 8.02 – Rotor 0.4 81.4 2716 
19 Hiarcs Paderborn 2007 – Chess Tiger 2007.1 81.3 1761 
21 Naum 4 32-bit – TheMadPrune 1.1.25 81.1 1259 
24 Toga II 1.4 beta5c – Naum 2.2 64-bit 80.9 1327 
26 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 2CPU 80.8 2108 
27 Chess Tiger 2007.1 – Chessmaster 11 80.8 1973 
Dann Corbit
Posts: 12792
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Clone detection test

Post by Dann Corbit »

michiguel wrote:
Dann Corbit wrote:
swami wrote:
BubbaTough wrote:
swami wrote:
I hope STS can be used for the clone detection test as it also comes with the partial credit moves which just maximizes the probing and the choices of engines will be more easily and comprehensively assessed.

Next version will be released probably by the end of this month, and we will have 1000 positions.
I thought the point of STS was that there was an objectively best move (as well as a possibly 2nd or 3rd best for partial credit). If this is the case, then the better the programs are the more they would look like each other in terms of STS results. If anything, the positions you have rejected are more likely to be good test candidates, because assumably you rejected them as not having a clear best move. Or better yet, the positions you did not even consider using, because it is completely unclear what the best move might be.

-Sam
Yes, you have raised an interesting point. I've usually sent about 160-200 positions to Dann, of which 100 gets selected because they have the best moves as well as partial credit ones. What about the rejected ones? They are rejected because there's no best objective solution.

So, yes, rejected positions might be the good test candidates because it assesses engine's input in positions where there's no clear best move. Perhaps Dann had saved a list of all the rejected positions for each one of the test?
I have decided that the entire idea is a very bad one.

The idea is really just like the ponder hit of CCRL. (IOW, engines that very frequently have the same pv nodes). Consider this query:
http://www.computerchess.org.uk/ccrl/40 ... es+only%29

Code: Select all

# Pair Ponder hit Moves
counted 
1 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 4CPU 84.1 981 
2 Pro Deo 1.1 Silver – Booot 4.11.1 83.8 439 
3 Booot 4.11.1 – Zeus 1.28 83.3 654 
4 Rybka 2.3.2a 64-bit – Naum 3 64-bit 4CPU 83.3 1386 
5 Sloppy 0.1.1 – Booot 4.11.1 83.3 460 
6 GreKo 5.5 – Deuterium 06.08.25.04 83.0 383 
7 Rybka 3 64-bit – Naum 3 64-bit 82.7 394 
8 Naum 3 64-bit – Deep Sjeng 3.0 64-bit 1CPU 82.7 306 
9 BBChess 1.3a – Cyrano 0.2f 82.6 397 
10 Dragon 4.6 – Tytan 9.3 82.3 368 
11 Matacz 1.1 – Homer 2.0 82.2 573 
12 Naum 2.0 32-bit – Delfi 5.4 82.2 1379 
13 Delfi 5.2 – Hamsters 0.6 81.6 2312 
14 Uralochka 1.1b – AliChess 4.08 81.5 536 
15 Stockfish 1.4 32-bit – Booot 4.15.0 81.5 1700 
16 Chessmaster 11 – Delfi 5.2 81.4 1996 
17 Ufim 8.02 – Rotor 0.4 81.4 2716 
18 Naum 3 64-bit – Glaurung 2.1 64-bit 81.3 578 
19 Hiarcs Paderborn 2007 – Chess Tiger 2007.1 81.3 1761 
20 Alf 1.09 – Matheus 2.3 81.2 351 
21 Naum 4 32-bit – TheMadPrune 1.1.25 81.1 1259 
22 Ufim 7.01 – Homer 2.0 81.0 596 
23 Booot 4.11.1 – Tytan 9.3 81.0 405 
24 Toga II 1.4 beta5c – Naum 2.2 64-bit 80.9 1327 
25 Movei 00.8.438 (10 10 10) – Dragon 4.6 80.9 482 
26 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 2CPU 80.8 2108 
27 Chess Tiger 2007.1 – Chessmaster 11 80.8 1973 
28 Tornado 2.2 – Pupsi2 0.07 80.7 990 
29 Naum 2.2 64-bit – Hiarcs 11.2 80.7 378 
30 Arasan 10.1 – LittleThought 1.00 32-bit 80.7 409 
Are Rybka and Naum clones of each other?
How about Booot and ProDeo?
Chessmaster and Delfi?

This is an incredibly thorough analysis of engines that "think alike" and yet what it shows me is that there is no correlation to geneology to be assumed from this.
There is difference with this ponder hit statistics: The positions are not the same for each engine. In other words, the comparison between engine A vs B is done with positions played in only their games. B vs C have another set of positions. This introduces a huge amount of noise that makes the whole thing impossible to handle.

Again, the previous technique does not establish genealogy, but tries to address statistical similarity in move selection with a clustering technique. Yes, it is a bad idea to wildly extrapolate conclusions beyond what it represent. It just add an objective measure (at least try) to the perception of some people when they say "This engine is plays similar to this one".

Miguel
This I do agree with. I think that if we do not try to read more into it than we should, then it may be a good idea and not a bad one. But if we are using this idea alone to assume some sort of guilt, then I think it is the worst idea imaginable.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Clone detection test

Post by michiguel »

Shaun wrote:I would like to add that I really like the idea of a tool to highlight similar move detection. While IMO not proof one way or the other it will make a useful tool both for possible clone detection. It will also highlight engines with very different move selection, ideal for selecting analysis partners.
I think the latter is the most useful side of the technique. If you want to pick partners, choose some that are in complete different branches.

A public tool that also allows you to substitute your own positions and experiment with different time controls would be FANTASTIC.

Shaun


This is certainly doable.

Miguel
Shaun
Posts: 323
Joined: Wed Mar 08, 2006 9:55 pm
Location: Brighton - UK

Re: Clone detection test

Post by Shaun »

Dann Corbit wrote:
Uri Blass wrote:
Dann Corbit wrote:
swami wrote:
BubbaTough wrote:
swami wrote:
I hope STS can be used for the clone detection test as it also comes with the partial credit moves which just maximizes the probing and the choices of engines will be more easily and comprehensively assessed.

Next version will be released probably by the end of this month, and we will have 1000 positions.
I thought the point of STS was that there was an objectively best move (as well as a possibly 2nd or 3rd best for partial credit). If this is the case, then the better the programs are the more they would look like each other in terms of STS results. If anything, the positions you have rejected are more likely to be good test candidates, because assumably you rejected them as not having a clear best move. Or better yet, the positions you did not even consider using, because it is completely unclear what the best move might be.

-Sam
Yes, you have raised an interesting point. I've usually sent about 160-200 positions to Dann, of which 100 gets selected because they have the best moves as well as partial credit ones. What about the rejected ones? They are rejected because there's no best objective solution.

So, yes, rejected positions might be the good test candidates because it assesses engine's input in positions where there's no clear best move. Perhaps Dann had saved a list of all the rejected positions for each one of the test?
I have decided that the entire idea is a very bad one.

The idea is really just like the ponder hit of CCRL. (IOW, engines that very frequently have the same pv nodes). Consider this query:
http://www.computerchess.org.uk/ccrl/40 ... es+only%29

Code: Select all

# Pair Ponder hit Moves
counted 
1 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 4CPU 84.1 981 
2 Pro Deo 1.1 Silver – Booot 4.11.1 83.8 439 
3 Booot 4.11.1 – Zeus 1.28 83.3 654 
4 Rybka 2.3.2a 64-bit – Naum 3 64-bit 4CPU 83.3 1386 
5 Sloppy 0.1.1 – Booot 4.11.1 83.3 460 
6 GreKo 5.5 – Deuterium 06.08.25.04 83.0 383 
7 Rybka 3 64-bit – Naum 3 64-bit 82.7 394 
8 Naum 3 64-bit – Deep Sjeng 3.0 64-bit 1CPU 82.7 306 
9 BBChess 1.3a – Cyrano 0.2f 82.6 397 
10 Dragon 4.6 – Tytan 9.3 82.3 368 
11 Matacz 1.1 – Homer 2.0 82.2 573 
12 Naum 2.0 32-bit – Delfi 5.4 82.2 1379 
13 Delfi 5.2 – Hamsters 0.6 81.6 2312 
14 Uralochka 1.1b – AliChess 4.08 81.5 536 
15 Stockfish 1.4 32-bit – Booot 4.15.0 81.5 1700 
16 Chessmaster 11 – Delfi 5.2 81.4 1996 
17 Ufim 8.02 – Rotor 0.4 81.4 2716 
18 Naum 3 64-bit – Glaurung 2.1 64-bit 81.3 578 
19 Hiarcs Paderborn 2007 – Chess Tiger 2007.1 81.3 1761 
20 Alf 1.09 – Matheus 2.3 81.2 351 
21 Naum 4 32-bit – TheMadPrune 1.1.25 81.1 1259 
22 Ufim 7.01 – Homer 2.0 81.0 596 
23 Booot 4.11.1 – Tytan 9.3 81.0 405 
24 Toga II 1.4 beta5c – Naum 2.2 64-bit 80.9 1327 
25 Movei 00.8.438 (10 10 10) – Dragon 4.6 80.9 482 
26 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 2CPU 80.8 2108 
27 Chess Tiger 2007.1 – Chessmaster 11 80.8 1973 
28 Tornado 2.2 – Pupsi2 0.07 80.7 990 
29 Naum 2.2 64-bit – Hiarcs 11.2 80.7 378 
30 Arasan 10.1 – LittleThought 1.00 32-bit 80.7 409 
Are Rybka and Naum clones of each other?
How about Booot and ProDeo?
Chessmaster and Delfi?

This is an incredibly thorough analysis of engines that "think alike" and yet what it shows me is that there is no correlation to geneology to be assumed from this.
I think that the number of moves is clearly too small and they are from games so they are not independent.

It is possible that a program was involved in some tactical games with many forced moves.

Taking fixed positions is clearly better then playing games and calculating ponder hit.

Uri
The number of moves is from 306 to 2716. It will include all phases of the game, since the moves are counted from games played between the opponents. If we ignore all of those pairs for which the number of moves counted is less than 1000, I don't think it changes anything.

Code: Select all

# Pair Ponder hit Moves counted 
4 Rybka 2.3.2a 64-bit – Naum 3 64-bit 4CPU 83.3 1386 
12 Naum 2.0 32-bit – Delfi 5.4 82.2 1379 
13 Delfi 5.2 – Hamsters 0.6 81.6 2312 
15 Stockfish 1.4 32-bit – Booot 4.15.0 81.5 1700 
16 Chessmaster 11 – Delfi 5.2 81.4 1996 
17 Ufim 8.02 – Rotor 0.4 81.4 2716 
19 Hiarcs Paderborn 2007 – Chess Tiger 2007.1 81.3 1761 
21 Naum 4 32-bit – TheMadPrune 1.1.25 81.1 1259 
24 Toga II 1.4 beta5c – Naum 2.2 64-bit 80.9 1327 
26 Rybka 2.3.2a 64-bit 2CPU – Naum 3 64-bit 2CPU 80.8 2108 
27 Chess Tiger 2007.1 – Chessmaster 11 80.8 1973 
See above - perhaps we need to remove the ponder hit stats untill the scripts can be updated to provide reliable stats again....