Clone detection test

Edmund · Post by **Edmund** » Thu Jan 28, 2010 4:56 pm

Don wrote:
Edmund wrote:
If I understand your setup correctly you replayed whole games, whilest Don Dailey just tests on individual postitions. The positions in this tests are not very tactical in nature as Don stated, but only represent the engine style or preference.
I have no idea how you came to that conclusion. Larry ran the same exact test I did except he used 5 ply tests and I used 0.1 second tests. These positions came from strong corespondance games and so many of them might be sequences of moves from games. However, the thousand positions we both used are taken from a much larger set of positions and those were scrambled to not appear in any particular order.

This set was created 2 or 3 years ago, I don't remember. There was some attempt to remove positions that had obvious answers but I don't remember the specifics. I am just guessing here, but it seems like I did something like remove all positions where a variety of programs played the same moves at low depths. The goal was to remove positions that would just waste time running through the tester because they were obvious moves, such as recaptures that must be played or else you lose.

Looking at your games, at most moves all the engines play the same move. That is not surprising, as in most chess positions arising in a game one move will be significantly superiour to others. Interesting are only those moves where a lot of different good moves are possible. Maybe like in your first testgame at move #61, where 9 engines played 3 different moves.

Furthermore, you only test if or if not a certain move was matched by all engines. Don in the contrary doesn't test whether or not a certain move was found, but whether it was matched with some other engine.

regards,
Edmund

I was quoting and refering to Nick M. with my statement - not to Larry K.

regards,
Edmund

mjlef · Post by **mjlef** » Thu Jan 28, 2010 6:34 pm

Another idea using this test is to compare matching move percentages as depth increases. Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves? How quickly? This might even measure how much various programs improve with depth.

Don · Post by **Don** » Thu Jan 28, 2010 6:36 pm

mjlef wrote:Another idea using this test is to compare matching move percentages as depth increases. Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves? How quickly? This might even measure how much various programs improve with depth.

For this test to really be valid, I'm assuming that it's not neccessary to converge on a single move - in many positions there should be a choice of 2 or more moves that all good. Of course ultimately it will probably turn out to be the case that one good move leads to checkmate in 57 moves with best play and another leads to checkmate in 58 moves.

Sven · Post by **Sven** » Thu Jan 28, 2010 6:46 pm

Don wrote:
mjlef wrote:Another idea using this test is to compare matching move percentages as depth increases. Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves? How quickly? This might even measure how much various programs improve with depth.
For this test to really be valid, I'm assuming that it's not neccessary to converge on a single move - in many positions there should be a choice of 2 or more moves that all good. Of course ultimately it will probably turn out to be the case that one good move leads to checkmate in 57 moves with best play and another leads to checkmate in 58 moves.

Or, to take it to the extreme, 30 moves draw and 8 moves lose.

It is possible that move choices of engines are not significant enough in some cases to allow that kind of comparison. Could matching the whole (final) PVs be an improvement? We would need a definition for "closeness of two PVs", and we would need access to the PVs themselves somehow. The latter should be easy.

Sven

Don · Post by **Don** » Thu Jan 28, 2010 6:57 pm

Sven Schüle wrote:
Don wrote:
mjlef wrote:Another idea using this test is to compare matching move percentages as depth increases. Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves? How quickly? This might even measure how much various programs improve with depth.
For this test to really be valid, I'm assuming that it's not neccessary to converge on a single move - in many positions there should be a choice of 2 or more moves that all good. Of course ultimately it will probably turn out to be the case that one good move leads to checkmate in 57 moves with best play and another leads to checkmate in 58 moves.
Or, to take it to the extreme, 30 moves draw and 8 moves lose.

It is possible that move choices of engines are not significant enough in some cases to allow that kind of comparison. Could matching the whole (final) PVs be an improvement? We would need a definition for "closeness of two PVs", and we would need access to the PVs themselves somehow. The latter should be easy.

Sven

It's a thought. But it would be messy as we cannot depend on the length of the PV. UCI does not require that a PV even be sent, but a best move must be sent for the program to function.

Is anyone interested in helping me do this?

michiguel · Post by **michiguel** » Thu Jan 28, 2010 7:12 pm

Don wrote:
Sven Schüle wrote:
Don wrote:
mjlef wrote:Another idea using this test is to compare matching move percentages as depth increases. Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves? How quickly? This might even measure how much various programs improve with depth.
For this test to really be valid, I'm assuming that it's not neccessary to converge on a single move - in many positions there should be a choice of 2 or more moves that all good. Of course ultimately it will probably turn out to be the case that one good move leads to checkmate in 57 moves with best play and another leads to checkmate in 58 moves.
Or, to take it to the extreme, 30 moves draw and 8 moves lose.

It is possible that move choices of engines are not significant enough in some cases to allow that kind of comparison. Could matching the whole (final) PVs be an improvement? We would need a definition for "closeness of two PVs", and we would need access to the PVs themselves somehow. The latter should be easy.

Sven
It's a thought. But it would be messy as we cannot depend on the length of the PV. UCI does not require that a PV even be sent, but a best move must be sent for the program to function.

Is anyone interested in helping me do this?

The test design is perfect as it is. It only gets better with more positions and a further statistical analysis. You can always beat noise with more data. If you start picking and choosing you fall into other risks.

What help do you need?

If you make available the .res files of new engines you test, I could easily put them in the tree graphical representation. I already wrote a Ruby script that processes the whole thing.

Can you test one weak engine? I am really curious.

Can epd2wb be use to test winboard engines? I never tried. How about wb2uci adapter?

Caveat: This does not detect clones, it detects "similarity of style". However...

Miguel
PS: This is very useful to choose which engines you should pick for long term analysis to get a bigger diversity of options.

Don · Post by **Don** » Thu Jan 28, 2010 7:29 pm

michiguel wrote:
Don wrote:
Sven Schüle wrote:
Don wrote:
mjlef wrote:Another idea using this test is to compare matching move percentages as depth increases. Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves? How quickly? This might even measure how much various programs improve with depth.
For this test to really be valid, I'm assuming that it's not neccessary to converge on a single move - in many positions there should be a choice of 2 or more moves that all good. Of course ultimately it will probably turn out to be the case that one good move leads to checkmate in 57 moves with best play and another leads to checkmate in 58 moves.
Or, to take it to the extreme, 30 moves draw and 8 moves lose.

It is possible that move choices of engines are not significant enough in some cases to allow that kind of comparison. Could matching the whole (final) PVs be an improvement? We would need a definition for "closeness of two PVs", and we would need access to the PVs themselves somehow. The latter should be easy.

Sven
It's a thought. But it would be messy as we cannot depend on the length of the PV. UCI does not require that a PV even be sent, but a best move must be sent for the program to function.

Is anyone interested in helping me do this?
The test design is perfect as it is. It only gets better with more positions and a further statistical analysis. You can always beat noise with more data. If you start picking and choosing you fall into other risks.

What help do you need?

If you make available the .res files of new engines you test, I could easily put them in the tree graphical representation. I already wrote a Ruby script that processes the whole thing.

Can you test one weak engine? I am really curious.

Can epd2wb be use to test winboard engines? I never tried. How about wb2uci adapter?

Caveat: This does not detect clones, it detects "similarity of style". However...

Miguel
PS: This is very useful to choose which engines you should pick for long term analysis to get a bigger diversity of options.

My desire is to have a test that can run very quickly, not to have a test that requires massive amounts of time to run in order to beat back noise. There are a lot of positions where we can identify the best move with nearly 100% certainty and those should not be in the test. Those positions test the strength of the program, not the style.

If you go take a math test, you are not going to see questions given on subjects that have nothing to do with math. Playing the only move to avoid checkmate in 1 is not relevant to this test and should not be included. That is an obvious example to illustrate my point.

At least one simple thing that we can do is run 50 different programs on the positions at very high levels and just throw out the positions where they all agree on what the best move is.

There will be no perfect way to do this, no argument there. But if there is a way to make this similarity test more accurate without imposing long run times, that is what we want to do. After all, we all do that with our chess programs don't we? I want it to play stronger in less time. Of course I can just run it at 2 weeks per move and get much stronger play but ...

Spacious_Mind · Post by **Spacious_Mind** » Thu Jan 28, 2010 7:29 pm

Don wrote:The number 595 is totally arbitrary, it means nothing. I did not intend that value to represent the distinctions between what is a clone and what is not a clone - I picked it only because that was the point in the data set where you could draw a line that separated related programs from unrelated programs.

If I ran the test on another set of programs, the line would almost certainly show up a little higher or a little lower or you might even get overlap and mis-identification. That all needs to be checked out before the test is taken too seriously.

Think of the line as not black and white but a big fat grey band which is fuzzy. Anything near this line could be disputable, anything well above it would appear to be highly related and everything below it is not very related.

Spacious_Mind wrote:I am really sorry for perhaps asking some questions for which the answers from all you learned gentlemen might be far too obvious. But perhaps it will help me understand what you are trying to achieve and enable me to try some other tests with dedicateds. If someone would kindly provide some answers to a question below that I am asking based on your proposed test, I would appreciate this.
I have some interest in clone tests based on dedicated computer. And here is a test I made a while back on a few, which is not too dissimilar to what you are now discussing. But obviously very basic.

http://spacious-mind.com/html/gk_2100_clones_test.html

Now in my test I had several known clones and also two computers that are known to be totally different but of similar playing strength, and one previously suspected clone.

As you can see the real clones scored on average about 95%. playing back exactly the same moves. The related computer to the clones played back 78% percent of all the moves, the non related clones 65% and 60%.

Now here is my question with regards to your proposed tests. If you assume that each moves was a board set-up move (game position) then for dedicateds the number 595 of your test would in my test suggest that all the computers are clones of each other. Ponder was off in my test therefore each position was in reality a unique test position.

Therefore what I am not sure about is the number 595 and how in your test a good number could be provided because in my dedicated test that would actually mean that each computer in the test was in fact a clone?

Or did I understand your proposed test wrong?

Best regards

Nick

Thanks a lot Don, I think now we are more in alignment. Your 595 is similar to what I would state with dedicates if I were to say that relatives start at 70% and end at 90% for all moves replayed in a game. Since your positions I am sure would not have totally obvious or forced next move situations that all machines would have to play, your number for your tests for a start of relationship suspicions would likely be lower then the 70% assumption in my test.

Thanks and best regards

Nick

michiguel · Post by **michiguel** » Thu Jan 28, 2010 7:39 pm

Don wrote:
michiguel wrote:
Don wrote:
Sven Schüle wrote:
Don wrote:
mjlef wrote:Another idea using this test is to compare matching move percentages as depth increases. Assuming great search depth leads to better moves, shouldn't all programs eventually converge on the same best moves? How quickly? This might even measure how much various programs improve with depth.
For this test to really be valid, I'm assuming that it's not neccessary to converge on a single move - in many positions there should be a choice of 2 or more moves that all good. Of course ultimately it will probably turn out to be the case that one good move leads to checkmate in 57 moves with best play and another leads to checkmate in 58 moves.
Or, to take it to the extreme, 30 moves draw and 8 moves lose.

It is possible that move choices of engines are not significant enough in some cases to allow that kind of comparison. Could matching the whole (final) PVs be an improvement? We would need a definition for "closeness of two PVs", and we would need access to the PVs themselves somehow. The latter should be easy.

Sven
It's a thought. But it would be messy as we cannot depend on the length of the PV. UCI does not require that a PV even be sent, but a best move must be sent for the program to function.

Is anyone interested in helping me do this?
The test design is perfect as it is. It only gets better with more positions and a further statistical analysis. You can always beat noise with more data. If you start picking and choosing you fall into other risks.

What help do you need?

If you make available the .res files of new engines you test, I could easily put them in the tree graphical representation. I already wrote a Ruby script that processes the whole thing.

Can you test one weak engine? I am really curious.

Can epd2wb be use to test winboard engines? I never tried. How about wb2uci adapter?

Caveat: This does not detect clones, it detects "similarity of style". However...

Miguel
PS: This is very useful to choose which engines you should pick for long term analysis to get a bigger diversity of options.
My desire is to have a test that can run very quickly, not to have a test that requires massive amounts of time to run in order to beat back noise. There are a lot of positions where we can identify the best move with nearly 100% certainty and those should not be in the test. Those positions test the strength of the program, not the style.

If you go take a math test, you are not going to see questions given on subjects that have nothing to do with math. Playing the only move to avoid checkmate in 1 is not relevant to this test and should not be included. That is an obvious example to illustrate my point.

At least one simple thing that we can do is run 50 different programs on the positions at very high levels and just throw out the positions where they all agree on what the best move is.

I agree, but to do that you have to run them first. You have already the data

Seriously, I am not talking about gazillions positions. bumping the number to 10x will be a huge statistical improvement. Your test is already quite fast.

Miguel

There will be no perfect way to do this, no argument there. But if there is a way to make this similarity test more accurate without imposing long run times, that is what we want to do. After all, we all do that with our chess programs don't we? I want it to play stronger in less time. Of course I can just run it at 2 weeks per move and get much stronger play but ...

Graham Banks · Post by **Graham Banks** » Thu Jan 28, 2010 7:53 pm

Thanks Don to yourself and any other programmers involved. If you can come up with a watertight clone detection test, it's going to save a lot of future hassles.

Cheers,
Graham.

Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test

Re: Clone detection test