World Computer Chess Championship ?

michiguel · Post by **michiguel** » Tue Jun 19, 2012 5:00 pm

Uri Blass wrote:
Rebel wrote:
mar wrote:
Rebel wrote:This is BAD NEWS for cloners who think they can take an existing source code modify all eval values (some even use multiplication) and think they can get away with it. Playing style is hard to remove from an engine.
I dare to disagree Ed. Playing style is determined by the eval in the first place. So I would say it's actually good news for them.
If I would take for example Stockfish and would heavily lobotomize it's eval (losing several hundred elo), I bet it would appear as "crystal clear" in the similary test. And yet much stronger than vast majority of other engines.
Likely true. I had severe problems removing the traces of my own and only managed after the loss of 200-300 elo points. But I fail to see how this is good news for cloners.
I wondered if you tried only changing the evaluation because it is possible to try other changes like changing the order of moves in the move generator.

I also wonder if there is a significant increase in similiarity with significantly longer time control(100 times slower than the time control that you usually use).

I suspect that in some positions there is only one best move that all programs find but part of the programs need a long search to find the best move so if the time control is long enough you can find a significant increase in similiarity.

If it is the case then maybe removing the relevant positions from the set of positions can help to improve the similiarity tool.

No, when you start removing things you introduce bias. It is better to keep the hand of the experimental operator out of this. You may increase the precision and make the accuracy suffer. I am not saying it cannot be done, but you have to be extremely careful with what you do and it is not worth the effort.

Positions from games, picked randomly, are just perfect.

Miguel

Rebel · Post by **Rebel** » Tue Jun 19, 2012 5:17 pm

mar wrote:
Rebel wrote:But I fail to see how this is good news for cloners.
Simply: StockFish (or any other super strong open source engine) -300 or -400 elo should be still far above average which would IMHO satisfy a cloner, at least for version 0.1 or 1.0 beta or whatever The most important thing is that it will fool the similary test, I am pretty sure.

Well, IMO cloners who want to compete in tournaments are not satisfied with a lower class ranking but they could influence (and spoil the fun) for those who don't participate for a top ranking, I agree.

So the general recipe for "similarity safe" cloning is to take a strong engine, strip eval (material only) and voila, much weaker but already "clean". Then rewrite eval from scratch, that should do the trick.

Agree, a cloner can get the search for free this way. Nevertheless ponder-hits and similarity detector should not be used to brand someone as a cloner but for participation permission only and a signal that something might be wrong. The classic approach (identical code check) should remain decisive. The tool should serve as a grip in a world of chaos. Perfect solutions are only on Utopia island.

michiguel · Post by **michiguel** » Tue Jun 19, 2012 5:26 pm

mar wrote:
Laskos wrote: True, but one will lose several hundred Elo. Smarter guys could do even better, try ro adapt several sources to their engine. False negatives are always a danger, but do you have better tools? How do you go after a lobotomized Stockfish? At least it cannot win a WCCC. This thread is already full of pretty arbitrary accusations, if such a mob will decide the originality of an engine in competitions, it would be very much worse than the simple 60% at CSVN. In the future they will have to use random, secret positions with a bit different, pretty universal threshold.

Kai
No I don't have anything better, in fact I appreciate Adam's work on similarity tests a lot. It's fully automated and is next best thing to full RE.
The whole point is that these similarity tests are heavily influenced by eval (which is logical as you can't run similarity test on tactical positions with one best move for example),
so they basically compare evals (and PSQ tables). That's the problem.

Assuming you are right, you are still seeing a half empty glass.
Even if a cloner butchers a top open-source engine and comes with a weaker engine, this will require:
1) A huge amount of work that most cloners won't like to do.
2) Expertise to do it
and also
3) The damage to the community will be dramatically minimized, it the clone is not so strong.

If your point is that this technique is not perfect, yes, of course, but I still fail to see data that shows that this goes in the wrong direction. Your argument calls for the existence of false negatives. But, you will always have that.

This is nothing but a methodological way to quantify the old perception "this engine plays like X", which was used in the past. That subjectivity is completely gone.

This tool needs to be used carefully and probably the people need to be educated about what it really means, but that does not mean that is not a very useful tool.

Miguel

AFAIK There's no oracle that you would feed with a binary and it would output either "clone" or "clean".

Rebel · Post by **Rebel** » Tue Jun 19, 2012 5:36 pm

mar wrote:
Laskos wrote: True, but one will lose several hundred Elo. Smarter guys could do even better, try ro adapt several sources to their engine. False negatives are always a danger, but do you have better tools? How do you go after a lobotomized Stockfish? At least it cannot win a WCCC. This thread is already full of pretty arbitrary accusations, if such a mob will decide the originality of an engine in competitions, it would be very much worse than the simple 60% at CSVN. In the future they will have to use random, secret positions with a bit different, pretty universal threshold.

Kai
No I don't have anything better, in fact I appreciate Adam's work on similarity tests a lot. It's fully automated and is next best thing to full RE.
The whole point is that these similarity tests are heavily influenced by eval (which is logical as you can't run similarity test on tactical positions with one best move for example),
so they basically compare evals (and PSQ tables). That's the problem.
AFAIK There's no oracle that you would feed with a binary and it would output either "clone" or "clean".

The ponder hits were first introduced by Kai and then used in the Chessbase article of Soren Riis. I did not pay (much) attention back then because I felt that search had too much influence as the data was extracted from CCRL games. Similarity detector is different, with short time controls like 1/10 of a second you are limiting the influence of search enormously, even more if you run the test at low fixed depths, you get more pure eval the lower you go. And yet (much to my surprise) ponder-hits and similarity detector tell the same story, the same suspects are listed.

I think no one yet can claim expert status in this new and unexplored area and more research is needed but the results are too good to be ignored and I am hoping something very good may come out of it once we understand all the in's and out's better.

bob · Post by **bob** » Tue Jun 19, 2012 5:41 pm

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
So you believe there is some "magic match percentage" (such as the one chosen by CSVN) that is a safe number. Anything above that is simply a clone with no investigation needed, anything below that is not?

(Hint: CSVN's number doesn't look particularly "safe" to me)...
As of now, anything higher than 60% is suspicious at 100ms (on 1 modern core) with Don 8,000 or so positions, although the approach is a bit simplistic. Special care must be taken for the testing positions not be publicly available. It's true that cloning must be proven by inspecting the sources, so these suspicious engines must be dealt separately from the main body of engines in a tourney (asking for sources, etc.). There could be "false negatives" at say 57% or even 55% level. Ponder hit numbers from games are very similar (and are not dependent on the choise of positions).

Kai

ps I was not extremely enthusiasmed by the CSVN approach, but seeing so much useless talk about what must be done, the approach now seems pretty adequate
I dislike simple detection schemes. They always have built-in error rates that are non-zero. I'd hate to see someone branded a clone just because of a similarity test, particularly once lots of newcomers are measured...
You seem to be very srupulous with regard to this test, but you yourself (and some others) are accusing directly or not a lot of authors unscrupulously. Generally speaking, this test is say >95% correct in positive detections, while your accusations are pretty random. Besides that, the engines will be labeled only as suspicious, and the inspection of the sources will establish the copying.

Kai

Who have I accused? Vas? Lots of supporting evidence. Houdart? Ditto. Beyond that, the only ones _I_ have accused have all been proven. El Chinito, Le Petite. etc... So please feel free to show me my "random accusations" that seem to be more imagination on your part than anything else. As a control experiment, I wonder what would happen on (say) the ponder hit data if human games were used? Should NOT produce suspicious behaviour, would you agree?

bob · Post by **bob** » Tue Jun 19, 2012 5:51 pm

Rebel wrote:
bob wrote:
Laskos wrote:
bob wrote:
So you believe there is some "magic match percentage" (such as the one chosen by CSVN) that is a safe number. Anything above that is simply a clone with no investigation needed, anything below that is not?

(Hint: CSVN's number doesn't look particularly "safe" to me)...
As of now, anything higher than 60% is suspicious at 100ms (on 1 modern core) with Don 8,000 or so positions, although the approach is a bit simplistic. Special care must be taken for the testing positions not be publicly available. It's true that cloning must be proven by inspecting the sources, so these suspicious engines must be dealt separately from the main body of engines in a tourney (asking for sources, etc.). There could be "false negatives" at say 57% or even 55% level. Ponder hit numbers from games are very similar (and are not dependent on the choise of positions).

Kai

ps I was not extremely enthusiasmed by the CSVN approach, but seeing so much useless talk about what must be done, the approach now seems pretty adequate
I dislike simple detection schemes.
And how much of that is driven by Adam's low percentage of Rybka 1.0? Can you honestly state you are total objective here?

My comments have absolutely nothing to do with Rybka 1.0. They are based on simple observations from past experiments that have been tried to show similarity, or predict the rating of a program. ICC has done a TON of work on detecting computer usage. I worked with Tim over there years ago and saw just how hard it is to show that this "matching moves" stuff indicates anything other than playing good chess in the majority of the cases.

The normal way to develop a model such as this is to put it together, and then run it against several trial groups, specifically comprised to test the model thoroughly. Groups that have nothing to do with each other. Groups that include a couple of related versions. I've not seen any of the "control experiments" to see what ponderhit might show in a group of GM players that are obviously not clones. The more programs that are tested in this way, the more false positives one will see due to nothing more than "the pigeonhole issue"...

They always have built-in error rates that are non-zero. I'd hate to see someone branded a clone just because of a similarity test, particularly once lots of newcomers are measured...
1. There are options to deal with false positives. In case of doubt:
1a. Run the test again now at fixed depth as a second opinion.
1b Run a second set of 8000 positions.
1c. I have made a start with a database with (odd and secret) positions that make a kind of fingerprint of the open sources and how they handle them, it measures the absence or presence of certain chess knowledge and as such can serve as extra information in case of doubt.

"Secret positions" is not very useful. They don't remain secret for long. I have seen examples of programs in the past with a "hidden built-in opening book" so that they could play reasonably even without a book file. I have seen examples of programs with built-in learning information so that they would do better on some "test positions" but which were caught when someone noticed that if you flipped the positions left-right or flipped colors, the program played completely differently. So "secret positions" don't impress me very much at all.

2. I am not very impressed by your sudden care for false accusations since you are mastering accusations yourself and elevated it to some kind of art.

A direct challenge: how about listing exactly WHO I have accused of copying code? Vas/Rybka/ Houdart/Houdini. Because there is a TON of evidence showing both convincingly. The others would include El Chinito, Le Petite, Bionic Impakt, Voyager, and several other Crafty clones that were proven beyond any doubt. Who ELSE have I accused? Be waiting to see if you answer or run here...

3. Besides, the tool is not meant to brand programs as a clone, just to exclude programs from participation. An author just has to make sure his brainchild is original enough by the percentage set by the TD.

"exclude from participation" is not the same as "branding them as a clone"?

bob · Post by **bob** » Tue Jun 19, 2012 5:53 pm

Rebel wrote:In addition I can tell the following, some programmer (who wants to remain unknown) has done the following experiment:

1. Take the Fruit 2.1 source and modify each of Fruit's EVAL equal to Rybka 1.0

2. Similarity detector reported an only 4% increase.

This is BAD NEWS for cloners who think they can take an existing source code modify all eval values (some even use multiplication) and think they can get away with it. Playing style is hard to remove from an engine.

Try drastically changing null-move, or LMR threshold in Fruit, or a few very SIMPLE things like those. Those things greatly influence move selection.

Laskos · Post by **Laskos** » Tue Jun 19, 2012 6:10 pm

bob wrote:
Laskos wrote:
You seem to be very srupulous with regard to this test, but you yourself (and some others) are accusing directly or not a lot of authors unscrupulously. Generally speaking, this test is say >95% correct in positive detections, while your accusations are pretty random. Besides that, the engines will be labeled only as suspicious, and the inspection of the sources will establish the copying.

Kai
Who have I accused? Vas? Lots of supporting evidence. Houdart? Ditto. Beyond that, the only ones _I_ have accused have all been proven. El Chinito, Le Petite. etc... So please feel free to show me my "random accusations" that seem to be more imagination on your part than anything else. As a control experiment, I wonder what would happen on (say) the ponder hit data if human games were used? Should NOT produce suspicious behaviour, would you agree?

You have accused the engines like IvanHoe or even StockFish of possible cloning (or in need of investigation), besides Houdini (which if took something, took from the open domain, and that thing in the open domain was never proven anything). Do you have something better than hearsay on OpenChess, three lines of code and mob lynching which is so dear to you? Even for copyright violation, significant chunks of code must be found as copied.
How ponder hit works for human games? It would be great to use something analogous to sim on humans, you will see that it's much more reliable on correct positives than you imply.
After reading you and others in the same vein, and knowing that you have influence in ICGA, CSVN 60% tourney rule seems much more adequate than your views.

Kai

mar · Post by **mar** » Tue Jun 19, 2012 7:51 pm

Rebel wrote: The ponder hits were first introduced by Kai and then used in the Chessbase article of Soren Riis. I did not pay (much) attention back then because I felt that search had too much influence as the data was extracted from CCRL games. Similarity detector is different, with short time controls like 1/10 of a second you are limiting the influence of search enormously, even more if you run the test at low fixed depths, you get more pure eval the lower you go. And yet (much to my surprise) ponder-hits and similarity detector tell the same story, the same suspects are listed.

I think no one yet can claim expert status in this new and unexplored area and more research is needed but the results are too good to be ignored and I am hoping something very good may come out of it once we understand all the in's and out's better.

I agree Ed, what bothers me about ponder hits are sequences of forced moves. I would believe ponder hit ratio even less than similarity tests. But that's only theory so I may be wrong.

EDIT: one of the older versions of my engine had >50% ponder hit ratio with 15 other engines (according to CCRL), while the latest version only has 11-17%, though still the same breed

So I wonder what informative value ponder hits really have... Or pehaps I'm missing something?

Laskos · Post by **Laskos** » Tue Jun 19, 2012 7:57 pm

Rebel wrote:
mar wrote:
Laskos wrote: True, but one will lose several hundred Elo. Smarter guys could do even better, try ro adapt several sources to their engine. False negatives are always a danger, but do you have better tools? How do you go after a lobotomized Stockfish? At least it cannot win a WCCC. This thread is already full of pretty arbitrary accusations, if such a mob will decide the originality of an engine in competitions, it would be very much worse than the simple 60% at CSVN. In the future they will have to use random, secret positions with a bit different, pretty universal threshold.

Kai
No I don't have anything better, in fact I appreciate Adam's work on similarity tests a lot. It's fully automated and is next best thing to full RE.
The whole point is that these similarity tests are heavily influenced by eval (which is logical as you can't run similarity test on tactical positions with one best move for example),
so they basically compare evals (and PSQ tables). That's the problem.
AFAIK There's no oracle that you would feed with a binary and it would output either "clone" or "clean".
The ponder hits were first introduced by Kai and then used in the Chessbase article of Soren Riis. I did not pay (much) attention back then because I felt that search had too much influence as the data was extracted from CCRL games. Similarity detector is different, with short time controls like 1/10 of a second you are limiting the influence of search enormously, even more if you run the test at low fixed depths, you get more pure eval the lower you go. And yet (much to my surprise) ponder-hits and similarity detector tell the same story, the same suspects are listed.

I think no one yet can claim expert status in this new and unexplored area and more research is needed but the results are too good to be ignored and I am hoping something very good may come out of it once we understand all the in's and out's better.

Soren Riis used my Sim results with hierarchical clustering. I used in other plots ponder hit results, mostly in cases where Sim doesn't work (with ChessBase native engines and other cases). The methods are almost the same, only the percentages differ a bit, but the shape of clusters is almost identical.

Kai

World Computer Chess Championship ?

Re: World Computer Chess Championship ?

Re: World Computer Chess Championship ?

Re: World Computer Chess Championship ?

Re: World Computer Chess Championship ?

Re: World Computer Chess Championship ?

Re: World Computer Chess Championship ?

Re: World Computer Chess Championship ?

Re: World Computer Chess Championship ?

Re: World Computer Chess Championship ?

Re: World Computer Chess Championship ?