Critter 1.2 SEEMS to be a member of the Ippo family

michiguel · Post by **michiguel** » Mon Jul 04, 2011 6:20 am

BubbaTough wrote:
Milos wrote:
BubbaTough wrote:Why do you say that? I certainly don't know much about it, but it looks like a 100ms search, not a static eval. And I don't see why you claim static eval = PST, and not static eval = whole eval.
It might be a good idea to first read some good book on communications and noise statistics/correlation (I suggest S. Haykin's Communication systems) and read prior threads on this topic.
100ms search actually has so much randomness in the actual search time (searched node count) that any common search component between two engines gets lost in the noise. The most of the randomness comes from the engine's check time poling function.
The same is valid for the dynamic components of the evaluation - so called positional values (scores that depend on pieces interaction - king safety, different pawns structure bonuses, mobility, etc.). There again the noise kills the correlation info.
What remains is the static evaluation - material values + PST.
You can actually very easily test it in your engine. Make full eval disabled and only lazy eval (material+PST) active. And then let this simplified engine play against your normal engine version.
Start from longer fixed time per move TC (like 1s per move) and gradually reduce TC towards 50ms or even 20ms and check what's the impact on Elo difference. You'll get quite surprised .
Its fine if you want to make some noise argument. Sounds quite reasonable (except for the data that seems to indicate there are some strong correlations), I don't understand why you are artificially separating piece squares from other things. It seems extremely unlikely that whether an engine chooses to push a passed pawn only depends on piece square tables, and not passed pawn values. Or whether moving a knight to an outpost is only dependent on square values and not the value of an outpost, or doing a BxN move only depends on PST and not on things like double pawn values or isolated pawn values or such. Or pushing a pawn in front of the king is only based on PST and not king shelter scores. The whole things sounds ridiculous. I am not saying its not true, I really have no idea, but it sounds ridiculous. If noise is the main factor, fine then, it will decrease correlation. But to claim PST and material values are somehow special in that they pierce noise and nothing else does, well, that would be very unintuitive to me.

-Sam

This technique has been tried at different and longer times, fixed seconds, different time to compensate for strength, etc. and in all cases the results were basically the same. i.e. how engines clustered. It was also ran by different people (Kai, Don, Michael Hart, Adam Hair), different positions, and even if you randomly select the positions (bootstrapped), you still get the same thing most of the time (for the engines with stronger similarity signal, that is).

Miguel

Matthias Gemuh · Post by **Matthias Gemuh** » Mon Jul 04, 2011 9:17 am

Richard, keep up the good work on your nice engine.
I hope you won't soon go commercial.

Matthias.

noctiferus · Post by **noctiferus** » Mon Jul 04, 2011 9:30 am

Thx, Kai. Very kind of you.
Enrico

Laskos · Post by **Laskos** » Mon Jul 04, 2011 9:29 pm

noctiferus wrote:Hi.

2) in alternative, could you run a clustering with complete and single linkage, please (just to look at stability of similarities)?

Thx
Enrico

Hi Enrico,
I can answer your second question too.

With this notations:

1) Critter 1.2 (time: 100 ms scale: 1)
2) Houdini 1.5 (time: 100 ms scale: 1.0)
3) IvanHoe B47cB (time: 100 ms scale: 1.00)
4) Komodo 2.03 (time: 100 ms scale: 1)
5) Robbo 009 (time: 100 ms scale: 1.0)
6) Rybka 3 (time: 100 ms scale: 1.0)
7) Rybka 4 (time: 100 ms scale: 1.00)
8) Shredder 10 (time: 100 ms scale: 1.0)
9) Shredder 12 (time: 100 ms scale: 1.0)
10) Stockfish 2.1.1 (time: 100 ms scale: 1.0)

Dendrogram using Complete Linkage is

Dendrogram using Single Linkage is

You can see that the clustering is stable, in line with the (more accurate) Average Linkage Dendrogram presented before.

Kai

noctiferus · Post by **noctiferus** » Mon Jul 04, 2011 10:57 pm

Thx, Kai. You did all the work

noctiferus · Post by **noctiferus** » Tue Jul 05, 2011 12:11 pm

e-mail would be fine.
Thx again: very kind of you!

Laskos · Post by **Laskos** » Tue Jul 05, 2011 6:26 pm

noctiferus wrote:e-mail would be fine.
Thx again: very kind of you!

Ok, send me a PM with your e-mail address, I will send you the "similarity.data" file (~400kB) in which all the ~8000 moves in algebraic notation are chosen by each engine, you can perform bootstrapping, for example, or everything you want. This file can be opened in notepad to see what's there, I never used it for bootstrapping, but Miguel did it, you better ask him how to deal with this file.

Kai

Don · Post by **Don** » Wed Aug 22, 2012 6:45 pm

Laskos wrote:
noctiferus wrote:e-mail would be fine.
Thx again: very kind of you!
Ok, send me a PM with your e-mail address, I will send you the "similarity.data" file (~400kB) in which all the ~8000 moves in algebraic notation are chosen by each engine, you can perform bootstrapping, for example, or everything you want. This file can be opened in notepad to see what's there, I never used it for bootstrapping, but Miguel did it, you better ask him how to deal with this file.

Kai

Here is my take on the sim tool.

It was designed to measure the similarity in move choice between 2 different programs. That's a very simple concept and all it does is simply counting. It gets a lot more complicated if you make assumptions about what it is supposed to measure.

The issue of piece square tables came up. The test was not designed to measure piece square tables, only move choice. If your program is heavily influence by piece square tables then I would expect that to make a big difference.

If your program COPIES piece square tables from other programs, then you should expect to get more similarity because your program has plagiarized elements from that other chess program. That should not be a surprise to anyone. I would further expect that if you copy even more evaluation concept and exact values for them from other programs you will likely get more move choices that are the same. I don't believe this is rocket science.

I do not believe the search has much impact on the results at all because you can run the same program at widely disparate depths and get extremely high similarity. So it's my hypothesis that search has a very small (but not zero) impact on the similarity.

I never claimed the tool proves that a program was cloned and in fact I have always made that disclaimer, so anyone critical of it on that basis is the one making claims and jumping to conclusions. The very first version of the tool was advertised as a clone detector in this forum for the impact and sensationalism to get everyone's attention but even in that first post I made the disclaimer. Several times since then I wrote that we need to gain more experience with it in order to understand how it works.

So is it a good clone detection utility? No. Naum got very high correlation with Rybka and it turns out the reason was that they used automated tuning methods to MAKE it play like Rykba - evidently they were successful. I don't think anyone believes the Naum programmer plagiarized Rybka.

I suggested one possible use of it long ago - as a tool to clear programs. I am personally more comfortable using it to clear people than to convict them and put this in the same class as a polygraph test, that it should be used as an investigation tool only - not admissible as proof of clonesmanship.

It appears that it does what it does pretty effectively however. Richard Vida says he copied the piece square tables and tool picked this up.

Don

Don · Post by **Don** » Thu Aug 23, 2012 7:21 pm

rvida wrote:Hi, Kai

1) Hunting for some publicity, eh?

2) Do you know what exactly is measured with sim03?

3) I will tell you a "secret". Houdini, Robbolito, Ippolit, etc. share same piece square tables. While the ones that Critter uses are not 100% identical but are very very close (differences are just because of rounding errors - Critter uses 1/256th of a pawn instead of 1/100 - for every practical purpose they can be called identical... they are working fine, and there are more reasonable ways to spend development time than to make them different just to make someone happy). I don't know about your programming skills but let's try an experiment: Take 2 different open source programs (let's say Crafty & Fruit) and force them to use the same exact PSQ tables. Now run them through sim03 and see the shocking result (and write a sensational post on a forum of your choice about one being a clone of the other...)

4) Sources of Critter are not top-secret. Although after version 0.42 I choose to go closed source, so far I have sent my sources to everyone who asked for them. Most of such requests concerned version 0.90, but a few people on this forum do have sources of v1.2 too (or the last beta before v1.2 release).

Richard

Hi Richard,

I did the experiment you suggested and I posted the results here on the general topics forum. Basically I transplanted the Ivanhoe piece square tables into Komodo. Since Komodo uses pawn = 1000 I multiplied all values x 10 - as you probably know Komodo's tables are multiplied by some specified weight.

By your reasoning I should now have a program that plays much more like Ivanhoe - but that is simply not the case. I posted all the results - including the Ivanhoe tables to make it easier for anyone else to do this experiment.

I'm not sure you can even make a general sweeping statement like this because the tables could matter a lot for a given program if primarily has a highly static evaluation function. But I don't think strong programs allow a simple fixed table to be the main component that controls how they play chess. That just seems stupid to me. I see them as an efficient way to implement a small subset of evaluation terms, such as "centralization" but as I explain what the piece square tables do can be completely replaced by a small handful of terms implemented differently. So I consider this a myth.

If you copied those tables just to save yourself some time because you didn't consider them very important or interesting, you probably took a lot of other shortcuts to avoid having to think about them and to move on to other things you consider more interesting but that would explain why your program plays a lot like these other programs which all play much like each other.

Something I did not include in my study is a follow up study where I set up a round robin tournament to look at the impact of these changes on the rating. I think the conclusion that I might draw from the rating study is that you are right to believe the piece square tables are not a big deal. The tables are significantly different than Komodo's (which you can verify for yourself since you regularly look at Komodo and I assume other programs) and yet the ELO difference is only about 35 ELO which is remarkable considering that only a superficial attempt was made to make the tables compatible with Komodo and no attempt was made to further tune them to work best with everything else.

Personally, I believe all this fuss about copying piece square tables is not a very important issue. It turns out that it's a dead giveaway that an author is working from a given code-base but it has never been something that bothered me much as I think a strong evaluation function (and the work involved in creating one) has almost nothing to do with the piece square table.

Code: Select all

Rank Name            Elo      +      -    games   score   oppo.   draws 
   1 Kdev-default  3232.6   21.3   21.3     895   61.4%  3136.4   37.7% 
   2 c14           3227.3   21.4   21.4     900   60.4%  3136.7   34.8% 
   3 c16           3226.6   21.2   21.2     900   60.6%  3136.8   37.0% 
   4 Kdev-IVHx10   3197.3   21.2   21.2     894   56.6%  3140.3   37.4% 
   5 IvanHoe       3180.3   21.3   21.3     900   54.3%  3142.0   35.1% 
   6 Kdev_IVH      3143.4   21.4   21.4     895   49.6%  3146.2   33.9% 
   7 Kdev-low_mob  3132.9   21.5   21.5     894   48.2%  3147.4   33.4% 
   8 Komodo_3      3095.8   21.3   21.3     896   43.2%  3151.6   35.9% 
   9 sf-2.22       3021.6   21.8   21.8     894   34.0%  3159.8   33.3% 
  10 sf-2.0        3000.0   22.1   22.1     894   31.6%  3162.2   30.8%

Laskos · Post by **Laskos** » Thu Aug 23, 2012 10:57 pm

Don wrote:
Laskos wrote:
noctiferus wrote:e-mail would be fine.
Thx again: very kind of you!
Ok, send me a PM with your e-mail address, I will send you the "similarity.data" file (~400kB) in which all the ~8000 moves in algebraic notation are chosen by each engine, you can perform bootstrapping, for example, or everything you want. This file can be opened in notepad to see what's there, I never used it for bootstrapping, but Miguel did it, you better ask him how to deal with this file.

Kai
Here is my take on the sim tool.

It was designed to measure the similarity in move choice between 2 different programs. That's a very simple concept and all it does is simply counting. It gets a lot more complicated if you make assumptions about what it is supposed to measure.

The issue of piece square tables came up. The test was not designed to measure piece square tables, only move choice. If your program is heavily influence by piece square tables then I would expect that to make a big difference.

If your program COPIES piece square tables from other programs, then you should expect to get more similarity because your program has plagiarized elements from that other chess program. That should not be a surprise to anyone. I would further expect that if you copy even more evaluation concept and exact values for them from other programs you will likely get more move choices that are the same. I don't believe this is rocket science.

Now, that yo have debunked the PST tale of Richard Vida, do you agree that to make your program behave differently under Sim, you have to change much more than PST?

I do not believe the search has much impact on the results at all because you can run the same program at widely disparate depths and get extremely high similarity. So it's my hypothesis that search has a very small (but not zero) impact on the similarity.

Search does have some impact, for a factor x100 in time the similarity may shift by 5-7%, which is not a very small variation, when all the range from unrelated to very related is some 20%. Therefore CSVN, when setting 60% limit, must set the time control too, say 100ms on one modern core.

I never claimed the tool proves that a program was cloned and in fact I have always made that disclaimer, so anyone critical of it on that basis is the one making claims and jumping to conclusions. The very first version of the tool was advertised as a clone detector in this forum for the impact and sensationalism to get everyone's attention but even in that first post I made the disclaimer. Several times since then I wrote that we need to gain more experience with it in order to understand how it works.

So is it a good clone detection utility? No. Naum got very high correlation with Rybka and it turns out the reason was that they used automated tuning methods to MAKE it play like Rykba - evidently they were successful. I don't think anyone believes the Naum programmer plagiarized Rybka.

Do you still believe that? Could you tune Komodo whatever methods to play like Rybka or even Shredder? Must be a humongous task. I believe it's another myth, the similarity of Naum (and of Fritz, if I remember) to Rybka (Strelka) was unusually high.

I suggested one possible use of it long ago - as a tool to clear programs. I am personally more comfortable using it to clear people than to convict them and put this in the same class as a polygraph test, that it should be used as an investigation tool only - not admissible as proof of clonesmanship.

It appears that it does what it does pretty effectively however. Richard Vida says he copied the piece square tables and tool picked this up.

Don

I think that the tool as used by CSVN could be well applied to erase suspicions in tournaments. If some do not like it, it can even be portrayed as a tool to measure "diversity" of engines (nobody can deny that it does measure diversity), and that the tournaments need high "diversity" of engines, while we know what it means with the Sim tester, and what kind of suspicions there really are.

Kai

Critter 1.2 SEEMS to be a member of the Ippo family

Re: Critter 1.2 SEEMS to be a member of the Ippo family

Re: Critter 1.2 SEEMS to be a member of the Ippo family

Re: Critter 1.2 SEEMS to be a member of the Ippo family

Re: Critter 1.2 SEEMS to be a member of the Ippo family

Re: Critter 1.2 SEEMS to be a member of the Ippo family

Re: Critter 1.2 SEEMS to be a member of the Ippo family

Re: Critter 1.2 SEEMS to be a member of the Ippo family

Re: Critter 1.2 SEEMS to be a member of the Ippo family

Re: Critter 1.2 SEEMS to be a member of the Ippo family

Re: Critter 1.2 SEEMS to be a member of the Ippo family