Adam Hair's article on Pairwise comparison of engines

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Adam Hair's article on Pairwise comparison of engines

Post by Adam Hair »

Norm Pollock wrote:Adam,

Would it be unreasonable to test a 2nd copy of each engine to establish consistency?

I think you need to have some assurance that each engine is strongly consistent and will produce the same move from the same position the great majority of the time if given a 2nd chance.

After running a second version through the 8000+ positions, the 2 versions of each engine can be compared. If the versions don't have at least 95% matched moves, then I would not consider the engine consistent and possible disqualify the engine from the test.

-Norm
You are correct, Norm. After I collected the data for this study, Lukas Cimiotti (IIRC) and Richard Vida pointed out how I could improve the self-similarity percentages for engines like Houdini ("Clear Hash" and "ucinewgame" commands"). What I found was that the results became more refined, but the old results were still valid. Still, a proper study should not include inconsistent engines.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Adam Hair's article on Pairwise comparison of engines

Post by Adam Hair »

One day I will do a follow up that uses Miguel's idea of comparing evaluation scores rather than the moves selected. Besides numerical analysis, the similarity of engines can be shown visually. For example, here is a plot of Toga 1.1a eval scores at depth 6 vs Gaviota 0.86 (using the 8000+ positions packaged in Don's similarity tool):

Image

Now look at Toga 1.1a versus Loop 2007:

Image

Most engine pairs' correlation of evaluation scores look a lot like Toga 1.1a and Gaviota 0.86. Less so if there is a large difference in age and/or strength, more so if the difference is smaller. The Toga 1.1a/Loop 2007 plot is highly unusual even though Toga is open source.
User avatar
Rebel
Posts: 6997
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: Adam Hair's article on Pairwise comparison of engines

Post by Rebel »

Adam Hair wrote:
Rebel wrote: And about Adam, I think that guy is totally indifferent :wink: but you gotta love him anyway.
Maybe a bit irreverent, but not necessarily indifferent.

I know that you know all of the following, Ed. My response is for others who may read this thread:

Two people that I have a great amount of respect for have looked at the similarity data and reached two opposite conclusions. Based on phylogenetic analysis, Miguel Ballicora concluded that the similarity data does not show an unreasonable amount of similarity between Rybka 1.0 Beta and Fruit 2.1. Based on a refinement of my methods, Mark Watkins found that early Rybka versions (starting with Rybka 1.0 Beta) "show abnormally large move-matching with Fruit 2.1" (found in Move similarity analysis in chess programs - preprint)

Personally, I think that the similarity data shows that Fruit influenced Rybka, and that is all that it shows. If a person was looking for indications that Rybka was copied from Fruit, the data would make them more suspicious. If a person was looking for clones and close derivatives, then they would ignore Rybka/Fruit and look at other pairs of engines (start with Ed's discovery of Loop 2007, Toga 1.1a, and Fruit 2.2.1).
Adam, maybe my choice (and perception) of the word "indifferent" was a bad one, my sincere aplogies if my intend of humor came out wrongly and could be seen as insulting. Not meant so.

Regarding R/F - I can agree with both you and Miguel. It's obvious Rybka was heavily influenced by Fruit and it's a not unreasonable thing to say Vas modelled his EVAL to Fruit.

Regarding the SYM tool - it's a brilliant tool based on a very simple (yet to the point) idea. It has a weak point also, the definition were to draw the line(s) because the lines are based on known and proven clones, derivatives and those who has studied the tool draw the line(s) different based on their understanding of the known clones and derivatives.

My understanding of the disagreement is about a fluctuation of 5% max.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Adam Hair's article on Pairwise comparison of engines

Post by Adam Hair »

Oh no, I did not feel insulted at all. It made me chuckle :D
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Adam Hair's article on Pairwise comparison of engines

Post by michiguel »

Rebel wrote:
Adam Hair wrote:
Rebel wrote: And about Adam, I think that guy is totally indifferent :wink: but you gotta love him anyway.
Maybe a bit irreverent, but not necessarily indifferent.

I know that you know all of the following, Ed. My response is for others who may read this thread:

Two people that I have a great amount of respect for have looked at the similarity data and reached two opposite conclusions. Based on phylogenetic analysis, Miguel Ballicora concluded that the similarity data does not show an unreasonable amount of similarity between Rybka 1.0 Beta and Fruit 2.1. Based on a refinement of my methods, Mark Watkins found that early Rybka versions (starting with Rybka 1.0 Beta) "show abnormally large move-matching with Fruit 2.1" (found in Move similarity analysis in chess programs - preprint)

Personally, I think that the similarity data shows that Fruit influenced Rybka, and that is all that it shows. If a person was looking for indications that Rybka was copied from Fruit, the data would make them more suspicious. If a person was looking for clones and close derivatives, then they would ignore Rybka/Fruit and look at other pairs of engines (start with Ed's discovery of Loop 2007, Toga 1.1a, and Fruit 2.2.1).
Adam, maybe my choice (and perception) of the word "indifferent" was a bad one, my sincere aplogies if my intend of humor came out wrongly and could be seen as insulting. Not meant so.

Regarding R/F - I can agree with both you and Miguel. It's obvious Rybka was heavily influenced by Fruit and it's a not unreasonable thing to say Vas modelled his EVAL to Fruit.

Regarding the SYM tool - it's a brilliant tool based on a very simple (yet to the point) idea. It has a weak point also, the definition were to draw the line(s) because the lines are based on known and proven clones, derivatives and those who has studied the tool draw the line(s) different based on their understanding of the known clones and derivatives.

My understanding of the disagreement is about a fluctuation of 5% max.
The way I process the data (with trees and bootstrap analysis) has less problems with "drawing lines". In fact, it is totally independent of the set of input positions. The more random, the better. The "line" is drawn based on the propensity of an engine to cluster (or not) with certain "neighbors". It does not even matter if you include a "crazy" engine that changes move selected often after given the same position twice, of if you include in the analysis engines who pick move at random. They will go to a different branch. But, the result is not easier to show w/o a bit of explanation.

Miguel
User avatar
Rebel
Posts: 6997
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: Adam Hair's article on Pairwise comparison of engines

Post by Rebel »

Adam Hair wrote:One day I will do a follow up that uses Miguel's idea of comparing evaluation scores rather than the moves selected. Besides numerical analysis, the similarity of engines can be shown visually. For example, here is a plot of Toga 1.1a eval scores at depth 6 vs Gaviota 0.86 (using the 8000+ positions packaged in Don's similarity tool):

Image

Now look at Toga 1.1a versus Loop 2007:

Image

Most engine pairs' correlation of evaluation scores look a lot like Toga 1.1a and Gaviota 0.86. Less so if there is a large difference in age and/or strength, more so if the difference is smaller. The Toga 1.1a/Loop 2007 plot is highly unusual even though Toga is open source.
Interesting. But somehow you must be able to create a percentage from the data points?
User avatar
Rebel
Posts: 6997
Joined: Thu Aug 18, 2011 12:04 pm
Full name: Ed Schröder

Re: Adam Hair's article on Pairwise comparison of engines

Post by Rebel »

michiguel wrote:
Rebel wrote:
Adam Hair wrote:
Rebel wrote: And about Adam, I think that guy is totally indifferent :wink: but you gotta love him anyway.
Maybe a bit irreverent, but not necessarily indifferent.

I know that you know all of the following, Ed. My response is for others who may read this thread:

Two people that I have a great amount of respect for have looked at the similarity data and reached two opposite conclusions. Based on phylogenetic analysis, Miguel Ballicora concluded that the similarity data does not show an unreasonable amount of similarity between Rybka 1.0 Beta and Fruit 2.1. Based on a refinement of my methods, Mark Watkins found that early Rybka versions (starting with Rybka 1.0 Beta) "show abnormally large move-matching with Fruit 2.1" (found in Move similarity analysis in chess programs - preprint)

Personally, I think that the similarity data shows that Fruit influenced Rybka, and that is all that it shows. If a person was looking for indications that Rybka was copied from Fruit, the data would make them more suspicious. If a person was looking for clones and close derivatives, then they would ignore Rybka/Fruit and look at other pairs of engines (start with Ed's discovery of Loop 2007, Toga 1.1a, and Fruit 2.2.1).
Adam, maybe my choice (and perception) of the word "indifferent" was a bad one, my sincere aplogies if my intend of humor came out wrongly and could be seen as insulting. Not meant so.

Regarding R/F - I can agree with both you and Miguel. It's obvious Rybka was heavily influenced by Fruit and it's a not unreasonable thing to say Vas modelled his EVAL to Fruit.

Regarding the SYM tool - it's a brilliant tool based on a very simple (yet to the point) idea. It has a weak point also, the definition were to draw the line(s) because the lines are based on known and proven clones, derivatives and those who has studied the tool draw the line(s) different based on their understanding of the known clones and derivatives.

My understanding of the disagreement is about a fluctuation of 5% max.
The way I process the data (with trees and bootstrap analysis) has less problems with "drawing lines". In fact, it is totally independent of the set of input positions. The more random, the better. The "line" is drawn based on the propensity of an engine to cluster (or not) with certain "neighbors". It does not even matter if you include a "crazy" engine that changes move selected often after given the same position twice, of if you include in the analysis engines who pick move at random. They will go to a different branch. But, the result is not easier to show w/o a bit of explanation.

Miguel
Of which, if you have the time and energy I would appreciate if you did.
Dann Corbit
Posts: 12542
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Adam Hair's article on Pairwise comparison of engines

Post by Dann Corbit »

Rebel wrote:
Adam Hair wrote:One day I will do a follow up that uses Miguel's idea of comparing evaluation scores rather than the moves selected. Besides numerical analysis, the similarity of engines can be shown visually. For example, here is a plot of Toga 1.1a eval scores at depth 6 vs Gaviota 0.86 (using the 8000+ positions packaged in Don's similarity tool):

Image

Now look at Toga 1.1a versus Loop 2007:

Image

Most engine pairs' correlation of evaluation scores look a lot like Toga 1.1a and Gaviota 0.86. Less so if there is a large difference in age and/or strength, more so if the difference is smaller. The Toga 1.1a/Loop 2007 plot is highly unusual even though Toga is open source.
Interesting. But somehow you must be able to create a percentage from the data points?
Looks like a job for Pearson's correlation coefficient:
http://en.wikipedia.org/wiki/Pearson_pr ... oefficient
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Adam Hair's article on Pairwise comparison of engines

Post by Adam Hair »

Miguel used Spearman's rank correlation coefficient in order to compare some depth 6 evaluations that I generated:

Code: Select all

1: Alfil_13.1_depth6_scores.txt
2: Arasan_16.1_depth6_scores.txt
3: Atlas_3.50_depth6_scores.txt
4: Bobcat_3.25_depth6_scores.txt
5: Cheng3_1.07_depth6_scores.txt
6: Cyrano_06b17_depth6_scores.txt
7: Daydreamer-1.75_depth6_scores.txt
8: DiscoCheck_4.3_depth6_scores.txt
9: Doch_09.980_depth6_scores.txt
10: Fruit_2.1_depth6_scores.txt
11: Fruit_2.2.1_depth6_scores.txt
12: Gaviota_0.86_depth6_scores.txt
13: Gaviota_0.87_a8__depth6_scores.txt
14: Glass_2.0_depth6_scores.txt
15: GNUChess-5.50_depth6_scores.txt
16: Godel_2.3.7_depth6_scores.txt
17: Hamsters_071_depth6_scores.txt
18: Hannibal_1.3_depth6_scores.txt
19: Houdini_1.00_depth6_scores.txt
20: Houdini_3_depth6_scores.txt
21: iCE_1.00_depth6_scores.txt
22: Komodo_5.1_depth6_scores.txt
23: Komodo_CCT_depth6_scores.txt
24: Loop_2007_depth6_scores.txt
25: MinkoChess_1.3_depth6_scores.txt
26: Movei00_8_438_depth6_scores.txt
27: Murka_3_depth6_scores.txt
28: Naum_4.2_depth6_scores.txt
29: Nebula_2.0_depth6_scores.txt
30: Nemo_1.01_beta_depth6_scores.txt
31: Octochess_r5190_depth6_scores.txt
32: Pawny_1.0_depth6_scores.txt
33: Quazar_0.4_depth6_scores.txt
34: RedQueen_1.1.4_depth6_scores.txt
35: RobboLito_085d3_depth6_scores.txt
36: Ruffian_210_depth6_scores.txt
37: Rybka_1.0_Beta_depth6_scores.txt
38: Shredder11_depth6_scores.txt
39: Sjeng_WC2008_depth6_scores.txt
40: SmarThink_1.20_depth6_scores.txt
41: Spark_1.0_depth6_scores.txt
42: Spike_1.4_depth6_scores.txt
43: Stockfish_4_depth6_scores.txt
44: Strelka_1.8_depth6_scores.txt
45: Strelka_2.0B_depth6_scores.txt
46: Texel_1.02_depth6_scores.txt
47: TogaII_1.0_depth6_scores.txt
Image

Each engine is represented by an endpoint on the tree. The closer 2 endpoints are to the last vertex common to their branches, the more closely related are their evaluation scores. Compare Houdini 1.00/Robbo and Fruit 2.2.1/Loop (Toga 1.1a explains how Loop is so close to the closed source Fruit 2.2.1) to the Komodo and Gaviota versions.

Miguel and I also discussed measuring the entropy of each plot (lower entropy equates to higher correlation of scores), but this was 1 1/2 years ago and we have not had the time nor enough interest to pursue this further.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Adam Hair's article on Pairwise comparison of engines

Post by bob »

Dann Corbit wrote:
Rebel wrote:
Adam Hair wrote:One day I will do a follow up that uses Miguel's idea of comparing evaluation scores rather than the moves selected. Besides numerical analysis, the similarity of engines can be shown visually. For example, here is a plot of Toga 1.1a eval scores at depth 6 vs Gaviota 0.86 (using the 8000+ positions packaged in Don's similarity tool):

Image

Now look at Toga 1.1a versus Loop 2007:

Image

Most engine pairs' correlation of evaluation scores look a lot like Toga 1.1a and Gaviota 0.86. Less so if there is a large difference in age and/or strength, more so if the difference is smaller. The Toga 1.1a/Loop 2007 plot is highly unusual even though Toga is open source.
Interesting. But somehow you must be able to create a percentage from the data points?
Looks like a job for Pearson's correlation coefficient:
http://en.wikipedia.org/wiki/Pearson_pr ... oefficient
Looks like a job for an astrophysicist. That is CLEARLY a picture of a galaxy that is slowly rotating orthogonally to its primary direction of spin/rotation. :) Wait a while and it will look like a circle with more density in the "galactic core". :)