Pairwise Analysis of Chess Engine Move Selections
Posted: Sun Apr 17, 2011 3:25 am
I am presenting some data that I have accumulated with the purpose
of posing some questions concerning some of the results.
In general, it is supposed that unrelated engines will tend to choose
different moves when given a position to analyse that has more than
one "good" move that could be made, due in part to differences in
their evaluation code. By unrelated, I mean engines from different
authors, who have varying skills in programming and the game of
chess. Of course, given the flow of ideas in the computer chess
community, one would expect there to be some overlap of ideas when
comparing two unrelated engines. Yet, the actual implementation of
those ideas would be expected to differ some, and thus cause some
differences in move selection. Conversely, one would expect in most
cases that related engines would tend to choose the same moves at
a higher frequency. The data I have generated seems to indicate that
these assumptions are true in general.
However, there are some cases where seemingly unrelated engines
select the same moves at a higher rate than is expected. Also, in
almost every case, where there is a group of engines that select
the same moves with a higher frequency there is also a open source
engine in that group. I am refering to Fruit, Strelka, Robbolito/IvanHoe,
and Glaurung/Stockfish. This seems to argue against the possibility that
the effect I have observed is due to convergent ideas or successful
attempts to emulate the evaluation of other, closed sourced engines. It
seems to imply the flow of ideas from the open source to other engines.
It does not imply any wrong doing, for authors of original engines (Crafty,
Fruit, Glaurung to name a few) choose to open their sources to share
their ideas and help others learn. However, in most cases I could not
find any acknowledgement of the source of the ideas being used. Given
the standing in the community of some of the authors involved, I am
led to ask certain questions and make no assumptions or accusations.
Furthermore, the data I have generated is no proof of anything. It is only
circumstantial, patterns of data that require additional information to
interpret correctly. So, here are my questions:
Is there a discernable threshhold where the fair use of ideas turns into
plagarism?
Is it possible to use many of the ideas from another engine and yet create
a unique engine that avoids plagarism?
Is it possible that there are explanations for my data other than the overuse
of ideas from open source engines?
I used Don Dailey's similarity tool to create the data. The tool contains
8000+ positions for the engines to analyse. While not all of the positions
are such where information can be extracted, there are more than enough
"good" positions to be able to make statistical inferences. Since it sends
UCI commands, I had to use WB2UCI for the few WB engines that I could
get to run the test properly. In one case (Delfi), I had to use InBetween
to change the "go depth 50" command to "go infinite" in order to run the
test correctly. There were several engines that will not run the test
correctly and are not included.
I gave Houdini 1.5 10 ms per position and determined the thinking time
for the other engines by this formula: 10 ms * 2^(Elo Diff/100), where
Elo Diff is the difference in elo from Houdini 1.5. I had to guesstimate
the Elo difference in several cases and this is merely an attempt to
cancel, to some degree, the correlation between depth searched and
matched moves. From other data I know that the correlation is not very
strong (the number of matched moves does not change greatly with
changes in thinking time). Let me add that Houdini searches to a depth
of 8 plies when given 10 ms, depending on the position. Also, each engine
used one core and 64 MB of hash (I thought that more than enough hash
was better than not enough).
Some engines obey the UCI commands more faithfully than others,
particularly the "stop" command. Also, as I noted above, I guesstimated
some Elo ratings. In fact, the Elo ratings I used are not that precise.
The end result is that the data generated has some impreciseness
inherent to it. So, borderline cases of "similarity" should be ignored.
Also, instead of looking at the raw percentages (which are relative to
the positions found in the tool and the conditions of the test), I
standardized the data. This involves computing the mean (average) and
the sample standard deviation and then converted the data with this
formula: (Mean - Percent matched)/ standard deviation. The resulting
numbers convey how far from the mean the Percent of matched moves
is in each case. This also informs how likely this result would occur if
the assumptions of the test is true. The following table gives the
probability of the result occuring:
The underlying assumption that I am using is that engines from different
families are unrelated in regards to evaluation, unrelated in the sense
that the overlap of ideas is not great. In the case of Spark and Bright,
one would have to think that they share a lot of code, but they act
differently in regards to choosing moves. I chose the following 37
engines as the basis for my comparisons. They were chosen because they
are not highly similar in regards to move selection.
This list produced 666 data points.
Each data point represents the percentage of matching moves for each
possible pair of engines. The mean for the percentage of
matched moves for pairs of engines is 45.71% and the sample deviation
is 2.31 The only pairs which were more than 3 standard deviations from
the mean were:
Aristarch 4.50 - List 5.12 = 3.94 deviations
Cyrano 0.6b17 - Fruit 2.1 = 3.36 deviations
List 5.12 - Ruffian 2.1.0 = 3.44 deviations
Then, I proceeded by including the data for one engine, compared to
these engines, to the 666 data points. I then recomputed the mean
and deviation on this combined data and then standardized the data.
In cases where the engine being tested was related to an engine
included in the base list, for example Rybka 1.0 beta, I exchanged
the existing engine with the engine being tested.
Here is the list of engine pairs that were at least 3 standard deviations
from their means. 3 standard deviations may not be significant enough
in this context. Perhaps 4 or more deviations would be more appropriate.
Also, the first column of numbers is the percent of matched moves and
the second column is the number of standard deviations from the mean of
the group of engines pairs that the particular pair was compared with.
The numbers of the first column are compareable, the numbers of the
second column are weakly compareable.
All of my data and analysis can be download here: http://www.mediafire.com/file/8d3pq334a ... nalysis.7z
of posing some questions concerning some of the results.
In general, it is supposed that unrelated engines will tend to choose
different moves when given a position to analyse that has more than
one "good" move that could be made, due in part to differences in
their evaluation code. By unrelated, I mean engines from different
authors, who have varying skills in programming and the game of
chess. Of course, given the flow of ideas in the computer chess
community, one would expect there to be some overlap of ideas when
comparing two unrelated engines. Yet, the actual implementation of
those ideas would be expected to differ some, and thus cause some
differences in move selection. Conversely, one would expect in most
cases that related engines would tend to choose the same moves at
a higher frequency. The data I have generated seems to indicate that
these assumptions are true in general.
However, there are some cases where seemingly unrelated engines
select the same moves at a higher rate than is expected. Also, in
almost every case, where there is a group of engines that select
the same moves with a higher frequency there is also a open source
engine in that group. I am refering to Fruit, Strelka, Robbolito/IvanHoe,
and Glaurung/Stockfish. This seems to argue against the possibility that
the effect I have observed is due to convergent ideas or successful
attempts to emulate the evaluation of other, closed sourced engines. It
seems to imply the flow of ideas from the open source to other engines.
It does not imply any wrong doing, for authors of original engines (Crafty,
Fruit, Glaurung to name a few) choose to open their sources to share
their ideas and help others learn. However, in most cases I could not
find any acknowledgement of the source of the ideas being used. Given
the standing in the community of some of the authors involved, I am
led to ask certain questions and make no assumptions or accusations.
Furthermore, the data I have generated is no proof of anything. It is only
circumstantial, patterns of data that require additional information to
interpret correctly. So, here are my questions:
Is there a discernable threshhold where the fair use of ideas turns into
plagarism?
Is it possible to use many of the ideas from another engine and yet create
a unique engine that avoids plagarism?
Is it possible that there are explanations for my data other than the overuse
of ideas from open source engines?
I used Don Dailey's similarity tool to create the data. The tool contains
8000+ positions for the engines to analyse. While not all of the positions
are such where information can be extracted, there are more than enough
"good" positions to be able to make statistical inferences. Since it sends
UCI commands, I had to use WB2UCI for the few WB engines that I could
get to run the test properly. In one case (Delfi), I had to use InBetween
to change the "go depth 50" command to "go infinite" in order to run the
test correctly. There were several engines that will not run the test
correctly and are not included.
I gave Houdini 1.5 10 ms per position and determined the thinking time
for the other engines by this formula: 10 ms * 2^(Elo Diff/100), where
Elo Diff is the difference in elo from Houdini 1.5. I had to guesstimate
the Elo difference in several cases and this is merely an attempt to
cancel, to some degree, the correlation between depth searched and
matched moves. From other data I know that the correlation is not very
strong (the number of matched moves does not change greatly with
changes in thinking time). Let me add that Houdini searches to a depth
of 8 plies when given 10 ms, depending on the position. Also, each engine
used one core and 64 MB of hash (I thought that more than enough hash
was better than not enough).
Some engines obey the UCI commands more faithfully than others,
particularly the "stop" command. Also, as I noted above, I guesstimated
some Elo ratings. In fact, the Elo ratings I used are not that precise.
The end result is that the data generated has some impreciseness
inherent to it. So, borderline cases of "similarity" should be ignored.
Also, instead of looking at the raw percentages (which are relative to
the positions found in the tool and the conditions of the test), I
standardized the data. This involves computing the mean (average) and
the sample standard deviation and then converted the data with this
formula: (Mean - Percent matched)/ standard deviation. The resulting
numbers convey how far from the mean the Percent of matched moves
is in each case. This also informs how likely this result would occur if
the assumptions of the test is true. The following table gives the
probability of the result occuring:
Code: Select all
1σ 31.7310508% 1 / 3.1514872
2σ 4.5500264% 1 / 21.977895
3σ 0.2699796% 1 / 370.398
4σ 0.006334% 1 / 15,787
5σ 0.0000573303% 1 / 1,744,278
6σ 0.0000001973% 1 / 506,800,000
7σ 0.0000000002560% 1 / 390,700,000,000
families are unrelated in regards to evaluation, unrelated in the sense
that the overlap of ideas is not great. In the case of Spark and Bright,
one would have to think that they share a lot of code, but they act
differently in regards to choosing moves. I chose the following 37
engines as the basis for my comparisons. They were chosen because they
are not highly similar in regards to move selection.
Code: Select all
1) Alfil 8.1.1 Opt (time: 485 ms scale: 1.0)
2) Arasan 12.3 (time: 761 ms scale: 1.0)
3) Aristarch 4.50 (time: 686 ms scale: 1.0)
4) Bison 9.11 (time: 288 ms scale: 1.0)
5) Bright 0.5c (time: 166 ms scale: 1.0)
6) Crafty 23.4 (time: 149 ms scale: 1.0)
7) Critter 0.80 (time: 39 ms scale: 1.0)
8) Cyrano 0.6b17 (time: 422 ms scale: 1.0)
9) Daydreamer 1.75 (time: 343 ms scale: 1.0)
10) Et Chess 130108 (time: 408 ms scale: 1.0)
11) Fruit 2.1 (time: 381 ms scale: 1.0)
12) Gandalf 6.0 (time: 485 ms scale: 1.0)
13) Gaviota 0.80 (time: 844 ms scale: 1.0)
14) Gull 1.2 (time: 59 ms scale: 1.0)
15) Hannibal 1.0a (time: 89 ms scale: 1.0)
16) Hermann 2.6 (time: 970 ms scale: 1.0)
17) HIARCS 12 (time: 113 ms scale: 1.0)
18) Houdini 1.5 (time: 10 ms scale: 1.0)
19) Jonny 4.00 (time: 209 ms scale: 1.0)
20) Komodo 1.3 (time: 39 ms scale: 1.0)
21) List 512 (time: 577 ms scale: 1.0)
22) Little Goliath Evolution (time: 1004 ms scale: 1.0)
23) Movei 00.8.438 (time: 453 ms scale: 1.0)
24) Naum 2.0 (time: 279 ms scale: 1.0)
25) Protector 1.4.0 (time: 77 ms scale: 1.0)
26) RedQueen 0.9.5 (time: 663 ms scale: 1.0)
27) Ruffian 2.1.0 (time: 502 ms scale: 1.0)
28) Rybka 4 (time: 18 ms scale: 1.0)
29) Shredder 11 (time: 102 ms scale: 1.0)
30) SOS 5.1 (time: 640 ms scale: 1.0)
31) Spark 1.0 (time: 61 ms scale: 1.0)
32) Spike 1.4 (time: 65 ms scale: 1.0)
33) Stockfish 2.0 (time: 20 ms scale: 1.0)
34) Thinker 5.4d Active (time: 144 ms scale: 1.0)
35) Tornado 4.4 (time: 243 ms scale: 1.0)
36) Ufim 8.02 (time: 844 ms scale: 1.0)
37) Yace 9987 (time: 1040 ms scale: 1.0)
Each data point represents the percentage of matching moves for each
possible pair of engines. The mean for the percentage of
matched moves for pairs of engines is 45.71% and the sample deviation
is 2.31 The only pairs which were more than 3 standard deviations from
the mean were:
Aristarch 4.50 - List 5.12 = 3.94 deviations
Cyrano 0.6b17 - Fruit 2.1 = 3.36 deviations
List 5.12 - Ruffian 2.1.0 = 3.44 deviations
Then, I proceeded by including the data for one engine, compared to
these engines, to the 666 data points. I then recomputed the mean
and deviation on this combined data and then standardized the data.
In cases where the engine being tested was related to an engine
included in the base list, for example Rybka 1.0 beta, I exchanged
the existing engine with the engine being tested.
Here is the list of engine pairs that were at least 3 standard deviations
from their means. 3 standard deviations may not be significant enough
in this context. Perhaps 4 or more deviations would be more appropriate.
Also, the first column of numbers is the percent of matched moves and
the second column is the number of standard deviations from the mean of
the group of engines pairs that the particular pair was compared with.
The numbers of the first column are compareable, the numbers of the
second column are weakly compareable.
Code: Select all
Houdini 1.00 RobboLito 0.085g3 71.15 9.90
Loop 2007 Fruit 2.1 71.13 9.74
Onno 1.0.4 Fruit 2.1 67.83 8.71
Rybka 1.0 beta Strelka 2.0 B 66.78 8.65
Thinker 5.4D Inert Strelka 2.0 B 62.07 6.80
Naum 4.2 Rybka 1.0 beta 61.00 6.71
RobboLito 0.085g3 Houdini 1.5 61.87 6.59
Naum 4.2 Strelka 2.0 B 61.37 6.48
Philou 3.5.1 Stockfish 2.0 61.17 6.23
Alaric 707 Fruit 2.1 60.28 6.02
Colossus 2008 Fruit 2.1 59.69 5.83
Glaurung 1.2.1 Alfil 8.1.1 Opt 59.07 5.82
Thinker 5.4D Inert Rybka 1.0 beta 58.50 5.78
Fruit 1.5 Rotor 0.6 59.09 5.76
Rybka 3 RobboLito 0.085g3 59.16 5.37
Critter 1.0 RobboLito 0.085g3 59.04 5.29
Strelka 2.0 B Fruit 2.1 58.18 5.02
Delfi 5.4 Fruit 2.1 57.50 4.85
Naraku 1.31 Fruit 2.1 57.00 4.79
TwistedLogic 20090922 Strelka 2.0 B 56.65 4.73
Rybka 2.3.2a Fruit 2.1 57.26 4.71
Thinker 5.4D Passive Fruit 2.1 56.65 4.69
Rybka 1.0 Beta Fruit 2.1 56.85 4.68
Critter 1.0 Houdini 1.5 57.40 4.64
SmarThink 1.20 Fruit 2.1 55.89 4.33
Hamsters 0.7.1 Fruit 2.1 55.38 4.07
Aristarch 4.50 List 512 54.82 3.94
RobboLito 0.085g3 Rybka 4 54.98 3.77
Rybka 3 Houdini 1.5 55.16 3.76
Pupsi 0.08 Fruit 2.1 54.36 3.71
Glaurung 2.2 Fruit 2.1 54.29 3.68
Strelka 2.0 B Rybka 4 54.07 3.62
Daydreamer 1.7 Strelka 2.0 B 54.07 3.60
Philou 3.5.1 Fruit 2.1 54.53 3.53
List 512 Ruffian 2.1.0 53.65 3.44
Cyrano 0.6b17 Fruit 2.1 53.48 3.36
Philou 3.5.1 Hannibal 1.0a 53.98 3.30
Strelka 2.0 B Hannibal 1.0a 52.72 3.04
Strelka 2.0 B Bison 9.11 52.63 3.00