Stylistic bias in computer chess programs

Don · Post by **Don** » Fri Dec 09, 2011 8:19 pm

The similarity study I did a while back was interesting in that it
showed that programs have very particular styles of play that seems to
have very little relation to their actual playing strength. For
example I showed that Komodo running at levels that would make it
hundreds of ELO stronger only had a small impact on how many moves
were played differently.

So I decided to do a follow up study to compare chess programs to
actual human players, in an attempt to discover if some programs
played signficantly more like one player over another.

I picked 2000 random positions from games that the player in question
played and counted how many times each program matched the move of the
player. I searched each position for 1/4 second, stopped the program
and accepted the returned move.

First of all, let me give my disclaimer - I don't know what this
actually measures and I would not attach too much meaning to it.
Determining if one player plays "like" some other player is probably
more complicated than just counting how many moves they would play in
common. Nevertheless, I am going to use terminology such as "plays
like" or "plays more like" with the understanding that I mean from the
point of view of whatever this test actually measures (if anything of
relevance.)

I was actually disppointed to find that the programs I tested tended
to predict the human players moves about the same. The old Spike program
showed the biggest difference where it appears to play more like Fischer and
least like Karpov of the 4 human players I picked.

The sample of 2000 positions is probably not enough, statistical noise
could easily change things.

Another interesting thing is that the strength of the programs in
question does not see very related to how often it chooses the humans
move. Houdini is the worst program in this regard, but it's hard to
say what that means or if it's good or bad. One could argue that
Houdini does not have a human style but I don't know if this really
indicates that or not. One could also argue that computers are much
stronger than humans now and getting a high match with human moves is
a BAD thing, not a good thing. I'll leave any interpretation up to
you.

I had intended to test other programs and include more human players
in this study, but I have mostly lost interest since I envisioned this
as a (non-serious) way to classify programs by which players they best
matched up with. I could still do that, but there does not see to
strong biases in programs to play like any particular player over
another.

Here is the data, showing the percentage of time it matched the moves
of the specified player:

komodo 4328.06
Karpov -> 45.10
Fischer -> 45.00
Carlsen -> 44.55
Anand -> 42.55

Spike 1.2 Turin
Fischer -> 50.00
Anand -> 48.05
Carlsen -> 47.95
Karpov -> 45.95

houdini 1.5
Carlsen -> 37.90
Karpov -> 37.60
Fischer -> 37.60
Anand -> 35.30

Don · Post by **Don** » Fri Dec 09, 2011 8:25 pm

Don wrote:The similarity study I did a while back was interesting in that it
showed that programs have very particular styles of play that seems to
have very little relation to their actual playing strength. For
example I showed that Komodo running at levels that would make it
hundreds of ELO stronger only had a small impact on how many moves
were played differently.

So I decided to do a follow up study to compare chess programs to
actual human players, in an attempt to discover if some programs
played signficantly more like one player over another.

I picked 2000 random positions from games that the player in question
played and counted how many times each program matched the move of the
player. I searched each position for 1/4 second, stopped the program
and accepted the returned move.

First of all, let me give my disclaimer - I don't know what this
actually measures and I would not attach too much meaning to it.
Determining if one player plays "like" some other player is probably
more complicated than just counting how many moves they would play in
common. Nevertheless, I am going to use terminology such as "plays
like" or "plays more like" with the understanding that I mean from the
point of view of whatever this test actually measures (if anything of
relevance.)

I was actually disppointed to find that the programs I tested tended
to predict the human players moves about the same. The old Spike program
showed the biggest difference where it appears to play more like Fischer and
least like Karpov of the 4 human players I picked.

The sample of 2000 positions is probably not enough, statistical noise
could easily change things.

Another interesting thing is that the strength of the programs in
question does not see very related to how often it chooses the humans
move.

Or perhaps I should have said could even be inversely related, the strongest program matched the humans the least and the weakest program matched it the most. Not enough programs and humans to say for sure if this is an actual trend.

Houdini is the worst program in this regard, but it's hard to
say what that means or if it's good or bad. One could argue that
Houdini does not have a human style but I don't know if this really
indicates that or not. One could also argue that computers are much
stronger than humans now and getting a high match with human moves is
a BAD thing, not a good thing. I'll leave any interpretation up to
you.

I had intended to test other programs and include more human players
in this study, but I have mostly lost interest since I envisioned this
as a (non-serious) way to classify programs by which players they best
matched up with. I could still do that, but there does not see to
strong biases in programs to play like any particular player over
another.

Here is the data, showing the percentage of time it matched the moves
of the specified player:

komodo 4328.06
Karpov -> 45.10
Fischer -> 45.00
Carlsen -> 44.55
Anand -> 42.55

Spike 1.2 Turin
Fischer -> 50.00
Anand -> 48.05
Carlsen -> 47.95
Karpov -> 45.95

houdini 1.5
Carlsen -> 37.90
Karpov -> 37.60
Fischer -> 37.60
Anand -> 35.30

JuLieN · Post by **JuLieN** » Fri Dec 09, 2011 11:40 pm

Thx Don, very interesting study (I actually thought about such an experiment myself but had no time for it).

A reason why programs won't display specially high correlations for human players might be because these programs are stronger.

What I mean is that, for instance, if you want to test correlations with Karpov at his peak you should pick programs that really are around 2780 human Elo. Or 2820 for Fischer or Carlsen, etc..

If Houdini REALLY plays like a 3300 human player would, using it to test some style correlations with human styles would be like searching for such correlation between Fruit and and a 2300 Elo player...

Of course I am implying that strength and style are correlated (for instance, given a position with, say, 50 legal moves, very strong players will only consider two of them, which lets nearly no place for style, and IM would consider 5 moves, strong club players maybe 8-9, and wood pushers nearly all of them

).

Don · Post by **Don** » Fri Dec 09, 2011 11:51 pm

JuLieN wrote:Thx Don, very interesting study (I actually thought about such an experiment myself but had no time for it).

A reason why programs won't display specially high correlations for human players might be because these programs are stronger.

What I mean is that, for instance, if you want to test correlations with Karpov at his peak you should pick programs that really are around 2780 human Elo. Or 2820 for Fischer or Carlsen, etc..

If Houdini REALLY plays like a 3300 human player would, using it to test some style correlations with human styles would be like searching for such correlation between Fruit and and a 2300 Elo player...

Of course I am implying that strength and style are correlated (for instance, given a position with, say, 50 legal moves, very strong players will only consider two of them, which lets nearly no place for style, and IM would consider 5 moves, strong club players maybe 8-9, and wood pushers nearly all of them ).

I'm starting to think this study is more interesting that I had previously thought. It's probably the case that 90% of the moves are normal or natural and any human player or computer would likely play the same move, so if those were removed we might see that some program MUCH prefer the moves of some players rather than others.

I can cull out these positions by taking a variety of programs and removing any positions where all programs play the same moves.

For example:

Spike 1.2 Turin
Fischer -> 50.00
Anand -> 48.05
Carlsen -> 47.95
Karpov -> 45.95

Spike seems to like Fischer's moves more than Karpov's but out of 2000 positions it plays 81 more moves of Fischer to get only about 4% more moves. Out of 2000 81 is not that many, but if 1000 of those moves were routine and were not included, it would start to look a lot more significant.

Don

Don · Post by **Don** » Sat Dec 10, 2011 12:04 am

JuLieN wrote:Thx Don, very interesting study (I actually thought about such an experiment myself but had no time for it).

A reason why programs won't display specially high correlations for human players might be because these programs are stronger.

I have to disagree with you for a couple of reasons. Tell me what you think:

1. The programs are probably weaker than the top Grandmasters because I am testing them at 0.25 on a relatively slow computer.

2. I didn't report this, but I did one run with Komodo where I increased the time by a factor of 10, and for that player the percentage went up by only about 1.3. At 10x Komodo would be hundreds of ELO stronger but as I have seen many times already, programs are not really that sensitive to time when it comes to similarity testing or move choice. Of course it matters SOME, but it does not change the basic style of the program. Yes, programs may play 500 ELO stronger but they still make the same moves the majority of the time.

What I mean is that, for instance, if you want to test correlations with Karpov at his peak you should pick programs that really are around 2780 human Elo. Or 2820 for Fischer or Carlsen, etc..

If Houdini REALLY plays like a 3300 human player would, using it to test some style correlations with human styles would be like searching for such correlation between Fruit and and a 2300 Elo player...

Of course I am implying that strength and style are correlated ...

But for chess programs the correlation is apparently quite small. I could run spike at much longer time controls so that it's actually player stronger than Houdini but I'll bet the correlation will not go down. When I ran Komodo at a longer time the correlation between the player I was testing against actually increased, but of course that might be some evidence that you are correct because it was probably closer in strength to the GM I was testing.

(for instance, given a position with, say, 50 legal moves, very strong players will only consider two of them, which lets nearly no place for style, and IM would consider 5 moves, strong club players maybe 8-9, and wood pushers nearly all of them ).

I should probably include some much weaker players in the study, such as lower rated IM's or something like that.

JuLieN · Post by **JuLieN** » Sat Dec 10, 2011 12:05 am

Don wrote: I can cull out these positions by taking a variety of programs and removing any positions where all programs play the same moves.

Brillant, that would be a good start! (And you might want to save the remaining positions as epd test files?)

There's a little flaw anyway: people are no machines. They can be tired, or not in their usual mood. So Karpov, for instance, won't always play like Karpov. What part of Karpov's moves are typically karpovian (e.g. only Karpov would usually chose them) ? Paradoxically, if we had a program that would always play the moves that only Karpov would chose, it would be more karpovian than Karpov himself and you'll never reach a 100% correlation. Still, this program will get the higher correlation anyway (that's why I wrote "small" flaw).

(So a 60% correlation result could actually mean 100%, because even Karpov would only play 60% of karpovian moves.)

JuLieN · Post by **JuLieN** » Sat Dec 10, 2011 12:14 am

Don wrote:
JuLieN wrote:Thx Don, very interesting study (I actually thought about such an experiment myself but had no time for it).

A reason why programs won't display specially high correlations for human players might be because these programs are stronger.
I have to disagree with you for a couple of reasons. Tell me what you think:

1. The programs are probably weaker than the top Grandmasters because I am testing them at 0.25 on a relatively slow computer.

The problem is the same if the programs are weaker.

Don wrote: 2. I didn't report this, but I did one run with Komodo where I increased the time by a factor of 10, and for that player the percentage went up by only about 1.3. At 10x Komodo would be hundreds of ELO stronger but as I have seen many times already, programs are not really that sensitive to time when it comes to similarity testing or move choice. Of course it matters SOME, but it does not change the basic style of the program. Yes, programs may play 500 ELO stronger but they still make the same moves the majority of the time.

That's a good point. Could be called the "Dailey law": "only a handful of moves are not decisive in a chess game." (meaning that you can chose other moves without compromising your chances).

Which gives me another idea: maybe you could produce epd test files with only positions where engines would typically change their mind if given greater search time?

This would be especially interesting for your study if this epd file is itself a subset of the one generated after you culled out the positions for which all the engines agree. Put differently, all those positions would be called "critical" when it comes to differentiate engines.

Mincho Georgiev · Post by **Mincho Georgiev** » Sat Dec 10, 2011 9:24 am

Honestly, when you started the similarity discussion while back, I was wondering what it would be if it gets compared to SGM's move choice instead of an engines only. I'm glad that you decided to give it a try, since I found this very interesting!

Don · Post by **Don** » Sat Dec 10, 2011 6:14 pm

Ok, I found some bugs with my scripts, so I have to redo this study. Essentially the programs were only doing very shallow searches regardless of what level was set.

I'm also going to cull out positions where all the programs play the same move and it matches the GM or most of the programs play the same move unless any of them match the GM move.

Also, I need suggestions on which players to represent. I don't want very many, but I want a good variety of style and popularity. My current list is:

1. Carlsen
2. Fischer
3. Anand
4. Karpov

I would also like to add Tal and any players well known for very interesting styles or consider immensely popular. I don't want more than a small handful, perhaps 10 at the most and 7 or 8 preferably.

If I find that certain players invoke interesting computer scores, such as extreme matches or mismatches with programs those would be ideal.

So I need suggestions from you.

Don

JuLieN · Post by **JuLieN** » Sat Dec 10, 2011 7:12 pm

Well, some very typically styled players are:

- Petrossian (defense!)
- Shirov (attack!)
- Nimzovitch (logical but unnatural)
- Capablanca (natural talent: only smooth moves)
- Alekhine (ancient attacking style, very sharp but still positional sound)
- Kasparov (like Alekhine, but even more positional).

And I don't think players like Anand or Carlsen have so personal style... But Fischer and Karpov do.

Also, the measurement should only take place in the middle game, not the opening nor the endings (too technical, no place for style).

Stylistic bias in computer chess programs

Stylistic bias in computer chess programs

Re: Stylistic bias in computer chess programs

Re: Stylistic bias in computer chess programs

Re: Stylistic bias in computer chess programs

Re: Stylistic bias in computer chess programs

Re: Stylistic bias in computer chess programs

Re: Stylistic bias in computer chess programs

Re: Stylistic bias in computer chess programs

Re: Stylistic bias in computer chess programs

Re: Stylistic bias in computer chess programs