Engines performance for selected openings

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Engines performance for selected openings

Post by Don »

Laskos wrote:
Kohflote wrote:Hi,

Firstly, thank you for the info and is definitely helpful.

Can you please clarify the elo and performance of each opening that you have given, does that mean is the elo and performance of the engine playing black?

What about Pirc Defence?

Best wishes,
Koh, Kah Huat
The performances for selected openings are compared with the base performance in 16,000+ games of adjusted for strength engines given in the first table. The engines use different time controls, so that their strength is comparable. These 16,000+ games were played using 5,000+ PGN opening 8-move positions by Frank Quisinsky, they are pretty balanced, and not too deep, similar to a generic book. Each of the selected openings is played by all engines with both white and black, so the performances of engines are combined white/black performances for each opening.

I will post the results for Pirc Defence soon.

Kai
So you are saying that how well an engine does in one particular opening is determined by it's performance on both sides of that opening.

I am wondering if it was be more useful to break this down into defense and offense. In real games a program may stronger prefer defending and opening it would never choose. Humans wont' play openings they think are inferior but would prefer to defend them (for that same reason.)

A question I have always had is whether (in general) a program will play an opening better if it "likes" the opening. Each program seems to have openings that due to their evaluation function they are "happy" with compared to others. Does that mean they will also play those openings better or could it even mean they will play them worse? Presumably there is a chance they are mis-evalating an opening that they prefer over others and that is why the opening is getting a higher score than the others.

Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Engines performance for selected openings

Post by Laskos »

Don wrote:
Laskos wrote:
Kohflote wrote:Hi,

Firstly, thank you for the info and is definitely helpful.

Can you please clarify the elo and performance of each opening that you have given, does that mean is the elo and performance of the engine playing black?

What about Pirc Defence?

Best wishes,
Koh, Kah Huat
The performances for selected openings are compared with the base performance in 16,000+ games of adjusted for strength engines given in the first table. The engines use different time controls, so that their strength is comparable. These 16,000+ games were played using 5,000+ PGN opening 8-move positions by Frank Quisinsky, they are pretty balanced, and not too deep, similar to a generic book. Each of the selected openings is played by all engines with both white and black, so the performances of engines are combined white/black performances for each opening.

I will post the results for Pirc Defence soon.

Kai
So you are saying that how well an engine does in one particular opening is determined by it's performance on both sides of that opening.

I am wondering if it was be more useful to break this down into defense and offense. In real games a program may stronger prefer defending and opening it would never choose. Humans wont' play openings they think are inferior but would prefer to defend them (for that same reason.)

A question I have always had is whether (in general) a program will play an opening better if it "likes" the opening. Each program seems to have openings that due to their evaluation function they are "happy" with compared to others. Does that mean they will also play those openings better or could it even mean they will play them worse? Presumably there is a chance they are mis-evalating an opening that they prefer over others and that is why the opening is getting a higher score than the others.

Don
Yes, on both sides of opening. I am aware that this has its flaws, my assumption is that if an engine "understands" the opening, and this is a mainstream, balanced opening without short-term hits, then it will play it better both sides. Also, I am not very skilled in separating the PGN file in White and Black performances for each engine, I would rather play gauntlets with fixed colours, a very lengthy process. Then I would have to compare to general White and Black performances in that 16,000 games file.
I agree that an engine could play even worse the balanced openings it "likes", I do not know if it's more mis-evaluation or more a feature of an engine, and that would be interesting to study (my general opinion is mixed on that).

I put 3 more openings to test, the Slav Defence and the Catalan are to follow:

Pirc Defence
[D]rnbqkb1r/ppp1pp1p/3p1np1/8/3PP3/2N5/PPP2PPP/R1BQKBNR w KQkq - 0 4[/D]

Code: Select all

    Program                            Score     %     Elo   Performance

  1 Junior 13                      :  65.0/118  55.1   3030     +15
  2 Houdini 3                      :  63.0/120  52.5   3015      +3
  3 Komodo 5                       :  61.5/120  51.2   3008      +3
  4 Rybka 4.1                      :  59.0/118  50.0   3000     +14
  5 Hiarcs 14                      :  58.5/120  48.8   2993     -16
  6 Critter 1.6                    :  58.0/120  48.3   2990     -16
  7 Stockfish 2.3.1                :  53.0/120  44.2   2965      -2 

Four Knights
[D]r1bqkb1r/pppp1ppp/2n2n2/4p3/4P3/2N2N2/PPPP1PPP/R1BQKB1R w KQkq - 0 4[/D]

Code: Select all

    Program                            Score     %     Elo   Performance

  1 Critter 1.6                    :  64.0/120  53.3   3020     +14
  2 Hiarcs 14                      :  62.0/119  52.1   3013      +4
  3 Rybka 4.1                      :  58.5/115  50.9   3005     +19
  4 Komodo 5                       :  60.0/118  50.8   3005       0
  5 Houdini 3                      :  59.0/117  50.4   3002     -10
  6 Junior 13                      :  59.0/119  49.6   2998     -17
  7 Stockfish 2.3.1                :  51.5/120  42.9   2958      -9 

No engine under- or over-performs in these 2 openings beyond error margins (50 Elo points)



Semi-Slav Defence
[D]rnbqkb1r/pp3ppp/2p1pn2/3p4/2PP4/2N2N2/PP2PPPP/R1BQKB1R w KQkq - 0 5[/D]

Code: Select all

    Program                            Score     %     Elo   Performance

  1 Komodo 5                       :  69.5/119  58.4   3055     +50
  2 Critter 1.6                    :  66.5/119  55.9   3035     +29
  3 Junior 13                      :  61.0/120  50.8   3005     -10
  4 Rybka 4.1                      :  57.0/116  49.1   2995      +9
  5 Stockfish 2.3.1                :  55.5/118  47.0   2982     +15
  6 Houdini 3                      :  55.5/119  46.6   2980     -32
  7 Hiarcs 14                      :  50.0/119  42.0   2952     -57 
Komodo overperforms, Hiarcs underperforms.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Engines performance for selected openings

Post by Don »

Laskos wrote: Yes, on both sides of opening. I am aware that this has its flaws, my assumption is that if an engine "understands" the opening, and this is a mainstream, balanced opening without short-term hits, then it will play it better both sides. Also, I am not very skilled in separating the PGN file in White and Black performances for each engine, I would rather play gauntlets with fixed colours, a very lengthy process. Then I would have to compare to general White and Black performances in that 16,000 games file.
I agree that an engine could play even worse the balanced openings it "likes", I do not know if it's more mis-evaluation or more a feature of an engine, and that would be interesting to study (my general opinion is mixed on that).
I don't know the answers to your question either. I think it is interesting to see if an engine can handle both sides, I am not suggesting that we ignore that, only that we include more statistics if it's easy to do.

I am running my drawishness test and I am pretty skilled at breaking things down so maybe I can supplement your data. My data is going to have thousands of openings to 5 moves depth so this will limit the extent that I could break things down. In many cases I would have to combine lines that are forced on the machine. For example the Ruy Lopez after Bb5 can continued many different ways by both sides and until the 6th move the responses are forced on the computer. So if I have "Ruy Lopez" as one category it may not correctly represent how each program would continue after 3. Bb5 - we can only get that answer after blacks 5th move. But it still might be useful.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Engines performance for selected openings

Post by Laskos »

Don wrote:
Laskos wrote: Yes, on both sides of opening. I am aware that this has its flaws, my assumption is that if an engine "understands" the opening, and this is a mainstream, balanced opening without short-term hits, then it will play it better both sides. Also, I am not very skilled in separating the PGN file in White and Black performances for each engine, I would rather play gauntlets with fixed colours, a very lengthy process. Then I would have to compare to general White and Black performances in that 16,000 games file.
I agree that an engine could play even worse the balanced openings it "likes", I do not know if it's more mis-evaluation or more a feature of an engine, and that would be interesting to study (my general opinion is mixed on that).
I don't know the answers to your question either. I think it is interesting to see if an engine can handle both sides, I am not suggesting that we ignore that, only that we include more statistics if it's easy to do.

I am running my drawishness test and I am pretty skilled at breaking things down so maybe I can supplement your data. My data is going to have thousands of openings to 5 moves depth so this will limit the extent that I could break things down. In many cases I would have to combine lines that are forced on the machine. For example the Ruy Lopez after Bb5 can continued many different ways by both sides and until the 6th move the responses are forced on the computer. So if I have "Ruy Lopez" as one category it may not correctly represent how each program would continue after 3. Bb5 - we can only get that answer after blacks 5th move. But it still might be useful.
I think my 16,000 games PGN test file wouldn't be enough to this task, it means only 2,300 games per engine per colour, on some selected openings it would have some 50 games at best, but on some openings it could work. Maybe I will try to break down into position/colour/engine and somehow compute ratings.

I am posting two more positions:

Slav Defence
[D]rnbqkbnr/pp2pppp/2p5/3p4/2PP4/8/PP2PPPP/RNBQKBNR w KQkq - 0 3[/D]

Code: Select all

    Program                            Score     %     Elo   Performance

  1 Komodo 5                       :  66.0/119  55.5   3033     +28
  2 Stockfish 2.3.1                :  64.5/120  53.8   3023     +56
  3 Rybka 4.1                      :  62.5/119  52.5   3015     +29
  4 Hiarcs 14                      :  62.0/120  51.7   3010      +1
  5 Houdini 3                      :  60.5/120  50.4   3003      -9
  6 Junior 13                      :  56.0/120  46.7   2980     -25
  7 Critter 1.6                    :  47.5/120  39.6   2937     -69 
Stockfish overperforms and Critter underperforms here.


Catalan
[D]rnbqkb1r/ppp2ppp/4pn2/3p4/2PP4/6P1/PP2PPBP/RNBQK1NR b KQkq - 0 4[/D]

Code: Select all

    Program                            Score     %     Elo   Performance

  1 Komodo 5                       :  70.5/119  59.2   3056     +51
  2 Houdini 3                      :  65.5/119  55.0   3030     +18
  3 Critter 1.6                    :  62.5/118  53.0   3018     +12
  4 Rybka 4.1                      :  60.5/116  52.2   3013     +27
  5 Hiarcs 14                      :  55.5/120  46.2   2978     -31
  6 Junior 13                      :  54.5/120  45.4   2972     -43
  7 Stockfish 2.3.1                :  46.0/118  39.0   2933     -34 
Komodo overperforms here.
User avatar
Don
Posts: 5106
Joined: Tue Apr 29, 2008 4:27 pm

Re: Engines performance for selected openings

Post by Don »

Laskos wrote:
Don wrote:
Laskos wrote: Yes, on both sides of opening. I am aware that this has its flaws, my assumption is that if an engine "understands" the opening, and this is a mainstream, balanced opening without short-term hits, then it will play it better both sides. Also, I am not very skilled in separating the PGN file in White and Black performances for each engine, I would rather play gauntlets with fixed colours, a very lengthy process. Then I would have to compare to general White and Black performances in that 16,000 games file.
I agree that an engine could play even worse the balanced openings it "likes", I do not know if it's more mis-evaluation or more a feature of an engine, and that would be interesting to study (my general opinion is mixed on that).
I don't know the answers to your question either. I think it is interesting to see if an engine can handle both sides, I am not suggesting that we ignore that, only that we include more statistics if it's easy to do.

I am running my drawishness test and I am pretty skilled at breaking things down so maybe I can supplement your data. My data is going to have thousands of openings to 5 moves depth so this will limit the extent that I could break things down. In many cases I would have to combine lines that are forced on the machine. For example the Ruy Lopez after Bb5 can continued many different ways by both sides and until the 6th move the responses are forced on the computer. So if I have "Ruy Lopez" as one category it may not correctly represent how each program would continue after 3. Bb5 - we can only get that answer after blacks 5th move. But it still might be useful.
I think my 16,000 games PGN test file wouldn't be enough to this task, it means only 2,300 games per engine per colour, on some selected openings it would have some 50 games at best, but on some openings it could work. Maybe I will try to break down into position/colour/engine and somehow compute ratings.
Yes, it's difficult to have some starting position and then be able to measure how a program does with large enough samples. If you provide variety beyond that you influence the results.

It is possible to get variety by running the games over and over against with the same openings, the games seem to always vary at some point. That doesn't seem very satisfying though. You could run several matches with slightly different time controls which would provide more variety at early points. It's tough to know how much relevance this has.


I am posting two more positions:

Slav Defence
[D]rnbqkbnr/pp2pppp/2p5/3p4/2PP4/8/PP2PPPP/RNBQKBNR w KQkq - 0 3[/D]

Code: Select all

    Program                            Score     %     Elo   Performance

  1 Komodo 5                       :  66.0/119  55.5   3033     +28
  2 Stockfish 2.3.1                :  64.5/120  53.8   3023     +56
  3 Rybka 4.1                      :  62.5/119  52.5   3015     +29
  4 Hiarcs 14                      :  62.0/120  51.7   3010      +1
  5 Houdini 3                      :  60.5/120  50.4   3003      -9
  6 Junior 13                      :  56.0/120  46.7   2980     -25
  7 Critter 1.6                    :  47.5/120  39.6   2937     -69 
Stockfish overperforms and Critter underperforms here.


Catalan
[D]rnbqkb1r/ppp2ppp/4pn2/3p4/2PP4/6P1/PP2PPBP/RNBQK1NR b KQkq - 0 4[/D]

Code: Select all

    Program                            Score     %     Elo   Performance

  1 Komodo 5                       :  70.5/119  59.2   3056     +51
  2 Houdini 3                      :  65.5/119  55.0   3030     +18
  3 Critter 1.6                    :  62.5/118  53.0   3018     +12
  4 Rybka 4.1                      :  60.5/116  52.2   3013     +27
  5 Hiarcs 14                      :  55.5/120  46.2   2978     -31
  6 Junior 13                      :  54.5/120  45.4   2972     -43
  7 Stockfish 2.3.1                :  46.0/118  39.0   2933     -34 
Komodo overperforms here.
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.