Performances of engines in the Endgame

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Performances of engines in the Endgame

Post by Laskos »

I put engines on two round-robins, first from opening positions, second from endgame positions. Then I eliminated the draws, as the endgame play from equal positions is drawish, and it's hard to compare it with the play from opening positions:

From openings:

Code: Select all

    Program                            Score       %     Elo    +   -

  1 Houdini 3 Pro x64              : 2075.0/2699  76.9   3166   16  15
  2 Stockfish 2.3.1 JA 64bit       : 1895.0/2784  68.1   3104   14  14
  3 Critter 1.6 64-bit             : 1281.0/2711  47.3   2985   13  13
  4 Komodo 5 64-bit                : 1158.0/2744  42.2   2961   13  13
  5 Deep Rybka 4.1 x64             :  551.0/2982  18.5   2802   16  16
From endgame positions:

Code: Select all

    Program                            Score       %     Elo    +   -

  1 Stockfish 2.3.1 JA 64bit       : 1339.0/1879  71.3   3124   17  17
  2 Houdini 3 Pro x64              : 1101.0/1698  64.8   3085   17  17
  3 Komodo 5 64-bit                :  914.0/1888  48.4   2994   16  16
  4 Critter 1.6 64-bit             :  833.0/1847  45.1   2974   16  16
  5 Deep Rybka 4.1 x64             :  451.0/1964  23.0   2837   18  18
Comparing the two:

Houdini 3 underperforms in the endgame by 81 Elo points
Stockfish 2.3.1 overperforms in the endgame by 20 Elo points
Komodo 5 overperforms in the endgame by 33 Elo points
Critter 1.6 underperforms in the endgame by 11 Elo points
Rybka 4.1 overperforms in the endgame by 35 Elo points
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Performances of engines in the Endgame

Post by Evert »

What end-game positions did you use?

One of the things I like to do is add endgame knowledge to my program (even though it's not very strong and there are probably other things that are more important) but it's very hard to verify that they work, apart from testing against a bunch of standard problem sets from books. In regular matches the difference in performance due to one piece of end-game knowledge is small (even if it's rook endings, that's still only ~10% of all games that reach the endgame) and therefore hard to measure...
User avatar
Ajedrecista
Posts: 2226
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Performances of engines in the endgame.

Post by Ajedrecista »

Hello Don:
Laskos wrote:I put engines on two round-robins, first from opening positions, second from endgame positions. Then I eliminated the draws, as the endgame play from equal positions is drawish, and it's hard to compare it with the play from opening positions:

From openings:

Code: Select all

    Program                            Score       %     Elo    +   -

  1 Houdini 3 Pro x64              : 2075.0/2699  76.9   3166   16  15
  2 Stockfish 2.3.1 JA 64bit       : 1895.0/2784  68.1   3104   14  14
  3 Critter 1.6 64-bit             : 1281.0/2711  47.3   2985   13  13
  4 Komodo 5 64-bit                : 1158.0/2744  42.2   2961   13  13
  5 Deep Rybka 4.1 x64             :  551.0/2982  18.5   2802   16  16
From endgame positions:

Code: Select all

    Program                            Score       %     Elo    +   -

  1 Stockfish 2.3.1 JA 64bit       : 1339.0/1879  71.3   3124   17  17
  2 Houdini 3 Pro x64              : 1101.0/1698  64.8   3085   17  17
  3 Komodo 5 64-bit                :  914.0/1888  48.4   2994   16  16
  4 Critter 1.6 64-bit             :  833.0/1847  45.1   2974   16  16
  5 Deep Rybka 4.1 x64             :  451.0/1964  23.0   2837   18  18
Comparing the two:

Houdini 3 underperforms in the endgame by 81 Elo points
Stockfish 2.3.1 overperforms in the endgame by 20 Elo points
Komodo 5 overperforms in the endgame by 33 Elo points
Critter 1.6 underperforms in the endgame by 11 Elo points
Rybka 4.1 overperforms in the endgame by 35 Elo points
With the data you provide, there are 6960 non-drawn games from the opening and 4638 non-drawn games from endgame positions.

Just looking at the output, it seems that you used EloSTAT. I think you should also try other well-known rating programmes like BayesElo and Ordo, just for comparison purposes.

If you eliminated draws then error bars grow respect to error bars with including draws if EloSTAT works like I think, which is also my method. Anyway, the error bars seem too high for the usual 95% confidence; I did some calculations with Derive 6 and it seems that the parameter z in a normal distribution is around 2.58 if I did things correctly... knowing that 99% confidence is obtained in the interval of more less |z| < 2.575829303, then I suppose that you used a confidence interval of 99% confidence (or a LOS of 99.5%). Am I right?

I think it would be interesting that you also add the ratings including draws, for seeing their impact in ratings. So you would have twelve different rating lists [(opening, endgame)·(EloSTAT, BayesElo, Ordo)·(with draws, without draws)]. Maybe it is a little difficult to handle.

I stay tuned for conclusions.

Regards from Spain.

Ajedrecista.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Performances of engines in the endgame.

Post by Laskos »

Ajedrecista wrote:
With the data you provide, there are 6960 non-drawn games from the opening and 4638 non-drawn games from endgame positions.

Just looking at the output, it seems that you used EloSTAT. I think you should also try other well-known rating programmes like BayesElo and Ordo, just for comparison purposes.

If you eliminated draws then error bars grow respect to error bars with including draws if EloSTAT works like I think, which is also my method. Anyway, the error bars seem too high for the usual 95% confidence; I did some calculations with Derive 6 and it seems that the parameter z in a normal distribution is around 2.58 if I did things correctly... knowing that 99% confidence is obtained in the interval of more less |z| < 2.575829303, then I suppose that you used a confidence interval of 99% confidence (or a LOS of 99.5%). Am I right?

I think it would be interesting that you also add the ratings including draws, for seeing their impact in ratings. So you would have twelve different rating lists [(opening, endgame)·(EloSTAT, BayesElo, Ordo)·(with draws, without draws)]. Maybe it is a little difficult to handle.

I stay tuned for conclusions.

Regards from Spain.

Ajedrecista.
I think the error bars are correct for 95% (2 SD) interval, if keeping in mind that without draws the errors are ~700/sqrt(N) or larger for skewed, say 75:25 results. I think EloStat doesn't do much harm here, I just want to see the magnitude of performance, the main point is visible with the naked eye.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Performances of engines in the Endgame

Post by Laskos »

Evert wrote:What end-game positions did you use?

One of the things I like to do is add endgame knowledge to my program (even though it's not very strong and there are probably other things that are more important) but it's very hard to verify that they work, apart from testing against a bunch of standard problem sets from books. In regular matches the difference in performance due to one piece of end-game knowledge is small (even if it's rook endings, that's still only ~10% of all games that reach the endgame) and therefore hard to measure...
I have 100k neutral poistions in an EPD file provided to me by Miguel Ballicora. Then I used Scid or epdutils to filter the positions with less than 16 men, mostly without queens.
User avatar
Ajedrecista
Posts: 2226
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Performances of engines in the endgame.

Post by Ajedrecista »

Hello Kai:
Ajedrecista wrote:Hello Don:
SORRY! I do not know in what I was thinking for commit such an error.
Laskos wrote:
Ajedrecista wrote:
With the data you provide, there are 6960 non-drawn games from the opening and 4638 non-drawn games from endgame positions.

Just looking at the output, it seems that you used EloSTAT. I think you should also try other well-known rating programmes like BayesElo and Ordo, just for comparison purposes.

If you eliminated draws then error bars grow respect to error bars with including draws if EloSTAT works like I think, which is also my method. Anyway, the error bars seem too high for the usual 95% confidence; I did some calculations with Derive 6 and it seems that the parameter z in a normal distribution is around 2.58 if I did things correctly... knowing that 99% confidence is obtained in the interval of more less |z| < 2.575829303, then I suppose that you used a confidence interval of 99% confidence (or a LOS of 99.5%). Am I right?

I think it would be interesting that you also add the ratings including draws, for seeing their impact in ratings. So you would have twelve different rating lists [(opening, endgame)·(EloSTAT, BayesElo, Ordo)·(with draws, without draws)]. Maybe it is a little difficult to handle.

I stay tuned for conclusions.

Regards from Spain.

Ajedrecista.
I think the error bars are correct for 95% (2 SD) interval, if keeping in mind that without draws the errors are ~700/sqrt(N) or larger for skewed, say 75:25 results. I think EloStat doesn't do much harm here, I just want to see the magnitude of performance, the main point is visible with the naked eye.
It is true that error bars grow with unbalanced results (far of a score of 50%). I only did the calculation with the following data:

Code: Select all

  Program                            Score       %     Elo    +   -

1 Houdini 3 Pro x64              : 2075.0/2699  76.9   3166   16  15
I should took more data but I was in a hurry. Thanks for your tests. It should be understood that Houdini 3 performs pretty well in middlegames and this is why it can not perform better in endgames?

Regards from Spain.

Ajedrecista.
Jouni
Posts: 3883
Joined: Wed Mar 08, 2006 8:15 pm
Full name: Jouni Uski

Re: Performances of engines in the Endgame

Post by Jouni »

Please can You repeat with tablebases too see which one benefits most from them?
Jouni
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Performances of engines in the Endgame

Post by Laskos »

Jouni wrote:Please can You repeat with tablebases too see which one benefits most from them?
Tablebases effect is a pain to measure. Here is an example

Code: Select all

    Program                            Score     %     Elo     +   -    Draws

  1 Houdini 3  Scorpio         : 24159.5/48300  50.0   3000    2   2   64.4 %
  2 Houdini 3                  : 24140.5/48300  50.0   3000    2   2   64.4 %
They amount to no more than 2 Elo points, maybe a bit more if playing from late endgame positions, but I am not going to try to measure the effect. Tablebases are good for analysis mostly.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Performances of engines in the endgame.

Post by Laskos »

Ajedrecista wrote:Hello Kai:
Ajedrecista wrote:Hello Don:
SORRY! I do not know in what I was thinking for commit such an error.
Laskos wrote:
Ajedrecista wrote:
With the data you provide, there are 6960 non-drawn games from the opening and 4638 non-drawn games from endgame positions.

Just looking at the output, it seems that you used EloSTAT. I think you should also try other well-known rating programmes like BayesElo and Ordo, just for comparison purposes.

If you eliminated draws then error bars grow respect to error bars with including draws if EloSTAT works like I think, which is also my method. Anyway, the error bars seem too high for the usual 95% confidence; I did some calculations with Derive 6 and it seems that the parameter z in a normal distribution is around 2.58 if I did things correctly... knowing that 99% confidence is obtained in the interval of more less |z| < 2.575829303, then I suppose that you used a confidence interval of 99% confidence (or a LOS of 99.5%). Am I right?

I think it would be interesting that you also add the ratings including draws, for seeing their impact in ratings. So you would have twelve different rating lists [(opening, endgame)·(EloSTAT, BayesElo, Ordo)·(with draws, without draws)]. Maybe it is a little difficult to handle.

I stay tuned for conclusions.

Regards from Spain.

Ajedrecista.
I think the error bars are correct for 95% (2 SD) interval, if keeping in mind that without draws the errors are ~700/sqrt(N) or larger for skewed, say 75:25 results. I think EloStat doesn't do much harm here, I just want to see the magnitude of performance, the main point is visible with the naked eye.
It is true that error bars grow with unbalanced results (far of a score of 50%). I only did the calculation with the following data:

Code: Select all

  Program                            Score       %     Elo    +   -

1 Houdini 3 Pro x64              : 2075.0/2699  76.9   3166   16  15
I should took more data but I was in a hurry. Thanks for your tests. It should be understood that Houdini 3 performs pretty well in middlegames and this is why it can not perform better in endgames?

Regards from Spain.

Ajedrecista.
It seems that Houdini 3 is well above the competition in opening/middlegame, but closer to Stockfish and Komodo in endgames.
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: Performances of engines in the Endgame

Post by Houdini »

Laskos wrote:From openings:

Code: Select all

    Program                            Score       %     Elo    +   -

  1 Houdini 3 Pro x64              : 2075.0/2699  76.9   3166   16  15
  2 Stockfish 2.3.1 JA 64bit       : 1895.0/2784  68.1   3104   14  14
  3 Critter 1.6 64-bit             : 1281.0/2711  47.3   2985   13  13
  4 Komodo 5 64-bit                : 1158.0/2744  42.2   2961   13  13
  5 Deep Rybka 4.1 x64             :  551.0/2982  18.5   2802   16  16
Kai, I'm surprised by the result "from openings".
SF 2.3.1 appears to be significantly above Critter 1.6 in what probably are very fast games, that's not consistent with my experience.
Can you be more specific about the conditions, and can you also provide the full results including the draws?

Thanks,
Robert