Performances of engines in the Endgame

Laskos · Post by **Laskos** » Thu Jan 31, 2013 10:19 am

I put engines on two round-robins, first from opening positions, second from endgame positions. Then I eliminated the draws, as the endgame play from equal positions is drawish, and it's hard to compare it with the play from opening positions:

From openings:

Code: Select all

    Program                            Score       %     Elo    +   -

  1 Houdini 3 Pro x64              : 2075.0/2699  76.9   3166   16  15
  2 Stockfish 2.3.1 JA 64bit       : 1895.0/2784  68.1   3104   14  14
  3 Critter 1.6 64-bit             : 1281.0/2711  47.3   2985   13  13
  4 Komodo 5 64-bit                : 1158.0/2744  42.2   2961   13  13
  5 Deep Rybka 4.1 x64             :  551.0/2982  18.5   2802   16  16

From endgame positions:

Code: Select all

    Program                            Score       %     Elo    +   -

  1 Stockfish 2.3.1 JA 64bit       : 1339.0/1879  71.3   3124   17  17
  2 Houdini 3 Pro x64              : 1101.0/1698  64.8   3085   17  17
  3 Komodo 5 64-bit                :  914.0/1888  48.4   2994   16  16
  4 Critter 1.6 64-bit             :  833.0/1847  45.1   2974   16  16
  5 Deep Rybka 4.1 x64             :  451.0/1964  23.0   2837   18  18

Comparing the two:

Houdini 3 underperforms in the endgame by 81 Elo points
Stockfish 2.3.1 overperforms in the endgame by 20 Elo points
Komodo 5 overperforms in the endgame by 33 Elo points
Critter 1.6 underperforms in the endgame by 11 Elo points
Rybka 4.1 overperforms in the endgame by 35 Elo points

Evert · Post by **Evert** » Thu Jan 31, 2013 11:24 am

What end-game positions did you use?

One of the things I like to do is add endgame knowledge to my program (even though it's not very strong and there are probably other things that are more important) but it's very hard to verify that they work, apart from testing against a bunch of standard problem sets from books. In regular matches the difference in performance due to one piece of end-game knowledge is small (even if it's rook endings, that's still only ~10% of all games that reach the endgame) and therefore hard to measure...

Ajedrecista · Post by **Ajedrecista** » Thu Jan 31, 2013 11:46 am

Hello Don:

Laskos wrote:I put engines on two round-robins, first from opening positions, second from endgame positions. Then I eliminated the draws, as the endgame play from equal positions is drawish, and it's hard to compare it with the play from opening positions:

From openings:
Code: Select all
    Program                            Score       %     Elo    +   -

  1 Houdini 3 Pro x64              : 2075.0/2699  76.9   3166   16  15
  2 Stockfish 2.3.1 JA 64bit       : 1895.0/2784  68.1   3104   14  14
  3 Critter 1.6 64-bit             : 1281.0/2711  47.3   2985   13  13
  4 Komodo 5 64-bit                : 1158.0/2744  42.2   2961   13  13
  5 Deep Rybka 4.1 x64             :  551.0/2982  18.5   2802   16  16
From endgame positions:
Code: Select all
    Program                            Score       %     Elo    +   -

  1 Stockfish 2.3.1 JA 64bit       : 1339.0/1879  71.3   3124   17  17
  2 Houdini 3 Pro x64              : 1101.0/1698  64.8   3085   17  17
  3 Komodo 5 64-bit                :  914.0/1888  48.4   2994   16  16
  4 Critter 1.6 64-bit             :  833.0/1847  45.1   2974   16  16
  5 Deep Rybka 4.1 x64             :  451.0/1964  23.0   2837   18  18
Comparing the two:

Houdini 3 underperforms in the endgame by 81 Elo points
Stockfish 2.3.1 overperforms in the endgame by 20 Elo points
Komodo 5 overperforms in the endgame by 33 Elo points
Critter 1.6 underperforms in the endgame by 11 Elo points
Rybka 4.1 overperforms in the endgame by 35 Elo points

With the data you provide, there are 6960 non-drawn games from the opening and 4638 non-drawn games from endgame positions.

Just looking at the output, it seems that you used EloSTAT. I think you should also try other well-known rating programmes like BayesElo and Ordo, just for comparison purposes.

If you eliminated draws then error bars grow respect to error bars with including draws if EloSTAT works like I think, which is also my method. Anyway, the error bars seem too high for the usual 95% confidence; I did some calculations with Derive 6 and it seems that the parameter z in a normal distribution is around 2.58 if I did things correctly... knowing that 99% confidence is obtained in the interval of more less |z| < 2.575829303, then I suppose that you used a confidence interval of 99% confidence (or a LOS of 99.5%). Am I right?

I think it would be interesting that you also add the ratings including draws, for seeing their impact in ratings. So you would have twelve different rating lists [(opening, endgame)·(EloSTAT, BayesElo, Ordo)·(with draws, without draws)]. Maybe it is a little difficult to handle.

I stay tuned for conclusions.

Regards from Spain.

Ajedrecista.

Laskos · Post by **Laskos** » Thu Jan 31, 2013 11:56 am

Ajedrecista wrote:
With the data you provide, there are 6960 non-drawn games from the opening and 4638 non-drawn games from endgame positions.

Just looking at the output, it seems that you used EloSTAT. I think you should also try other well-known rating programmes like BayesElo and Ordo, just for comparison purposes.

If you eliminated draws then error bars grow respect to error bars with including draws if EloSTAT works like I think, which is also my method. Anyway, the error bars seem too high for the usual 95% confidence; I did some calculations with Derive 6 and it seems that the parameter z in a normal distribution is around 2.58 if I did things correctly... knowing that 99% confidence is obtained in the interval of more less |z| < 2.575829303, then I suppose that you used a confidence interval of 99% confidence (or a LOS of 99.5%). Am I right?

I think it would be interesting that you also add the ratings including draws, for seeing their impact in ratings. So you would have twelve different rating lists [(opening, endgame)·(EloSTAT, BayesElo, Ordo)·(with draws, without draws)]. Maybe it is a little difficult to handle.

I stay tuned for conclusions.

Regards from Spain.

Ajedrecista.

I think the error bars are correct for 95% (2 SD) interval, if keeping in mind that without draws the errors are ~700/sqrt(N) or larger for skewed, say 75:25 results. I think EloStat doesn't do much harm here, I just want to see the magnitude of performance, the main point is visible with the naked eye.

Laskos · Post by **Laskos** » Thu Jan 31, 2013 12:04 pm

Evert wrote:What end-game positions did you use?

One of the things I like to do is add endgame knowledge to my program (even though it's not very strong and there are probably other things that are more important) but it's very hard to verify that they work, apart from testing against a bunch of standard problem sets from books. In regular matches the difference in performance due to one piece of end-game knowledge is small (even if it's rook endings, that's still only ~10% of all games that reach the endgame) and therefore hard to measure...

I have 100k neutral poistions in an EPD file provided to me by Miguel Ballicora. Then I used Scid or epdutils to filter the positions with less than 16 men, mostly without queens.

Ajedrecista · Post by **Ajedrecista** » Thu Jan 31, 2013 12:14 pm

Hello Kai:

Ajedrecista wrote:Hello Don:

SORRY! I do not know in what I was thinking for commit such an error.

Laskos wrote:
Ajedrecista wrote:
With the data you provide, there are 6960 non-drawn games from the opening and 4638 non-drawn games from endgame positions.

Just looking at the output, it seems that you used EloSTAT. I think you should also try other well-known rating programmes like BayesElo and Ordo, just for comparison purposes.

If you eliminated draws then error bars grow respect to error bars with including draws if EloSTAT works like I think, which is also my method. Anyway, the error bars seem too high for the usual 95% confidence; I did some calculations with Derive 6 and it seems that the parameter z in a normal distribution is around 2.58 if I did things correctly... knowing that 99% confidence is obtained in the interval of more less |z| < 2.575829303, then I suppose that you used a confidence interval of 99% confidence (or a LOS of 99.5%). Am I right?

I think it would be interesting that you also add the ratings including draws, for seeing their impact in ratings. So you would have twelve different rating lists [(opening, endgame)·(EloSTAT, BayesElo, Ordo)·(with draws, without draws)]. Maybe it is a little difficult to handle.

I stay tuned for conclusions.

Regards from Spain.

Ajedrecista.
I think the error bars are correct for 95% (2 SD) interval, if keeping in mind that without draws the errors are ~700/sqrt(N) or larger for skewed, say 75:25 results. I think EloStat doesn't do much harm here, I just want to see the magnitude of performance, the main point is visible with the naked eye.

It is true that error bars grow with unbalanced results (far of a score of 50%). I only did the calculation with the following data:

Code: Select all

  Program                            Score       %     Elo    +   -

1 Houdini 3 Pro x64              : 2075.0/2699  76.9   3166   16  15

I should took more data but I was in a hurry. Thanks for your tests. It should be understood that Houdini 3 performs pretty well in middlegames and this is why it can not perform better in endgames?

Regards from Spain.

Ajedrecista.

Jouni · Post by **Jouni** » Thu Jan 31, 2013 2:02 pm

Please can You repeat with tablebases too see which one benefits most from them?

Laskos · Post by **Laskos** » Thu Jan 31, 2013 9:02 pm

Jouni wrote:Please can You repeat with tablebases too see which one benefits most from them?

Tablebases effect is a pain to measure. Here is an example

Code: Select all

    Program                            Score     %     Elo     +   -    Draws

  1 Houdini 3  Scorpio         : 24159.5/48300  50.0   3000    2   2   64.4 %
  2 Houdini 3                  : 24140.5/48300  50.0   3000    2   2   64.4 %

They amount to no more than 2 Elo points, maybe a bit more if playing from late endgame positions, but I am not going to try to measure the effect. Tablebases are good for analysis mostly.

Laskos · Post by **Laskos** » Thu Jan 31, 2013 9:05 pm

Ajedrecista wrote:Hello Kai:

Ajedrecista wrote:Hello Don:
SORRY! I do not know in what I was thinking for commit such an error.

Laskos wrote:
Ajedrecista wrote:
With the data you provide, there are 6960 non-drawn games from the opening and 4638 non-drawn games from endgame positions.

Just looking at the output, it seems that you used EloSTAT. I think you should also try other well-known rating programmes like BayesElo and Ordo, just for comparison purposes.

If you eliminated draws then error bars grow respect to error bars with including draws if EloSTAT works like I think, which is also my method. Anyway, the error bars seem too high for the usual 95% confidence; I did some calculations with Derive 6 and it seems that the parameter z in a normal distribution is around 2.58 if I did things correctly... knowing that 99% confidence is obtained in the interval of more less |z| < 2.575829303, then I suppose that you used a confidence interval of 99% confidence (or a LOS of 99.5%). Am I right?

I think it would be interesting that you also add the ratings including draws, for seeing their impact in ratings. So you would have twelve different rating lists [(opening, endgame)·(EloSTAT, BayesElo, Ordo)·(with draws, without draws)]. Maybe it is a little difficult to handle.

I stay tuned for conclusions.

Regards from Spain.

Ajedrecista.
I think the error bars are correct for 95% (2 SD) interval, if keeping in mind that without draws the errors are ~700/sqrt(N) or larger for skewed, say 75:25 results. I think EloStat doesn't do much harm here, I just want to see the magnitude of performance, the main point is visible with the naked eye.
It is true that error bars grow with unbalanced results (far of a score of 50%). I only did the calculation with the following data:
Code: Select all
  Program                            Score       %     Elo    +   -

1 Houdini 3 Pro x64              : 2075.0/2699  76.9   3166   16  15
I should took more data but I was in a hurry. Thanks for your tests. It should be understood that Houdini 3 performs pretty well in middlegames and this is why it can not perform better in endgames?

Regards from Spain.

Ajedrecista.

It seems that Houdini 3 is well above the competition in opening/middlegame, but closer to Stockfish and Komodo in endgames.

Houdini · Post by **Houdini** » Thu Jan 31, 2013 9:57 pm

Laskos wrote:From openings:

Code: Select all

    Program                            Score       %     Elo    +   -

  1 Houdini 3 Pro x64              : 2075.0/2699  76.9   3166   16  15
  2 Stockfish 2.3.1 JA 64bit       : 1895.0/2784  68.1   3104   14  14
  3 Critter 1.6 64-bit             : 1281.0/2711  47.3   2985   13  13
  4 Komodo 5 64-bit                : 1158.0/2744  42.2   2961   13  13
  5 Deep Rybka 4.1 x64             :  551.0/2982  18.5   2802   16  16

Kai, I'm surprised by the result "from openings".
SF 2.3.1 appears to be significantly above Critter 1.6 in what probably are very fast games, that's not consistent with my experience.
Can you be more specific about the conditions, and can you also provide the full results including the draws?

Thanks,
Robert

Performances of engines in the Endgame

Performances of engines in the Endgame

Re: Performances of engines in the Endgame

Re: Performances of engines in the endgame.

Re: Performances of engines in the endgame.

Re: Performances of engines in the Endgame

Re: Performances of engines in the endgame.

Re: Performances of engines in the Endgame

Re: Performances of engines in the Endgame

Re: Performances of engines in the endgame.

Re: Performances of engines in the Endgame