Hello Don:
Laskos wrote:I put engines on two round-robins, first from opening positions, second from endgame positions. Then I eliminated the draws, as the endgame play from equal positions is drawish, and it's hard to compare it with the play from opening positions:
From openings:
Code: Select all
Program Score % Elo + -
1 Houdini 3 Pro x64 : 2075.0/2699 76.9 3166 16 15
2 Stockfish 2.3.1 JA 64bit : 1895.0/2784 68.1 3104 14 14
3 Critter 1.6 64-bit : 1281.0/2711 47.3 2985 13 13
4 Komodo 5 64-bit : 1158.0/2744 42.2 2961 13 13
5 Deep Rybka 4.1 x64 : 551.0/2982 18.5 2802 16 16
From endgame positions:
Code: Select all
Program Score % Elo + -
1 Stockfish 2.3.1 JA 64bit : 1339.0/1879 71.3 3124 17 17
2 Houdini 3 Pro x64 : 1101.0/1698 64.8 3085 17 17
3 Komodo 5 64-bit : 914.0/1888 48.4 2994 16 16
4 Critter 1.6 64-bit : 833.0/1847 45.1 2974 16 16
5 Deep Rybka 4.1 x64 : 451.0/1964 23.0 2837 18 18
Comparing the two:
Houdini 3 underperforms in the endgame by 81 Elo points
Stockfish 2.3.1 overperforms in the endgame by 20 Elo points
Komodo 5 overperforms in the endgame by 33 Elo points
Critter 1.6 underperforms in the endgame by 11 Elo points
Rybka 4.1 overperforms in the endgame by 35 Elo points
With the data you provide, there are 6960 non-drawn games from the opening and 4638 non-drawn games from endgame positions.
Just looking at the output, it seems that you used EloSTAT. I think you should also try other well-known rating programmes like BayesElo and Ordo, just for comparison purposes.
If you eliminated draws then error bars grow respect to error bars with including draws if EloSTAT works like I think, which is also my method. Anyway, the error bars seem too high for the usual 95% confidence; I did some calculations with Derive 6 and it seems that the parameter z in a normal distribution is around 2.58 if I did things correctly... knowing that 99% confidence is obtained in the interval of more less |z| < 2.575829303, then I suppose that you used a confidence interval of 99% confidence (or a LOS of 99.5%). Am I right?
I think it would be interesting that you also add the ratings including draws, for seeing their impact in ratings. So you would have twelve different rating lists [(opening, endgame)·(EloSTAT, BayesElo, Ordo)·(with draws, without draws)]. Maybe it is a little difficult to handle.
I stay tuned for conclusions.
Regards from Spain.
Ajedrecista.