Ratinglist based on positional openingpositions

Yarget · Post by **Yarget** » Tue Feb 05, 2008 3:19 pm

Yesterday I presented the updated positional ratinglist which included Toga II 1.4beta5c 2 CPU:

    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Rybka 2.3.2a mp 32-bit         : 2935   42  41   200    70.2 %   2786   31.5 %
  2 Toga II 1.4 beta5c             : 2847   39  39   200    57.5 %   2795   35.0 %
  3 Deep Shredder 11 UCI           : 2830   40  39   200    54.8 %   2797   33.5 %
  4 Deep Fritz 10                  : 2827   40  40   200    54.2 %   2797   31.5 %
  5 Zap!Chess Zanzibar             : 2800   39  39   200    50.0 %   2800   34.0 %
  6 LoopMP 11A.32                  : 2779   38  38   200    46.8 %   2802   37.5 %
  7 Deep Junior 10.1               : 2776   44  44   200    46.2 %   2802   18.5 %
  8 SpikeMP 1.2 Turin              : 2776   39  40   200    46.2 %   2802   33.5 %
  9 HIARCS 11.1 MP UCI             : 2773   39  39   200    45.8 %   2802   35.5 %
 10 Naum 2.2                       : 2765   37  38   200    44.5 %   2803   40.0 %
 11 Glaurung 2.0.1                 : 2693   41  41   200    33.8 %   2810   31.5 %

A good and strong performance by Toga (around 26 ratingpoints more than expected compared to the referencelist CEGT 40/4). Today I finished the gambitgames with Toga and again this fine engine performed very well:

Code: Select all

    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Rybka 2.3.2a mp 32-bit         : 2944   44  43   200    71.2 %   2786   27.5 %
  2 Toga II 1.4 beta5c             : 2863   41  41   200    59.8 %   2794   29.5 %
  3 Deep Shredder 11 UCI           : 2841   42  41   200    56.5 %   2796   27.0 %
  4 Deep Fritz 10                  : 2833   43  43   200    55.2 %   2797   21.5 %
  5 HIARCS 11.1 MP UCI             : 2819   40  40   200    53.0 %   2798   31.0 %
  6 Naum 2.2                       : 2819   41  41   200    53.0 %   2798   28.0 %
  7 LoopMP 11A.32                  : 2805   41  41   200    50.7 %   2800   27.5 %
  8 Zap!Chess Zanzibar             : 2784   41  41   200    47.5 %   2802   28.0 %
  9 Glaurung 2.0.1                 : 2733   42  42   200    39.5 %   2807   26.0 %
 10 Deep Junior 10.1               : 2688   46  47   200    33.0 %   2811   17.0 %
 11 SpikeMP 1.2 Turin              : 2670   44  45   200    30.5 %   2813   24.0 %

This performance is even stronger than the one Toga made in the positional games. To be exactly it is 43,7 ratingpoints better than expected (based on the referencelist CEGT 40/4). Indeed a very impressive performance by the new Toga. Comparing the 2 ratinglists also show that Toga is "less sensitive" regarding preferred openings (the difference is only 16 ratingpoints).

The updated "allround" ratinglist (the 2 lists combined) looks like this:

Code: Select all

    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Rybka 2.3.2a mp 32-bit         : 2939   30  30   400    70.8 %   2786   29.5 %
  2 Toga II 1.4 beta5c             : 2855   28  28   400    58.6 %   2794   32.2 %
  3 Deep Shredder 11 UCI           : 2835   29  29   400    55.6 %   2796   30.2 %
  4 Deep Fritz 10                  : 2830   29  29   400    54.8 %   2797   26.5 %
  5 HIARCS 11.1 MP UCI             : 2796   28  28   400    49.4 %   2800   33.2 %
  6 LoopMP 11A.32                  : 2792   28  28   400    48.8 %   2800   32.5 %
  7 Naum 2.2                       : 2792   28  28   400    48.8 %   2800   34.0 %
  8 Zap!Chess Zanzibar             : 2792   28  28   400    48.8 %   2800   31.0 %
  9 Deep Junior 10.1               : 2733   31  32   400    39.6 %   2806   17.8 %
 10 SpikeMP 1.2 Turin              : 2725   29  29   400    38.4 %   2807   28.8 %
 11 Glaurung 2.0.1                 : 2713   29  30   400    36.6 %   2808   28.8 %

I will now resume the test of Bright 0.2c

Regards
Per

Yarget · Post by **Yarget** » Wed Feb 06, 2008 5:48 pm

I have now finished the gambitgames for Bright 0.2c. Quite frankly, I had some doubts about Bright before starting the testgames. Would Bright be able to compete and cope with this strong field of engines? Fortunately I was wrong and Bright 0.2c made a fine performance and managed to beat Junior and Spike:

Code: Select all

10 bright-0.2c               : 2715  220 (+ 56,= 51,-113), 37.0 %

Rybka 2.3.2a mp 32-bit        :  20 (+  0,=  5,- 15), 12.5 %
Deep Shredder 11 UCI          :  20 (+  4,=  2,- 14), 25.0 %
Deep Junior 10.1              :  20 (+ 11,=  3,-  6), 62.5 %
Deep Fritz 10                 :  20 (+  8,=  2,- 10), 45.0 %
HIARCS 11.1 MP UCI            :  20 (+  5,=  2,- 13), 30.0 %
Glaurung 2.0.1                :  20 (+  5,=  3,- 12), 32.5 %
LoopMP 11A.32                 :  20 (+  4,=  7,-  9), 37.5 %
SpikeMP 1.2 Turin             :  20 (+  8,=  6,-  6), 55.0 %
Naum 2.2                      :  20 (+  3,= 11,-  6), 42.5 %
Zap!Chess Zanzibar            :  20 (+  5,=  4,- 11), 35.0 %
Toga II 1.4 beta5c            :  20 (+  3,=  6,- 11), 30.0 %

Here follows the updated Gambitratinglist in which Bright is ranked 10th:

Code: Select all

     Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Rybka 2.3.2a mp 32-bit         : 2956   42  42   220    72.7 %   2786   27.3 %
  2 Toga II 1.4 beta5c             : 2869   39  39   220    60.7 %   2794   29.5 %
  3 Deep Shredder 11 UCI           : 2852   40  40   220    58.2 %   2795   25.5 %
  4 Deep Fritz 10                  : 2833   41  41   220    55.2 %   2797   20.5 %
  5 HIARCS 11.1 MP UCI             : 2829   39  39   220    54.5 %   2797   29.1 %
  6 Naum 2.2                       : 2822   39  38   220    53.4 %   2798   30.5 %
  7 LoopMP 11A.32                  : 2811   39  39   220    51.8 %   2799   28.2 %
  8 Zap!Chess Zanzibar             : 2794   39  39   220    49.1 %   2800   27.3 %
  9 Glaurung 2.0.1                 : 2749   40  40   220    42.0 %   2805   25.0 %
 10 bright-0.2c                    : 2715   41  42   220    37.0 %   2808   23.2 %
 11 Deep Junior 10.1               : 2690   44  44   220    33.4 %   2810   16.8 %
 12 SpikeMP 1.2 Turin              : 2679   42  42   220    31.8 %   2811   24.5 %

The good performance by Bright is also confirmed by the CEGT 40/4 ratinglist that I'm using as referenceratinglist. Perhaps it's worth explaining how I compare my lists and the CEGT 40/4 ratinglist. It goes like this:

I compare Bright (like I do with every new engine entering the list) with all engines in my ratinglist starting with Rybka 2.3.2a mp. The referencelist (CEGT) shows a ratingdifference between Rybka and Bright of 213 ratingpoints (2981 - 2768), in my list the difference is 241 ratingpoints meaning that Bright is performing 28 ratingpoints less than expected. Compared with Toga it is -16, Shredder +26 and so on. In the end you add all the +'s and -'s and you get +301 and -59 meaning that Bright has 242 ratingpoints "too much" compared to the the referencelist. This number must be divided with the number of opponents: 242/11 = 22,00 ratingpoints. In other words: Bright 0.2c has made a performance that is 22 ratingpoints higher than expected. Indeed a fine performance.

The positional games are running now and currently the score of Bright is around 35%. I don't think that Bright will be able to reach 37% like in the gambitgames but the performance might be good enough to stay ahead of Glaurung that is last in the Positional ratinglist.

Regards
Per

Laskos · Post by **Laskos** » Thu Feb 07, 2008 2:38 am

Yarget wrote:
Perhaps it's worth explaining how I compare my lists and the CEGT 40/4 ratinglist. It goes like this:

I compare Bright (like I do with every new engine entering the list) with all engines in my ratinglist starting with Rybka 2.3.2a mp. The referencelist (CEGT) shows a ratingdifference between Rybka and Bright of 213 ratingpoints (2981 - 2768), in my list the difference is 241 ratingpoints meaning that Bright is performing 28 ratingpoints less than expected. Compared with Toga it is -16, Shredder +26 and so on. In the end you add all the +'s and -'s and you get +301 and -59 meaning that Bright has 242 ratingpoints "too much" compared to the the referencelist. This number must be divided with the number of opponents: 242/11 = 22,00 ratingpoints. In other words: Bright 0.2c has made a performance that is 22 ratingpoints higher than expected. Indeed a fine performance.

Regards
Per

Ah, ok, this eliminates the calibration errors.

Kai

Yarget · Post by **Yarget** » Fri Feb 08, 2008 10:38 am

I have now finished testing Bright 0.2c. Two days ago I presented the gambitresult for Bright which was a good one (22 ratingpoints higher than expected, check out details above in my earlier thread). As expected Bright wasn't quite able to repeat this performance in the positional test. Here are the single results:

Code: Select all

12 bright-0.2c               : 2701  220 (+ 44,= 66,-110), 35.0 %

Rybka 2.3.2a mp 32-bit        :  20 (+  3,=  3,- 14), 22.5 %
Deep Shredder 11 UCI          :  20 (+  5,=  5,- 10), 37.5 %
Deep Junior 10.1              :  20 (+  5,=  4,- 11), 35.0 %
HIARCS 11.1 MP UCI            :  20 (+  5,=  6,-  9), 40.0 %
Deep Fritz 10                 :  20 (+  5,=  5,- 10), 37.5 %
LoopMP 11A.32                 :  20 (+  4,=  6,- 10), 35.0 %
SpikeMP 1.2 Turin             :  20 (+  2,=  8,- 10), 30.0 %
Naum 2.2                      :  20 (+  3,= 10,-  7), 40.0 %
Glaurung 2.0.1                :  20 (+  6,=  4,- 10), 40.0 %
Zap!Chess Zanzibar            :  20 (+  3,=  9,-  8), 37.5 %
Toga II 1.4 beta5c            :  20 (+  3,=  6,- 11), 30.0 %

The overall score of Bright was 35,0% (in the gambitgames it was 37,0%). After completing the testgames of Bright the updated positional ratinglist looks like this:

Code: Select all

    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Rybka 2.3.2a mp 32-bit         : 2942   41  40   220    70.9 %   2787   30.0 %
  2 Toga II 1.4 beta5c             : 2855   38  37   220    58.6 %   2795   34.5 %
  3 Deep Shredder 11 UCI           : 2835   38  38   220    55.5 %   2796   32.7 %
  4 Deep Fritz 10                  : 2832   38  38   220    55.0 %   2797   30.9 %
  5 Zap!Chess Zanzibar             : 2807   37  37   220    51.1 %   2799   35.0 %
  6 LoopMP 11A.32                  : 2790   37  37   220    48.4 %   2801   36.8 %
  7 SpikeMP 1.2 Turin              : 2790   37  37   220    48.4 %   2801   34.1 %
  8 Deep Junior 10.1               : 2787   42  42   220    48.0 %   2801   18.6 %
  9 HIARCS 11.1 MP UCI             : 2781   37  37   220    47.0 %   2801   35.0 %
 10 Naum 2.2                       : 2774   35  36   220    45.9 %   2802   40.9 %
 11 Glaurung 2.0.1                 : 2709   39  39   220    36.1 %   2808   30.5 %
 12 bright-0.2c                    : 2701   39  40   220    35.0 %   2809   30.0 %

Although Bright performed a bit stronger in the Gambitgames it should be mentioned that the positional performance was an acceptable one. Compared to the referencelist (CEGT 40/4) the rating of Bright is 6,33 ratingpoints higher than expected. All in all a fine performance by Bright 0.2c.

I have also updated the "sensitive" list which is the ratingdifference for each engine between the positional and gambitratinglist. The higher the number is the more "sensitive" the engine is to the type of testgames selected and vice versa. Starting with the most "sensitive" engines the list looks like this:

1. SpikeMP 1.2 Turin 111 ratingpoints
2. Deep Junior 10.1 97 ratingpoints
3-4. Hiarcs 11.1 MP & Naum 2.2 each 48 ratingpoints
5. Glaurung 2.0.1 40 ratingpoints
6. LoopMP 11A.32 21 ratingpoints
7. Deep Shredder 11 17 ratingpoints
8-10. Toga II 1.4 beta5c, Rybka 2.3.2a mp & Bright 0.2c each 14 ratingpoints
11. Zap!Chess Zanzibar 13 ratingpoints
12. Deep Fritz 10 1 ratingpoint

In other words: for Fritz it doesn't matter at all whether it plays the gambits or the positional games while engines like Junior and Spike are very "sensitive" engines.

Finally let me bring you the latest "all-round" ratinglist which is the positional and the gambitratinglist combined:

Code: Select all

    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Rybka 2.3.2a mp 32-bit         : 2948   29  29   440    71.8 %   2786   28.6 %
  2 Toga II 1.4 beta5c             : 2862   27  27   440    59.7 %   2794   32.0 %
  3 Deep Shredder 11 UCI           : 2843   28  28   440    56.8 %   2796   29.1 %
  4 Deep Fritz 10                  : 2832   28  28   440    55.1 %   2797   25.7 %
  5 HIARCS 11.1 MP UCI             : 2805   27  27   440    50.8 %   2799   32.0 %
  6 LoopMP 11A.32                  : 2800   27  27   440    50.1 %   2799   32.5 %
  7 Zap!Chess Zanzibar             : 2800   27  27   440    50.1 %   2799   31.1 %
  8 Naum 2.2                       : 2797   26  26   440    49.7 %   2800   35.7 %
  9 Deep Junior 10.1               : 2739   30  30   440    40.7 %   2805   17.7 %
 10 SpikeMP 1.2 Turin              : 2736   28  28   440    40.1 %   2805   29.3 %
 11 Glaurung 2.0.1                 : 2729   28  28   440    39.1 %   2806   27.7 %
 12 bright-0.2c                    : 2708   28  29   440    36.0 %   2808   26.6 %

I believe that the next engine to be tested will be Naum 3.

Regards
Per

Marek Soszynski · Post by **Marek Soszynski** » Fri Feb 08, 2008 11:54 am

I suggest that instead of calling engines insensitive/sensitive we call them balanced/unbalanced. Fritz then would be a relatively balanced engine.

That still leaves the terminological problem of the gambit test disfavouring what are thought to be tactical engines. For example, a quick look at the stats would suggest that Junior is a poor gambiteer. Obviously a rethink is required...

Getting engines to play with a positional book ought to be thought of as a test of their dynamism in quiet positions; whereas getting engines to play with a gambit book is a test of their soundness in wild positions.

In other words, the Gambit Rating could be renamed the Soundness Rating, while the Positional Rating could be renamed the Dynamism Rating. An engine that is further down the Soundness Rating List is probably playing too wildly; an engine that is further down the Dynamism Rating List is probably playing too quietly. An engine that is equally strong in both lists would be a balanced engine.

Finally, an engine that has a relatively low draw-score could be called an ambitious engine.

Yarget · Post by **Yarget** » Fri Feb 08, 2008 12:57 pm

Thanks for this and your earlier comments Marek. True, a lot of names can be used for the ratinglists I'm running and you should know that I myself reflect a lot about this issue too. How come that a "solid positional" engine like Naum 2.2 are doing much better in the gambitgames than in the positional games and why is opposite true for Junior? And what is the matter with Spike in the gambitgames?

As far as I know I´m the first one to run tests like this so everything is new. For the time being I will stick to my original names (perhaps with the exception of balanced/unbalanced engines) but I might do some changes later when more experience and knowledge is gained. We'll see. One thing is for sure; it's very interesting running these tests.

Regards
Per

Laskos · Post by **Laskos** » Sat Feb 09, 2008 3:07 am

Marek Soszynski wrote: Obviously a rethink is required...

Not so obvious. We had just some "impressions" that some engines are more positional than other. No real test of that.

Getting engines to play with a positional book ought to be thought of as a test of their dynamism in quiet positions; whereas getting engines to play with a gambit book is a test of their soundness in wild positions.

This is more dialectics than facts. What we really see is that Junior and Spike are faring better in positional openings, and Rybka and Naum in gambit openings.

Kai

Marek Soszynski · Post by **Marek Soszynski** » Sat Feb 09, 2008 9:38 am

Laskos wrote:
Marek Soszynski wrote: Obviously a rethink is required...
Not so obvious. We had just some "impressions" that some engines are more positional than other. No real test of that.

Getting engines to play with a positional book ought to be thought of as a test of their dynamism in quiet positions; whereas getting engines to play with a gambit book is a test of their soundness in wild positions.
This is more dialectics than facts. What we really see is that Junior and Spike are faring better in positional openings, and Rybka and Naum in gambit openings.

Kai

Impressions? Dialectics?

I tried to suggest a solution to the problem - for it has confused more than one poster - of a gambiteer engine having a relatively lower gambit rating.

If a hot room gets a low thermometer reading, then possibly the thermometer is broken or we should start to investigate whether it's a properly functioning barometer or other instrument instead. Nothing is learnt from saying that "what we really see" is that the red line of the thermometer or whatever is low.

Uri Blass · Post by **Uri Blass** » Sat Feb 09, 2008 12:13 pm

summery of your tables
Gambit games

Code: Select all

1 Rybka 2.3.2a mp 32-bit         : 2956   42  42   220    72.7 %   2786   27.3 % 
  2 Toga II 1.4 beta5c             : 2869   39  39   220    60.7 %   2794   29.5 % 
  3 Deep Shredder 11 UCI           : 2852   40  40   220    58.2 %   2795   25.5 % 
  4 Deep Fritz 10                  : 2833   41  41   220    55.2 %   2797   20.5 % 
  5 HIARCS 11.1 MP UCI             : 2829   39  39   220    54.5 %   2797   29.1 % 
  6 Naum 2.2                       : 2822   39  38   220    53.4 %   2798   30.5 % 
  7 LoopMP 11A.32                  : 2811   39  39   220    51.8 %   2799   28.2 % 
  8 Zap!Chess Zanzibar             : 2794   39  39   220    49.1 %   2800   27.3 % 
  9 Glaurung 2.0.1                 : 2749   40  40   220    42.0 %   2805   25.0 % 
 10 bright-0.2c                    : 2715   41  42   220    37.0 %   2808   23.2 % 
 11 Deep Junior 10.1               : 2690   44  44   220    33.4 %   2810   16.8 % 
 12 SpikeMP 1.2 Turin              : 2679   42  42   220    31.8 %   2811   24.5 %

positional games

Code: Select all

    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Rybka 2.3.2a mp 32-bit         : 2942   41  40   220    70.9 %   2787   30.0 %
  2 Toga II 1.4 beta5c             : 2855   38  37   220    58.6 %   2795   34.5 %
  3 Deep Shredder 11 UCI           : 2835   38  38   220    55.5 %   2796   32.7 %
  4 Deep Fritz 10                  : 2832   38  38   220    55.0 %   2797   30.9 %
  5 Zap!Chess Zanzibar             : 2807   37  37   220    51.1 %   2799   35.0 %
  6 LoopMP 11A.32                  : 2790   37  37   220    48.4 %   2801   36.8 %
  7 SpikeMP 1.2 Turin              : 2790   37  37   220    48.4 %   2801   34.1 %
  8 Deep Junior 10.1               : 2787   42  42   220    48.0 %   2801   18.6 %
  9 HIARCS 11.1 MP UCI             : 2781   37  37   220    47.0 %   2801   35.0 %
 10 Naum 2.2                       : 2774   35  36   220    45.9 %   2802   40.9 %
 11 Glaurung 2.0.1                 : 2709   39  39   220    36.1 %   2808   30.5 %
 12 bright-0.2c                    : 2701   39  40   220    35.0 %   2809   30.0 %

All

Code: Select all

    Program                          Elo    +   -   Games   Score   Av.Op.  Draws

  1 Rybka 2.3.2a mp 32-bit         : 2948   29  29   440    71.8 %   2786   28.6 %
  2 Toga II 1.4 beta5c             : 2862   27  27   440    59.7 %   2794   32.0 %
  3 Deep Shredder 11 UCI           : 2843   28  28   440    56.8 %   2796   29.1 %
  4 Deep Fritz 10                  : 2832   28  28   440    55.1 %   2797   25.7 %
  5 HIARCS 11.1 MP UCI             : 2805   27  27   440    50.8 %   2799   32.0 %
  6 LoopMP 11A.32                  : 2800   27  27   440    50.1 %   2799   32.5 %
  7 Zap!Chess Zanzibar             : 2800   27  27   440    50.1 %   2799   31.1 %
  8 Naum 2.2                       : 2797   26  26   440    49.7 %   2800   35.7 %
  9 Deep Junior 10.1               : 2739   30  30   440    40.7 %   2805   17.7 %
 10 SpikeMP 1.2 Turin              : 2736   28  28   440    40.1 %   2805   29.3 %
 11 Glaurung 2.0.1                 : 2729   28  28   440    39.1 %   2806   27.7 %
 12 bright-0.2c                    : 2708   28  29   440    36.0 %   2808   26.6 %

There is no significant difference for the most engines

It is not clear if Hiarcs and Naum and Glaurung's advantage in the gambit games is significant.
Spike and Junior have significant advantage in the positional games.

Note that I think that gmabit should help the stronger player because there are less draws.

Laskos · Post by **Laskos** » Sat Feb 09, 2008 11:10 pm

Uri Blass wrote:
It is not clear if Hiarcs and Naum and Glaurung's advantage in the gambit games is significant.

40 Elo points are about 1 std here, so one can infere with some 80% confidence that Hiarcs, Naum and Glaurung prefer gambit games. For Junior and Spike we have somethging like 97-99% confidence that they prefer positional games.

Kai

Ratinglist based on positional openingpositions

Re: Ratinglist based on positional openingpositions

Re: Ratinglist based on positional openingpositions

Re: Ratinglist based on positional openingpositions

Re: Ratinglist based on positional openingpositions

Re: Ratinglist based on positional openingpositions

Re: Ratinglist based on positional openingpositions

Re: Ratinglist based on positional openingpositions

Re: Ratinglist based on positional openingpositions

Re: Ratinglist based on positional openingpositions

Re: Ratinglist based on positional openingpositions