Why the errorbar is wrong ... simple example!

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Frank Quisinsky
Posts: 4844
Joined: Wed Nov 18, 2009 6:16 pm
Location: Trier, Germany
Contact:

Why the errorbar is wrong ... simple example!

Post by Frank Quisinsky » Tue Feb 23, 2016 12:19 am

Hi there,

at the moment Fizbo 1.6 x64 is still running vs. 59 opponents.

Results after round 25 vs. 20 strongest / weakest opponents!

Code: Select all

  14 Fizbo 1.6 x64 strong              : 2908.9    499   56.5  44.9   24.0  2860.2   11.5   20.0
  28 Fizbo 1.6 x64 week                : 2785.9    499   53.3  45.3   22.4  2763.9   11.1   20.0
Error = 22.4 or 24.0
OK
24.0 x 2 = 48 Elo

Result = 123 Elo differents
- 48 = 75 Elo different to ErrorBar after 500 games vs. 20 opponents. Not new for me ... the average is 55 Elo with other opponents (20 opponents with 50 games).

I made some calculations and find out that with 50 games per paring and 26 opponents the ErrorBar is correct. With more opponents the ErrorBar is smaler and with lesser opponents the ErrorBar is bigger.

Different opponents different results.
And a Rating List with lesser opponents ... to read tea leaves is more interesting.

Thats the bad point we have in our rating calculation programs because factor ... quantity of opponents is missing. We are looking on quantity of games only and this is absolutely wrong.

Here the Fizbo results:

Code: Select all

14) Fizbo 1.6 x64 stark           2908.9 :    499 (+170,=224,-105),  56.5 %

    vs.                                  :  games (   +,   =,   -),   (%) :    Diff,    SD, CFS (%)
    Komodo 9.3 x64                       :     25 (   0,   7,  18),  14.0 :  -276.6,  15.0,    0.0
    Stockfish 7 KP BMI2 x64              :     25 (   0,   9,  16),  18.0 :  -267.3,  14.4,    0.0
    Houdini 4 STD B x64                  :     25 (   2,  11,  12),  30.0 :  -189.9,  14.3,    0.0
    Fire 4 x64                           :     25 (   0,  12,  13),  24.0 :  -145.7,  14.0,    0.0
    Equinox 3.30 x64                     :     25 (   2,  14,   9),  36.0 :   -96.9,  14.1,    0.0
    Nirvanachess 2.2 POP x64             :     25 (   4,  13,   8),  42.0 :   -34.0,  13.6,    0.6
    Texel 1.05 x64                       :     24 (   4,  16,   4),  50.0 :    +4.1,  13.9,   61.6
    Naum 4.6 x64                         :     25 (   4,  17,   4),  50.0 :   +18.9,  13.4,   92.1
    Hakkapeliitta 3.0 x64                :     25 (  10,  11,   4),  62.0 :   +72.9,  13.5,  100.0
    Shredder 12 x64                      :     25 (   7,  17,   1),  62.0 :  +110.6,  13.4,  100.0
    Junior 13.3.00 x64                   :     25 (   9,  12,   4),  60.0 :  +111.9,  13.5,  100.0
    DiscoCheck 5.2.1 x64                 :     25 (   8,  16,   1),  64.0 :  +131.7,  13.3,  100.0
    Booot 5.2.0 x64                      :     25 (  16,   7,   2),  78.0 :  +136.8,  13.6,  100.0
    Deuterium 14.3.34.130 POP x64        :     25 (   9,  15,   1),  66.0 :  +148.3,  13.3,  100.0
    Doch 1.3.4 JA x64                    :     25 (  14,  10,   1),  76.0 :  +162.9,  13.7,  100.0
    MinkoChess 1.3 JA POP x64            :     25 (  17,   7,   1),  82.0 :  +184.9,  13.3,  100.0
    Murka 3 x64                          :     25 (  14,   9,   2),  74.0 :  +201.2,  13.6,  100.0
    Nemo 1.01 Beta POP x64               :     25 (  13,  11,   1),  74.0 :  +201.2,  13.7,  100.0
    Scorpio 2.77 JA POP x64              :     25 (  18,   5,   2),  82.0 :  +233.5,  14.1,  100.0
    The Baron 3.29 x64                   :     25 (  19,   5,   1),  86.0 :  +264.3,  13.8,  100.0

Code: Select all

28) Fizbo 1.6 x64 schwach         2785.9 :    499 (+153,=226,-120),  53.3 %

    vs.                                  :  games (   +,   =,   -),   (%) :    Diff,    SD, CFS (%)
    GullChess 3.0 BMI2 x64               :     25 (   0,   7,  18),  14.0 :  -264.3,  13.5,    0.0
    Critter 1.6a x64                     :     25 (   1,  12,  12),  28.0 :  -209.3,  13.5,    0.0
    iCE 3.0 v658 POP x64                 :     25 (   1,  15,   9),  34.0 :  -145.4,  13.0,    0.0
    Sting SF 6 x64                       :     25 (   2,   7,  16),  22.0 :  -139.3,  12.7,    0.0
    Cheng 4.39 x64                       :     25 (   7,  12,   6),  52.0 :   -16.8,  12.5,    8.9
    Quazar 0.4 x64                       :     25 (   6,  11,   8),  46.0 :   +13.7,  12.6,   86.2
    Alfil 15.04 C# Beta 24 x64           :     25 (   5,  13,   7),  46.0 :   +29.7,  12.5,   99.1
    Spark 1.0 x64                        :     25 (   7,  13,   5),  54.0 :   +36.1,  12.5,   99.8
    Crafty 25.0 DC x64                   :     25 (   8,  13,   4),  58.0 :   +49.1,  13.1,  100.0
    TogaII 280513 Intel w32              :     25 (  10,   9,   6),  58.0 :   +51.8,  12.7,  100.0
    Atlas 3.80 x64                       :     25 (  10,   9,   6),  58.0 :   +54.1,  12.9,  100.0
    Gaviota 1.0 AVX x64                  :     25 (   8,  14,   3),  60.0 :   +59.6,  12.4,  100.0
    Dirty 03NOV2015 POP x64              :     24 (   8,  12,   4),  58.3 :   +63.6,  12.8,  100.0
    Bobcat 7.1 x64                       :     25 (  10,  12,   3),  64.0 :   +72.3,  13.0,  100.0
    EXchess 7.71b x64                    :     25 (  10,  12,   3),  64.0 :   +74.3,  13.2,  100.0
    GNU Chess5 5.60 x64                  :     25 (  11,  11,   3),  66.0 :  +108.2,  12.7,  100.0
    Glaurung 2.2 JA x64                  :     25 (  13,   9,   3),  70.0 :  +126.1,  12.7,  100.0
    Rhetoric 1.4.3 POP x64               :     25 (  12,  11,   2),  70.0 :  +135.0,  12.9,  100.0
    BugChess2 1.9 POP x64                :     25 (  10,  15,   0),  70.0 :  +167.4,  13.1,  100.0
    Frenzee 3.5.19 x64                   :     25 (  14,   9,   2),  74.0 :  +176.4,  13.0,  100.0
We can do what we do ... since years.
Most interesting is the result we like.
Hard but fact.

Of course the right rating is in the middle ... around 2.840 Elo. I wrote more about it in German language in CSS Forum.

Best
Frank


PS: Logical ... with 19 opponents more as 123 Elo ... with more as 20 opponents lesser as 123 Elo.
I like computer chess!

User avatar
michiguel
Posts: 6386
Joined: Thu Mar 09, 2006 7:30 pm
Location: Chicago, Illinois, USA
Contact:

Re: Why the errorbar is wrong ... simple example!

Post by michiguel » Tue Feb 23, 2016 1:12 am

Frank Quisinsky wrote:Hi there,

at the moment Fizbo 1.6 x64 is still running vs. 59 opponents.

Results after round 25 vs. 20 strongest / weakest opponents!

Code: Select all

  14 Fizbo 1.6 x64 strong              : 2908.9    499   56.5  44.9   24.0  2860.2   11.5   20.0
  28 Fizbo 1.6 x64 week                : 2785.9    499   53.3  45.3   22.4  2763.9   11.1   20.0
Error = 22.4 or 24.0
OK
24.0 x 2 = 48 Elo

Result = 123 Elo differents
- 48 = 75 Elo different to ErrorBar after 500 games vs. 20 opponents. Not new for me ... the average is 55 Elo with other opponents (20 opponents with 50 games).

I made some calculations and find out that with 50 games per paring and 26 opponents the ErrorBar is correct. With more opponents the ErrorBar is smaler and with lesser opponents the ErrorBar is bigger.

Different opponents different results.
And a Rating List with lesser opponents ... to read tea leaves is more interesting.

Thats the bad point we have in our rating calculation programs because factor ... quantity of opponents is missing. We are looking on quantity of games only and this is absolutely wrong.

Here the Fizbo results:

Code: Select all

14) Fizbo 1.6 x64 stark           2908.9 :    499 (+170,=224,-105),  56.5 %

    vs.                                  :  games (   +,   =,   -),   (%) :    Diff,    SD, CFS (%)
    Komodo 9.3 x64                       :     25 (   0,   7,  18),  14.0 :  -276.6,  15.0,    0.0
    Stockfish 7 KP BMI2 x64              :     25 (   0,   9,  16),  18.0 :  -267.3,  14.4,    0.0
    Houdini 4 STD B x64                  :     25 (   2,  11,  12),  30.0 :  -189.9,  14.3,    0.0
    Fire 4 x64                           :     25 (   0,  12,  13),  24.0 :  -145.7,  14.0,    0.0
    Equinox 3.30 x64                     :     25 (   2,  14,   9),  36.0 :   -96.9,  14.1,    0.0
    Nirvanachess 2.2 POP x64             :     25 (   4,  13,   8),  42.0 :   -34.0,  13.6,    0.6
    Texel 1.05 x64                       :     24 (   4,  16,   4),  50.0 :    +4.1,  13.9,   61.6
    Naum 4.6 x64                         :     25 (   4,  17,   4),  50.0 :   +18.9,  13.4,   92.1
    Hakkapeliitta 3.0 x64                :     25 (  10,  11,   4),  62.0 :   +72.9,  13.5,  100.0
    Shredder 12 x64                      :     25 (   7,  17,   1),  62.0 :  +110.6,  13.4,  100.0
    Junior 13.3.00 x64                   :     25 (   9,  12,   4),  60.0 :  +111.9,  13.5,  100.0
    DiscoCheck 5.2.1 x64                 :     25 (   8,  16,   1),  64.0 :  +131.7,  13.3,  100.0
    Booot 5.2.0 x64                      :     25 (  16,   7,   2),  78.0 :  +136.8,  13.6,  100.0
    Deuterium 14.3.34.130 POP x64        :     25 (   9,  15,   1),  66.0 :  +148.3,  13.3,  100.0
    Doch 1.3.4 JA x64                    :     25 (  14,  10,   1),  76.0 :  +162.9,  13.7,  100.0
    MinkoChess 1.3 JA POP x64            :     25 (  17,   7,   1),  82.0 :  +184.9,  13.3,  100.0
    Murka 3 x64                          :     25 (  14,   9,   2),  74.0 :  +201.2,  13.6,  100.0
    Nemo 1.01 Beta POP x64               :     25 (  13,  11,   1),  74.0 :  +201.2,  13.7,  100.0
    Scorpio 2.77 JA POP x64              :     25 (  18,   5,   2),  82.0 :  +233.5,  14.1,  100.0
    The Baron 3.29 x64                   :     25 (  19,   5,   1),  86.0 :  +264.3,  13.8,  100.0

Code: Select all

28) Fizbo 1.6 x64 schwach         2785.9 :    499 (+153,=226,-120),  53.3 %

    vs.                                  :  games (   +,   =,   -),   (%) :    Diff,    SD, CFS (%)
    GullChess 3.0 BMI2 x64               :     25 (   0,   7,  18),  14.0 :  -264.3,  13.5,    0.0
    Critter 1.6a x64                     :     25 (   1,  12,  12),  28.0 :  -209.3,  13.5,    0.0
    iCE 3.0 v658 POP x64                 :     25 (   1,  15,   9),  34.0 :  -145.4,  13.0,    0.0
    Sting SF 6 x64                       :     25 (   2,   7,  16),  22.0 :  -139.3,  12.7,    0.0
    Cheng 4.39 x64                       :     25 (   7,  12,   6),  52.0 :   -16.8,  12.5,    8.9
    Quazar 0.4 x64                       :     25 (   6,  11,   8),  46.0 :   +13.7,  12.6,   86.2
    Alfil 15.04 C# Beta 24 x64           :     25 (   5,  13,   7),  46.0 :   +29.7,  12.5,   99.1
    Spark 1.0 x64                        :     25 (   7,  13,   5),  54.0 :   +36.1,  12.5,   99.8
    Crafty 25.0 DC x64                   :     25 (   8,  13,   4),  58.0 :   +49.1,  13.1,  100.0
    TogaII 280513 Intel w32              :     25 (  10,   9,   6),  58.0 :   +51.8,  12.7,  100.0
    Atlas 3.80 x64                       :     25 (  10,   9,   6),  58.0 :   +54.1,  12.9,  100.0
    Gaviota 1.0 AVX x64                  :     25 (   8,  14,   3),  60.0 :   +59.6,  12.4,  100.0
    Dirty 03NOV2015 POP x64              :     24 (   8,  12,   4),  58.3 :   +63.6,  12.8,  100.0
    Bobcat 7.1 x64                       :     25 (  10,  12,   3),  64.0 :   +72.3,  13.0,  100.0
    EXchess 7.71b x64                    :     25 (  10,  12,   3),  64.0 :   +74.3,  13.2,  100.0
    GNU Chess5 5.60 x64                  :     25 (  11,  11,   3),  66.0 :  +108.2,  12.7,  100.0
    Glaurung 2.2 JA x64                  :     25 (  13,   9,   3),  70.0 :  +126.1,  12.7,  100.0
    Rhetoric 1.4.3 POP x64               :     25 (  12,  11,   2),  70.0 :  +135.0,  12.9,  100.0
    BugChess2 1.9 POP x64                :     25 (  10,  15,   0),  70.0 :  +167.4,  13.1,  100.0
    Frenzee 3.5.19 x64                   :     25 (  14,   9,   2),  74.0 :  +176.4,  13.0,  100.0
We can do what we do ... since years.
Most interesting is the result we like.
Hard but fact.

Of course the right rating is in the middle ... around 2.840 Elo. I wrote more about it in German language in CSS Forum.

Best
Frank


PS: Logical ... with 19 opponents more as 123 Elo ... with more as 20 opponents lesser as 123 Elo.
Hi Frank,

How did you select it the first list and the second list?

Miguel

Frank Quisinsky
Posts: 4844
Joined: Wed Nov 18, 2009 6:16 pm
Location: Trier, Germany
Contact:

Re: Why the errorbar is wrong ... simple example!

Post by Frank Quisinsky » Tue Feb 23, 2016 1:38 am

Hi Miguel,

easy ...
I am looking in the round-robin results from Shredder Classic GUI.
http://www.amateurschach.de/ftptrigger/ ... 6-x64.html

Select out by hand ... best 20 and weakest 20 results (looking on perf.).

I create two databases and added the games vs. best and weakest opponents.

Add the two databases to the others games and got this one ...

Code: Select all

   # Player                           :    Elo  Games  Score  Draw  Error  OppAvg   OppE   OppD
   1 Komodo 9.3 x64                   : 3185.5   2025   84.7  26.8   15.6  2856.8   11.2   40.8
   2 Stockfish 7 KP BMI2 x64          : 3176.3   2025   84.0  30.0   15.1  2858.2   11.2   40.8
   3 Houdini 4 STD B x64              : 3098.9   2075   77.9  31.7   13.9  2854.6   11.2   41.8
   4 Fire 4 x64                       : 3054.7   1775   71.6  39.9   13.9  2876.3   11.3   35.8
   5 GullChess 3.0 BMI2 x64           : 3050.2   1775   71.3  41.9   13.8  2874.7   11.3   35.8
   6 Equinox 3.30 x64                 : 3005.9   1775   66.1  45.2   13.1  2877.7   11.4   35.8
   7 Fritz 15 x64                     : 2996.2   2000   66.8  43.5   12.5  2862.0   11.1   40.0
   8 Critter 1.6a x64                 : 2995.2   1775   65.0  45.2   13.1  2876.4   11.3   35.8
   9 Protector 1.9.0 x64              : 2954.1   1950   61.6  46.3   12.1  2864.8   11.1   39.0
  10 Nirvanachess 2.2 POP x64         : 2943.0   1975   60.1  46.1   12.1  2865.6   11.3   39.8
  11 iCE 3.0 v658 POP x64             : 2931.3   2025   59.1  44.0   11.7  2862.7   11.3   40.8
  12 Sting SF 6 x64                   : 2925.2   2025   58.3  44.9   11.7  2862.8   11.3   40.8
  13 Andscacs 0.83 POP x64            : 2924.7   1950   57.9  44.6   12.1  2865.5   11.1   39.0
  14 Fizbo 1.6 x64 stark              : 2908.9    499   56.5  44.9   24.0  2860.2   11.5   20.0
  15 Hannibal 1.5 x64                 : 2906.5   2100   57.6  45.0   11.7  2848.8   11.2   42.0
  16 Chiron 2.0 x64                   : 2906.3   2400   60.0  40.3   11.0  2829.5   11.2   48.0
  17 Texel 1.05 x64                   : 2904.9   2074   56.9  42.5   11.9  2853.0   11.3   41.8
  18 Naum 4.6 x64                     : 2890.0   2575   58.9  44.5   10.4  2822.7   11.4   51.8
  19 SmarThink 1.80 AVX x64           : 2852.0   2000   48.8  41.0   11.7  2865.6   11.1   40.0
  20 Senpai 1.0 SSE42 x64             : 2840.7   2800   53.9  43.2   10.0  2814.3   11.2   56.0
  21 Hakkapeliitta 3.0 x64            : 2836.0   2625   51.9  38.5   10.2  2825.6   11.3   52.8
  22 Hiarcs 14 WCSC w32               : 2829.2   2850   52.9  41.3   10.4  2810.9   11.2   57.0
  23 Sjeng c't 2010 w32               : 2811.5   2850   50.7  40.9    9.6  2811.2   11.2   57.0
  24 Cheng 4.39 x64                   : 2802.7   2875   49.5  41.0    9.7  2811.6   11.3   57.8
  25 Shredder 12 x64                  : 2798.3   2875   48.8  43.5    9.8  2812.8   11.3   57.8
  26 Junior 13.3.00 x64               : 2797.0   2925   48.9  40.8    9.8  2810.6   11.3   58.8
  27 Vajolet2 2.0 POP x64             : 2793.0   2900   48.5  40.9    9.6  2809.8   11.2   58.0
  28 Fizbo 1.6 x64 schwach            : 2785.9    499   53.3  45.3   22.4  2763.9   11.1   20.0
  29 Spike 1.4 Leiden w32             : 2784.2   2900   47.4  42.6    9.6  2810.0   11.2   58.0
  30 DiscoCheck 5.2.1 x64             : 2777.2   2925   46.4  39.5    9.6  2810.9   11.3   58.8
  31 Quazar 0.4 x64                   : 2772.2   2875   46.1  42.7    9.4  2808.6   11.3   57.8
  32 Booot 5.2.0 x64                  : 2772.1   2475   48.1  40.0   10.4  2792.5   11.1   49.8
  33 Deuterium 14.3.34.130 POP x64    : 2760.6   2925   44.3  43.6    9.7  2811.2   11.3   58.8
  34 Alfil 15.04 C# Beta 24 x64       : 2756.2   2575   46.7  34.6   10.0  2787.9   11.1   51.8
  35 Zappa Mexico II x64              : 2751.1   2850   43.4  42.8    9.7  2809.1   11.2   57.0
  36 Spark 1.0 x64                    : 2749.8   2875   43.3  42.7    9.4  2809.0   11.3   57.8
  37 Thinker 5.4d Inert x64           : 2746.3   2900   42.6  41.5    9.8  2810.6   11.2   58.0
  38 Doch 1.3.4 JA x64                : 2746.0   2275   48.4  46.0   10.8  2761.5   10.9   45.8
  39 Crafty 25.0 DC x64               : 2736.8   2325   46.9  41.9   10.7  2763.2   10.8   46.8
  40 TogaII 280513 Intel w32          : 2734.1   2425   45.4  40.3   10.5  2774.9   11.0   48.8
  41 Atlas 3.80 x64                   : 2731.8   2725   42.4  40.1   10.0  2796.5   11.1   54.8
  42 Tornado 5.0 SSE4 x64             : 2727.0   2300   45.5  38.9   10.8  2764.0   10.7   46.0
  43 Gaviota 1.0 AVX x64              : 2726.3   2925   40.1  37.9    9.6  2810.7   11.3   58.8
  44 MinkoChess 1.3 JA POP x64        : 2724.1   2475   43.5  43.4   10.4  2777.9   11.0   49.8
  45 Arasan 18.1 POP x64              : 2723.6   2500   43.0  39.6   10.9  2784.1   10.9   50.0
  46 Dirty 03NOV2015 POP x64          : 2722.3   2074   47.9  43.2   10.9  2737.8   10.6   41.8
  47 Bobcat 7.1 x64                   : 2713.6   2075   46.7  44.4   11.1  2738.1   10.6   41.8
  48 EXchess 7.71b x64                : 2711.6   2125   44.1  41.1   11.5  2761.6   11.0   42.8
  49 Rodent 1.7 Build 1 POP x64       : 2707.9   2050   45.7  43.9   11.3  2739.4   10.5   41.0
  50 Murka 3 x64                      : 2707.8   2125   44.6  46.4   11.2  2748.4   10.7   42.8
  51 Nemo 1.01 Beta POP x64           : 2707.7   2375   42.3  42.1   10.4  2768.4   10.9   47.8
  52 Pedone 1.2 BMI2 x64              : 2705.2   2450   41.0  44.0   10.2  2778.9   10.9   49.0
  53 DisasterArea 1.54 x64            : 2682.7   1950   42.3  45.3   11.5  2740.3   10.6   39.0
  54 GNU Chess5 5.60 x64              : 2677.7   2125   41.2  41.1   11.1  2742.9   10.6   42.8
  55 Scorpio 2.77 JA POP x64          : 2675.5   2225   39.9  40.4   11.1  2751.6   10.7   44.8
  56 Glaurung 2.2 JA x64              : 2659.8   2175   38.3  42.9   11.1  2747.1   10.7   43.8
  57 Rhetoric 1.4.3 POP x64           : 2650.9   2075   38.0  41.3   11.6  2739.6   10.6   41.8
  58 The Baron 3.29 x64               : 2644.6   2225   35.8  38.0   11.2  2752.3   10.7   44.8
  59 Octochess r5190 SSE4 x64         : 2637.8   2100   36.0  41.1   11.4  2743.3   10.5   42.0
  60 BugChess2 1.9 POP x64            : 2618.5   1975   34.5  37.5   12.2  2734.2   10.6   39.8
  61 Frenzee 3.5.19 x64               : 2609.5   1925   33.7  35.0   12.4  2731.6   10.6   38.8

White advantage = 36.00 +/- 1.02
Draw rate (equal opponents) = 47.68 % +/- 0.21


I discuss with Jesus in an other thread to this topic. I wish me a GUI for such simulation.

With 59 opponents = 1.770 possible pairings.
Should be easy ... we need a txt file with the perf results from each of these 1.770 possible pairings. Now the GUI can look in the txt file ... if I wish the information ... please give me a rating list with the best and weakest Gaviata results and 20, 21, 22 opponents (what I wish). I can simulate each Rating List with such a strong database and we can simulate the corrrect errorbar with such a GUI / tool.

With other words, we can make the Elo calculation much stronger.

Best
Frank
I like computer chess!

User avatar
hgm
Posts: 23162
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: Why the errorbar is wrong ... simple example!

Post by hgm » Tue Feb 23, 2016 7:03 am

The error bars reported by Elo-calculating programs only represent the statistical error. The Elo is defined as the performance against a variety of opponents of all strength, stronger as well weaker in equal numbers. By testing only against strong or weak opponents, you will introduce a systematical error. Basically what you measure is no longer the Elo rating, and no matter how accurately you measure it by using many games, there is no reason to expect it to converge to the Elo rating.

There is nothing the rating calculator can do about this. It depends on the qualities of the program. Some players are good at defending and weak at attacking, and they would achieve good results against slightly better opponents, but disappointing results against weaker opponents. For players that are good attackers it might be just the other way around. By only looking at games against stronger opponents there is no way to predict how the player will perform against weaker opponents; this can be different for each player. If you don't play the games, you will never know. You should never trust any ratings obtained by one-sided testing.

Frank Quisinsky
Posts: 4844
Joined: Wed Nov 18, 2009 6:16 pm
Location: Trier, Germany
Contact:

Re: Why the errorbar is wrong ... simple example!

Post by Frank Quisinsky » Tue Feb 23, 2016 8:48 am

Quantity of opp. missed in Elo calculation.
So long quantity of opponents missed, so long is error a nonsense output and totaly wrong.

Nice and good old school what your wrote but I can nothing do with it.

But what I like from your answere:

Fizbo have a big weak point (endgame). In most of games the final result is building in transition into endgame. In the example I gave we have 75 Elo differents to error ... normaly it is with 20 opponents 55 Elo in average. Programs with a bigger weak points a bit more as 55 Elo. I saw in other calculation 80-85 Elo differents to error.

Error is always the same, not important how many opponents.

That is wrong because the opponents itself are important. With more opponents we can not build such an example I wrote at first.

With lesser opponents the error is higher, with more opponents the error is smaler.

People are looking in error.
People saw this two different results with the same error.

Each what we can find in a Rating List is pure random if we are not learn that quantity of games alone is nothing. The combination of quantity of games and quantity of opponents will give us a correct error.

So I wrote ... with the Information error after an elo calculation I can nothing do because its pure random if I have not the quantity of opponents considered.
I like computer chess!

Frank Quisinsky
Posts: 4844
Joined: Wed Nov 18, 2009 6:16 pm
Location: Trier, Germany
Contact:

Re: Why the errorbar is wrong ... simple example!

Post by Frank Quisinsky » Tue Feb 23, 2016 9:53 am

Other example:

Programmer give us a new version with the information 30 Elo stronger. What can I do with it ... 30 Elo stronger?!

Vs. itself in selfplay?
Vs. 10 opponents ... ?
Vs. 20 opponents ... ?
Vs. 30 opponents ... ?

If I know that the next question could be ... how many games you have vs. x opponents.

Rating List A:
This version is 40 Elo stronger (40 games vs. 20 opponents)

Rating List B:
This version is 60 Elo stronger (100 games vs. 15 opponents).

Rating List C: (different quantitiy of games vs. 10 opponents).
This new version is 20 Elo stronger

Pure random if not enough opponents in all of the results.
And we compare the results. We can't compare the results if we have not many opponents for it.

Error is correct if I am looking on quantity of games only. But nobody should have interest on error if we producecd with a complete group of other engines a complete other final result in Elo.

What we need is a better calculation for comparing different results. Most of People looking on error ... how sequre is the results ... and this one is nonsense (at the moment).

I am sure its a question of time if we have a calculation program considered quantity of opponents ... if we have it ... we can compare and can looking on error.

Best
Frank

Have a look in all the Stockfish / Komodo threads. Which program is stronger and all that what People do ... comparing the results from different rating lists for find out more.

It's so easy to fixed that.
I like computer chess!

User avatar
Ozymandias
Posts: 1033
Joined: Sun Oct 25, 2009 12:30 am

Re: Why the errorbar is wrong ... simple example!

Post by Ozymandias » Tue Feb 23, 2016 11:11 am

Frank Quisinsky wrote:Quantity of opp. missed in Elo calculation.
So long quantity of opponents missed, so long is error a nonsense output and totally wrong.
Finally someone else realized there's a problem there. I wouldn't say it's a nonsense, but it's mainly worthless and a loss of time. That's why I never perform the simulations under Ordo.
A couple of clarifications are in order:
  • * The number of opponents also depends on the variety. You use original engines in your rating list, others include clones, derivatives and different versions. Less variety, more opponents needed.
    * A wide choice of openings is also recommended. This isn't a problem for more rating lists, that run thousands of games, but those short tests you see posted on forums are highly suspect.

Frank Quisinsky
Posts: 4844
Joined: Wed Nov 18, 2009 6:16 pm
Location: Trier, Germany
Contact:

Re: Why the errorbar is wrong ... simple example!

Post by Frank Quisinsky » Tue Feb 23, 2016 11:31 am

Hi Juan,

yes, different styles of engines are important. If I added to many engines with the same style the complete work is broken with lesser opponents.

And yes ... less variety, more opponents needed.

I wish me since so many years that quantity of opponents will find a place in the Elo calculation.

Error lovers can life with the information we have but it would be nice to have a second information ... the combination of quantity of opponents and quantity of games.

I like it so much that the programmer of Ordo have add the new Opp features. Furthermore, I can deactivate the Error information with the new Parameters Ordo have. So I must not give my vistors an information nobody need. Be sure, I will do it with one of the next updates on my site.

Best
Frank

Honestly ...
With lesser opponents the Error information is not more interesting compare to what I am eating today (for the readers on my site).
I like computer chess!

bob
Posts: 20362
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Why the errorbar is wrong ... simple example!

Post by bob » Wed Feb 24, 2016 12:29 am

Frank Quisinsky wrote:Quantity of opp. missed in Elo calculation.
So long quantity of opponents missed, so long is error a nonsense output and totaly wrong.

Nice and good old school what your wrote but I can nothing do with it.

But what I like from your answere:

Fizbo have a big weak point (endgame). In most of games the final result is building in transition into endgame. In the example I gave we have 75 Elo differents to error ... normaly it is with 20 opponents 55 Elo in average. Programs with a bigger weak points a bit more as 55 Elo. I saw in other calculation 80-85 Elo differents to error.

Error is always the same, not important how many opponents.

That is wrong because the opponents itself are important. With more opponents we can not build such an example I wrote at first.

With lesser opponents the error is higher, with more opponents the error is smaler.

People are looking in error.
People saw this two different results with the same error.

Each what we can find in a Rating List is pure random if we are not learn that quantity of games alone is nothing. The combination of quantity of games and quantity of opponents will give us a correct error.

So I wrote ... with the Information error after an elo calculation I can nothing do because its pure random if I have not the quantity of opponents considered.
The number of opponents has absolutely nothing to do with Elo or error bar. Error bar is based on the size of the sample taken. Larger sample = smaller error bar. No exceptions (see central limit theorem in statistics).

Elo is also not an absolute number. It only works within a given sample of opponents and results. The more programs play each other in different combinations of groups, the better the Elo will predict expected results between any two of the opponents. If the two groups have no common opponents, the two ratings are meaningless when compared.

Might seem counter-intuitive, but it is not just a statistical guess, it is a statistical fact.

Frank Quisinsky
Posts: 4844
Joined: Wed Nov 18, 2009 6:16 pm
Location: Trier, Germany
Contact:

Re: Why the errorbar is wrong ... simple example!

Post by Frank Quisinsky » Wed Feb 24, 2016 4:39 am

Hi Bob,

thanks for your message!

I know that Bob. I read the book by Karl-Heinz Glenz (for one example) more times.

ErrorBar for chess can't be a practical application. Should have no place here. Statistic standard is useless. What I can do with it if I can simulate with each of test-runs such examples I gave in my first posting.

75 Elo different to error, this case is not isolated.

---

To give an information about tolerance is nice.

Table after Prof. Elo:
Two examples

5 games, standard deviation points 1.12, rating 126.0, probable error 0.76, error in rating 85.0, confidenz / confident 57
100 games, standard deviation ponts 5.00, rating 28,3, probable error 3.37, error in Rating 19.1, confidenz / confident 99,9

I believe we can search here the problem. Such a table I wish me for games in combination with opponents.

100 games, 5 opponents
100 games, 10 opponents
100 games, 20 opponents
100 games, 30 opponents

Again, with a strong database we can simulate it. All Prof. Elo does ... he collect game material. The problem Prof. Elo had ... he did not have enough of it between same players.

But we have it, can produced it in computer chess. Computer chess can help to try to make a better rating system. Maybe a rating system for computer chess only.

I say ... tolerance information is great!
I say ... we can try to produce a better information as errorbar.

To Display "error" in a Rating list of chess programs is the same as I display a picture from my grandmother in the elo calculation. And believe me the chess players will have much fun more if they can see how my grandmothers looked in the past. That is much more interesting.

We have to search and have to produce such a tolerance information with the combination of quantity of games and quantity of opponents.

Best
Frank
I like computer chess!

Post Reply