Why the errorbar is wrong ... simple example!

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Why the errorbar is wrong ... simple example!

Post by Laskos »

Frank Quisinsky wrote:Hi there,

at the moment Fizbo 1.6 x64 is still running vs. 59 opponents.

Results after round 25 vs. 20 strongest / weakest opponents!

Code: Select all

  14 Fizbo 1.6 x64 strong              : 2908.9    499   56.5  44.9   24.0  2860.2   11.5   20.0
  28 Fizbo 1.6 x64 week                : 2785.9    499   53.3  45.3   22.4  2763.9   11.1   20.0
Error = 22.4 or 24.0
OK
24.0 x 2 = 48 Elo

Result = 123 Elo differents
- 48 = 75 Elo different to ErrorBar after 500 games vs. 20 opponents. Not new for me ... the average is 55 Elo with other opponents (20 opponents with 50 games).

I made some calculations and find out that with 50 games per paring and 26 opponents the ErrorBar is correct. With more opponents the ErrorBar is smaler and with lesser opponents the ErrorBar is bigger.

Different opponents different results.
And a Rating List with lesser opponents ... to read tea leaves is more interesting.

Thats the bad point we have in our rating calculation programs because factor ... quantity of opponents is missing. We are looking on quantity of games only and this is absolutely wrong.

Here the Fizbo results:

Code: Select all

14) Fizbo 1.6 x64 stark           2908.9 :    499 (+170,=224,-105),  56.5 %

    vs.                                  :  games (   +,   =,   -),   (%) :    Diff,    SD, CFS (%)
    Komodo 9.3 x64                       :     25 (   0,   7,  18),  14.0 :  -276.6,  15.0,    0.0
    Stockfish 7 KP BMI2 x64              :     25 (   0,   9,  16),  18.0 :  -267.3,  14.4,    0.0
    Houdini 4 STD B x64                  :     25 (   2,  11,  12),  30.0 :  -189.9,  14.3,    0.0
    Fire 4 x64                           :     25 (   0,  12,  13),  24.0 :  -145.7,  14.0,    0.0
    Equinox 3.30 x64                     :     25 (   2,  14,   9),  36.0 :   -96.9,  14.1,    0.0
    Nirvanachess 2.2 POP x64             :     25 (   4,  13,   8),  42.0 :   -34.0,  13.6,    0.6
    Texel 1.05 x64                       :     24 (   4,  16,   4),  50.0 :    +4.1,  13.9,   61.6
    Naum 4.6 x64                         :     25 (   4,  17,   4),  50.0 :   +18.9,  13.4,   92.1
    Hakkapeliitta 3.0 x64                :     25 (  10,  11,   4),  62.0 :   +72.9,  13.5,  100.0
    Shredder 12 x64                      :     25 (   7,  17,   1),  62.0 :  +110.6,  13.4,  100.0
    Junior 13.3.00 x64                   :     25 (   9,  12,   4),  60.0 :  +111.9,  13.5,  100.0
    DiscoCheck 5.2.1 x64                 :     25 (   8,  16,   1),  64.0 :  +131.7,  13.3,  100.0
    Booot 5.2.0 x64                      :     25 (  16,   7,   2),  78.0 :  +136.8,  13.6,  100.0
    Deuterium 14.3.34.130 POP x64        :     25 (   9,  15,   1),  66.0 :  +148.3,  13.3,  100.0
    Doch 1.3.4 JA x64                    :     25 (  14,  10,   1),  76.0 :  +162.9,  13.7,  100.0
    MinkoChess 1.3 JA POP x64            :     25 (  17,   7,   1),  82.0 :  +184.9,  13.3,  100.0
    Murka 3 x64                          :     25 (  14,   9,   2),  74.0 :  +201.2,  13.6,  100.0
    Nemo 1.01 Beta POP x64               :     25 (  13,  11,   1),  74.0 :  +201.2,  13.7,  100.0
    Scorpio 2.77 JA POP x64              :     25 (  18,   5,   2),  82.0 :  +233.5,  14.1,  100.0
    The Baron 3.29 x64                   :     25 (  19,   5,   1),  86.0 :  +264.3,  13.8,  100.0

Code: Select all

28) Fizbo 1.6 x64 schwach         2785.9 :    499 (+153,=226,-120),  53.3 %

    vs.                                  :  games (   +,   =,   -),   (%) :    Diff,    SD, CFS (%)
    GullChess 3.0 BMI2 x64               :     25 (   0,   7,  18),  14.0 :  -264.3,  13.5,    0.0
    Critter 1.6a x64                     :     25 (   1,  12,  12),  28.0 :  -209.3,  13.5,    0.0
    iCE 3.0 v658 POP x64                 :     25 (   1,  15,   9),  34.0 :  -145.4,  13.0,    0.0
    Sting SF 6 x64                       :     25 (   2,   7,  16),  22.0 :  -139.3,  12.7,    0.0
    Cheng 4.39 x64                       :     25 (   7,  12,   6),  52.0 :   -16.8,  12.5,    8.9
    Quazar 0.4 x64                       :     25 (   6,  11,   8),  46.0 :   +13.7,  12.6,   86.2
    Alfil 15.04 C# Beta 24 x64           :     25 (   5,  13,   7),  46.0 :   +29.7,  12.5,   99.1
    Spark 1.0 x64                        :     25 (   7,  13,   5),  54.0 :   +36.1,  12.5,   99.8
    Crafty 25.0 DC x64                   :     25 (   8,  13,   4),  58.0 :   +49.1,  13.1,  100.0
    TogaII 280513 Intel w32              :     25 (  10,   9,   6),  58.0 :   +51.8,  12.7,  100.0
    Atlas 3.80 x64                       :     25 (  10,   9,   6),  58.0 :   +54.1,  12.9,  100.0
    Gaviota 1.0 AVX x64                  :     25 (   8,  14,   3),  60.0 :   +59.6,  12.4,  100.0
    Dirty 03NOV2015 POP x64              :     24 (   8,  12,   4),  58.3 :   +63.6,  12.8,  100.0
    Bobcat 7.1 x64                       :     25 (  10,  12,   3),  64.0 :   +72.3,  13.0,  100.0
    EXchess 7.71b x64                    :     25 (  10,  12,   3),  64.0 :   +74.3,  13.2,  100.0
    GNU Chess5 5.60 x64                  :     25 (  11,  11,   3),  66.0 :  +108.2,  12.7,  100.0
    Glaurung 2.2 JA x64                  :     25 (  13,   9,   3),  70.0 :  +126.1,  12.7,  100.0
    Rhetoric 1.4.3 POP x64               :     25 (  12,  11,   2),  70.0 :  +135.0,  12.9,  100.0
    BugChess2 1.9 POP x64                :     25 (  10,  15,   0),  70.0 :  +167.4,  13.1,  100.0
    Frenzee 3.5.19 x64                   :     25 (  14,   9,   2),  74.0 :  +176.4,  13.0,  100.0
We can do what we do ... since years.
Most interesting is the result we like.
Hard but fact.

Of course the right rating is in the middle ... around 2.840 Elo. I wrote more about it in German language in CSS Forum.

Best
Frank


PS: Logical ... with 19 opponents more as 123 Elo ... with more as 20 opponents lesser as 123 Elo.
I didn't followed this thread, if I understood you, these is a selection of performances of Fizbo against 40 engines, right? Nothing wrong with the result. Each mini-match (25 games) has 2 SD about 110 ELO points. Assuming normal distribution in performances in these mini-matches, if I divide the Gaussian in two equal left-hand and right-hand parts, the averaged distance between them is about 1.7 SD or ~100 ELO points. You get 123 or so, but things of this nature happen with 40 engines as opponents. Especially if you selected Fizbo for illustration, having a bit exaggerated distortion. Nothing wrong, and don't separate like this the performances, you can just see a magnified noise. Maybe it would be instructive if you do the same with only 2 engines, separate a 1000 games match between them in 40 minimatches of 25 games each, and then compare the average of performances in the top 20 minimatches with that of 20 bottom minimatches, the difference should be close to 100 ELO points, similar to what you see here. If you do minimatches of 6 games each, 167 or whatever not too small of them, you will see a difference of 200 ELO points between top and bottom halves.

Maybe I didn't follow your question, though.
Frank Quisinsky
Posts: 6808
Joined: Wed Nov 18, 2009 7:16 pm
Location: Gutweiler, Germany
Full name: Frank Quisinsky

Re: Why the errorbar is wrong ... simple example!

Post by Frank Quisinsky »

Hi Kai,

vs. 59 opponents.
I select out best 20 and weakest 20 opponents after 25 of 50 games.

The question for myself is how many opponents we need for a more stable rating.

Have a look in the thread.

I think with more opponents lesser games are necessary for a more stable rating.

With a strong database I can make some experiments.

Soon I have a database with the TOP-60, each one vs. each one 50 games = 2.950 games for the Group of 60 engines (1770 Pairings). Then different experiments to Elo stability are possible.

Interesting what you wrote ...

Different experiments are possible if the database will be stronger. In around 70 days I have the database.

Maybe we can delay the discussion ...

Best
Frank
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Why the errorbar is wrong ... simple example!

Post by Laskos »

Frank Quisinsky wrote:Hi Kai,

vs. 59 opponents.
I select out best 20 and weakest 20 opponents after 25 of 50 games.
Ah, so the effect is even exaggerated, you leave out middle 19 engines? Then the effect should be like 2.4 SD with 2 SD for 25 games being about 110 ELO points, so 130 ELO points. You get 123, which is perfectly normal.


The question for myself is how many opponents we need for a more stable rating.
The simplest is to treat all engines equal, so it doesn't matter. I don't know how to improve on that.
Have a look in the thread.

I think with more opponents lesser games are necessary for a more stable rating.

With a strong database I can make some experiments.

Soon I have a database with the TOP-60, each one vs. each one 50 games = 2.950 games for the Group of 60 engines (1770 Pairings). Then different experiments to Elo stability are possible.

Interesting what you wrote ...

Different experiments are possible if the database will be stronger. In around 70 days I have the database.

Maybe we can delay the discussion ...

Best
Frank
EDIT: just played for illustration Komodo 9.3 against Stockfish 7, total 120 games in 20 minimatches of 6 games each, same Stockfish each of the minimatch (Stockfish 7). Then I left aside 6 middle Stockfishes out of 20, as you did with 19 out of 59:

Code: Select all

    Program                            Score     %    Av.Op.  Elo    +   -    Draws

    Komodo                         :  44.0/120  36.7     46    -49   49  50   38.3 %

=======================================================

  1 Stockfish 17                   :   5.0/  6  83.3    -51    228  320 457    0.0 %
  2 Stockfish 11                   :   4.5/  6  75.0    -51    139  321 156   50.0 %
  3 Stockfish 20                   :   4.5/  6  75.0    -51    139  321 156   50.0 %
  4 Stockfish 13                   :   4.5/  6  75.0    -51    139  321 156   50.0 %
  5 Stockfish 16                   :   4.5/  6  75.0    -51    139  409 334   16.7 %
  6 Stockfish 18                   :   4.5/  6  75.0    -51    139  321 156   50.0 %
  7 Stockfish 07                   :   4.0/  6  66.7    -51     69  268 112   66.7 %
========================================================
Komodo 9.3 Weak: -142 ELO pointts


 14 Stockfish 09                   :   3.5/  6  58.3    -51      7  230 215   50.0 %
 15 Stockfish 10                   :   3.5/  6  58.3    -51      7  268  65   83.3 %
 16 Stockfish 12                   :   3.5/  6  58.3    -51      7  342 316   16.7 %
 17 Stockfish 02                   :   3.0/  6  50.0    -51    -51  271 271   33.3 %
 18 Stockfish 15                   :   3.0/  6  50.0    -51    -51  271 271   33.3 %
 19 Stockfish 14                   :   3.0/  6  50.0    -51    -51  271 271   33.3 %
 20 Stockfish 19                   :   3.0/  6  50.0    -51    -51  271 271   33.3 %
=========================================================
Komodo 9.3 Strong: +32  ELO points


Difference: 174 ELO points
I don't know what deep conclusions can be traced from this sort of rearrangements of matches, the effect happens independently of how particular engines behave under ELO.
Frank Quisinsky
Posts: 6808
Joined: Wed Nov 18, 2009 7:16 pm
Location: Gutweiler, Germany
Full name: Frank Quisinsky

Re: Why the errorbar is wrong ... simple example!

Post by Frank Quisinsky »

Hi Kai,

could you explain "SD".
I am slow to catch on.

The 123 Elo was only a result from the current test-run. I made some other experiments with older test-runs and often I have over 150 Elo! In most of cases with engines with a clear weak Point (endgame for an example). Have a look in the message by H.G.Muller.

I wrote in one of my messages in the thread the idea I have to find out more to this topic. For 4 years I made here some experiments with simulated games and games I have from my older SWCR Rating List. I found out that I need with 40 games for a stable rating 26 opponents. Now I will make it better with a stronger database and the current TOP-60.

My main interest is ...
With lesser games and more opponents to get a strong result.
The question is ... with how many opponents and how many games.

Your simple example:
Do you know what we can find out if we have such a strong database I have written before, with 60 opponents and 50 games for each match?

A lot of other stats are possible to opening, middlegame, endgame. A dream for me and I am shortly before.

Maybe you have later interest on the database for own experiments. I hope of many readers with interest on such statistics. I am sure we can find out a lot with different stats ideas.

Best and thanks
Frank
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Why the errorbar is wrong ... simple example!

Post by Laskos »

Frank Quisinsky wrote:Hi Kai,

could you explain "SD".
I am slow to catch on.

The 123 Elo was only a result from the current test-run. I made some other experiments with older test-runs and often I have over 150 Elo! In most of cases with engines with a clear weak Point (endgame for an example). Have a look in the message by H.G.Muller.

I wrote in one of my messages in the thread the idea I have to find out more to this topic. For 4 years I made here some experiments with simulated games and games I have from my older SWCR Rating List. I found out that I need with 40 games for a stable rating 26 opponents. Now I will make it better with a stronger database and the current TOP-60.

My main interest is ...
With lesser games and more opponents to get a strong result.
The question is ... with how many opponents and how many games.

Your simple example:
Do you know what we can find out if we have such a strong database I have written before, with 60 opponents and 50 games for each match?

A lot of other stats are possible to opening, middlegame, endgame. A dream for me and I am shortly before.

Maybe you have later interest on the database for own experiments. I hope of many readers with interest on such statistics. I am sure we can find out a lot with different stats ideas.

Best and thanks
Frank
SD = standard deviation

Each of your mini-matches is of 25 games. In those conditions it means roughly 560/sqrt(25) ~ 110 ELO points for 2 standard deviations. Also If you divide your mini-matches as weakest performing in 25 games mini-matches 1/3, middle 1/3, strongest 1/3, then the distance between the average ELO of strongest performing 1/3 and weakest performing 1/3 is roughly 2.4 standard deviations. If 2.0 standard deviations are about 110 ELO points (calculated above, from 25 games), then the ELO difference between the "strong" performing engine and "weak" performing engine is expected to be around 130 ELO points. You saw 123 in that example, and some even 140, and it's perfectly normal you see that. It's just statistical noise. And in fact a confirmation that most of engines out there obey fairly well the normal distribution. So you can safely combine all the mini-matches into one big database of the played by that engine games and compute its small error margins (or standard deviation), that one the rating tool shows (a very different one from what I discussed earlier).

I am not a statistician, by the way, just knew some statistics from high school and undergrad, it is often needed later.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Why the errorbar is wrong ... simple example!

Post by bob »

Frank Quisinsky wrote:Hi Bob,

all is OK ...
Working a lot of years on the opening book I created for Shredder Classic GUI I am using for testing. Each game are checked for book loses and replayed if so. All 500 ECO codes my opening book can be play.

I have around 45.000 playable lines in my opening book. For sure, populare openings with the higher priority. I optimate the book after each test-run.

Example:
Games ended with draw undo move 20 ... line will be directly deactivated. I have around 500 deactivated draw lines. Can nothing do with such games ... so I wrote for a while ... if 1% of my book have deactivated draw lines ... I have now contempt = 1

:-)

I know that I need such a database for the experiments I will do ...

Best
Frank

PS: I am not a fan from ... I am using always the same openings. Because I can nothing do with the database I produced ... later for statistics.

Can be see ...
Have a look here ...
http://www.amateurschach.de/main/_sgbp.htm

Download your Crafty 25.0 DC x64 games and checked the opening systems your engine plays with the my book. You can check book loses too ... you will not find it ... OK maybe 1-2 games I overlook ... I don't know.
I know this is hard to understand. Students have problems with it all the time. But statistics are statistics. I frequently give a monte-carlo type assignment and students are told to find their own PRNG. And to test it. I don't tell them how (initially) and invariably one will come knocking on my door. "Dr Hyatt, this Mersenne Twister is broken. After 15,000 numbers it produced 1,2,3,4,5,6,7 (the numbers are randomly chosen from the interval 1 .. 52.) Obviously this is anything but random, as 1 through 7 in order should never happen.

I tell them "go look up the probability of my getting a royal flush in Poker." They return with "roughly one in 40,000 attempts." I then ask "what is a royal flush but 5 cards chosen completely randomly by the shuffle and dealing, and yet you get 10, J, Q, K and A."

At that point it sinks in that their perception of random is not random at all. If there are no such strings, clearly the PRNG is broken, yet they thought it was broken because it did exactly what the laws or probability said it should do.

Same thing happens here. Regardless of what we believe should be the case, regardless of what we believe we are seeing, statics are sound and correct. Took me a while to get around some of this when I started cluster testing 15 years ago or so. You can find old discussions here. But I watched, and I learned, and I modified my beliefs to mesh with statistical theory and analysis. And I have not looked back.
Frank Quisinsky
Posts: 6808
Joined: Wed Nov 18, 2009 7:16 pm
Location: Gutweiler, Germany
Full name: Frank Quisinsky

Re: Why the errorbar is wrong ... simple example!

Post by Frank Quisinsky »

Hi Bob,

yes, it is hard to understand and I am honest ... for me now quit clear (after a long time) but not logical.

With more opponents the rating is earlier stable, error must be smaler.

Example:
http://talkchess.com/forum/viewtopic.ph ... 72&t=59307

Have only a look in the thread title.
Since 826 of 2950 games rating is the same.

Not again ...

I think I have understand it and have to seek one's salvation in "opp stability factor" as in Error. I will make some statistics here because I am down over maybe to 85% only.

Please give me not the rest.

And to the other things you worte.
I like to read it.

If I am younger and student and life in US, believe me I am happy with such a Prof. you are. But I am sure you are not happy with me as really stubborn person.

Enough ...
Thanks again!

Best
Frank