Why the errorbar is wrong ... simple example!

Frank Quisinsky · Post by **Frank Quisinsky** » Wed Feb 24, 2016 6:34 am

Hi Bob,

again hello ...

What we need is strong database, each one vs. each one for the simulation.

Example:
60 programs, 59 opponents, 1.770 possible pairings, 50 games = 85500 games.

OK ...

Now it is possible to simulate for each of the 60 engines ...

Table for:
Best and weekest results for 4, 5, undo 59 opponents for each of the 60 engines.

We can see ...

Gaviota produced vs. 4 opponents (best results, 50 games) = 2.925 Elo
Gaviota produced vs. 4 opponents (weakest results, 50 games) = 2.525 Elo

different = 400
We do the same vs. 4 opponets with all other 60 engines.

---
Note: Perhaps we cancel the best 5 and weakest 5 results for such a calculation from the group of 59 results. I don't know what is good here.
---

We are building the average of all of this 60 results.

Now we can write in our table ...
4 opponents = 400 Elo differents (the average) with 200 games.

As next step ...
We do it now 49, 48, 47, 46 games ...

We can write in our table:
4 opponents = 400 Elo differents with 200 games.
4 opponents = 412 Elo differents with 196 games
4 opponents = 420 Elo differents with 192 games

etc.

Now the same for 5 opponents, 6 opponents ... undo 59 opponents.

End of the day (better end of the year, we need a program for it) we can looking and can compare ...

200 games vs. 4 opponents ...
200 games vs. 50 opponents ...

Directly we can see what is happen!

Easy ...
the Elo calculation program can calculate with the information we had written in our table.

With such a table ... if we have the results undo 50 games, we can easy simulate the results from 51-infinite games. The calculation program can do.

All what we need is a strong database.
I am working on it and maybe Miguel have interest to create the calculation for it.

After all ...
We replace error with this one ... it's much better!

Best
Frank

For around five years I try it in Excel but I gave up (don't have a program for it). I don't have the games and opponents and simulate the games. I find out that we can produced with 26 opponents and 40 games per match a very stable Elo. I find out that with 25 opponents we need 44 games for the same, need with 24 opponents 62 games ... and so one.

But the calculation I made are bad because the database are bad with the simulated games I don't have.

michiguel · Post by **michiguel** » Wed Feb 24, 2016 6:41 am

Frank Quisinsky wrote:Hi Miguel,

easy ...
I am looking in the round-robin results from Shredder Classic GUI.
http://www.amateurschach.de/ftptrigger/ ... 6-x64.html

Select out by hand ... best 20 and weakest 20 results (looking on perf.).

I create two databases and added the games vs. best and weakest opponents.

Add the two databases to the others games and got this one ...

Code: Select all

   # Player                           &#58;    Elo  Games  Score  Draw  Error  OppAvg   OppE   OppD
   1 Komodo 9.3 x64                   &#58; 3185.5   2025   84.7  26.8   15.6  2856.8   11.2   40.8
   2 Stockfish 7 KP BMI2 x64          &#58; 3176.3   2025   84.0  30.0   15.1  2858.2   11.2   40.8
   3 Houdini 4 STD B x64              &#58; 3098.9   2075   77.9  31.7   13.9  2854.6   11.2   41.8
   4 Fire 4 x64                       &#58; 3054.7   1775   71.6  39.9   13.9  2876.3   11.3   35.8
   5 GullChess 3.0 BMI2 x64           &#58; 3050.2   1775   71.3  41.9   13.8  2874.7   11.3   35.8
   6 Equinox 3.30 x64                 &#58; 3005.9   1775   66.1  45.2   13.1  2877.7   11.4   35.8
   7 Fritz 15 x64                     &#58; 2996.2   2000   66.8  43.5   12.5  2862.0   11.1   40.0
   8 Critter 1.6a x64                 &#58; 2995.2   1775   65.0  45.2   13.1  2876.4   11.3   35.8
   9 Protector 1.9.0 x64              &#58; 2954.1   1950   61.6  46.3   12.1  2864.8   11.1   39.0
  10 Nirvanachess 2.2 POP x64         &#58; 2943.0   1975   60.1  46.1   12.1  2865.6   11.3   39.8
  11 iCE 3.0 v658 POP x64             &#58; 2931.3   2025   59.1  44.0   11.7  2862.7   11.3   40.8
  12 Sting SF 6 x64                   &#58; 2925.2   2025   58.3  44.9   11.7  2862.8   11.3   40.8
  13 Andscacs 0.83 POP x64            &#58; 2924.7   1950   57.9  44.6   12.1  2865.5   11.1   39.0
  14 Fizbo 1.6 x64 stark              &#58; 2908.9    499   56.5  44.9   24.0  2860.2   11.5   20.0
  15 Hannibal 1.5 x64                 &#58; 2906.5   2100   57.6  45.0   11.7  2848.8   11.2   42.0
  16 Chiron 2.0 x64                   &#58; 2906.3   2400   60.0  40.3   11.0  2829.5   11.2   48.0
  17 Texel 1.05 x64                   &#58; 2904.9   2074   56.9  42.5   11.9  2853.0   11.3   41.8
  18 Naum 4.6 x64                     &#58; 2890.0   2575   58.9  44.5   10.4  2822.7   11.4   51.8
  19 SmarThink 1.80 AVX x64           &#58; 2852.0   2000   48.8  41.0   11.7  2865.6   11.1   40.0
  20 Senpai 1.0 SSE42 x64             &#58; 2840.7   2800   53.9  43.2   10.0  2814.3   11.2   56.0
  21 Hakkapeliitta 3.0 x64            &#58; 2836.0   2625   51.9  38.5   10.2  2825.6   11.3   52.8
  22 Hiarcs 14 WCSC w32               &#58; 2829.2   2850   52.9  41.3   10.4  2810.9   11.2   57.0
  23 Sjeng c't 2010 w32               &#58; 2811.5   2850   50.7  40.9    9.6  2811.2   11.2   57.0
  24 Cheng 4.39 x64                   &#58; 2802.7   2875   49.5  41.0    9.7  2811.6   11.3   57.8
  25 Shredder 12 x64                  &#58; 2798.3   2875   48.8  43.5    9.8  2812.8   11.3   57.8
  26 Junior 13.3.00 x64               &#58; 2797.0   2925   48.9  40.8    9.8  2810.6   11.3   58.8
  27 Vajolet2 2.0 POP x64             &#58; 2793.0   2900   48.5  40.9    9.6  2809.8   11.2   58.0
  28 Fizbo 1.6 x64 schwach            &#58; 2785.9    499   53.3  45.3   22.4  2763.9   11.1   20.0
  29 Spike 1.4 Leiden w32             &#58; 2784.2   2900   47.4  42.6    9.6  2810.0   11.2   58.0
  30 DiscoCheck 5.2.1 x64             &#58; 2777.2   2925   46.4  39.5    9.6  2810.9   11.3   58.8
  31 Quazar 0.4 x64                   &#58; 2772.2   2875   46.1  42.7    9.4  2808.6   11.3   57.8
  32 Booot 5.2.0 x64                  &#58; 2772.1   2475   48.1  40.0   10.4  2792.5   11.1   49.8
  33 Deuterium 14.3.34.130 POP x64    &#58; 2760.6   2925   44.3  43.6    9.7  2811.2   11.3   58.8
  34 Alfil 15.04 C# Beta 24 x64       &#58; 2756.2   2575   46.7  34.6   10.0  2787.9   11.1   51.8
  35 Zappa Mexico II x64              &#58; 2751.1   2850   43.4  42.8    9.7  2809.1   11.2   57.0
  36 Spark 1.0 x64                    &#58; 2749.8   2875   43.3  42.7    9.4  2809.0   11.3   57.8
  37 Thinker 5.4d Inert x64           &#58; 2746.3   2900   42.6  41.5    9.8  2810.6   11.2   58.0
  38 Doch 1.3.4 JA x64                &#58; 2746.0   2275   48.4  46.0   10.8  2761.5   10.9   45.8
  39 Crafty 25.0 DC x64               &#58; 2736.8   2325   46.9  41.9   10.7  2763.2   10.8   46.8
  40 TogaII 280513 Intel w32          &#58; 2734.1   2425   45.4  40.3   10.5  2774.9   11.0   48.8
  41 Atlas 3.80 x64                   &#58; 2731.8   2725   42.4  40.1   10.0  2796.5   11.1   54.8
  42 Tornado 5.0 SSE4 x64             &#58; 2727.0   2300   45.5  38.9   10.8  2764.0   10.7   46.0
  43 Gaviota 1.0 AVX x64              &#58; 2726.3   2925   40.1  37.9    9.6  2810.7   11.3   58.8
  44 MinkoChess 1.3 JA POP x64        &#58; 2724.1   2475   43.5  43.4   10.4  2777.9   11.0   49.8
  45 Arasan 18.1 POP x64              &#58; 2723.6   2500   43.0  39.6   10.9  2784.1   10.9   50.0
  46 Dirty 03NOV2015 POP x64          &#58; 2722.3   2074   47.9  43.2   10.9  2737.8   10.6   41.8
  47 Bobcat 7.1 x64                   &#58; 2713.6   2075   46.7  44.4   11.1  2738.1   10.6   41.8
  48 EXchess 7.71b x64                &#58; 2711.6   2125   44.1  41.1   11.5  2761.6   11.0   42.8
  49 Rodent 1.7 Build 1 POP x64       &#58; 2707.9   2050   45.7  43.9   11.3  2739.4   10.5   41.0
  50 Murka 3 x64                      &#58; 2707.8   2125   44.6  46.4   11.2  2748.4   10.7   42.8
  51 Nemo 1.01 Beta POP x64           &#58; 2707.7   2375   42.3  42.1   10.4  2768.4   10.9   47.8
  52 Pedone 1.2 BMI2 x64              &#58; 2705.2   2450   41.0  44.0   10.2  2778.9   10.9   49.0
  53 DisasterArea 1.54 x64            &#58; 2682.7   1950   42.3  45.3   11.5  2740.3   10.6   39.0
  54 GNU Chess5 5.60 x64              &#58; 2677.7   2125   41.2  41.1   11.1  2742.9   10.6   42.8
  55 Scorpio 2.77 JA POP x64          &#58; 2675.5   2225   39.9  40.4   11.1  2751.6   10.7   44.8
  56 Glaurung 2.2 JA x64              &#58; 2659.8   2175   38.3  42.9   11.1  2747.1   10.7   43.8
  57 Rhetoric 1.4.3 POP x64           &#58; 2650.9   2075   38.0  41.3   11.6  2739.6   10.6   41.8
  58 The Baron 3.29 x64               &#58; 2644.6   2225   35.8  38.0   11.2  2752.3   10.7   44.8
  59 Octochess r5190 SSE4 x64         &#58; 2637.8   2100   36.0  41.1   11.4  2743.3   10.5   42.0
  60 BugChess2 1.9 POP x64            &#58; 2618.5   1975   34.5  37.5   12.2  2734.2   10.6   39.8
  61 Frenzee 3.5.19 x64               &#58; 2609.5   1925   33.7  35.0   12.4  2731.6   10.6   38.8

White advantage = 36.00 +/- 1.02
Draw rate &#40;equal opponents&#41; = 47.68 % +/- 0.21

I discuss with Jesus in an other thread to this topic. I wish me a GUI for such simulation.

With 59 opponents = 1.770 possible pairings.
Should be easy ... we need a txt file with the perf results from each of these 1.770 possible pairings. Now the GUI can look in the txt file ... if I wish the information ... please give me a rating list with the best and weakest Gaviata results and 20, 21, 22 opponents (what I wish). I can simulate each Rating List with such a strong database and we can simulate the corrrect errorbar with such a GUI / tool.

With other words, we can make the Elo calculation much stronger.

Best
Frank

If I understand correctly, it is expected and normal. If you pick the worst and compared the with the best, of course they will be very different. You are picking one tail and compare it with the other tail of the distribution.

If you want to test whether some opponents perform very badly compare to others, now you would have to re run everything and see whether the top 20 are still the top 20 and bottom 20 are still the bottom 20. I bet they will all be mixed up. This may be a lot of of work, but you could do it with the data you already have: Split the database in half (randomly) and process both halves. You may see that the worse 20 in one half are not the same in the other.

Miguel

Frank Quisinsky · Post by **Frank Quisinsky** » Wed Feb 24, 2016 7:06 am

Hi Miguel,

I will do that !!

But not yet. The database isn't ready for it. I have 71.700 of 85.500 games I need. I will not make a break in the during test-runs with current engine updates. In maybe 4-6 months I should have the missing games.

If you have time for my bad English, please have a look in the message I had written to Bob.

I have in my brain to create such a table with the database I produce.

For me important as main information for a possible tolerance output:

With 26 opponents you need 50 games per match = 1.300 games
And if you like to create such a stable rating with 14 engines you need 4.000 games.

Such a result I will see end of the day in the tolerance information. I think we can calculate it with the example I gave in the message to Bob.

Best
Frank

But with the example I gave you can see that often the Elo information we produced is pur random.

Example:
CEGT Elo from Fizbo 1.6.
For me absolutely clear why Fizbo 1.6 have 30 Elo more as in my test ... the opponents! We can't compare Elo from different rating list if we used other opponents ... or better ... with more opponents it's more to compare.

bob · Post by **bob** » Wed Feb 24, 2016 9:08 am

Frank Quisinsky wrote:Hi Bob,

thanks for your message!

I know that Bob. I read the book by Karl-Heinz Glenz (for one example) more times.

ErrorBar for chess can't be a practical application. Should have no place here. Statistic standard is useless. What I can do with it if I can simulate with each of test-runs such examples I gave in my first posting.

75 Elo different to error, this case is not isolated.

---

To give an information about tolerance is nice.

Table after Prof. Elo:
Two examples

5 games, standard deviation points 1.12, rating 126.0, probable error 0.76, error in rating 85.0, confidenz / confident 57
100 games, standard deviation ponts 5.00, rating 28,3, probable error 3.37, error in Rating 19.1, confidenz / confident 99,9

I believe we can search here the problem. Such a table I wish me for games in combination with opponents.

100 games, 5 opponents
100 games, 10 opponents
100 games, 20 opponents
100 games, 30 opponents

Again, with a strong database we can simulate it. All Prof. Elo does ... he collect game material. The problem Prof. Elo had ... he did not have enough of it between same players.

But we have it, can produced it in computer chess. Computer chess can help to try to make a better rating system. Maybe a rating system for computer chess only.

I say ... tolerance information is great!
I say ... we can try to produce a better information as errorbar.

To Display "error" in a Rating list of chess programs is the same as I display a picture from my grandmother in the elo calculation. And believe me the chess players will have much fun more if they can see how my grandmothers looked in the past. That is much more interesting.

We have to search and have to produce such a tolerance information with the combination of quantity of games and quantity of opponents.

Best
Frank

You are missing the point. The real elo can NOT lie outside the error bar 95% of the time. 5% it can. But the Elo number is ONLY applicable to the pool of opponents and games being used. No way to "connect" them to other pools of opponents or games, other than by having inter-play between the groups that will help to normalize the Elo for the programs that play in both groups.

For your example, the Elo is going to be valid with the same error bar for any example you gave. And the Elos might well be different for each group as well. Unfortunately, one will not be more accurate than the other. That being said, when you want to compare two programs in terms of Elo, the more common opponents they have, the better the accuracy of the final Elo numbers, since they are "coupled" by common opponents.

One can easily play group A and then group B, and use Bayeselo to compute the rating for each group. The combine and compare the resulting Elos. I did a bunch of this when I first started Cluster testing, and had several long conversations with Remi' when analyzing the results...

Ozymandias · Post by **Ozymandias** » Wed Feb 24, 2016 9:54 am

bob wrote:The real elo can NOT lie outside the error bar 95% of the time.

If number of games is the only thing you need, to determine the error bar, how many do you need, to get it below 1?

Ajedrecista · Post by **Ajedrecista** » Wed Feb 24, 2016 11:11 am

Hello Juan:

Ozymandias wrote:If number of games is the only thing you need, to determine the error bar, how many do you need, to get it below 1?

Using a normal distribution as an approximation, I would say the following:

Code: Select all

1 Elo gain means a score of 1/2 + ln&#40;10&#41;/1600 &#40;just using Taylor series around score = 1/2&#41;.
Explanation of ln&#40;10&#41;/1600 here&#58;
http&#58;//talkchess.com/forum/viewtopic.php?topic_view=threads&p=470965&t=44100

sigma = &#40;score - 1/2&#41;/z

Where z defines the &#40;confidence level&#41; = erf&#91;z/sqrt&#40;2&#41;&#93;

In the longest case &#40;no draws&#41;&#58;

sigma = sqrt&#91;score*&#40;1 - score&#41;/games&#93; ~ 0.5*sqrt&#40;games&#41;

So&#58;

sigma = ln&#40;10&#41;/&#91;1600*z&#93; ~ 0.5*sqrt&#40;games&#41;
games ~ 0.25*&#91;1600*z/ln&#40;10&#41;&#93;² = &#91;800*z/ln&#40;10&#41;&#93;²

For a confidence level of 95%, z ~ 1.96 and games ~ 463725
You need more than 463.7k games in order to get error bars < 1 Elo with 95% confidence.

If you include draws, you can multiply the previous result by &#40;1 - draw_ratio&#41;.
For example, if there are 60% of draws&#58;
463725*&#40;1 - 0.6&#41; = 185490 ~ 185.5k games.

In any case, you need lots of games.

Regards from Spain.

Ajedrecista.

Frank Quisinsky · Post by **Frank Quisinsky** » Wed Feb 24, 2016 11:12 am

Hello Bob,

thanks for your time and explantions / hints again!

I have interest to give the following information:

With 26 opponents ... 1.400 games
With 24 opponents ... 1.600 games
...
for the same stable Elo!

The problem ist ...
Error is looking on quantity of games only! With the results that I have the same output not important how many opponents I am using for it.

But with more opponents lesser games are necessary (again for the same stable rating).

This information the elo calculation programs are not able to give us. Maybe this information should have an other name as error. Opp. Stability factor can be the name.

You wrote:

Code: Select all

That being said, when you want to compare two programs in terms of Elo, the more common opponents they have, the better the accuracy of the final Elo numbers, since they are "coupled" by common opponents.

No doubt about it ...

The 5% you wrote is an other topic.
Randomly produced? There is a chance for more or less as 5%.

I am very forceful to this topic because I have reasons for it.

"Producing a strong result with the order to save electricity and time"

I think I need for the same stable result with 60 opponents 20 games per paring only.

= 1700 parings (60 opponents) x 20 games = 34.000 games (each one vs. each one).

And I produced a stronger and more accuracy rating as with a lot of more games and lesser opponents.

That's it what I have in my brain.
I have interest to know that very exactly!

Best
Frank

Ozymandias · Post by **Ozymandias** » Wed Feb 24, 2016 11:53 am

Ajedrecista wrote:Hello Juan:

Ozymandias wrote:If number of games is the only thing you need, to determine the error bar, how many do you need, to get it below 1?

Using a normal distribution as an approximation, I would say the following:

Code: Select all

1 Elo gain means a score of 1/2 + ln&#40;10&#41;/1600 &#40;just using Taylor series around score = 1/2&#41;.
Explanation of ln&#40;10&#41;/1600 here&#58;
http&#58;//talkchess.com/forum/viewtopic.php?topic_view=threads&p=470965&t=44100

sigma = &#40;score - 1/2&#41;/z

Where z defines the &#40;confidence level&#41; = erf&#91;z/sqrt&#40;2&#41;&#93;

In the longest case &#40;no draws&#41;&#58;

sigma = sqrt&#91;score*&#40;1 - score&#41;/games&#93; ~ 0.5*sqrt&#40;games&#41;

So&#58;

sigma = ln&#40;10&#41;/&#91;1600*z&#93; ~ 0.5*sqrt&#40;games&#41;
games ~ 0.25*&#91;1600*z/ln&#40;10&#41;&#93;² = &#91;800*z/ln&#40;10&#41;&#93;²

For a confidence level of 95%, z ~ 1.96 and games ~ 463725
You need more than 463.7k games in order to get error bars < 1 Elo with 95% confidence.

If you include draws, you can multiply the previous result by &#40;1 - draw_ratio&#41;.
For example, if there are 60% of draws&#58;
463725*&#40;1 - 0.6&#41; = 185490 ~ 185.5k games.

In any case, you need lots of games.

Regards from Spain.

Ajedrecista.

So, for a database containing a 40-50% of drawn games, we'd need a number of games closer to 185.5k than to 463.7k, in order to achieve a sub 1 error bar, correct?

Ajedrecista · Post by **Ajedrecista** » Wed Feb 24, 2016 12:00 pm

Hi again:

Ozymandias wrote:So, for a database containing a 40-50% of drawn games, we'd need a number of games closer to 185.5k than to 463.7k, in order to achieve a sub 1 error bar, correct?

Well, in this case the number of games would be 463725*(1 - 0.4) ~ 278.2k games or 463725*(1 - 0.5) = 231.9k games... each engine! Not the sum of games of all the engines. It is difficult, isn't it?

Just keep an eye on the current state of Frank's rating list: a total of 180250 games but error bars of engines are around ±6.3 to ±14.9 (depending on the number of games of each engine as a first approximation), not even near to ±1.

Regards from Spain.

Ajedrecista.

Ozymandias · Post by **Ozymandias** » Wed Feb 24, 2016 12:11 pm

Ajedrecista wrote:Hi again:

Ozymandias wrote:So, for a database containing a 40-50% of drawn games, we'd need a number of games closer to 185.5k than to 463.7k, in order to achieve a sub 1 error bar, correct?
Well, in this case the number of games would be 463725*(1 - 0.4) ~ 278.2k games or 463725*(1 - 0.5) = 231.9k games... each engine! Not the sum of games of all the engines. It is difficult, isn't it?

Not really, I'm well over 250k games per engine. In this case, there shouldn't be any rating oscillation over 1 ELO point, for more than 5% of the engines, correct?

Why the errorbar is wrong ... simple example!

Re: Error contra this one!

Re: Why the errorbar is wrong ... simple example!

Re: Why the errorbar is wrong ... simple example!

Re: Why the errorbar is wrong ... simple example!

Re: Why the errorbar is wrong ... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the errorbar is wrong ... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the error bar is wrong... simple example!