Hi Bob,
again hello ...
What we need is strong database, each one vs. each one for the simulation.
Example:
60 programs, 59 opponents, 1.770 possible pairings, 50 games = 85500 games.
OK ...
Now it is possible to simulate for each of the 60 engines ...
Table for:
Best and weekest results for 4, 5, undo 59 opponents for each of the 60 engines.
We can see ...
Gaviota produced vs. 4 opponents (best results, 50 games) = 2.925 Elo
Gaviota produced vs. 4 opponents (weakest results, 50 games) = 2.525 Elo
different = 400
We do the same vs. 4 opponets with all other 60 engines.
---
Note: Perhaps we cancel the best 5 and weakest 5 results for such a calculation from the group of 59 results. I don't know what is good here.
---
We are building the average of all of this 60 results.
Now we can write in our table ...
4 opponents = 400 Elo differents (the average) with 200 games.
As next step ...
We do it now 49, 48, 47, 46 games ...
We can write in our table:
4 opponents = 400 Elo differents with 200 games.
4 opponents = 412 Elo differents with 196 games
4 opponents = 420 Elo differents with 192 games
etc.
Now the same for 5 opponents, 6 opponents ... undo 59 opponents.
End of the day (better end of the year, we need a program for it) we can looking and can compare ...
200 games vs. 4 opponents ...
200 games vs. 50 opponents ...
Directly we can see what is happen!
Easy ...
the Elo calculation program can calculate with the information we had written in our table.
With such a table ... if we have the results undo 50 games, we can easy simulate the results from 51-infinite games. The calculation program can do.
All what we need is a strong database.
I am working on it and maybe Miguel have interest to create the calculation for it.
After all ...
We replace error with this one ... it's much better!
Best
Frank
For around five years I try it in Excel but I gave up (don't have a program for it). I don't have the games and opponents and simulate the games. I find out that we can produced with 26 opponents and 40 games per match a very stable Elo. I find out that with 25 opponents we need 44 games for the same, need with 24 opponents 62 games ... and so one.
But the calculation I made are bad because the database are bad with the simulated games I don't have.
Why the errorbar is wrong ... simple example!
Moderators: hgm, Rebel, chrisw
-
- Posts: 6808
- Joined: Wed Nov 18, 2009 7:16 pm
- Location: Gutweiler, Germany
- Full name: Frank Quisinsky
-
- Posts: 6401
- Joined: Thu Mar 09, 2006 8:30 pm
- Location: Chicago, Illinois, USA
Re: Why the errorbar is wrong ... simple example!
If I understand correctly, it is expected and normal. If you pick the worst and compared the with the best, of course they will be very different. You are picking one tail and compare it with the other tail of the distribution.Frank Quisinsky wrote:Hi Miguel,
easy ...
I am looking in the round-robin results from Shredder Classic GUI.
http://www.amateurschach.de/ftptrigger/ ... 6-x64.html
Select out by hand ... best 20 and weakest 20 results (looking on perf.).
I create two databases and added the games vs. best and weakest opponents.
Add the two databases to the others games and got this one ...
Code: Select all
# Player : Elo Games Score Draw Error OppAvg OppE OppD 1 Komodo 9.3 x64 : 3185.5 2025 84.7 26.8 15.6 2856.8 11.2 40.8 2 Stockfish 7 KP BMI2 x64 : 3176.3 2025 84.0 30.0 15.1 2858.2 11.2 40.8 3 Houdini 4 STD B x64 : 3098.9 2075 77.9 31.7 13.9 2854.6 11.2 41.8 4 Fire 4 x64 : 3054.7 1775 71.6 39.9 13.9 2876.3 11.3 35.8 5 GullChess 3.0 BMI2 x64 : 3050.2 1775 71.3 41.9 13.8 2874.7 11.3 35.8 6 Equinox 3.30 x64 : 3005.9 1775 66.1 45.2 13.1 2877.7 11.4 35.8 7 Fritz 15 x64 : 2996.2 2000 66.8 43.5 12.5 2862.0 11.1 40.0 8 Critter 1.6a x64 : 2995.2 1775 65.0 45.2 13.1 2876.4 11.3 35.8 9 Protector 1.9.0 x64 : 2954.1 1950 61.6 46.3 12.1 2864.8 11.1 39.0 10 Nirvanachess 2.2 POP x64 : 2943.0 1975 60.1 46.1 12.1 2865.6 11.3 39.8 11 iCE 3.0 v658 POP x64 : 2931.3 2025 59.1 44.0 11.7 2862.7 11.3 40.8 12 Sting SF 6 x64 : 2925.2 2025 58.3 44.9 11.7 2862.8 11.3 40.8 13 Andscacs 0.83 POP x64 : 2924.7 1950 57.9 44.6 12.1 2865.5 11.1 39.0 14 Fizbo 1.6 x64 stark : 2908.9 499 56.5 44.9 24.0 2860.2 11.5 20.0 15 Hannibal 1.5 x64 : 2906.5 2100 57.6 45.0 11.7 2848.8 11.2 42.0 16 Chiron 2.0 x64 : 2906.3 2400 60.0 40.3 11.0 2829.5 11.2 48.0 17 Texel 1.05 x64 : 2904.9 2074 56.9 42.5 11.9 2853.0 11.3 41.8 18 Naum 4.6 x64 : 2890.0 2575 58.9 44.5 10.4 2822.7 11.4 51.8 19 SmarThink 1.80 AVX x64 : 2852.0 2000 48.8 41.0 11.7 2865.6 11.1 40.0 20 Senpai 1.0 SSE42 x64 : 2840.7 2800 53.9 43.2 10.0 2814.3 11.2 56.0 21 Hakkapeliitta 3.0 x64 : 2836.0 2625 51.9 38.5 10.2 2825.6 11.3 52.8 22 Hiarcs 14 WCSC w32 : 2829.2 2850 52.9 41.3 10.4 2810.9 11.2 57.0 23 Sjeng c't 2010 w32 : 2811.5 2850 50.7 40.9 9.6 2811.2 11.2 57.0 24 Cheng 4.39 x64 : 2802.7 2875 49.5 41.0 9.7 2811.6 11.3 57.8 25 Shredder 12 x64 : 2798.3 2875 48.8 43.5 9.8 2812.8 11.3 57.8 26 Junior 13.3.00 x64 : 2797.0 2925 48.9 40.8 9.8 2810.6 11.3 58.8 27 Vajolet2 2.0 POP x64 : 2793.0 2900 48.5 40.9 9.6 2809.8 11.2 58.0 28 Fizbo 1.6 x64 schwach : 2785.9 499 53.3 45.3 22.4 2763.9 11.1 20.0 29 Spike 1.4 Leiden w32 : 2784.2 2900 47.4 42.6 9.6 2810.0 11.2 58.0 30 DiscoCheck 5.2.1 x64 : 2777.2 2925 46.4 39.5 9.6 2810.9 11.3 58.8 31 Quazar 0.4 x64 : 2772.2 2875 46.1 42.7 9.4 2808.6 11.3 57.8 32 Booot 5.2.0 x64 : 2772.1 2475 48.1 40.0 10.4 2792.5 11.1 49.8 33 Deuterium 14.3.34.130 POP x64 : 2760.6 2925 44.3 43.6 9.7 2811.2 11.3 58.8 34 Alfil 15.04 C# Beta 24 x64 : 2756.2 2575 46.7 34.6 10.0 2787.9 11.1 51.8 35 Zappa Mexico II x64 : 2751.1 2850 43.4 42.8 9.7 2809.1 11.2 57.0 36 Spark 1.0 x64 : 2749.8 2875 43.3 42.7 9.4 2809.0 11.3 57.8 37 Thinker 5.4d Inert x64 : 2746.3 2900 42.6 41.5 9.8 2810.6 11.2 58.0 38 Doch 1.3.4 JA x64 : 2746.0 2275 48.4 46.0 10.8 2761.5 10.9 45.8 39 Crafty 25.0 DC x64 : 2736.8 2325 46.9 41.9 10.7 2763.2 10.8 46.8 40 TogaII 280513 Intel w32 : 2734.1 2425 45.4 40.3 10.5 2774.9 11.0 48.8 41 Atlas 3.80 x64 : 2731.8 2725 42.4 40.1 10.0 2796.5 11.1 54.8 42 Tornado 5.0 SSE4 x64 : 2727.0 2300 45.5 38.9 10.8 2764.0 10.7 46.0 43 Gaviota 1.0 AVX x64 : 2726.3 2925 40.1 37.9 9.6 2810.7 11.3 58.8 44 MinkoChess 1.3 JA POP x64 : 2724.1 2475 43.5 43.4 10.4 2777.9 11.0 49.8 45 Arasan 18.1 POP x64 : 2723.6 2500 43.0 39.6 10.9 2784.1 10.9 50.0 46 Dirty 03NOV2015 POP x64 : 2722.3 2074 47.9 43.2 10.9 2737.8 10.6 41.8 47 Bobcat 7.1 x64 : 2713.6 2075 46.7 44.4 11.1 2738.1 10.6 41.8 48 EXchess 7.71b x64 : 2711.6 2125 44.1 41.1 11.5 2761.6 11.0 42.8 49 Rodent 1.7 Build 1 POP x64 : 2707.9 2050 45.7 43.9 11.3 2739.4 10.5 41.0 50 Murka 3 x64 : 2707.8 2125 44.6 46.4 11.2 2748.4 10.7 42.8 51 Nemo 1.01 Beta POP x64 : 2707.7 2375 42.3 42.1 10.4 2768.4 10.9 47.8 52 Pedone 1.2 BMI2 x64 : 2705.2 2450 41.0 44.0 10.2 2778.9 10.9 49.0 53 DisasterArea 1.54 x64 : 2682.7 1950 42.3 45.3 11.5 2740.3 10.6 39.0 54 GNU Chess5 5.60 x64 : 2677.7 2125 41.2 41.1 11.1 2742.9 10.6 42.8 55 Scorpio 2.77 JA POP x64 : 2675.5 2225 39.9 40.4 11.1 2751.6 10.7 44.8 56 Glaurung 2.2 JA x64 : 2659.8 2175 38.3 42.9 11.1 2747.1 10.7 43.8 57 Rhetoric 1.4.3 POP x64 : 2650.9 2075 38.0 41.3 11.6 2739.6 10.6 41.8 58 The Baron 3.29 x64 : 2644.6 2225 35.8 38.0 11.2 2752.3 10.7 44.8 59 Octochess r5190 SSE4 x64 : 2637.8 2100 36.0 41.1 11.4 2743.3 10.5 42.0 60 BugChess2 1.9 POP x64 : 2618.5 1975 34.5 37.5 12.2 2734.2 10.6 39.8 61 Frenzee 3.5.19 x64 : 2609.5 1925 33.7 35.0 12.4 2731.6 10.6 38.8 White advantage = 36.00 +/- 1.02 Draw rate (equal opponents) = 47.68 % +/- 0.21
I discuss with Jesus in an other thread to this topic. I wish me a GUI for such simulation.
With 59 opponents = 1.770 possible pairings.
Should be easy ... we need a txt file with the perf results from each of these 1.770 possible pairings. Now the GUI can look in the txt file ... if I wish the information ... please give me a rating list with the best and weakest Gaviata results and 20, 21, 22 opponents (what I wish). I can simulate each Rating List with such a strong database and we can simulate the corrrect errorbar with such a GUI / tool.
With other words, we can make the Elo calculation much stronger.
Best
Frank
If you want to test whether some opponents perform very badly compare to others, now you would have to re run everything and see whether the top 20 are still the top 20 and bottom 20 are still the bottom 20. I bet they will all be mixed up. This may be a lot of of work, but you could do it with the data you already have: Split the database in half (randomly) and process both halves. You may see that the worse 20 in one half are not the same in the other.
Miguel
-
- Posts: 6808
- Joined: Wed Nov 18, 2009 7:16 pm
- Location: Gutweiler, Germany
- Full name: Frank Quisinsky
Re: Why the errorbar is wrong ... simple example!
Hi Miguel,
I will do that !!
But not yet. The database isn't ready for it. I have 71.700 of 85.500 games I need. I will not make a break in the during test-runs with current engine updates. In maybe 4-6 months I should have the missing games.
If you have time for my bad English, please have a look in the message I had written to Bob.
I have in my brain to create such a table with the database I produce.
For me important as main information for a possible tolerance output:
With 26 opponents you need 50 games per match = 1.300 games
And if you like to create such a stable rating with 14 engines you need 4.000 games.
Such a result I will see end of the day in the tolerance information. I think we can calculate it with the example I gave in the message to Bob.
Best
Frank
But with the example I gave you can see that often the Elo information we produced is pur random.
Example:
CEGT Elo from Fizbo 1.6.
For me absolutely clear why Fizbo 1.6 have 30 Elo more as in my test ... the opponents! We can't compare Elo from different rating list if we used other opponents ... or better ... with more opponents it's more to compare.
I will do that !!
But not yet. The database isn't ready for it. I have 71.700 of 85.500 games I need. I will not make a break in the during test-runs with current engine updates. In maybe 4-6 months I should have the missing games.
If you have time for my bad English, please have a look in the message I had written to Bob.
I have in my brain to create such a table with the database I produce.
For me important as main information for a possible tolerance output:
With 26 opponents you need 50 games per match = 1.300 games
And if you like to create such a stable rating with 14 engines you need 4.000 games.
Such a result I will see end of the day in the tolerance information. I think we can calculate it with the example I gave in the message to Bob.
Best
Frank
But with the example I gave you can see that often the Elo information we produced is pur random.
Example:
CEGT Elo from Fizbo 1.6.
For me absolutely clear why Fizbo 1.6 have 30 Elo more as in my test ... the opponents! We can't compare Elo from different rating list if we used other opponents ... or better ... with more opponents it's more to compare.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Why the errorbar is wrong ... simple example!
You are missing the point. The real elo can NOT lie outside the error bar 95% of the time. 5% it can. But the Elo number is ONLY applicable to the pool of opponents and games being used. No way to "connect" them to other pools of opponents or games, other than by having inter-play between the groups that will help to normalize the Elo for the programs that play in both groups.Frank Quisinsky wrote:Hi Bob,
thanks for your message!
I know that Bob. I read the book by Karl-Heinz Glenz (for one example) more times.
ErrorBar for chess can't be a practical application. Should have no place here. Statistic standard is useless. What I can do with it if I can simulate with each of test-runs such examples I gave in my first posting.
75 Elo different to error, this case is not isolated.
---
To give an information about tolerance is nice.
Table after Prof. Elo:
Two examples
5 games, standard deviation points 1.12, rating 126.0, probable error 0.76, error in rating 85.0, confidenz / confident 57
100 games, standard deviation ponts 5.00, rating 28,3, probable error 3.37, error in Rating 19.1, confidenz / confident 99,9
I believe we can search here the problem. Such a table I wish me for games in combination with opponents.
100 games, 5 opponents
100 games, 10 opponents
100 games, 20 opponents
100 games, 30 opponents
Again, with a strong database we can simulate it. All Prof. Elo does ... he collect game material. The problem Prof. Elo had ... he did not have enough of it between same players.
But we have it, can produced it in computer chess. Computer chess can help to try to make a better rating system. Maybe a rating system for computer chess only.
I say ... tolerance information is great!
I say ... we can try to produce a better information as errorbar.
To Display "error" in a Rating list of chess programs is the same as I display a picture from my grandmother in the elo calculation. And believe me the chess players will have much fun more if they can see how my grandmothers looked in the past. That is much more interesting.
We have to search and have to produce such a tolerance information with the combination of quantity of games and quantity of opponents.
Best
Frank
For your example, the Elo is going to be valid with the same error bar for any example you gave. And the Elos might well be different for each group as well. Unfortunately, one will not be more accurate than the other. That being said, when you want to compare two programs in terms of Elo, the more common opponents they have, the better the accuracy of the final Elo numbers, since they are "coupled" by common opponents.
One can easily play group A and then group B, and use Bayeselo to compute the rating for each group. The combine and compare the resulting Elos. I did a bunch of this when I first started Cluster testing, and had several long conversations with Remi' when analyzing the results...
-
- Posts: 1535
- Joined: Sun Oct 25, 2009 2:30 am
Re: Why the errorbar is wrong ... simple example!
If number of games is the only thing you need, to determine the error bar, how many do you need, to get it below 1?bob wrote:The real elo can NOT lie outside the error bar 95% of the time.
-
- Posts: 1969
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: Why the error bar is wrong... simple example!
Hello Juan:
In any case, you need lots of games.
Regards from Spain.
Ajedrecista.
Using a normal distribution as an approximation, I would say the following:Ozymandias wrote:If number of games is the only thing you need, to determine the error bar, how many do you need, to get it below 1?
Code: Select all
1 Elo gain means a score of 1/2 + ln(10)/1600 (just using Taylor series around score = 1/2).
Explanation of ln(10)/1600 here:
http://talkchess.com/forum/viewtopic.php?topic_view=threads&p=470965&t=44100
sigma = (score - 1/2)/z
Where z defines the (confidence level) = erf[z/sqrt(2)]
In the longest case (no draws):
sigma = sqrt[score*(1 - score)/games] ~ 0.5*sqrt(games)
So:
sigma = ln(10)/[1600*z] ~ 0.5*sqrt(games)
games ~ 0.25*[1600*z/ln(10)]² = [800*z/ln(10)]²
For a confidence level of 95%, z ~ 1.96 and games ~ 463725
You need more than 463.7k games in order to get error bars < 1 Elo with 95% confidence.
If you include draws, you can multiply the previous result by (1 - draw_ratio).
For example, if there are 60% of draws:
463725*(1 - 0.6) = 185490 ~ 185.5k games.
Regards from Spain.
Ajedrecista.
Last edited by Ajedrecista on Wed Feb 24, 2016 11:12 am, edited 1 time in total.
-
- Posts: 6808
- Joined: Wed Nov 18, 2009 7:16 pm
- Location: Gutweiler, Germany
- Full name: Frank Quisinsky
Re: Why the errorbar is wrong ... simple example!
Hello Bob,
thanks for your time and explantions / hints again!
I have interest to give the following information:
With 26 opponents ... 1.400 games
With 24 opponents ... 1.600 games
...
for the same stable Elo!
The problem ist ...
Error is looking on quantity of games only! With the results that I have the same output not important how many opponents I am using for it.
But with more opponents lesser games are necessary (again for the same stable rating).
This information the elo calculation programs are not able to give us. Maybe this information should have an other name as error. Opp. Stability factor can be the name.
You wrote:
No doubt about it ...
The 5% you wrote is an other topic.
Randomly produced? There is a chance for more or less as 5%.
I am very forceful to this topic because I have reasons for it.
"Producing a strong result with the order to save electricity and time"
I think I need for the same stable result with 60 opponents 20 games per paring only.
= 1700 parings (60 opponents) x 20 games = 34.000 games (each one vs. each one).
And I produced a stronger and more accuracy rating as with a lot of more games and lesser opponents.
That's it what I have in my brain.
I have interest to know that very exactly!
Best
Frank
thanks for your time and explantions / hints again!
I have interest to give the following information:
With 26 opponents ... 1.400 games
With 24 opponents ... 1.600 games
...
for the same stable Elo!
The problem ist ...
Error is looking on quantity of games only! With the results that I have the same output not important how many opponents I am using for it.
But with more opponents lesser games are necessary (again for the same stable rating).
This information the elo calculation programs are not able to give us. Maybe this information should have an other name as error. Opp. Stability factor can be the name.
You wrote:
Code: Select all
That being said, when you want to compare two programs in terms of Elo, the more common opponents they have, the better the accuracy of the final Elo numbers, since they are "coupled" by common opponents.
The 5% you wrote is an other topic.
Randomly produced? There is a chance for more or less as 5%.
I am very forceful to this topic because I have reasons for it.
"Producing a strong result with the order to save electricity and time"
I think I need for the same stable result with 60 opponents 20 games per paring only.
= 1700 parings (60 opponents) x 20 games = 34.000 games (each one vs. each one).
And I produced a stronger and more accuracy rating as with a lot of more games and lesser opponents.
That's it what I have in my brain.
I have interest to know that very exactly!
Best
Frank
-
- Posts: 1535
- Joined: Sun Oct 25, 2009 2:30 am
Re: Why the error bar is wrong... simple example!
So, for a database containing a 40-50% of drawn games, we'd need a number of games closer to 185.5k than to 463.7k, in order to achieve a sub 1 error bar, correct?Ajedrecista wrote:Hello Juan:
Using a normal distribution as an approximation, I would say the following:Ozymandias wrote:If number of games is the only thing you need, to determine the error bar, how many do you need, to get it below 1?
In any case, you need lots of games.Code: Select all
1 Elo gain means a score of 1/2 + ln(10)/1600 (just using Taylor series around score = 1/2). Explanation of ln(10)/1600 here: http://talkchess.com/forum/viewtopic.php?topic_view=threads&p=470965&t=44100 sigma = (score - 1/2)/z Where z defines the (confidence level) = erf[z/sqrt(2)] In the longest case (no draws): sigma = sqrt[score*(1 - score)/games] ~ 0.5*sqrt(games) So: sigma = ln(10)/[1600*z] ~ 0.5*sqrt(games) games ~ 0.25*[1600*z/ln(10)]² = [800*z/ln(10)]² For a confidence level of 95%, z ~ 1.96 and games ~ 463725 You need more than 463.7k games in order to get error bars < 1 Elo with 95% confidence. If you include draws, you can multiply the previous result by (1 - draw_ratio). For example, if there are 60% of draws: 463725*(1 - 0.6) = 185490 ~ 185.5k games.
Regards from Spain.
Ajedrecista.
-
- Posts: 1969
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Re: Why the error bar is wrong... simple example!
Hi again:
Just keep an eye on the current state of Frank's rating list: a total of 180250 games but error bars of engines are around ±6.3 to ±14.9 (depending on the number of games of each engine as a first approximation), not even near to ±1.
Regards from Spain.
Ajedrecista.
Well, in this case the number of games would be 463725*(1 - 0.4) ~ 278.2k games or 463725*(1 - 0.5) = 231.9k games... each engine! Not the sum of games of all the engines. It is difficult, isn't it?Ozymandias wrote:So, for a database containing a 40-50% of drawn games, we'd need a number of games closer to 185.5k than to 463.7k, in order to achieve a sub 1 error bar, correct?
Just keep an eye on the current state of Frank's rating list: a total of 180250 games but error bars of engines are around ±6.3 to ±14.9 (depending on the number of games of each engine as a first approximation), not even near to ±1.
Regards from Spain.
Ajedrecista.
-
- Posts: 1535
- Joined: Sun Oct 25, 2009 2:30 am
Re: Why the error bar is wrong... simple example!
Not really, I'm well over 250k games per engine. In this case, there shouldn't be any rating oscillation over 1 ELO point, for more than 5% of the engines, correct?Ajedrecista wrote:Hi again:
Well, in this case the number of games would be 463725*(1 - 0.4) ~ 278.2k games or 463725*(1 - 0.5) = 231.9k games... each engine! Not the sum of games of all the engines. It is difficult, isn't it?Ozymandias wrote:So, for a database containing a 40-50% of drawn games, we'd need a number of games closer to 185.5k than to 463.7k, in order to achieve a sub 1 error bar, correct?