Why the errorbar is wrong ... simple example!

bob · Post by **bob** » Wed Feb 24, 2016 9:26 pm

Ozymandias wrote:
bob wrote:This is an "independent trials" statistical analysis. If you just stuff the same game into the mix 100K times, it is pretty obvious that Elo calculations will see the error bar drop to the 1-2 range. But it is also pretty obvious that the result will be wrong, because that would not be 100K independent trials.

This means that duplicate games are NFG. Or if you use starting positions, and some of them are pre-determined wins for black or white, or draws, then you have fewer independent trials, and while a program like BayesElo will give you a small error bar, it will be completely wrong.
The first part is covered with more than 8 million unique starting positions. More than 5 million already tested, about 3 million left.

The second part could obviously be a problem for engine rating, because about 5% of the games finish before 10 ply of the starting position. But that's exactly what I'm trying to filter out, bad opening lines. The fact that I'm not getting as accurate a rating as I could, for engines, isn't an awful problem, because I only need to know if a new one is clearly (10 ELO) better or worse. It'd be nice to have finer grain, but that's it.

bob wrote:You are doing something badly wrong somewhere. 2.5 M games should not have a +/- 4 Elo error bar by any known method of calculation I know of. The more common problem is that the REAL error bar is larger than the reported error bar because of duplicate games or openings...
As I said, I don't even run simulations under Ordo, to find out the error bar, because I'm going to find out what the real one is anyway (about +/- 4 for the minimum 2.5 mill).

As an example, I'm looking at the last two updates, where the addition to the roster is SugaR 2.0. After the initial 830k games, exceptionally low, it got a rating of XX53. The subsequent burt of the usual 2.5 mill, where it performed at XX49, brought the current rating to XX50, after 3.3 million games. That translates to a 3 ELO point drop after the initial run.

Can you define "real error bar"? To compute that you need two things. (1) an INFINITE number of games so you know the absolute truth about the ratings and (2) a large sample to compare. Since you don't have (1) the "real error bar" is meaningless.

Now, to sampling theory. See the central limit theorem from probability and statistics first. If you take a random sample from a large population, the mean of that sample will be within some error bar of the mean of the entire population. And in fact, when you take many such random samples from that population, the samples will be normally distributed about the population mean.

You have two values of interest. Variance from mean (how far do the samples vary from the total mean) and confidence interval (how confident are you that the mean of your samples is within some fixed error bar of the actual mean.)

We generally use 95% confidence, which means that 95% of the matches we play lie within the two-sigma confidence interval as given by basic probability. That confidence interval gets larger with fewer games, smaller with more games. And nothing else matters. However, if you look at the central limit theorem closely, there are caveats:

The most important is that the observations (samples, games, whatever) have to be independent. IE if you only get heads because the coin has a bias, then the samples are not independent. If the samples are from different populations (programs a, b and C) vs (d, e and f) then it doesn't hold.

With chess we have multiple issues to deal with. (1) different games between any two opponents. Repeated games offer no new information, but they artificially reduce the confidence interval since it assumes independence that is not there. (2) two programs where one is so much stronger than the other that it wins every game. All you can predict from that is that the stronger program is strong enough to win every game, but that tells you nothing about the stronger program against other programs not in that population but not in that sample.

There is no "actual error bar" when it is defined as "the distance from the mean of the observations and the mean of the totality of all games, since the latter is unknown.

This is probability, not a discrete math problem with a closed solution with one final answer.

Ozymandias · Post by **Ozymandias** » Wed Feb 24, 2016 11:24 pm

michiguel wrote:If my guess is correct, it would mean that certain engines may perform slightly better against some specific ones and slightly worse against some others.

That could be the main reason, bad opening lines being a lesser one?

Ozymandias · Post by **Ozymandias** » Wed Feb 24, 2016 11:45 pm

bob wrote:Can you define "real error bar"? To compute that you need two things. (1) an INFINITE number of games so you know the absolute truth about the ratings and (2) a large sample to compare.

I mean I know how much the ELO will fluctuate, with less than several millions of games played. Always talking about integers, you only need an infinite number of games, if you're looking for an ELO number, with an infinite number of decimal places.

michiguel · Post by **michiguel** » Wed Feb 24, 2016 11:58 pm

Ozymandias wrote:
michiguel wrote:If my guess is correct, it would mean that certain engines may perform slightly better against some specific ones and slightly worse against some others.
That could be the main reason, bad opening lines being a lesser one?

Particularly if some know how to exploit them and some not. That would create an anomaly (results are not completely independent). I believe it should be small, but if you have so many games and the statistical error is so low... maybe you are detecting it.

Miguel

bob · Post by **bob** » Wed Feb 24, 2016 11:59 pm

Frank Quisinsky wrote:Hi Miguel,

I will do that !!

But not yet. The database isn't ready for it. I have 71.700 of 85.500 games I need. I will not make a break in the during test-runs with current engine updates. In maybe 4-6 months I should have the missing games.

If you have time for my bad English, please have a look in the message I had written to Bob.

I have in my brain to create such a table with the database I produce.

For me important as main information for a possible tolerance output:

With 26 opponents you need 50 games per match = 1.300 games
And if you like to create such a stable rating with 14 engines you need 4.000 games.

Such a result I will see end of the day in the tolerance information. I think we can calculate it with the example I gave in the message to Bob.

Best
Frank

But with the example I gave you can see that often the Elo information we produced is pur random.

Example:
CEGT Elo from Fizbo 1.6.
For me absolutely clear why Fizbo 1.6 have 30 Elo more as in my test ... the opponents! We can't compare Elo from different rating list if we used other opponents ... or better ... with more opponents it's more to compare.

This is what I have been telling you for a while now, in fact. Elo only predicts expected outcomes between players you have already played. If those players have played other players, there is a secondary coupling so that Elo should predict how you would do against that pool pretty well also. But it is not an absolute. It is a prediction based on a tiny sample (just the games the program you are measuring has played) of the total population of all games played.

Fewer games against a greater number of players only improves the coupling with opponents you have not played. But it doesn't affect the error bar at all, since that is purely a function of sample size. Ideally everybody should play everybody. But it won't every happen. So we accept statistical variability and move on.

bob · Post by **bob** » Thu Feb 25, 2016 12:04 am

Ozymandias wrote:
bob wrote:Can you define "real error bar"? To compute that you need two things. (1) an INFINITE number of games so you know the absolute truth about the ratings and (2) a large sample to compare.
I mean I know how much the ELO will fluctuate, with less than several millions of games played. Always talking about integers, you only need an infinite number of games, if you're looking for an ELO number, with an infinite number of decimal places.

If you want +/- 1 Elo accuracy each program is going to have to play well over 100K games apiece to make that happen. If you have 1000 players, that is 100 million games as a very low lower bound...

There really is no way to "cheat" the system...

Frank Quisinsky · Post by **Frank Quisinsky** » Thu Feb 25, 2016 7:46 am

Hi Bob,

OK, I understand all what you wrote and I have an order. At first the database at second the experiments with different statistics I can produce.

Maybe I find out a bit in questions how many opponents for a better, more stable rating, are necessary. Others can used the database later for own experiments if they like.

But it's clear ...
In the case database is ready ... if we able to switch the complete group of 60 engines we have again an other result. But ... I am sure ... with more and more opponents the final result will be more and more in the near to the _realtity_. Can't say what is the reality ... uncertain ... but I can suppose it!

The once problem I believe to have ...

Example:
First and second place with SF and Komodo 3.175 Elo and the last places with 2.600 Elo. 575 Elo differents between the TOP-60. The question is ... should play SF and Komodo, the others on the first places vs. the latest places on the list or not? Normaly it should or the database will be not complete. That could be the weak point in the database because the Elo difference is to high.

Frank Quisinsky · Post by **Frank Quisinsky** » Thu Feb 25, 2016 8:05 am

Hi Bob,

I will give you an example:

Zappa Mexiko II
After 4000 games vs. a lot of opponents in SWCR rating list (40 in 10 with ponder = on on Q9550 hardware) = 2.710 Elo.

Zappa Mexiko II
After 5000 games vs. a lot of other opponents in FCP rating list (40 in 10 with ponder = off on current i7 Intel Software = 2.750 Elo.

40 Elo difference ...
In SWCR all are playing with ponder.
In FCT all are playing without ponder.

Good to compare because around the same conditions (i7 4.0 Ghz have around the double power as Q9550 2.86 Ghz).

Secret strengths from Zappa or Zappa is weaker with ponder = on or more easy ... a complete group of others opponents ... quantity of games aren't important.

Best
Frank

Frank Quisinsky · Post by **Frank Quisinsky** » Thu Feb 25, 2016 8:19 am

Hi Bob,

all is OK ...
Working a lot of years on the opening book I created for Shredder Classic GUI I am using for testing. Each game are checked for book loses and replayed if so. All 500 ECO codes my opening book can be play.

I have around 45.000 playable lines in my opening book. For sure, populare openings with the higher priority. I optimate the book after each test-run.

Example:
Games ended with draw undo move 20 ... line will be directly deactivated. I have around 500 deactivated draw lines. Can nothing do with such games ... so I wrote for a while ... if 1% of my book have deactivated draw lines ... I have now contempt = 1

I know that I need such a database for the experiments I will do ...

Best
Frank

PS: I am not a fan from ... I am using always the same openings. Because I can nothing do with the database I produced ... later for statistics.

Can be see ...
Have a look here ...
http://www.amateurschach.de/main/_sgbp.htm

Download your Crafty 25.0 DC x64 games and checked the opening systems your engine plays with the my book. You can check book loses too ... you will not find it ... OK maybe 1-2 games I overlook ... I don't know.

Frank Quisinsky · Post by **Frank Quisinsky** » Thu Feb 25, 2016 9:17 am

Hi Bob,

shortly ...

I have to start six new updates (newer versions, missed in my TOP-60) ... 42 days needed. After this one I have around 77.000 of 88.500 games (60 engines, 1770 pairings) each one vs. each one. For the missing games I need ...

11.500 : 450 per day can be play with my conditions with the hardware I have for it = 25 days.

25 + 42 days = I am ready in around 67 days!

OK, 67 days each new engine update will be added on the ToDo list only

But the experiment is really interesting. Furthermore, I think to have such a database will be good for people working on statistics. And of course my FCP Rating List will be better!

I think I should do that.

The only question ...
Should SF, Komodo and Co really play vs. the weakest engines with 2.600 Elo or a bit higher or not?

Best
Frank

Why the errorbar is wrong ... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the errorbar is wrong ... simple example!

Re: Why the error bar is wrong... simple example!

Re: Why the errorbar is wrong ... simple example!

Re: Why the errorbar is wrong ... simple example!

Re: Why the errorbar is wrong ... simple example!

Re: Why the errorbar is wrong ... simple example!