IPON ratings calculation

IWB · Post by **IWB** » Fri Dec 30, 2011 3:39 pm

Sven Schüle wrote: ...

And AFAIK the final IPON ratings that are published were calculated with BayesElo. ...

Sven

That is correct and even worse: The final rating is done with bayeselo. When I make a test run I enter this bayeselo in the classic GUI which is calculation a rating with a fixed elo for the opponents and the pure Elo formula (which is basicaly Elostat). So we have a mixture of different rating systems here. AT BEST this method is a hint where the way is going - nothing more. To get a real rating one has to wait for the fnew final calculation after the match is finished!
I have to complain a bit about about what people see in this, but I feel like Don Quichotte figthing against windmills

. A few years ago people dont care for 20 or 30 Elo difference. Nowadays they argue about an up and down of 5 Elo or less ...! (And the worst: I think I am a bit responsiple for it ... !)

Bye
Ingo

Don · Post by **Don** » Fri Dec 30, 2011 4:11 pm

IWB wrote:
Sven Schüle wrote: ...

And AFAIK the final IPON ratings that are published were calculated with BayesElo. ...

Sven
That is correct and even worse: The final rating is done with bayeselo. When I make a test run I enter this bayeselo in the classic GUI which is calculation a rating with a fixed elo for the opponents and the pure Elo formula (which is basicaly Elostat). So we have a mixture of different rating systems here. AT BEST this method is a hint where the way is going - nothing more. To get a real rating one has to wait for the fnew final calculation after the match is finished!
I have to complain a bit about about what people see in this, but I feel like Don Quichotte figthing against windmills . A few years ago people dont care for 20 or 30 Elo difference. Nowadays they argue about an up and down of 5 Elo or less ...! (And the worst: I think I am a bit responsiple for it ... !)

Bye
Ingo

I don't see any problem in how you do this.

Essentially, the ratings you are showing are just a rough estimate until the final official rating is produced after the matches.

What happens is that when you add a new player, that players rating affects the others, so as soon as some games are complete the ratings of the other players are out of date. So I don't just look to see if Komodo's final rating changes but I also look for the relatively difference between Komodo and the other top players.

After you rated Komodo 4 it's rating only changed by 1 ELO compared to the estimate, however I noticed that it caused a substantial drop in Houdini 2 rating, so even though on your test Komodo 4 only gained 11 ELO over Komodo 3, it closed the gap between Komodo and Houdini by more than that.

Nevertheless, none of the ratings lists are very reliable for quantifying small differences in programs such as Critter and Komodo. The error margins are pretty large and would only go down substantially if you ran a significantly larger number of games which is probably impractical.

However I would appeal to you to double the number of games and get the error margins at least down to single digits. It would be a big improvement in my opinion. It's unsatisfying that Komodo is 2 ELO ahead of Critter but the error margin is close to 15 ELO essentially making the two programs indistinguishable. About the only things we can say with much confidence is that under IPON conditions the two programs are "probably" within about 10 ELO of each other.

bob · Post by **bob** » Fri Dec 30, 2011 4:25 pm

Adam Hair wrote:
IWB wrote:Hi Larry,

lkaufman wrote:... We did not design our time control for ponder on games, and I'm thinking this was a big mistake. This probably hurts our results in your testing and IPON. Maybe we can correct this. ...
For a very simple reason I am a bit suprised by this statement. Try to think the other way around! Basicaly engines are used only for two things:

1. Analysis (mainly)
2. To play against (OTB or on a server)

In case 2 the question is who is playing ponder OFF? The only people who are doing this are a few rating lists (for historic reasons - and now they dont want to trow away the games). Everyone else (!) is always plaing PON (and is loosing, therefore a good method to limit playing strength is important as well)! So, a good Ponder ON time management is much more important than the Ponder off thing!
I consider Ponder off as completly artifical and useless, sorry. You are right with the number of games, but that is an argument coming from times where there where onyl single CPUs. Nowadays it is possible to play a sufficiant number of games with ponder on. (I admit that engine development, with very short time controls is more practicable with POFF, but that has nothing to do with real game play - the other devices where ponder off might be used are smartphones or other mobile devices to save energy, but there, against humans, the timing of that ponder off games is less important ...)

Again, any real chess game played by humans, in a computer WC or at a server is Ponder ON. I consider this as real chess and ponder OFF as some kind of subgroup for special purposes.

Regards and a few more "happy holidays"
Ingo

EDIT: If you start to make developments to please the rating list and not the users this will backfire! Someone will come up with a new, better method of testing (it already happened and will happen again imho) and then you have to adapt again, and again ...
If your total focus is testing the top engines and you are fortunate enough to have multiple computers available for testing, then why not use ponder on? It is easy enough when you only have to stay on top of a couple of dozen engines. If you can afford to do it, then do it.

However, when you try to maintain multiple lists containing 200 to 300 engines (and adding more all of the time), ponder off makes a lot of sense. In addition, when you compare the results of ponder off testing with ponder on testing, it is hard to discern much difference. Given the differences in focus between IPON and CEGT/CCRL and the lack of truly demonstrative proof that ponder off is less accurate in practice than ponder on, I find the statement "I consider Ponder off as completly artifical and useless, sorry." to be off the mark.

This goes back to an old argument that comes up from time to time. I personally believe ponder=off testing is wrong, for the same reason Ingo gave.

I subscribe to the philosophy of "test like you plan on running". If you are testing yourself and only want to measure playing skill improvements, then ponder=off is perfectly OK. Might not give you the same final Elo number as with ponder=on, but if something helps with PON, it should help with POFF, unless you are changing the basic pondering code (or time allocation is different with PON and POFF).

I'm not going twiddle and tune my wife's car, then on saturday night my son and I take his mustang to the drag strip. Or I would not "practice" with a nitrous system turned off, then race with it on.

Laskos · Post by **Laskos** » Fri Dec 30, 2011 4:59 pm

Sven Schüle wrote:
lkaufman wrote:
Sven Schüle wrote:
lkaufman wrote:
Michel wrote:
Albert Silver wrote:I can only assume it is I who lack the proper understanding of how the ratings are calculated, but watching the IPON results of Critter 1.4, I began to wonder why its performance was 2978 after 2106 games. I took the 22 performances, added them up, and then divided them by 22 and came up with 3000.59, so why is the total performance 2978?
The calculation method of BayesElo is explained here:

http://remi.coulom.free.fr/Bayesian-Elo/#theory

The elo's are the result of a maximum likelihood calculation seeded
with a prior (afaics this can only be theoretically justified in a Bayesian
setting).

The actual algorithm is derived from this paper

http://www.stat.psu.edu/~dhunter/papers/bt.pdf
I think the "prior" may be the problem; it appears to have way too much weight. If an engine performs 3000 against every opponent in over 2000 games, it should get a rating very close to 3000, maybe 2999. But apparently the prior gets way too much weight, because I believe such an engine on the IPON list would get only around 2975.
Part of the problem is that "match performance" is an almost irrelevant number, and also that you can't take the arithmetic average of it due to non-linearity of the percentage expectancy curve. See also the other thread where this has been discussed (link was provided above).

Sven
I know all about the averaging problem, but if an engine had a 3000 performance against every other engine, it impliies that it would neither lose nor gain rating points if the games werre all rated at once starting from a 3000 rating. So that should be the performance rating if thousands of games have been played so that the prior has no significance. I believe that elostat would give a 3000 rating in this case. Elostat is wrong when the performance ratings differ, but if they are the same it should be right, I think.
The basic error is to look at "match performance" numbers at all, as if they would make any sense. A "match performance", in the world of chess engine ratings, is inherently misleading and has no value in my opinion. Total ratings of engines are derived from a whole pool of game results and have zero relation to "match performances", the latter are at best some by-product of the overall ratings that have already been calculated at that point, and you can't draw any conclusions from these numbers.

So you can at most blame the way how a match performance is being calculated, and that it is published at all.

Isn't this "derived from a whole pool of games" procedure fundamentally flawed? When I saw it in some rating algorithms, I thought immediately that mixing up all the games and matches is as wrong, naive and simple as one can do. What is important in statistics is not the number of games or matches or anything mixed up there, is the statistical weight of a quantity. Maybe I am wrong, and maybe I will try to think about my own algorithm, first I have to understand what "rating" is supposed to mean.

For example, if Elo curve had been linear, then averaging wouldn't be wrong. Elo curve is linear on infinitesimal intervals, so locally, your statement "and you can't draw any conclusions from these numbers" is wrong.

Kai

Human player ratings are a totally different animal, for the very reason that the rating principle is completely different. Here you have current ratings for each player, then the next tournament event appears and affects the current ratings of its participants, so ratings evolve over time, and you have an incremental rating process where the most recent events have highest weight while the oldest events are fading out. Calculating match performance makes some sense here. Engine rating is done at once for the whole pool of games, though, so a "match performance" in this case can only be derived similar to a rating of a human player from a set of games against unrated opponents.

Regarding your remarks about EloStat and "prior", either I have misunderstood you or there is some inconsistency in your statement. The program that uses a "prior" is BayesElo, not EloStat. And AFAIK the final IPON ratings that are published were calculated with BayesElo. But nevertheless I believe that the "prior" has little to no impact on the final ratings when considering the number of games involved here.

Sven

Houdini · Post by **Houdini** » Fri Dec 30, 2011 5:02 pm

Don wrote:After you rated Komodo 4 it's rating only changed by 1 ELO compared to the estimate, however I noticed that it caused a substantial drop in Houdini 2 rating, so even though on your test Komodo 4 only gained 11 ELO over Komodo 3, it closed the gap between Komodo and Houdini by more than that.

Congratulations on closing down the gap to 40 points.

Don wrote:However I would appeal to you to double the number of games and get the error margins at least down to single digits. It would be a big improvement in my opinion.

While it would reduce the random error, the systematic error from using those particular 50 opening positions on that particular hardware would still remain, and is probably at least 20 Elo.
I think IPON is fine as it is, there's little point in bringing down the random error to below 15 points when the systematic error is probably larger than that.

Robert

Don · Post by **Don** » Fri Dec 30, 2011 5:17 pm

bob wrote: I subscribe to the philosophy of "test like you plan on running". If you are testing yourself and only want to measure playing skill improvements, then ponder=off is perfectly OK. Might not give you the same final Elo number as with ponder=on, but if something helps with PON, it should help with POFF, unless you are changing the basic pondering code (or time allocation is different with PON and POFF).

I'm not going twiddle and tune my wife's car, then on saturday night my son and I take his mustang to the drag strip. Or I would not "practice" with a nitrous system turned off, then race with it on.

I agree with you in theory but not in practice. In real tournaments a human operates the machine, I'm sure you don't test your program by manually entering the moves do you? Of course not because it's not a workable thing in practice. In theory that is how you play but in practice you would never get 100,000 games that way.

So I'm afraid that you have to pick and choose which concessions you make for the sake of practicality. We try to pick them in order of how much we believe they are relevant.

Here is a list of concessions that most of us make - probably a few exceptions such as in your case when you have a major hardware testing infrastructure but you probably make some of the same concessions too:

1. Time control A
2. Time control B
3. Ponder vs No ponder
4. Book
5. Hardware
6. Opponents

One at a time:

1. There are 2 issues with time control. The first is playing with the same style time control with same ratio of time and increments if used or moves. For example 40/2 classic should scale to 20/2 if you want to speed up the test. If you want to play in 5 minutes + 5 seconds then you should test at 1 minute + 1 second, preserving the same ratio.

2. The other time control issue is actually playing the exact time control of the tournament you are playing in. If you want Crafty to play well at 40/2 then do you test only at 40/2 ???

3. Ponder vs No ponder. I test with a 6 core i7-980x and it's not much, it's a huge bottleneck for us. Larry has a bit more than I do but it's still a huge bottleneck. We test with ponder off. If we tested with ponder ON we would have to reduce our samples by half, or increase our testing time by 2X to get the same number of games. We cannot afford to do this just to be anal retentive about this issue.

See: http://en.wikipedia.org/wiki/Anal_retentiveness

4. Book. Does Crafty use the same book that it will compete with? You would have to in order to follow your principle of testing the same as you will play.

5. Hardware. Does Crafty use the same exactly hardware and configuration for you big 20,000 game samples that you intend to compete with? I doubt it.

6. Opponents. When you test Crafty I'm sure you don't test against the same players and versions you will compete with in tournaments. This is not possible anyway since you don't know who will be there and what they will bring and what hardware they will use.

As you see, it is not even CLOSE to possible to "test like you plan on running." I don't mean to be critical about this but I don't understand why people latch on to what is probably the LEAST important factor in the list above and make it seem like a major blunder, as if there is absolutely no correlation between how a program will do with ponder vs not pondering - when all major testing is done with an opening book that does not resemble in any way, shape, or form what a program will use in a serious competition. Which do you think is the greater issue?

Larry and decided long ago that testing with Ponder although better in some idealist sense is a major trade-off in the wrong direction, where sample size means so much more.

So if the principle is "test like you plan on running" how would you justify most of these concession? Do you think ponder is more important that time control or using your tournament book or running the same exact hardware?

When it comes to things like this a good engineer knows the different between the lower order bits and and the higher order bits. The truth of the matter is that hardly anyone has the luxury of "testing like we plan on running" but we have a very good sense of what the tradeoffs are. We know that if the program improves, it will probably show up still at a different time control if it's not ridiculously different.

Let me ask you this: If we make an improvement to the evaluation function which shows a definite improvement with no ponder, do you think the results is invalid because we did not test with ponder on? I don't think so .....

You might find this interesting, but I'm a lot more anal retentive about stuff like this than Larry is, but compared to you I'm not disciplined at all!

IWB · Post by **IWB** » Fri Dec 30, 2011 5:28 pm

Houdini wrote: ...
I think IPON is fine as it is, there's little point in bringing down the random error to below 15 points when the systematic error is probably larger than that.

Hi Robert,

I agreee that the 50 positions are artifical and there is room for improvement (in 2012), but I doubt that the systematic error is that big.
If I compare my results with the rating lists which are using books (which has different problems which might even be bigger) the differences to my list in average are much smaller than 15 ELo ... and yes, for an individual engine my testset might be a bad or a perfect choice. Of course not on purpose but the possibility is there, no doubt.

Anyhow, there is no 100% result in statistics of rating lists but I try to come close

BYe
Ingo

Don · Post by **Don** » Fri Dec 30, 2011 5:38 pm

Houdini wrote:
Don wrote:After you rated Komodo 4 it's rating only changed by 1 ELO compared to the estimate, however I noticed that it caused a substantial drop in Houdini 2 rating, so even though on your test Komodo 4 only gained 11 ELO over Komodo 3, it closed the gap between Komodo and Houdini by more than that.
Congratulations on closing down the gap to 40 points.

Thank you for the encouragement. Does this mean you are now a Komodo fan?

Don wrote:However I would appeal to you to double the number of games and get the error margins at least down to single digits. It would be a big improvement in my opinion.
While it would reduce the random error, the systematic error from using those particular 50 opening positions on that particular hardware would still remain, and is probably at least 20 Elo.
I think IPON is fine as it is, there's little point in bringing down the random error to below 15 points when the systematic error is probably larger than that.
Robert

That's true. However 2400 games is a ridiculously small sample and bringing the error margin down a few points would be a big step in the right direction, but in reality you need far more games. I assume that he would have to double his opening set to accommodate this however and that would also be a big step in the right direction as you point out.

I just recently posted about the priorities people put on various testing factors such as ponder on or off when I think this is relatively minor compared to other factors like the opening book. There are several popular programs that always do well in tournaments but are not in the same league as the top 5 programs. I think it's because of exceptional opening book preparation. Although these are solid programs in their own right, they would not do well under IPON or any other rating agency conditions.

Houdini · Post by **Houdini** » Fri Dec 30, 2011 5:41 pm

IWB wrote: ... and yes, for an individual engine my testset might be a bad or a perfect choice. Of course not on purpose but the possibility is there, no doubt.

Ingo, that's exactly what I mean, your opening positions could very well create 10 to 20 Elo difference for some engines.
Also your choice of hardware (Phenom II X6) could easily make 5 to 10 Elo difference for some engines.

All this means that there's no point in bringing down the random error to 10 Elo or below, it's insignificant when you look at the larger picture.

Ribert

IWB · Post by **IWB** » Fri Dec 30, 2011 5:47 pm

Houdini wrote:
IWB wrote: ... and yes, for an individual engine my testset might be a bad or a perfect choice. Of course not on purpose but the possibility is there, no doubt.
Ingo, that's exactly what I mean, your opening positions could very well create 10 to 20 Elo difference for some engines.
Also your choice of hardware (Phenom II X6) could easily make 5 to 10 Elo difference for some engines.

All this means that there's no point in bringing down the random error to 10 Elo or below, it's insignificant when you look at the larger picture.

Ribert

While I cant argue with the opening set I can with the Hardware. When I switched to AMD I was concerned as well but found out that there was nothing to worry about.

Why:
The engine which suffers most in Nodes/second was Fritz 12, yet the change in rating (not to talk about ranking) after I switched to the new hardware was not excisting (after a few thousand games it dropped by 1 ELo, but that is more likely a result of a lot of higher ranked new enignes).

So, Particular good or bad opening positions - there is a chance, sure, Hardware no way!

Bye
Ingo

IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: Not realistic!

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: Not realistic!

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: IPON ratings calculation