IPON ratings calculation

lkaufman · Post by **lkaufman** » Fri Dec 30, 2011 5:39 am

Robert Flesher wrote:
lkaufman wrote:Could you please comment on how Komodo 4 uses its time in your testing, compared to other engines and also compared to what you think is ideal? We did not design our time control for ponder on games, and I'm thinking this was a big mistake. This probably hurts our results in your testing and IPON. Maybe we can correct this.

I still believe ponder on testing is less efficient than giving double the time, even your own statistics agree. But since you and IPON do it, we need to pay attention to it.

Best regards,
Larry

In regards to time management, I have noticed some strange time management in SD games in which each engine will have 40 min each with pondering off. Komodo can have 30 min left on the clock and it will sometimes only use 6-7 seconds and then move, this seems like poor time management. However, because it did not lose these games, maybe it knew what it was doing in each particular position and figured it was fine.

That shouldn't happen. I take it these weren't obvious recaptures or the like? Can you tell me anything about the positions where this happens?

LucenaTheLucid · Post by **LucenaTheLucid** » Fri Dec 30, 2011 5:44 am

Larry I can send you a PGN of some games where Komodo does this. It shows the seconds per move + evaluation of what Komodo thinks in the PGN.

lkaufman · Post by **lkaufman** » Fri Dec 30, 2011 6:02 am

LucenaTheLucid wrote:From watching Komodo play with ponder=off on my computer it seems to me like Komodo uses very little of its time and has too much left at the end.

As far as changing hash table sizes mid-game I don't know if the UCI protocol would allow it because I *think* the GUI sends the command for hash size but I could be wrong.

Please specify the time limit you refer to, I don't know if it is increment or sd or repeating, and if increment the ratio of main time to increment is important.

lkaufman · Post by **lkaufman** » Fri Dec 30, 2011 6:03 am

LucenaTheLucid wrote:Larry I can send you a PGN of some games where Komodo does this. It shows the seconds per move + evaluation of what Komodo thinks in the PGN.

Please do so. You can send it to Don Dailey, his email address is listed on this site.

lkaufman · Post by **lkaufman** » Fri Dec 30, 2011 6:10 am

Sven Schüle wrote:
lkaufman wrote:
Michel wrote:
Albert Silver wrote:I can only assume it is I who lack the proper understanding of how the ratings are calculated, but watching the IPON results of Critter 1.4, I began to wonder why its performance was 2978 after 2106 games. I took the 22 performances, added them up, and then divided them by 22 and came up with 3000.59, so why is the total performance 2978?
The calculation method of BayesElo is explained here:

http://remi.coulom.free.fr/Bayesian-Elo/#theory

The elo's are the result of a maximum likelihood calculation seeded
with a prior (afaics this can only be theoretically justified in a Bayesian
setting).

The actual algorithm is derived from this paper

http://www.stat.psu.edu/~dhunter/papers/bt.pdf
I think the "prior" may be the problem; it appears to have way too much weight. If an engine performs 3000 against every opponent in over 2000 games, it should get a rating very close to 3000, maybe 2999. But apparently the prior gets way too much weight, because I believe such an engine on the IPON list would get only around 2975.
Part of the problem is that "match performance" is an almost irrelevant number, and also that you can't take the arithmetic average of it due to non-linearity of the percentage expectancy curve. See also the other thread where this has been discussed (link was provided above).

Sven

I know all about the averaging problem, but if an engine had a 3000 performance against every other engine, it impliies that it would neither lose nor gain rating points if the games werre all rated at once starting from a 3000 rating. So that should be the performance rating if thousands of games have been played so that the prior has no significance. I believe that elostat would give a 3000 rating in this case. Elostat is wrong when the performance ratings differ, but if they are the same it should be right, I think.

IWB · Post by **IWB** » Fri Dec 30, 2011 9:35 am

Adam Hair wrote: ...
If your total focus is testing the top engines and you are fortunate enough to have multiple computers available for testing, then why not use ponder on? It is easy enough when you only have to stay on top of a couple of dozen engines. If you can afford to do it, then do it.

Yes, I do - for the reasons named above!

Adam Hair wrote: However, when you try to maintain multiple lists containing 200 to 300 engines (and adding more all of the time), ponder off makes a lot of sense.

Here we go, you say it by yourself and I repeat in the word of my initial posting: The only people who are doing this are a few rating lists! Do you see what I wanted to say?

Adam Hair wrote: In addition, when you compare the results of ponder off testing with ponder on testing, it is hard to discern much difference. Given the differences in focus between IPON and CEGT/CCRL and the lack of truly demonstrative proof that ponder off is less accurate in practice than ponder on ....

That is a different discussion. I still believe that there adre differences in playing style AND in rating, alone I have to admit that the differnce might be very small and hard to prove. I once had a good exapmle (Shredder 12/Naum4) but noone cares for Naum 4 anymore ... and again that has nothing to do with the relevance of "ponder off" for chess!
(And regarding the difference between POFF and PON: I know of 3 engines which are doing a complete different timing with ponder on as they simply assume ponder hit and therefore more time. I am 100% sure that more engines are doing this. There IS a difference between PON and POFF!)

Adam Hair wrote: , I find the statement "I consider Ponder off as completly artifical and useless, sorry." to be off the mark.

Actualy, you might not like it and obviously it offends you (and I hope only a little and appologize again) but that is what I think (reasoning above) about POFF Testing/Lists!

Regards
Ingo

PS: I write this with all respect for the programmers and you have to believe me that I know how many work is to do to make a list (more than outsiders think) But your statement about "hundrets of engines" is abit exaggregating or better "showy"

. Yes there are hundrets of engines, but who is really intereted in them. I once had a mistake in my list for an engine ranked between 15 - 20 (individual list, so best of its kind). The error was there for monthes and nobody found it! The vast majority of people care for the Top 10, a few for the top 20 and then it ends. With later engines you have the programmer and maybe a hand full of people world wide beeing interested! Again, I am full of respect for the work to make such a list and I have even more respect for the programmer making these engines (as I cant do it), but I see the relevance as well!
Another disadvantage with testing 100s of engines is, that If I look at the "huge" rating lists I see more holes than testing. I know that not many people look closely at the condition or the way a list is done but that holes are a good part why I started my list! I personaly think that the POFF list should be more focused. And again, no offence ment! If you want to discuss this in detail give me a PM.

IWB · Post by **IWB** » Fri Dec 30, 2011 10:42 am

Don wrote: ...

So in this example not a single program has a positive score against all others.

And this is the normal case! We just had an exception the last few years and many came later to computer chess to realize what is normal and what isn't.

For sure it is much more exciting this way, than the last years

Bye
Ingo

pohl4711 · Post by **pohl4711** » Fri Dec 30, 2011 12:18 pm

Hi Larry,

in my NEBB-Rankinglists Komodo 4 is 42 Elo better than Komodo 3 in Blitz (4'+2'')...

http://talkchess.com/forum/viewtopic.ph ... 53&t=41677

Greetings - Stefan

Don · Post by **Don** » Fri Dec 30, 2011 1:56 pm

LucenaTheLucid wrote:From watching Komodo play with ponder=off on my computer it seems to me like Komodo uses very little of its time and has too much left at the end.

My tester measures this, so I will check out if that is happening.

As far as changing hash table sizes mid-game I don't know if the UCI protocol would allow it because I *think* the GUI sends the command for hash size but I could be wrong.

Of course the chess program does not have to honor that command. However I don't really see a point in making the hash table larger as the game progresses, it's not like memory is a resource that can be "saved up" for later. Time control IS such a resource, what you save now can be used later or vise versa, so trade-offs can be chosen.

Sven · Post by **Sven** » Fri Dec 30, 2011 3:13 pm

lkaufman wrote:
Sven Schüle wrote:
lkaufman wrote:
Michel wrote:
Albert Silver wrote:I can only assume it is I who lack the proper understanding of how the ratings are calculated, but watching the IPON results of Critter 1.4, I began to wonder why its performance was 2978 after 2106 games. I took the 22 performances, added them up, and then divided them by 22 and came up with 3000.59, so why is the total performance 2978?
The calculation method of BayesElo is explained here:

http://remi.coulom.free.fr/Bayesian-Elo/#theory

The elo's are the result of a maximum likelihood calculation seeded
with a prior (afaics this can only be theoretically justified in a Bayesian
setting).

The actual algorithm is derived from this paper

http://www.stat.psu.edu/~dhunter/papers/bt.pdf
I think the "prior" may be the problem; it appears to have way too much weight. If an engine performs 3000 against every opponent in over 2000 games, it should get a rating very close to 3000, maybe 2999. But apparently the prior gets way too much weight, because I believe such an engine on the IPON list would get only around 2975.
Part of the problem is that "match performance" is an almost irrelevant number, and also that you can't take the arithmetic average of it due to non-linearity of the percentage expectancy curve. See also the other thread where this has been discussed (link was provided above).

Sven
I know all about the averaging problem, but if an engine had a 3000 performance against every other engine, it impliies that it would neither lose nor gain rating points if the games werre all rated at once starting from a 3000 rating. So that should be the performance rating if thousands of games have been played so that the prior has no significance. I believe that elostat would give a 3000 rating in this case. Elostat is wrong when the performance ratings differ, but if they are the same it should be right, I think.

The basic error is to look at "match performance" numbers at all, as if they would make any sense. A "match performance", in the world of chess engine ratings, is inherently misleading and has no value in my opinion. Total ratings of engines are derived from a whole pool of game results and have zero relation to "match performances", the latter are at best some by-product of the overall ratings that have already been calculated at that point, and you can't draw any conclusions from these numbers.

So you can at most blame the way how a match performance is being calculated, and that it is published at all.

Human player ratings are a totally different animal, for the very reason that the rating principle is completely different. Here you have current ratings for each player, then the next tournament event appears and affects the current ratings of its participants, so ratings evolve over time, and you have an incremental rating process where the most recent events have highest weight while the oldest events are fading out. Calculating match performance makes some sense here. Engine rating is done at once for the whole pool of games, though, so a "match performance" in this case can only be derived similar to a rating of a human player from a set of games against unrated opponents.

Regarding your remarks about EloStat and "prior", either I have misunderstood you or there is some inconsistency in your statement. The program that uses a "prior" is BayesElo, not EloStat. And AFAIK the final IPON ratings that are published were calculated with BayesElo. But nevertheless I believe that the "prior" has little to no impact on the final ratings when considering the number of games involved here.

Sven

IPON ratings calculation

Re: Not realistic!

Re: Not realistic!

Re: Not realistic!

Re: Not realistic!

Re: IPON ratings calculation

Re: Not realistic!

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: Not realistic!

Re: IPON ratings calculation