That shouldn't happen. I take it these weren't obvious recaptures or the like? Can you tell me anything about the positions where this happens?Robert Flesher wrote:lkaufman wrote:Could you please comment on how Komodo 4 uses its time in your testing, compared to other engines and also compared to what you think is ideal? We did not design our time control for ponder on games, and I'm thinking this was a big mistake. This probably hurts our results in your testing and IPON. Maybe we can correct this.
I still believe ponder on testing is less efficient than giving double the time, even your own statistics agree. But since you and IPON do it, we need to pay attention to it.
Best regards,
Larry
In regards to time management, I have noticed some strange time management in SD games in which each engine will have 40 min each with pondering off. Komodo can have 30 min left on the clock and it will sometimes only use 6-7 seconds and then move, this seems like poor time management. However, because it did not lose these games, maybe it knew what it was doing in each particular position and figured it was fine.
IPON ratings calculation
Moderator: Ras
-
- Posts: 6259
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: Not realistic!
-
- Posts: 197
- Joined: Mon Jul 13, 2009 2:16 am
Re: Not realistic!
Larry I can send you a PGN of some games where Komodo does this. It shows the seconds per move + evaluation of what Komodo thinks in the PGN.
-
- Posts: 6259
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: Not realistic!
Please specify the time limit you refer to, I don't know if it is increment or sd or repeating, and if increment the ratio of main time to increment is important.LucenaTheLucid wrote:From watching Komodo play with ponder=off on my computer it seems to me like Komodo uses very little of its time and has too much left at the end.
As far as changing hash table sizes mid-game I don't know if the UCI protocol would allow it because I *think* the GUI sends the command for hash size but I could be wrong.
-
- Posts: 6259
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: Not realistic!
Please do so. You can send it to Don Dailey, his email address is listed on this site.LucenaTheLucid wrote:Larry I can send you a PGN of some games where Komodo does this. It shows the seconds per move + evaluation of what Komodo thinks in the PGN.
-
- Posts: 6259
- Joined: Sun Jan 10, 2010 6:15 am
- Location: Maryland USA
- Full name: Larry Kaufman
Re: IPON ratings calculation
I know all about the averaging problem, but if an engine had a 3000 performance against every other engine, it impliies that it would neither lose nor gain rating points if the games werre all rated at once starting from a 3000 rating. So that should be the performance rating if thousands of games have been played so that the prior has no significance. I believe that elostat would give a 3000 rating in this case. Elostat is wrong when the performance ratings differ, but if they are the same it should be right, I think.Sven Schüle wrote:Part of the problem is that "match performance" is an almost irrelevant number, and also that you can't take the arithmetic average of it due to non-linearity of the percentage expectancy curve. See also the other thread where this has been discussed (link was provided above).lkaufman wrote:I think the "prior" may be the problem; it appears to have way too much weight. If an engine performs 3000 against every opponent in over 2000 games, it should get a rating very close to 3000, maybe 2999. But apparently the prior gets way too much weight, because I believe such an engine on the IPON list would get only around 2975.Michel wrote:The calculation method of BayesElo is explained here:Albert Silver wrote:I can only assume it is I who lack the proper understanding of how the ratings are calculated, but watching the IPON results of Critter 1.4, I began to wonder why its performance was 2978 after 2106 games. I took the 22 performances, added them up, and then divided them by 22 and came up with 3000.59, so why is the total performance 2978?
http://remi.coulom.free.fr/Bayesian-Elo/#theory
The elo's are the result of a maximum likelihood calculation seeded
with a prior (afaics this can only be theoretically justified in a Bayesian
setting).
The actual algorithm is derived from this paper
http://www.stat.psu.edu/~dhunter/papers/bt.pdf
Sven
-
- Posts: 1539
- Joined: Thu Mar 09, 2006 2:02 pm
Re: Not realistic!
Yes, I do - for the reasons named above!Adam Hair wrote: ...
If your total focus is testing the top engines and you are fortunate enough to have multiple computers available for testing, then why not use ponder on? It is easy enough when you only have to stay on top of a couple of dozen engines. If you can afford to do it, then do it.
Here we go, you say it by yourself and I repeat in the word of my initial posting: The only people who are doing this are a few rating lists! Do you see what I wanted to say?Adam Hair wrote: However, when you try to maintain multiple lists containing 200 to 300 engines (and adding more all of the time), ponder off makes a lot of sense.
That is a different discussion. I still believe that there adre differences in playing style AND in rating, alone I have to admit that the differnce might be very small and hard to prove. I once had a good exapmle (Shredder 12/Naum4) but noone cares for Naum 4 anymore ... and again that has nothing to do with the relevance of "ponder off" for chess!Adam Hair wrote: In addition, when you compare the results of ponder off testing with ponder on testing, it is hard to discern much difference. Given the differences in focus between IPON and CEGT/CCRL and the lack of truly demonstrative proof that ponder off is less accurate in practice than ponder on ....
(And regarding the difference between POFF and PON: I know of 3 engines which are doing a complete different timing with ponder on as they simply assume ponder hit and therefore more time. I am 100% sure that more engines are doing this. There IS a difference between PON and POFF!)
Actualy, you might not like it and obviously it offends you (and I hope only a little and appologize again) but that is what I think (reasoning above) about POFF Testing/Lists!Adam Hair wrote: , I find the statement "I consider Ponder off as completly artifical and useless, sorry." to be off the mark.
Regards
Ingo
PS: I write this with all respect for the programmers and you have to believe me that I know how many work is to do to make a list (more than outsiders think) But your statement about "hundrets of engines" is abit exaggregating or better "showy"

Another disadvantage with testing 100s of engines is, that If I look at the "huge" rating lists I see more holes than testing. I know that not many people look closely at the condition or the way a list is done but that holes are a good part why I started my list! I personaly think that the POFF list should be more focused. And again, no offence ment! If you want to discuss this in detail give me a PM.
-
- Posts: 1539
- Joined: Thu Mar 09, 2006 2:02 pm
Re: IPON ratings calculation
And this is the normal case! We just had an exception the last few years and many came later to computer chess to realize what is normal and what isn't.Don wrote: ...
So in this example not a single program has a positive score against all others.
For sure it is much more exciting this way, than the last years

Bye
Ingo
-
- Posts: 2815
- Joined: Sat Sep 03, 2011 7:25 am
- Location: Berlin, Germany
- Full name: Stefan Pohl
Re: IPON ratings calculation
Hi Larry,
in my NEBB-Rankinglists Komodo 4 is 42 Elo better than Komodo 3 in Blitz (4'+2'')...
http://talkchess.com/forum/viewtopic.ph ... 53&t=41677
Greetings - Stefan
in my NEBB-Rankinglists Komodo 4 is 42 Elo better than Komodo 3 in Blitz (4'+2'')...
http://talkchess.com/forum/viewtopic.ph ... 53&t=41677
Greetings - Stefan
-
- Posts: 5106
- Joined: Tue Apr 29, 2008 4:27 pm
Re: Not realistic!
My tester measures this, so I will check out if that is happening.LucenaTheLucid wrote:From watching Komodo play with ponder=off on my computer it seems to me like Komodo uses very little of its time and has too much left at the end.
Of course the chess program does not have to honor that command. However I don't really see a point in making the hash table larger as the game progresses, it's not like memory is a resource that can be "saved up" for later. Time control IS such a resource, what you save now can be used later or vise versa, so trade-offs can be chosen.
As far as changing hash table sizes mid-game I don't know if the UCI protocol would allow it because I *think* the GUI sends the command for hash size but I could be wrong.
-
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: IPON ratings calculation
The basic error is to look at "match performance" numbers at all, as if they would make any sense. A "match performance", in the world of chess engine ratings, is inherently misleading and has no value in my opinion. Total ratings of engines are derived from a whole pool of game results and have zero relation to "match performances", the latter are at best some by-product of the overall ratings that have already been calculated at that point, and you can't draw any conclusions from these numbers.lkaufman wrote:I know all about the averaging problem, but if an engine had a 3000 performance against every other engine, it impliies that it would neither lose nor gain rating points if the games werre all rated at once starting from a 3000 rating. So that should be the performance rating if thousands of games have been played so that the prior has no significance. I believe that elostat would give a 3000 rating in this case. Elostat is wrong when the performance ratings differ, but if they are the same it should be right, I think.Sven Schüle wrote:Part of the problem is that "match performance" is an almost irrelevant number, and also that you can't take the arithmetic average of it due to non-linearity of the percentage expectancy curve. See also the other thread where this has been discussed (link was provided above).lkaufman wrote:I think the "prior" may be the problem; it appears to have way too much weight. If an engine performs 3000 against every opponent in over 2000 games, it should get a rating very close to 3000, maybe 2999. But apparently the prior gets way too much weight, because I believe such an engine on the IPON list would get only around 2975.Michel wrote:The calculation method of BayesElo is explained here:Albert Silver wrote:I can only assume it is I who lack the proper understanding of how the ratings are calculated, but watching the IPON results of Critter 1.4, I began to wonder why its performance was 2978 after 2106 games. I took the 22 performances, added them up, and then divided them by 22 and came up with 3000.59, so why is the total performance 2978?
http://remi.coulom.free.fr/Bayesian-Elo/#theory
The elo's are the result of a maximum likelihood calculation seeded
with a prior (afaics this can only be theoretically justified in a Bayesian
setting).
The actual algorithm is derived from this paper
http://www.stat.psu.edu/~dhunter/papers/bt.pdf
Sven
So you can at most blame the way how a match performance is being calculated, and that it is published at all.
Human player ratings are a totally different animal, for the very reason that the rating principle is completely different. Here you have current ratings for each player, then the next tournament event appears and affects the current ratings of its participants, so ratings evolve over time, and you have an incremental rating process where the most recent events have highest weight while the oldest events are fading out. Calculating match performance makes some sense here. Engine rating is done at once for the whole pool of games, though, so a "match performance" in this case can only be derived similar to a rating of a human player from a set of games against unrated opponents.
Regarding your remarks about EloStat and "prior", either I have misunderstood you or there is some inconsistency in your statement. The program that uses a "prior" is BayesElo, not EloStat. And AFAIK the final IPON ratings that are published were calculated with BayesElo. But nevertheless I believe that the "prior" has little to no impact on the final ratings when considering the number of games involved here.
Sven