IPON ratings calculation

Frank Quisinsky · Post by **Frank Quisinsky** » Thu Dec 29, 2011 9:45 am

Uri,

logical after information we have.

1.
With more time more remis games.

2.
Average with 1:0 / 0:1 games, adavantage average = move number 54 (late middlegame, early endgame). Means that the average a game goes to 1:0 or 0:1 = move number 54 if the average from all particpants / opponents in a tourney or rating list isn't higher as 135 ELO (135 ELO situation we have with our TOP-40).

3.
An engine which have so many problems in endgames like Junior can't go higher in ratings with more time. The others will see in the game phase Junior is strong with more time better moves.

Not a proof but after all I think it should be clear that to 95% the Junior rating can't go higher with more time as around 30 ELO (in comparing to the others). I think more that Junior lost his advantage with more and more time ...

Example:
40 in 5 ... No advantage
40 in 10 ... 30 ELO advantage to the others
40 in 40 ... 20 ELO advantage to the others
and perhaps with 40 in 120 ... no advantage to the others

Best
Frank

Uri Blass · Post by **Uri Blass** » Thu Dec 29, 2011 10:04 am

Frank Quisinsky wrote:Uri,

logical after information we have.

1.
With more time more remis games.

2.
Average with 1:0 / 0:1 games, adavantage average = move number 54 (late middlegame, early endgame). Means that the average a game goes to 1:0 or 0:1 = move number 54 if the average from all particpants / opponents in a tourney or rating list isn't higher as 135 ELO (135 ELO situation we have with our TOP-40).

3.
An engine which have so many problems in endgames like Junior can't go higher in ratings with more time. The others will see in the game phase Junior is strong with more time better moves.

Not a proof but after all I think it should be clear that to 95% the Junior rating can't go higher with more time as around 30 ELO (in comparing to the others). I think more that Junior lost his advantage with more and more time ...

Example:
40 in 5 ... No advantage
40 in 10 ... 30 ELO advantage to the others
40 in 40 ... 20 ELO advantage to the others
and perhaps with 40 in 120 ... no advantage to the others

Best
Frank

You say:
The others will see in the game phase Junior is strong with more time better moves.

The question is why not to think also that Junior is going to see in the phase that it is weak with more time better moves.

Michel · Post by **Michel** » Thu Dec 29, 2011 10:21 am

Albert Silver wrote:I can only assume it is I who lack the proper understanding of how the ratings are calculated, but watching the IPON results of Critter 1.4, I began to wonder why its performance was 2978 after 2106 games. I took the 22 performances, added them up, and then divided them by 22 and came up with 3000.59, so why is the total performance 2978?

The calculation method of BayesElo is explained here:

http://remi.coulom.free.fr/Bayesian-Elo/#theory

The elo's are the result of a maximum likelihood calculation seeded
with a prior (afaics this can only be theoretically justified in a Bayesian
setting).

The actual algorithm is derived from this paper

http://www.stat.psu.edu/~dhunter/papers/bt.pdf

Frank Quisinsky · Post by **Frank Quisinsky** » Thu Dec 29, 2011 11:11 am

Hi Uri,

with longer time controls move and remis average will go higher. The advantage Junior have in the early middlegame can't compare weaknesses in endgames if more endgames with more time are to play.

With other words:
More time = longer games = more endgames.

Possible that Junior will be stronger with more time in the early middlegame but you have compare it with ... Junior will get with more time more problems in endgames too.

The advantage Junior have deflagrates.

Same for Spark because both have same weaknesses and strengths. Spark is more aggressive as Junior, produced more of such games as Junior but broadly speaking ... strengths and weeknesses are the same.

Best
Frank

lkaufman · Post by **lkaufman** » Thu Dec 29, 2011 8:08 pm

CCRL does 40/40 ratings, and usually a new engine will have a well-established rating on their list within a week or two. For you it is probably not realistic because you are just one person, but for a group like CCRL it appears to be no problem.

lkaufman · Post by **lkaufman** » Thu Dec 29, 2011 8:11 pm

geots wrote:
lkaufman wrote:
melajara wrote:Critter 1.4 match seems to be stalled now. Last time I checked, score was 2978 after 2162 games but as I'm writing this, there is no more ad interim results displayed.

From what I observed from several IPON unfolding matches, for whatever reason the provisional score seems to drop after a few hundred games.
It would be ironic that Critter 1.4 score exactly matches Komodo 4 when the profile of play of both programs is very different (from this match, Critter being stronger with the strongest opponents but somewhat inconsistent with weaker ones).

Anyway, we are clearly in the diminishing return phase from the latest version for both programs.
At current rate of progress, we'll need Komodo 7 and Critter 1.6 to bypass Houdini 2/1.5

This demonstrates the engineering ability of Mr Houdart or the luck he had in tuning the parameters of Houdini 1.5
I don't think so. Although IPON only shows a 14 elo gain for K4 over K3, CCRL so far shows 28 elo, so probably my original estimate of 20 was spot on. I guess we'll need two more versions to catch Houdini at blitz based on this, but the next version should do it at 40/40 minutes.

Right Larry, but which Houdini version are you referring to when you say "catch Houdini". You seem to be in this period of time you refer to assuming Robert will be sitting on his hands doing nothing to come out with possibly a much stronger and faster release. Just a thought.

Best,

george

I refer to either 1.5 or 2.0, as the lists show no net difference between the two; in fact the slower tests give 1.5 an edge. If he couldn't improve Houdini in a year I don't expect miracles, but perhaps he'll find some improvements in 2012.

lkaufman · Post by **lkaufman** » Thu Dec 29, 2011 8:14 pm

Michel wrote:
Albert Silver wrote:I can only assume it is I who lack the proper understanding of how the ratings are calculated, but watching the IPON results of Critter 1.4, I began to wonder why its performance was 2978 after 2106 games. I took the 22 performances, added them up, and then divided them by 22 and came up with 3000.59, so why is the total performance 2978?
The calculation method of BayesElo is explained here:

http://remi.coulom.free.fr/Bayesian-Elo/#theory

The elo's are the result of a maximum likelihood calculation seeded
with a prior (afaics this can only be theoretically justified in a Bayesian
setting).

The actual algorithm is derived from this paper

http://www.stat.psu.edu/~dhunter/papers/bt.pdf

I think the "prior" may be the problem; it appears to have way too much weight. If an engine performs 3000 against every opponent in over 2000 games, it should get a rating very close to 3000, maybe 2999. But apparently the prior gets way too much weight, because I believe such an engine on the IPON list would get only around 2975.

Frank Quisinsky · Post by **Frank Quisinsky** » Thu Dec 29, 2011 9:07 pm

Hi Larry,

I have 40 in 60 if I would say (conditions are Pentium 4 2.0 GHz, without ponder, with resign).

40 in 40 CCRL without ponder and without resign factors = around 40 in 16 if you compare it with 40 in 10 SWCR conditions.

CCRL is playing without ponder
40 in 40 on older AMD hardware

SWCR is playing with ponder
40 in 10 on faster Intel Q9550 hardware
Ponder = around 40 ELO more, see Crafty 23.3 x64 results in SWCR rating list.

Indeed CCRL have the highest conditions with around 40 in 16 (comparing to 40 in 10 SWCR), CEGT with 40 in 20 without ponder and slower hardware as SWCR is around the same as SWCR. IPON have with ponder around 40 in 4 if I compare it with SWCR.

Highest conditions comes from CCRL!
CEGT and SWCR around the same.

But the different between CCRL and SWCR / CEGT isn't enough as to see anything.

Best
Frank

CCRL to SWCR will give Komodo not a big jumping in ELO. In CEGT Komodo 4 is around + 14 stronger as Komodo 3. Same results after around 250 games with a lot of opponents in SWCR (clear) and in IPON it is 12 ELO. So you can be sure that the CCRL rating for Komodo is to high. CCRL have not so many participant I think, much more important as many games are many opponents.

With other words:
CEGT + 14 so far
SWCR + 14 so far
IPON + 12 so far
Your tester Clemens wrote today in CSS forum that in testing Komodo he find out + 15

And CCRL have +28 but this could not be right if I am looking on the other -- 4 -- results. I think the reason is, that CCRL don't used so many strong opponents the others are using or don't have at the moment many different opponents.

Best
Frank

My example to current electric other things are 40 in 40 with actual hardware. This one we need if to see is Komodo stronger or not at the others with longer time controls. With CCRL in comparing SWCR or CEGT you can't see it.

lkaufman · Post by **lkaufman** » Thu Dec 29, 2011 9:17 pm

Frank Quisinsky wrote:Hi Larry,

I have 40 in 60 if I would say (conditions are Pentium 4 2.0 GHz, without ponder, with resign).

40 in 40 CCRL without ponder and without resign factors = around 40 in 16 if you compare it with 40 in 10 SWCR conditions.

CCRL is playing without ponder
40 in 40 on older AMD hardware

SWCR is playing with ponder
40 in 40 on faster Intel Q9550 hardware

Indeed CCRL have the highest conditions with around 40 in 16 (comparing to 40 in 10 SWCR), CEGT with 40 in 20 without ponder and slower hardware as SWCR is around the same as SWCR. IPON have with ponder around 40 in 4 if I compare it with SWCR.

Highest conditions comes from CCRL, CEGT and SWCR around the same.

Best
Frank

CCRL to SWCR will give Komodo not a big jumping in ELO. In CEGT Komodo 4 is around + 14 stronger as Komodo 3. Same results after around 250 games in SWCR (clear) and in IPON it is 12 ELO. So you can be sure that the CCRL rating for Komodo is to high. CCRL have not so many participant I think, much more important as many games are many opponents.

With other words:
CEGT + 14 so far
SWCR + 14 so far
IPON + 12 so far
Your tester Clemens wrote today in CSS forum that in testing Komodo he find out + 15

And CCRL have +28 but this could not be right if I am looking on the other results. I think the reason is, that CCRL don't used so many strong opponents the others are using or don't have at the moment many different opponents.

Best
Frank

My example to current electric other things are 40 in 40 with actual hardware. This one we need if to see is Komodo stronger or not at the others with longer time controls. With CCRL in comparing SWCR or CEGT you can't see it.

I don't consider resign on/off or ponder on/off to be important. Ponder on has unpredictable effects, and I think it raises the quality only slightly, not nearly enough to justify cutting your sample in half. We would never test that way. If your machine were used for CCRL 40/40 test, what setting would you use? In other words, what is the hardware-only adjustment between CCRL and your list, disregarding ponder and resign?

Best,
Larry

Frank Quisinsky · Post by **Frank Quisinsky** » Thu Dec 29, 2011 9:28 pm

Hi Larry,

ponder is a time factor and very important.

Without ponder you can produced more games, with ponder and around 30% ponder hits the performance go higher. I think games without ponder are half games only.

Code: Select all

   -  157 Crafty 23.3 JA x64             2593   18   18  1160   33%  2721   34% 
   -  176 Crafty 23.3 JA x64, no ponder  2546   20   20  1000   26%  2729   30%

= 47 ELO.
With the double on time you will get 60-65 ELO more.

40 in 40 CCRL without ponder = around 40 in 25 with ponder.
SpeedUp from Q9550 to AMD3800 = a lot.

I calculate it for two years ...
40 in 10 SWCR should be around 40 in 16 CCRL if CCRL used the same hardware and ponder I used.

With or without resign only for statistics.
Move average without resign = 86 moves
Move average with resign = 67 moves
Better statistics with resign = off are possible.

Again, complete other opinion to ponder as yourself.
Ponder is very very important speed factor.

Best
Frank

IPON ratings calculation

Re: Not realistic!

Re: Not realistic!

Re: IPON ratings calculation

Re: Not realistic!

Re: Not realistic!

Re: IPON ratings calculation

Re: IPON ratings calculation

Re: Not realistic!

Re: Not realistic!

Re: Not realistic!