best way to determine elos of a group

Daniel Shawul · Post by **Daniel Shawul** » Tue Jul 30, 2019 4:53 am

I am sure you are all familiar with selfplay elo graphs of lc0 are considered not so reliable ( sometimes called its "ego" due to its inflated value ).
I think this is mostly because each network is being tested against the previous network -- maybe they do more but for this discussion lets stick with
testing against previous network only. So in essence elo is assumed to be incremental. The elo of neworks goes like +50, +40, -20, +30, +25, ... etc

I faced this sort of problem when trying to evaluate my networks. For my case roughly 60 times more nets are produced comapred to lc0's.
So the time I spent evaluating networks takes a significant portion of my resources. Using the lc0 approach for the first 12 networks calculated with bopo (my implementaton of bayeselo algorithms) can be seen here

http://scorpiozero.ddns.net/scorpiozero ... atings.txt

When i got to 160 networks in under 80k games, I decided that I should sample networks and match them to possible fit a curve for all nets without actually playing net ID x with x+1. This thought led me to some interesting questions on maximizing resources for estimation of elo of a group. My question simply put is: what is the best way to match the nets to produce a reliable elo estimate ? The incremental testing resulted in "ego" so there must be a better way to approach the problem of determining elos with few games as possible. I know this is not a very precise question for mathematicians but if you get the idea please frame the question the way you like it. I am thinking of @Michel here.

This problem also seems to be related to Upper confidence bound (UCB) formula -- specifically for perft estimation! Where should you spend your match games to minimize the standard errors of elo estimates of all networks. We had fun with "perft estimation" some years ago that showed that this problem
is different from standard UCB as used in tree search where the goal is to find the best arm. If the problem was to find the best network, the standard UCT approach will zoom in on the latest networks and minimize the errors in their elo esimates.

Should the goal even be to determine elos for all networks? Also determining the best network is probably easy without even doing matches so I think the goal should be somewhere in between. We want to detect when things are going bad so the goal shouldn't be to find only the best net which is probably trivial anyway.

regards,
Daniel

dragontamer5788 · Post by **dragontamer5788** » Tue Jul 30, 2019 6:48 am

If you can manage to find a copy of "The Method of Paired Comparisons" by H. A. David, it will go deep into the Bradley-Terry model (which Elo's ranking system was based on). Elo was designed for ladders, not tournaments. The Bradley-Terry model was originally designed for one-shot tournaments where the skills of players are unlikely to change. Unfortunately, most people have forgotten the relationship between the Elo model and Bradley-Terry model.

My particular copy of the book is the 1988 version.

Chapter 4 covers the "Linear Model" (particularly Bradley-Terry) in depth, including confidence intervals and tests of the null hypothesis.

"Little Pi _ij" stands for "probability that i beats j", and the Bradley Terry model is a regression-like methodology to linearize the probabilities and put them all in relative terms. "Little Pi" is the true probabilities, while the methodology discussed in those pages calculates "p" (an estimate of little-pi).

I think that covers the main gist of the algorithm, and its confidence bounds. "p" is basically the Bradley-Terry score (aka: Elo) in the equations above.

The equation 4.3.4 on page 62 (with the addendum on page 63) is the most important equation in the book. It is how to calculate "p" (the estimate of Little Pi) through an iterative methodology. As long as all players have lost at least one time (you may add a "half loss" of 0.5 to all scores to ensure this effect), you will continuously get more-and-more precise terms each iteration. The series diverges however if an opponent has never lost (that opponent will reach infinity). Page 63 describes it as follows:

The convergence of the iterative process has been established by Zermelo and Ford in this general case, provided the following assumption holds:

"In every possible partition of the objects into two non-empty subsets, some object in the second set has been preferred [Note: Has won a match] at least once to some object in the first set".

Chapter 5 discusses some tournament designs, starting with Round Robin and various cycles. But in my experience, the equation 4.3.4 is extremely flexible. Play at least as many matches needed to get 4.3.4 to converge, and keep playing until your confidence intervals are tight enough to draw the conclusions you need.

-------

Notes:

* Converting from "win probability" to "Elo score" is roughly EloScore = Log(p_ij/p_ji) * 400 (where p_ij is the probability i beats j). In chess, a draw would be considered a half-win to both players (+ 0.5).

* "a_ij" is the number of times "i was preferred to j", or "i beat j" in a game.

* I recall seeing a paper that tried to calculate the pairings which would potentially generate the biggest change to ratings. I believe it took the derivative of the confidence bounds and searched for the pairings that would generate the biggest drop in confidence bounds. Or something along those lines. I forgot the name of that paper however.

Michel · Post by **Michel** » Tue Jul 30, 2019 1:08 pm

Daniel Shawul wrote: ↑Tue Jul 30, 2019 4:53 am I am sure you are all familiar with selfplay elo graphs of lc0 are considered not so reliable ( sometimes called its "ego" due to its inflated value ).
I think this is mostly because each network is being tested against the previous network -- maybe they do more but for this discussion lets stick with
testing against previous network only. So in essence elo is assumed to be incremental. The elo of neworks goes like +50, +40, -20, +30, +25, ... etc

I faced this sort of problem when trying to evaluate my networks. For my case roughly 60 times more nets are produced comapred to lc0's.
So the time I spent evaluating networks takes a significant portion of my resources. Using the lc0 approach for the first 12 networks calculated with bopo (my implementaton of bayeselo algorithms) can be seen here

http://scorpiozero.ddns.net/scorpiozero ... atings.txt

When i got to 160 networks in under 80k games, I decided that I should sample networks and match them to possible fit a curve for all nets without actually playing net ID x with x+1. This thought led me to some interesting questions on maximizing resources for estimation of elo of a group. My question simply put is: what is the best way to match the nets to produce a reliable elo estimate ? The incremental testing resulted in "ego" so there must be a better way to approach the problem of determining elos with few games as possible. I know this is not a very precise question for mathematicians but if you get the idea please frame the question the way you like it. I am thinking of @Michel here.

This problem also seems to be related to Upper confidence bound (UCB) formula -- specifically for perft estimation! Where should you spend your match games to minimize the standard errors of elo estimates of all networks. We had fun with "perft estimation" some years ago that showed that this problem
is different from standard UCB as used in tree search where the goal is to find the best arm. If the problem was to find the best network, the standard UCT approach will zoom in on the latest networks and minimize the errors in their elo esimates.

Should the goal even be to determine elos for all networks? Also determining the best network is probably easy without even doing matches so I think the goal should be somewhere in between. We want to detect when things are going bad so the goal shouldn't be to find only the best net which is probably trivial anyway.

regards,
Daniel

I am not too sure if I understand your question correctly.

If you only play against the last engine and don't do gating(!!) you still get an unbiased elo estimate. The problem however is that the variance of the estimator will become bigger and bigger (you are measuring
X_n-X_1 as (X_n-X_{n-1})+(X_{n-1}-X_{n-2})+...+(X_2-X_1)). So as an absolute elo estimate it becomes worthless.

To get an elo estimate with fixed variance you should test against a fixed engine (or group of engines). But this then not good for comparing engines (it is well known that the variance of elo difference measured against
a 3d engine is 4 times as big as when measured in direct play).

Of course you can combine the two, i.e. occasionally playing against a fixed engine. This will serve as an anchor for the other results.

Maybe there is something to optimize here, but it depends on the question.

Laskos · Post by **Laskos** » Tue Jul 30, 2019 4:40 pm

Michel wrote: ↑Tue Jul 30, 2019 1:08 pm To get an elo estimate with fixed variance you should test against a fixed engine (or group of engines). But this then not good for comparing engines (it is well known that the variance of elo difference measured against
a 3d engine is 4 times as big as when measured in direct play).

Well, 2 times as big if he starts the procedure from the beginning of his line of nets. The issue might become whether that engine as an opponent is not peculiar in some ways and what really we want to measure ("strength", I suppose, but in relation to "something"). How the gating would look now? For example, would it be an orderly gating to require that each successive net performs better against "fixed opponents" than the previous net?

Daniel Shawul · Post by **Daniel Shawul** » Tue Jul 30, 2019 4:59 pm

dragontamer5788 wrote: ↑Tue Jul 30, 2019 6:48 am If you can manage to find a copy of "The Method of Paired Comparisons" by H. A. David, it will go deep into the Bradley-Terry model (which Elo's ranking system was based on). Elo was designed for ladders, not tournaments. The Bradley-Terry model was originally designed for one-shot tournaments where the skills of players are unlikely to change. Unfortunately, most people have forgotten the relationship between the Elo model and Bradley-Terry model.

That bayeselo program from Remi Coulom uses the Bradley-Terry model for estimating elos. Its major improvement over its predecessor (EloStat) is that its use of bayesian methods that improves elos estimates on extreme scores and such. More info can be found at this page. The Bradley-Terry model assumes players tend to overperform and also that there are no ties. The latter we have to address in this draft. Specifically the Davidson model was shown to be better with CCRL/CEGT data.

* I recall seeing a paper that tried to calculate the pairings which would potentially generate the biggest change to ratings. I believe it took the derivative of the confidence bounds and searched for the pairings that would generate the biggest drop in confidence bounds. Or something along those lines. I forgot the name of that paper however.

Ok that is interestng although what I want is the opposite, find parirings (matches) that will minimize elo changes (stabilize ratings). That is why i think the problem can be defined as UCB, where we try to maximize cumulative reward. Ofcourse the reward in our case is framed as minimize standard error of elo estimates (maybe dot producted with some prior distribution). I would rather do no matches between networks to be hones, A0 showed that you can get to a strong network without "gating" in the second paper so why waste reources on it. Plus unless you really invest time on it (exteme would be full-blown round-robin), the elo estimates are not reliable. With the incremental approach, my calculation shown in this plot says elo after 40,000 games is less than a random player with 0 games. However, I match head-to-head and the former is atleast 100 elo ...

Daniel Shawul · Post by **Daniel Shawul** » Tue Jul 30, 2019 5:14 pm

Michel wrote: ↑Tue Jul 30, 2019 1:08 pm
I am not too sure if I understand your question correctly.

If you only play against the last engine and don't do gating(!!) you still get an unbiased elo estimate. The problem however is that the variance of the estimator will become bigger and bigger (you are measuring
X_n-X_1 as (X_n-X_{n-1})+(X_{n-1}-X_{n-2})+...+(X_2-X_1)). So as an absolute elo estimate it becomes worthless.

I am not using gating since A0 showed you can do without it in their second paper. Maybe it is better to use it in theory assuming you have enough resources. Anyway my main qualm with the incremental approach is it sucks for a lack of a better word. Good to know it an unbiased estimate but the variance accumulates as you pointed out.

To get an elo estimate with fixed variance you should test against a fixed engine (or group of engines). But this then not good for comparing engines (it is well known that the variance of elo difference measured against
a 3d engine is 4 times as big as when measured in direct play).

Of course you can combine the two, i.e. occasionally playing against a fixed engine. This will serve as an anchor for the other results.

Say I measure against random player (ID-0) for a while but then i have to move the reference latter when the nets get stronger and all of them
trash ID-0. This will still be a moving target but a really slow one so that could be better.

Maybe there is something to optimize here, but it depends on the question.

Yes, I am just not sure how to define the target of optimization. The goal is somewhere in between "find the elo of the best network" and "find the elo of all networks". Anyway that is defined, we have to make sure we optimize allocation of resources (say you are given 10k games to evaluate 100 networks) to minimize the errors in the elo estimate of the networks -- possibly paired with a prior as we do not want to value all networks uniformly.
So come up with a function that generate match pairs given the outcomes so far which I think boils down to something similar to UCB i think.

Daniel Shawul · Post by **Daniel Shawul** » Tue Jul 30, 2019 5:20 pm

Laskos wrote: ↑Tue Jul 30, 2019 4:40 pm
Michel wrote: ↑Tue Jul 30, 2019 1:08 pm To get an elo estimate with fixed variance you should test against a fixed engine (or group of engines). But this then not good for comparing engines (it is well known that the variance of elo difference measured against
a 3d engine is 4 times as big as when measured in direct play).

Well, 2 times as big if he starts the procedure from the beginning of his line of nets. The issue might become whether that engine as an opponent is not peculiar in some ways and what really we want to measure ("strength", I suppose, but in relation to "something"). How the gating would look now? For example, would it be an orderly gating to require that each successive net performs better against "fixed opponents" than the previous net?

Yes, I think this issue is very serious. In my case ID-11 beat ID-10 and ID-12 with 200 games each, and its elo peaked really high. Even though the following networks are stronger than ID-11 it took really long for that to be shown on the plot since they were not directly paired with ID-11. There must be a better way to decide who to match to next (other than ID-x with ID-x+1) to minimize errors in elo estimates of the whole group within a certain specified total number of games.

dragontamer5788 · Post by **dragontamer5788** » Tue Jul 30, 2019 5:20 pm

Daniel Shawul wrote: ↑Tue Jul 30, 2019 4:59 pm Ok that is interestng although what I want is the opposite, find parirings (matches) that will minimize elo changes (stabilize ratings). That is why i think the problem can be defined as UCB, where we try to maximize cumulative reward. Ofcourse the reward in our case is framed as minimize standard error of elo estimates (maybe dot producted with some prior distribution). I would rather do no matches between networks to be hones, A0 showed that you can get to a strong network without "gating" in the second paper so why waste reources on it. Plus unless you really invest time on it (exteme would be full-blown round-robin), the elo estimates are not reliable. With the incremental approach, my calculation shown in this plot says elo after 40,000 games is less than a random player with 0 games. However, I match head-to-head and the former is atleast 100 elo ...

That graph looks... wrong.

After 40,000 games, it seems like your engine is rated -300 Elo or so, representing a win-loss ratio 15%. That is, your engine allegedly won 6000 out of 40,000 games.

After 70,000 games, your engine is rated +515 Elo, representing a win-loss ratio of 95%. That is, your engine is now estimated to have won 66,500 / 70,000 games.

I don't understand how your statistics are such that your engine gained 60,000 wins in just 30,000 games. This is mathematically impossible. You should be tallying the win/loss conditions across all matches. The variance of the proportion should decrease the more matches you play. At least, that's how Bradley-Terry MM algorithms work.

I don't know what algorithm you're using, but it isn't Bradley Terry Minimum-Maximization. The link you shared links to this paper, which has a good discussion: http://personal.psu.edu/drh20/papers/bt.pdf . I haven't read the paper all the way through yet, but it seems to be talking about the methodology I'm pushing for.

Daniel Shawul · Post by **Daniel Shawul** » Tue Jul 30, 2019 5:24 pm

dragontamer5788 wrote: ↑Tue Jul 30, 2019 5:20 pm
Daniel Shawul wrote: ↑Tue Jul 30, 2019 4:59 pm Ok that is interestng although what I want is the opposite, find parirings (matches) that will minimize elo changes (stabilize ratings). That is why i think the problem can be defined as UCB, where we try to maximize cumulative reward. Ofcourse the reward in our case is framed as minimize standard error of elo estimates (maybe dot producted with some prior distribution). I would rather do no matches between networks to be hones, A0 showed that you can get to a strong network without "gating" in the second paper so why waste reources on it. Plus unless you really invest time on it (exteme would be full-blown round-robin), the elo estimates are not reliable. With the incremental approach, my calculation shown in this plot says elo after 40,000 games is less than a random player with 0 games. However, I match head-to-head and the former is atleast 100 elo ...
That graph looks... wrong.

After 40,000 games, it seems like your engine is rated -300 Elo or so, representing a win-loss ratio 15%.

After 70,000 games, your engine is rated +515 Elo, representing a win-loss ratio of 95%. This is mathematically impossible.

It looks wrong i agree, but is not impossible because the variances are very high. The absurd way to prove the point would be to use just 1 game to estimate elo between ID-x and ID-x+1. I am making 200 games between ID-x and x+1 which is still not that great, but i have way too many nets. Lc0 only have 2 networks by the end of the 60000 games mark while i have maybe 80 nets.

Here is the full ratings table with error bars:

http://scorpiozero.ddns.net/scorpiozero ... atings.txt

I do not know how lc0 calculates elos (just incrementing elo difference of ID x+1 with iD x) or maybe they recalcualte elos of all players from scratch
using all match games like I did.

dragontamer5788 · Post by **dragontamer5788** » Tue Jul 30, 2019 5:25 pm

Daniel Shawul wrote: ↑Tue Jul 30, 2019 5:24 pm
dragontamer5788 wrote: ↑Tue Jul 30, 2019 5:20 pm
Daniel Shawul wrote: ↑Tue Jul 30, 2019 4:59 pm Ok that is interestng although what I want is the opposite, find parirings (matches) that will minimize elo changes (stabilize ratings). That is why i think the problem can be defined as UCB, where we try to maximize cumulative reward. Ofcourse the reward in our case is framed as minimize standard error of elo estimates (maybe dot producted with some prior distribution). I would rather do no matches between networks to be hones, A0 showed that you can get to a strong network without "gating" in the second paper so why waste reources on it. Plus unless you really invest time on it (exteme would be full-blown round-robin), the elo estimates are not reliable. With the incremental approach, my calculation shown in this plot says elo after 40,000 games is less than a random player with 0 games. However, I match head-to-head and the former is atleast 100 elo ...
That graph looks... wrong.

After 40,000 games, it seems like your engine is rated -300 Elo or so, representing a win-loss ratio 15%.

After 70,000 games, your engine is rated +515 Elo, representing a win-loss ratio of 95%. This is mathematically impossible.
It looks wrong i agree, but is not impossible because the variances are very high. The absurd way to prove the point would be to use just 1 game to estimate elo between ID-x and ID-x+1. I am making 200 games between ID-x and x+1 which is still not that great, but i have way too many nets. Lc0 only have 2 networks by the end of the 60000 games mark while i have maybe 80 nets.

EDIT: Hmm.... lemme think a bit more.

The absurd way to prove the point would be to use just 1 game to estimate elo between ID-x and ID-x+1

Hmm, that's a bit more of a hint, but still seems unlikely to me. The win/loss across all nets will affect everyone else's score in Bradley Terry Minorization–Maximization.

If ID_x and ID_x+1 only have one match, then their statistics would be mostly derived from all of the other games they've played if you're doing the MM algorithm correctly.

best way to determine elos of a group

best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group

Re: best way to determine elos of a group