best way to determine elos of a group
Moderators: hgm, Harvey Williamson, bob
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

 Posts: 3746
 Joined: Tue Mar 14, 2006 10:34 am
 Location: Ethiopia
 Contact:
best way to determine elos of a group
I am sure you are all familiar with selfplay elo graphs of lc0 are considered not so reliable ( sometimes called its "ego" due to its inflated value ).
I think this is mostly because each network is being tested against the previous network  maybe they do more but for this discussion lets stick with
testing against previous network only. So in essence elo is assumed to be incremental. The elo of neworks goes like +50, +40, 20, +30, +25, ... etc
I faced this sort of problem when trying to evaluate my networks. For my case roughly 60 times more nets are produced comapred to lc0's.
So the time I spent evaluating networks takes a significant portion of my resources. Using the lc0 approach for the first 12 networks calculated with bopo (my implementaton of bayeselo algorithms) can be seen here
http://scorpiozero.ddns.net/scorpiozero ... atings.txt
When i got to 160 networks in under 80k games, I decided that I should sample networks and match them to possible fit a curve for all nets without actually playing net ID x with x+1. This thought led me to some interesting questions on maximizing resources for estimation of elo of a group. My question simply put is: what is the best way to match the nets to produce a reliable elo estimate ? The incremental testing resulted in "ego" so there must be a better way to approach the problem of determining elos with few games as possible. I know this is not a very precise question for mathematicians but if you get the idea please frame the question the way you like it. I am thinking of @Michel here.
This problem also seems to be related to Upper confidence bound (UCB) formula  specifically for perft estimation! Where should you spend your match games to minimize the standard errors of elo estimates of all networks. We had fun with "perft estimation" some years ago that showed that this problem
is different from standard UCB as used in tree search where the goal is to find the best arm. If the problem was to find the best network, the standard UCT approach will zoom in on the latest networks and minimize the errors in their elo esimates.
Should the goal even be to determine elos for all networks? Also determining the best network is probably easy without even doing matches so I think the goal should be somewhere in between. We want to detect when things are going bad so the goal shouldn't be to find only the best net which is probably trivial anyway.
regards,
Daniel
I think this is mostly because each network is being tested against the previous network  maybe they do more but for this discussion lets stick with
testing against previous network only. So in essence elo is assumed to be incremental. The elo of neworks goes like +50, +40, 20, +30, +25, ... etc
I faced this sort of problem when trying to evaluate my networks. For my case roughly 60 times more nets are produced comapred to lc0's.
So the time I spent evaluating networks takes a significant portion of my resources. Using the lc0 approach for the first 12 networks calculated with bopo (my implementaton of bayeselo algorithms) can be seen here
http://scorpiozero.ddns.net/scorpiozero ... atings.txt
When i got to 160 networks in under 80k games, I decided that I should sample networks and match them to possible fit a curve for all nets without actually playing net ID x with x+1. This thought led me to some interesting questions on maximizing resources for estimation of elo of a group. My question simply put is: what is the best way to match the nets to produce a reliable elo estimate ? The incremental testing resulted in "ego" so there must be a better way to approach the problem of determining elos with few games as possible. I know this is not a very precise question for mathematicians but if you get the idea please frame the question the way you like it. I am thinking of @Michel here.
This problem also seems to be related to Upper confidence bound (UCB) formula  specifically for perft estimation! Where should you spend your match games to minimize the standard errors of elo estimates of all networks. We had fun with "perft estimation" some years ago that showed that this problem
is different from standard UCB as used in tree search where the goal is to find the best arm. If the problem was to find the best network, the standard UCT approach will zoom in on the latest networks and minimize the errors in their elo esimates.
Should the goal even be to determine elos for all networks? Also determining the best network is probably easy without even doing matches so I think the goal should be somewhere in between. We want to detect when things are going bad so the goal shouldn't be to find only the best net which is probably trivial anyway.
regards,
Daniel

 Posts: 98
 Joined: Thu Jun 06, 2019 6:05 pm
 Full name: Percival Tiglao
Re: best way to determine elos of a group
If you can manage to find a copy of "The Method of Paired Comparisons" by H. A. David, it will go deep into the BradleyTerry model (which Elo's ranking system was based on). Elo was designed for ladders, not tournaments. The BradleyTerry model was originally designed for oneshot tournaments where the skills of players are unlikely to change. Unfortunately, most people have forgotten the relationship between the Elo model and BradleyTerry model.
My particular copy of the book is the 1988 version.
Chapter 4 covers the "Linear Model" (particularly BradleyTerry) in depth, including confidence intervals and tests of the null hypothesis.
"Little Pi _ij" stands for "probability that i beats j", and the Bradley Terry model is a regressionlike methodology to linearize the probabilities and put them all in relative terms. "Little Pi" is the true probabilities, while the methodology discussed in those pages calculates "p" (an estimate of littlepi).
I think that covers the main gist of the algorithm, and its confidence bounds. "p" is basically the BradleyTerry score (aka: Elo) in the equations above.
The equation 4.3.4 on page 62 (with the addendum on page 63) is the most important equation in the book. It is how to calculate "p" (the estimate of Little Pi) through an iterative methodology. As long as all players have lost at least one time (you may add a "half loss" of 0.5 to all scores to ensure this effect), you will continuously get moreandmore precise terms each iteration. The series diverges however if an opponent has never lost (that opponent will reach infinity). Page 63 describes it as follows:

Notes:
* Converting from "win probability" to "Elo score" is roughly EloScore = Log(p_ij/p_ji) * 400 (where p_ij is the probability i beats j). In chess, a draw would be considered a halfwin to both players (+ 0.5).
* "a_ij" is the number of times "i was preferred to j", or "i beat j" in a game.
* I recall seeing a paper that tried to calculate the pairings which would potentially generate the biggest change to ratings. I believe it took the derivative of the confidence bounds and searched for the pairings that would generate the biggest drop in confidence bounds. Or something along those lines. I forgot the name of that paper however.
My particular copy of the book is the 1988 version.
Chapter 4 covers the "Linear Model" (particularly BradleyTerry) in depth, including confidence intervals and tests of the null hypothesis.
"Little Pi _ij" stands for "probability that i beats j", and the Bradley Terry model is a regressionlike methodology to linearize the probabilities and put them all in relative terms. "Little Pi" is the true probabilities, while the methodology discussed in those pages calculates "p" (an estimate of littlepi).
I think that covers the main gist of the algorithm, and its confidence bounds. "p" is basically the BradleyTerry score (aka: Elo) in the equations above.
The equation 4.3.4 on page 62 (with the addendum on page 63) is the most important equation in the book. It is how to calculate "p" (the estimate of Little Pi) through an iterative methodology. As long as all players have lost at least one time (you may add a "half loss" of 0.5 to all scores to ensure this effect), you will continuously get moreandmore precise terms each iteration. The series diverges however if an opponent has never lost (that opponent will reach infinity). Page 63 describes it as follows:
Chapter 5 discusses some tournament designs, starting with Round Robin and various cycles. But in my experience, the equation 4.3.4 is extremely flexible. Play at least as many matches needed to get 4.3.4 to converge, and keep playing until your confidence intervals are tight enough to draw the conclusions you need.The convergence of the iterative process has been established by Zermelo and Ford in this general case, provided the following assumption holds:
"In every possible partition of the objects into two nonempty subsets, some object in the second set has been preferred [Note: Has won a match] at least once to some object in the first set".

Notes:
* Converting from "win probability" to "Elo score" is roughly EloScore = Log(p_ij/p_ji) * 400 (where p_ij is the probability i beats j). In chess, a draw would be considered a halfwin to both players (+ 0.5).
* "a_ij" is the number of times "i was preferred to j", or "i beat j" in a game.
* I recall seeing a paper that tried to calculate the pairings which would potentially generate the biggest change to ratings. I believe it took the derivative of the confidence bounds and searched for the pairings that would generate the biggest drop in confidence bounds. Or something along those lines. I forgot the name of that paper however.
Re: best way to determine elos of a group
I am not too sure if I understand your question correctly.Daniel Shawul wrote: ↑Tue Jul 30, 2019 2:53 amI am sure you are all familiar with selfplay elo graphs of lc0 are considered not so reliable ( sometimes called its "ego" due to its inflated value ).
I think this is mostly because each network is being tested against the previous network  maybe they do more but for this discussion lets stick with
testing against previous network only. So in essence elo is assumed to be incremental. The elo of neworks goes like +50, +40, 20, +30, +25, ... etc
I faced this sort of problem when trying to evaluate my networks. For my case roughly 60 times more nets are produced comapred to lc0's.
So the time I spent evaluating networks takes a significant portion of my resources. Using the lc0 approach for the first 12 networks calculated with bopo (my implementaton of bayeselo algorithms) can be seen here
http://scorpiozero.ddns.net/scorpiozero ... atings.txt
When i got to 160 networks in under 80k games, I decided that I should sample networks and match them to possible fit a curve for all nets without actually playing net ID x with x+1. This thought led me to some interesting questions on maximizing resources for estimation of elo of a group. My question simply put is: what is the best way to match the nets to produce a reliable elo estimate ? The incremental testing resulted in "ego" so there must be a better way to approach the problem of determining elos with few games as possible. I know this is not a very precise question for mathematicians but if you get the idea please frame the question the way you like it. I am thinking of @Michel here.
This problem also seems to be related to Upper confidence bound (UCB) formula  specifically for perft estimation! Where should you spend your match games to minimize the standard errors of elo estimates of all networks. We had fun with "perft estimation" some years ago that showed that this problem
is different from standard UCB as used in tree search where the goal is to find the best arm. If the problem was to find the best network, the standard UCT approach will zoom in on the latest networks and minimize the errors in their elo esimates.
Should the goal even be to determine elos for all networks? Also determining the best network is probably easy without even doing matches so I think the goal should be somewhere in between. We want to detect when things are going bad so the goal shouldn't be to find only the best net which is probably trivial anyway.
regards,
Daniel
If you only play against the last engine and don't do gating(!!) you still get an unbiased elo estimate. The problem however is that the variance of the estimator will become bigger and bigger (you are measuring
X_nX_1 as (X_nX_{n1})+(X_{n1}X_{n2})+...+(X_2X_1)). So as an absolute elo estimate it becomes worthless.
To get an elo estimate with fixed variance you should test against a fixed engine (or group of engines). But this then not good for comparing engines (it is well known that the variance of elo difference measured against
a 3d engine is 4 times as big as when measured in direct play).
Of course you can combine the two, i.e. occasionally playing against a fixed engine. This will serve as an anchor for the other results.
Maybe there is something to optimize here, but it depends on the question.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Without ideas there is nothing to simplify.
Re: best way to determine elos of a group
Well, 2 times as big if he starts the procedure from the beginning of his line of nets. The issue might become whether that engine as an opponent is not peculiar in some ways and what really we want to measure ("strength", I suppose, but in relation to "something"). How the gating would look now? For example, would it be an orderly gating to require that each successive net performs better against "fixed opponents" than the previous net?Michel wrote: ↑Tue Jul 30, 2019 11:08 amTo get an elo estimate with fixed variance you should test against a fixed engine (or group of engines). But this then not good for comparing engines (it is well known that the variance of elo difference measured against
a 3d engine is 4 times as big as when measured in direct play).

 Posts: 3746
 Joined: Tue Mar 14, 2006 10:34 am
 Location: Ethiopia
 Contact:
Re: best way to determine elos of a group
That bayeselo program from Remi Coulom uses the BradleyTerry model for estimating elos. Its major improvement over its predecessor (EloStat) is that its use of bayesian methods that improves elos estimates on extreme scores and such. More info can be found at this page. The BradleyTerry model assumes players tend to overperform and also that there are no ties. The latter we have to address in this draft. Specifically the Davidson model was shown to be better with CCRL/CEGT data.dragontamer5788 wrote: ↑Tue Jul 30, 2019 4:48 amIf you can manage to find a copy of "The Method of Paired Comparisons" by H. A. David, it will go deep into the BradleyTerry model (which Elo's ranking system was based on). Elo was designed for ladders, not tournaments. The BradleyTerry model was originally designed for oneshot tournaments where the skills of players are unlikely to change. Unfortunately, most people have forgotten the relationship between the Elo model and BradleyTerry model.
Ok that is interestng although what I want is the opposite, find parirings (matches) that will minimize elo changes (stabilize ratings). That is why i think the problem can be defined as UCB, where we try to maximize cumulative reward. Ofcourse the reward in our case is framed as minimize standard error of elo estimates (maybe dot producted with some prior distribution). I would rather do no matches between networks to be hones, A0 showed that you can get to a strong network without "gating" in the second paper so why waste reources on it. Plus unless you really invest time on it (exteme would be fullblown roundrobin), the elo estimates are not reliable. With the incremental approach, my calculation shown in this plot says elo after 40,000 games is less than a random player with 0 games. However, I match headtohead and the former is atleast 100 elo ...* I recall seeing a paper that tried to calculate the pairings which would potentially generate the biggest change to ratings. I believe it took the derivative of the confidence bounds and searched for the pairings that would generate the biggest drop in confidence bounds. Or something along those lines. I forgot the name of that paper however.

 Posts: 3746
 Joined: Tue Mar 14, 2006 10:34 am
 Location: Ethiopia
 Contact:
Re: best way to determine elos of a group
I am not using gating since A0 showed you can do without it in their second paper. Maybe it is better to use it in theory assuming you have enough resources. Anyway my main qualm with the incremental approach is it sucks for a lack of a better word. Good to know it an unbiased estimate but the variance accumulates as you pointed out.Michel wrote: ↑Tue Jul 30, 2019 11:08 am
I am not too sure if I understand your question correctly.
If you only play against the last engine and don't do gating(!!) you still get an unbiased elo estimate. The problem however is that the variance of the estimator will become bigger and bigger (you are measuring
X_nX_1 as (X_nX_{n1})+(X_{n1}X_{n2})+...+(X_2X_1)). So as an absolute elo estimate it becomes worthless.
Say I measure against random player (ID0) for a while but then i have to move the reference latter when the nets get stronger and all of themTo get an elo estimate with fixed variance you should test against a fixed engine (or group of engines). But this then not good for comparing engines (it is well known that the variance of elo difference measured against
a 3d engine is 4 times as big as when measured in direct play).
Of course you can combine the two, i.e. occasionally playing against a fixed engine. This will serve as an anchor for the other results.
trash ID0. This will still be a moving target but a really slow one so that could be better.
Yes, I am just not sure how to define the target of optimization. The goal is somewhere in between "find the elo of the best network" and "find the elo of all networks". Anyway that is defined, we have to make sure we optimize allocation of resources (say you are given 10k games to evaluate 100 networks) to minimize the errors in the elo estimate of the networks  possibly paired with a prior as we do not want to value all networks uniformly.Maybe there is something to optimize here, but it depends on the question.
So come up with a function that generate match pairs given the outcomes so far which I think boils down to something similar to UCB i think.

 Posts: 3746
 Joined: Tue Mar 14, 2006 10:34 am
 Location: Ethiopia
 Contact:
Re: best way to determine elos of a group
Yes, I think this issue is very serious. In my case ID11 beat ID10 and ID12 with 200 games each, and its elo peaked really high. Even though the following networks are stronger than ID11 it took really long for that to be shown on the plot since they were not directly paired with ID11. There must be a better way to decide who to match to next (other than IDx with IDx+1) to minimize errors in elo estimates of the whole group within a certain specified total number of games.Laskos wrote: ↑Tue Jul 30, 2019 2:40 pmWell, 2 times as big if he starts the procedure from the beginning of his line of nets. The issue might become whether that engine as an opponent is not peculiar in some ways and what really we want to measure ("strength", I suppose, but in relation to "something"). How the gating would look now? For example, would it be an orderly gating to require that each successive net performs better against "fixed opponents" than the previous net?Michel wrote: ↑Tue Jul 30, 2019 11:08 amTo get an elo estimate with fixed variance you should test against a fixed engine (or group of engines). But this then not good for comparing engines (it is well known that the variance of elo difference measured against
a 3d engine is 4 times as big as when measured in direct play).

 Posts: 98
 Joined: Thu Jun 06, 2019 6:05 pm
 Full name: Percival Tiglao
Re: best way to determine elos of a group
That graph looks... wrong.Daniel Shawul wrote: ↑Tue Jul 30, 2019 2:59 pmOk that is interestng although what I want is the opposite, find parirings (matches) that will minimize elo changes (stabilize ratings). That is why i think the problem can be defined as UCB, where we try to maximize cumulative reward. Ofcourse the reward in our case is framed as minimize standard error of elo estimates (maybe dot producted with some prior distribution). I would rather do no matches between networks to be hones, A0 showed that you can get to a strong network without "gating" in the second paper so why waste reources on it. Plus unless you really invest time on it (exteme would be fullblown roundrobin), the elo estimates are not reliable. With the incremental approach, my calculation shown in this plot says elo after 40,000 games is less than a random player with 0 games. However, I match headtohead and the former is atleast 100 elo ...
After 40,000 games, it seems like your engine is rated 300 Elo or so, representing a winloss ratio 15%. That is, your engine allegedly won 6000 out of 40,000 games.
After 70,000 games, your engine is rated +515 Elo, representing a winloss ratio of 95%. That is, your engine is now estimated to have won 66,500 / 70,000 games.
I don't understand how your statistics are such that your engine gained 60,000 wins in just 30,000 games. This is mathematically impossible. You should be tallying the win/loss conditions across all matches. The variance of the proportion should decrease the more matches you play. At least, that's how BradleyTerry MM algorithms work.
I don't know what algorithm you're using, but it isn't Bradley Terry MinimumMaximization. The link you shared links to this paper, which has a good discussion: http://personal.psu.edu/drh20/papers/bt.pdf . I haven't read the paper all the way through yet, but it seems to be talking about the methodology I'm pushing for.
Last edited by dragontamer5788 on Tue Jul 30, 2019 3:24 pm, edited 1 time in total.

 Posts: 3746
 Joined: Tue Mar 14, 2006 10:34 am
 Location: Ethiopia
 Contact:
Re: best way to determine elos of a group
It looks wrong i agree, but is not impossible because the variances are very high. The absurd way to prove the point would be to use just 1 game to estimate elo between IDx and IDx+1. I am making 200 games between IDx and x+1 which is still not that great, but i have way too many nets. Lc0 only have 2 networks by the end of the 60000 games mark while i have maybe 80 nets.dragontamer5788 wrote: ↑Tue Jul 30, 2019 3:20 pmThat graph looks... wrong.Daniel Shawul wrote: ↑Tue Jul 30, 2019 2:59 pmOk that is interestng although what I want is the opposite, find parirings (matches) that will minimize elo changes (stabilize ratings). That is why i think the problem can be defined as UCB, where we try to maximize cumulative reward. Ofcourse the reward in our case is framed as minimize standard error of elo estimates (maybe dot producted with some prior distribution). I would rather do no matches between networks to be hones, A0 showed that you can get to a strong network without "gating" in the second paper so why waste reources on it. Plus unless you really invest time on it (exteme would be fullblown roundrobin), the elo estimates are not reliable. With the incremental approach, my calculation shown in this plot says elo after 40,000 games is less than a random player with 0 games. However, I match headtohead and the former is atleast 100 elo ...
After 40,000 games, it seems like your engine is rated 300 Elo or so, representing a winloss ratio 15%.
After 70,000 games, your engine is rated +515 Elo, representing a winloss ratio of 95%. This is mathematically impossible.
Here is the full ratings table with error bars:
http://scorpiozero.ddns.net/scorpiozero ... atings.txt
I do not know how lc0 calculates elos (just incrementing elo difference of ID x+1 with iD x) or maybe they recalcualte elos of all players from scratch
using all match games like I did.
Last edited by Daniel Shawul on Tue Jul 30, 2019 3:28 pm, edited 1 time in total.

 Posts: 98
 Joined: Thu Jun 06, 2019 6:05 pm
 Full name: Percival Tiglao
Re: best way to determine elos of a group
EDIT: Hmm.... lemme think a bit more.Daniel Shawul wrote: ↑Tue Jul 30, 2019 3:24 pmIt looks wrong i agree, but is not impossible because the variances are very high. The absurd way to prove the point would be to use just 1 game to estimate elo between IDx and IDx+1. I am making 200 games between IDx and x+1 which is still not that great, but i have way too many nets. Lc0 only have 2 networks by the end of the 60000 games mark while i have maybe 80 nets.dragontamer5788 wrote: ↑Tue Jul 30, 2019 3:20 pmThat graph looks... wrong.Daniel Shawul wrote: ↑Tue Jul 30, 2019 2:59 pmOk that is interestng although what I want is the opposite, find parirings (matches) that will minimize elo changes (stabilize ratings). That is why i think the problem can be defined as UCB, where we try to maximize cumulative reward. Ofcourse the reward in our case is framed as minimize standard error of elo estimates (maybe dot producted with some prior distribution). I would rather do no matches between networks to be hones, A0 showed that you can get to a strong network without "gating" in the second paper so why waste reources on it. Plus unless you really invest time on it (exteme would be fullblown roundrobin), the elo estimates are not reliable. With the incremental approach, my calculation shown in this plot says elo after 40,000 games is less than a random player with 0 games. However, I match headtohead and the former is atleast 100 elo ...
After 40,000 games, it seems like your engine is rated 300 Elo or so, representing a winloss ratio 15%.
After 70,000 games, your engine is rated +515 Elo, representing a winloss ratio of 95%. This is mathematically impossible.
Hmm, that's a bit more of a hint, but still seems unlikely to me. The win/loss across all nets will affect everyone else's score in Bradley Terry Minorization–Maximization.The absurd way to prove the point would be to use just 1 game to estimate elo between IDx and IDx+1
If ID_x and ID_x+1 only have one match, then their statistics would be mostly derived from all of the other games they've played if you're doing the MM algorithm correctly.