future of top engines:how much more elo?

Zenmastur · Post by **Zenmastur** » Mon Jul 22, 2019 4:39 pm

dragontamer5788 wrote: ↑Mon Jul 22, 2019 3:45 pm
Ovyron wrote: ↑Mon Jul 22, 2019 10:56 am Isn't this a problem with the elo system itself, though? Suppose Magnus Carlsen is rated 2800 and there's a chess engine that beats him 100% of the time, what rating do you give it?
Under the original Bradley Terry system that Elo is based on, you assume that the "losing" player has one half-win out of the whole bunch.
So if you played 50 games and the computer won 50 times out of 50, you calculate a score based off of 50/50.5. That is equivalent to 100-to-1, or roughly +800 Elo or so. If it were instead 5000 games and the computer won 5000 games out of 5000, then the score is off of 5000 / 5000.5, or 10,000-to-1 (roughly 1600 Elo).

Under a system I explored (unpublished), where you create acyclic graphs between players and assign the hypothetical computer "infinite" more score than Magnus Carlsen. You can only create a Bradley-Terry (or Elo) score between players who have a loss or draw against each other. You compare players with a Topological sort. All cycles are resolved with Bradley Terry (or Elo), while acyclic / trees are sorted by a Topological sort of some kind.

The fundamental assumption of Bradley Terry (or Elo) is that the win/loss graph is a statistic, and therefore has a chance to be wrong. If the system never observes a loss, then you have a "division by zero" situation. The actual division by zero can be avoided by topological sorts + graph theory. But the original papers talking about the system has the +0.5 methodology to the loser, which seems to work well in practice.

Hmmm....

Not sure I understand what you are saying. Say, I have a 10,000 game match between SF (search depth 10 ) and a random player. SF won all games. If I set the random player to ELO = 0 , are you saying that I can calculate SF ELO by assuming it won 10,000 games out of 10,000.5 games?

If this is what you are saying, then what would be the error margins on such a calculation? i.e. How would the error bands be calculated? Using some form of poison distribution?

Regards,

Zenmastur

dragontamer5788 · Post by **dragontamer5788** » Mon Jul 22, 2019 5:24 pm

Zenmastur wrote: ↑Mon Jul 22, 2019 4:39 pm
dragontamer5788 wrote: ↑Mon Jul 22, 2019 3:45 pm
Ovyron wrote: ↑Mon Jul 22, 2019 10:56 am Isn't this a problem with the elo system itself, though? Suppose Magnus Carlsen is rated 2800 and there's a chess engine that beats him 100% of the time, what rating do you give it?
Under the original Bradley Terry system that Elo is based on, you assume that the "losing" player has one half-win out of the whole bunch.
So if you played 50 games and the computer won 50 times out of 50, you calculate a score based off of 50/50.5. That is equivalent to 100-to-1, or roughly +800 Elo or so. If it were instead 5000 games and the computer won 5000 games out of 5000, then the score is off of 5000 / 5000.5, or 10,000-to-1 (roughly 1600 Elo).

Under a system I explored (unpublished), where you create acyclic graphs between players and assign the hypothetical computer "infinite" more score than Magnus Carlsen. You can only create a Bradley-Terry (or Elo) score between players who have a loss or draw against each other. You compare players with a Topological sort. All cycles are resolved with Bradley Terry (or Elo), while acyclic / trees are sorted by a Topological sort of some kind.

The fundamental assumption of Bradley Terry (or Elo) is that the win/loss graph is a statistic, and therefore has a chance to be wrong. If the system never observes a loss, then you have a "division by zero" situation. The actual division by zero can be avoided by topological sorts + graph theory. But the original papers talking about the system has the +0.5 methodology to the loser, which seems to work well in practice.
Hmmm....

Not sure I understand what you are saying. Say, I have a 10,000 game match between SF (search depth 10 ) and a random player. SF won all games. If I set the random player to ELO = 0 , are you saying that I can calculate SF ELO by assuming it won 10,000 games out of 10,000.5 games?

If this is what you are saying, then what would be the error margins on such a calculation? i.e. How would the error bands be calculated? Using some form of poison distribution?

Regards,

Zenmastur

I've only studied elementary statistics. But standard error of a proportion is sqrt(p * (1-p) / N), where p is the proportion, and N is the number of samples. As such, the standard error of 10,000 / 10,000 games won is sqrt(0 * 1 / 10,000) == 0.

Obviously, statistics don't handle "100%" or "0%" conditions very well. Which is why the +0.5 helps out. There's no rigorous theory however with the +0.5 methodology, but it does allow you to avoid the "division by zero" and carry on.

----------

I developed the acyclic graph + topological sort methodology to better handle this case. I haven't researched this subject very much (maybe some other mathematician already discovered the acyclic graph + topological sort methodology). In any case, how do you even assign a probability if you don't have a single sample of loss?

Does winning 10,000 out of 10,000 games mean you have a 100% chance of winning (acyclic graphs), or does it mean you have a 99.995% chance of winning (+0.5 method)? I dunno, but that's the joy of math. You take an assumption, then run with it. No one ever said math had to be perfect. Make a decision, then math out the results. It doesn't really matter at the end of the day, as long as your decisions are consistent.

Laskos · Post by **Laskos** » Mon Jul 22, 2019 6:37 pm

Zenmastur wrote: ↑Mon Jul 22, 2019 4:39 pm
dragontamer5788 wrote: ↑Mon Jul 22, 2019 3:45 pm
Ovyron wrote: ↑Mon Jul 22, 2019 10:56 am Isn't this a problem with the elo system itself, though? Suppose Magnus Carlsen is rated 2800 and there's a chess engine that beats him 100% of the time, what rating do you give it?
Under the original Bradley Terry system that Elo is based on, you assume that the "losing" player has one half-win out of the whole bunch.
So if you played 50 games and the computer won 50 times out of 50, you calculate a score based off of 50/50.5. That is equivalent to 100-to-1, or roughly +800 Elo or so. If it were instead 5000 games and the computer won 5000 games out of 5000, then the score is off of 5000 / 5000.5, or 10,000-to-1 (roughly 1600 Elo).

Under a system I explored (unpublished), where you create acyclic graphs between players and assign the hypothetical computer "infinite" more score than Magnus Carlsen. You can only create a Bradley-Terry (or Elo) score between players who have a loss or draw against each other. You compare players with a Topological sort. All cycles are resolved with Bradley Terry (or Elo), while acyclic / trees are sorted by a Topological sort of some kind.

The fundamental assumption of Bradley Terry (or Elo) is that the win/loss graph is a statistic, and therefore has a chance to be wrong. If the system never observes a loss, then you have a "division by zero" situation. The actual division by zero can be avoided by topological sorts + graph theory. But the original papers talking about the system has the +0.5 methodology to the loser, which seems to work well in practice.
Hmmm....

Not sure I understand what you are saying. Say, I have a 10,000 game match between SF (search depth 10 ) and a random player. SF won all games. If I set the random player to ELO = 0 , are you saying that I can calculate SF ELO by assuming it won 10,000 games out of 10,000.5 games?

If this is what you are saying, then what would be the error margins on such a calculation? i.e. How would the error bands be calculated? Using some form of poison distribution?

Regards,

Zenmastur

When thinking a bit, you want the Elo difference (Elo difference between SF and random player) and the error margins (confidence interval), often probably thinking at their ratio, or t-value (that's how usually the significance of the result is derived). For the significance of the result, we have something better, which is independent on the rating (ranking) scheme. We have Likelihood Of Superiority -- LOS with an uniform or "uninformed" prior. It is defined independently of the ranking scheme and has no such pathologies as Elo difference and standard deviation suffer.
LOS for 0 : 10,000 result is:

LOS = 2.5 * 10^( -3011 )

This is the likelihood that the random player is stronger than SF after a 0 : 10,000 result.

This gives the p-value. To derive the ratio (Elo difference) / (Elo 1 standard deviation) or t-value, we need the rating model. Two most popular in Chess are the Logistic Elo Model and the Gaussian Elo Model.

With 10,000 : 0 result, the ratios are the following:

For Logistic Elo Model:
(Elo difference) / (Elo 1 standard deviation) = 4258

For Gaussian Elo Model:
(Elo difference) / (Elo 1 standard deviation) = 118

So, we don't have an Elo difference, we don't have the standard deviation, but we have their ratio. And you see that the result is significant

.

Robert Pope · Post by **Robert Pope** » Mon Jul 22, 2019 7:42 pm

dragontamer5788 wrote: ↑Mon Jul 22, 2019 5:24 pm Obviously, statistics don't handle "100%" or "0%" conditions very well. Which is why the +0.5 helps out. There's no rigorous theory however with the +0.5 methodology, but it does allow you to avoid the "division by zero" and carry on.

It let's you carry on, but that doesn't mean the results are anywhere near correct for an "always winner":

Win% elo diff
0.9 382
0.99 798
0.999 1200
0.9999 1600
0.99999 2000
0.999999 2400

So you could look at a 500 game match and say that they are 1600 elo stronger. Then after playing 500,000 games you find your original estimate was off by 800+ elo!

If a player can be good enough to always force a win, there is no limit on elo. So the question becomes one of how well can the weaker programs force draws.

dragontamer5788 · Post by **dragontamer5788** » Mon Jul 22, 2019 8:19 pm

Robert Pope wrote: ↑Mon Jul 22, 2019 7:42 pm
dragontamer5788 wrote: ↑Mon Jul 22, 2019 5:24 pm Obviously, statistics don't handle "100%" or "0%" conditions very well. Which is why the +0.5 helps out. There's no rigorous theory however with the +0.5 methodology, but it does allow you to avoid the "division by zero" and carry on.
It let's you carry on, but that doesn't mean the results are anywhere near correct for an "always winner":

Win% elo diff
0.9 382
0.99 798
0.999 1200
0.9999 1600
0.99999 2000
0.999999 2400

So you could look at a 500 game match and say that they are 1600 elo stronger. Then after playing 500,000 games you find your original estimate was off by 800+ elo!

If a player can be good enough to always force a win, there is no limit on elo. So the question becomes one of how well can the weaker programs force draws.

Well, yes. That's why I prefer the acyclic graphs with topological sorting. There is a problem with topological sorts of acyclic graphs on the opposite end, with only one game played.

If after 500 games, the player is still undefeated, the acyclic / topological sort methodology would assume "infinite" elo. After 500,000 games, the player finally loses and you learn that the Elo isn't "infinity", but in fact, is "only" +2400 elo.

Obviously, the acyclic + topological sort methodology is poorly behaved in the case of 1-win / zero losses / zero draws. While the "+0.5" method handles this case reasonably.

----------

How do you want to handle your division by zero? Do you want to under-estimate, or do you want to over-estimate? Because "infinite" is clearly the wrong answer, but +0.5 is clearly an underestimate in most cases. Personally speaking, I think there is elegance in calling it "infinite" Elo in the face of the evidence.

My statistics teacher many decades ago said that you should simply carry forth any experiment until you notice at least 10-successes and 10-failures, that is... if you want to have a valid statistical result. If you don't get 10+ successes and 10+ failures on any proportion, the math simply won't work out very well.

If you can't get the minimum number of 10, then you have to just live with weird / wonky math. In any case, the solution to these cases is to get more data. You can't squeeze blood out of a stone. Garbage-in / garbage out. Etc. etc. Whatever metaphor you want, if you don't have enough win/loss data, then you simply can't calculate any good statistics.

Dann Corbit · Post by **Dann Corbit** » Mon Jul 22, 2019 9:07 pm

I think it is important to remember that the actual number at the top of a pile of opponents is completely arbitrary.
You can add 5000 to all of the Elo numbers if you want.
Or you can subtract 5000.
What is the maximum Elo number? Let's add one million to some pool of players and then I shall claim that the maximum Elo is over a million.

Elo has meaning in the differences, not the absolute numbers.
Can an engine reach 4000 Elo? Sure, TSCP can be 4000 Elo simply by adding the right offset to the pool of players.

Ovyron · Post by **Ovyron** » Tue Jul 23, 2019 2:06 am

At least in the case of Random vs. Stockfish, specifically, it'd be interesting to continue to play games until random draws one. Remember there's nothing special about random, as it can actually play perfect chess, or strongly perfect chess (one that makes a non-perfect entity lose the fastest) on a game by chance (and in a video I referenced recently, random is actually significantly stronger than other playing strategies), we'd just need to know how many games does it need for a draw.

I'd suggest using nodes for this, as it has been tested that, the more nodes, the stronger the Stockfish (no node 256 beating node 257 nonsense), so you'd start with Random vs Stockfish 1 node, and check how many games are required to get a draw, then you test Stockfish 2 node, and so on.

I'd expect it'd take longer and longer number of games to get a draw, but that after you'd covered several nodes, you could plot a line and extrapolate how many games normal Stockfish would need to play to not win a game.

And if after 500000 games Stockfish 1 node beats random 100% of the time, then, I give up...

Uri Blass · Post by **Uri Blass** » Tue Jul 23, 2019 2:44 am

Ovyron wrote: ↑Tue Jul 23, 2019 2:06 am At least in the case of Random vs. Stockfish, specifically, it'd be interesting to continue to play games until random draws one. Remember there's nothing special about random, as it can actually play perfect chess, or strongly perfect chess (one that makes a non-perfect entity lose the fastest) on a game by chance (and in a video I referenced recently, random is actually significantly stronger than other playing strategies), we'd just need to know how many games does it need for a draw.

I'd suggest using nodes for this, as it has been tested that, the more nodes, the stronger the Stockfish (no node 256 beating node 257 nonsense), so you'd start with Random vs Stockfish 1 node, and check how many games are required to get a draw, then you test Stockfish 2 node, and so on.

I'd expect it'd take longer and longer number of games to get a draw, but that after you'd covered several nodes, you could plot a line and extrapolate how many games normal Stockfish would need to play to not win a game.

And if after 500000 games Stockfish 1 node beats random 100% of the time, then, I give up...

I do not know if stockfish can play with 1 node per move
I believe that practically beating random 100% of the time in 500000 games is the expected result even for weak players.

random may be lucky not to lose in one game but the probability is clearly smaller than 0.000001 against every computer opponent with no bugs(opponents with bugs may draw by repetition or make stalemate for no reason not because of combination of the opponent).

it may be interesting to know what is the weakest program in the ccrl rating list that can beat the random player 500000-0

The random player got points against buggy engines that the CCRL test that I think that it is better simply not to allow them in the rating list

https://ccrl.chessdom.com/ccrl/404/cgi/ ... Brutus_RND

Dann Corbit · Post by **Dann Corbit** » Tue Jul 23, 2019 2:51 am

If a random moving engine played an infinite number of games against SF, some of the games would be wins for the random engine.
Obviously, a very tiny percentage.

But, just by chance, a win could happen early in the chain.

A random engine will not be zero Elo.
An engine that chooses the worst possible move after a careful search might be able to hit zero.

Uri Blass · Post by **Uri Blass** » Tue Jul 23, 2019 3:56 am

Dann Corbit wrote: ↑Tue Jul 23, 2019 2:51 am If a random moving engine played an infinite number of games against SF, some of the games would be wins for the random engine.
Obviously, a very tiny percentage.

But, just by chance, a win could happen early in the chain.

A random engine will not be zero Elo.
An engine that chooses the worst possible move after a careful search might be able to hit zero.

You mean it is not going to have the lowest elo.

I think the elo of random players should be less than 0 in the CCRL list if they accept the right engines.
unfortunately they test a lot of buggy engines.

I guess that many engines can beat the random player 500000-0 with one ply search(assuming one ply search is enough for them not to miss mate in the next move because they extend checks in the qsearch and also assuming one ply search is enough for them to avoid making a move that cause a repetition when they have the advantage or to avoid making stalemate because stalemate is evaluated by them as 0.00).

of course the random player may be lucky to get a position that it has mate in 2 but it is not going to happen often and even when it happens the probability of it to play the right move is small enough that I guess that the combination of these events is going to happen less than once in million games.

future of top engines:how much more elo?

Re: future of top engines:how much more elo?

Re: future of top engines:how much more elo?

Re: future of top engines:how much more elo?

Re: future of top engines:how much more elo?

Re: future of top engines:how much more elo?

Re: future of top engines:how much more elo?

Re: future of top engines:how much more elo?

Re: future of top engines:how much more elo?

Re: future of top engines:how much more elo?

Re: future of top engines:how much more elo?