Regan's conundrum

clumma · Post by **clumma** » Fri Dec 09, 2016 10:20 pm

Long article, but it shows an interesting question

https://rjlipton.wordpress.com/2016/12/ ... y-grinder/

Larry and Kai (among others) would be interested, I would think.

-Carl

Laskos · Post by **Laskos** » Sat Dec 10, 2016 5:15 am

clumma wrote:Long article, but it shows an interesting question

https://rjlipton.wordpress.com/2016/12/ ... y-grinder/

Larry and Kai (among others) would be interested, I would think.

-Carl

Thanks Carl, I will read it this morning, seems substantial.

jdart · Post by **jdart** » Sat Dec 10, 2016 6:02 am

Maybe it is just me, but I am finding that article quite hard to follow.

--Jon

clumma · Post by **clumma** » Sat Dec 10, 2016 6:54 am

jdart wrote:Maybe it is just me, but I am finding that article quite hard to follow.

--Jon

It's not just you.

Laskos · Post by **Laskos** » Sat Dec 10, 2016 7:57 am

clumma wrote:Long article, but it shows an interesting question

https://rjlipton.wordpress.com/2016/12/ ... y-grinder/

Larry and Kai (among others) would be interested, I would think.

-Carl

I read it cursorily in 15 minutes, will try to find some more time. "IPR" seems to be the human ratings derived from deviations from engine analysis. It's dubious. So, Carlsen and Karjakin performances are relative, and have anyway large error margins for 12 and 4 games. Generally, deriving ratings from each individual move is a dubious affair.

The next part is more interesting. It basically says that you cannot just rescale engine evaluations by a general factor to get the identical one for all engines related to to probabilities of win/draw/loss for a certain eval. But engines do NOT have per force to obey logistic curve in eval/performance. It is an engine like Texel, which defines its eval by logistic outcome who in principle must be doing that. That most of engines do comply roughly to logistic is the way engines evaluate material/positional values. But any monotonical conversion would do the job. One might mock this article by converting all the evals to Gaussian or even linear, and not change the play or strength of an engine.

Then, that he observed "the “firewall at zero” phenomenon I observed last January." It is that I observed maybe some 2 years ago and posted, plotted here. Engines seem to improve the score very slowly at eval close to 0. So 0.05 compared to 0.00 has much lower relevance than 0.75 compared to 0.70. This also contradicts pure logistic model, where the maximum derivative is in 0 (should have higher relevance). I even proposed some quasi-logistic models for that, but I am too lazy to search for posts here.

Very interesting anyway, many things I couldn't follow at fast pace, will read later.

clumma · Post by **clumma** » Sat Dec 10, 2016 8:22 am

jdart wrote:Maybe it is just me, but I am finding that article quite hard to follow.

I'll try to summarize what I know:

* Regan, following Guid & Bratko 2006, has been analyzing human games with computers. In particular, he finds the difference for each move between the evaluation (in centipawns) of the computer move and the move actually played. Then averages them over a game or tournament or whatever.

* It's found that this average difference for players maps linearly to their Elo ratings. The fit is the same today as it was in the 1970s (omitting opening moves), laying to rest speculation about "ratings inflation". It doesn't exist.

Now we're caught up with the previous post on their blog and are ready to start the one I linked to above:

* Computer evaluations are found to map logistically to game scores. This has been widely discussed on this mailing list. Of course different engines can have a different scale. +3 for Stockfish might correspond to the same winning probability as +2.5 for Komodo, as an example.

* Regan is in the "intrinsic ratings" business for more than just fun. He uses them to monitor human play for signs of cheating -- in cooperation with FIDE. He alludes to this work as "statistical tests"

So we should be able to multiply Komodo’s values by 1.046 and plug them into statistical tests derived using Stockfish, right?

He doesn't disclose them because he doesn't want potential cheaters to be able to defeat them.

* Anyway, he wants to be able to use different engines without having to run each one over his entire data set. He just wants to use a scaling factor. But there's a problem. The scaling factor changes depending on the strength of the players he's analyzing. He discusses this in the "Revenge of the Turkeys" section through the end of the post.

-Carl

clumma · Post by **clumma** » Sat Dec 10, 2016 8:30 am

Hadn't refreshed the page and didn't see this until after I posted.

Laskos wrote:I read it cursorily in 15 minutes, will try to find some more time. "IPR" seems to be the human ratings derived from deviations from engine analysis. It's dubious. So, Carlsen and Karjakin performances are relative, and have anyway large error margins for 12 and 4 games. Generally, deriving ratings from each individual move is a dubious affair.

I can't agree IPRs are dubious, but that's a side point.

The next part is more interesting. It basically says that you cannot just rescale engine evaluations by a general factor to get the identical one for all engines related to to probabilities of win/draw/loss for a certain eval. But engines do NOT have per force to obey logistic curve in eval/performance. It is an engine like Texel, which defines its eval by logistic outcome who in principle must be doing that. That most of engines do comply roughly to logistic is the way engines evaluate material/positional values. But any monotonical conversion would do the job. One might mock this article by converting all the evals to Gaussian or even linear, and not change the play or strength of an engine.

Regan claims the evals must be logistic or else the engine won't play well. I don't understand his reasoning so I can't comment. He credits this 2012 paper by Amir Ban with the observation. I glanced at it and don't immediately see such an argument.

-Carl

Laskos · Post by **Laskos** » Sat Dec 10, 2016 8:48 am

clumma wrote:Hadn't refreshed the page and didn't see this until after I posted.

Laskos wrote:I read it cursorily in 15 minutes, will try to find some more time. "IPR" seems to be the human ratings derived from deviations from engine analysis. It's dubious. So, Carlsen and Karjakin performances are relative, and have anyway large error margins for 12 and 4 games. Generally, deriving ratings from each individual move is a dubious affair.
I can't agree IPRs are dubious, but that's a side point.

Are they derived from engine analysis of individual moves? The moves humans do often have different long term goals from engine moves. Especially quiet moves. Maybe blunder analysis is more solid, but deriving human ratings on engine analysis of human moves in games is a bit far fetching, IMO.

The next part is more interesting. It basically says that you cannot just rescale engine evaluations by a general factor to get the identical one for all engines related to to probabilities of win/draw/loss for a certain eval. But engines do NOT have per force to obey logistic curve in eval/performance. It is an engine like Texel, which defines its eval by logistic outcome who in principle must be doing that. That most of engines do comply roughly to logistic is the way engines evaluate material/positional values. But any monotonical conversion would do the job. One might mock this article by converting all the evals to Gaussian or even linear, and not change the play or strength of an engine.
Regan claims the evals must be logistic or else the engine won't play well. I don't understand his reasoning so I can't comment. He credits this 2012 paper by Amir Ban with the observation. I glanced at it and don't immediately see such an argument.

-Carl

Suppose one translates the eval from logistic to linear. Engines will anyway have the goal to improve on eval, improvement in logistic is improvement on linear. The eval scores will look strange, that's true, but I don't know how that might affect the move by move choice. Only eval/performance graph would look very different. Maybe an engine like Strelka with monotonically distorted eval (forgot which Strelka) can show it, without deprecating the strength. Maybe I am missing something.

Lyudmil Tsvetkov · Post by **Lyudmil Tsvetkov** » Sat Dec 10, 2016 9:07 am

I guess this is an article which Regan himself does not understand.

performance rating of Carlsen for the match lower than that of Karyakin?
please, spare me the laughter. I replayed the games and, with the exception of 1 or 2, it was Carlsen who has been pushing, and playing better.

the much more likely hypothesis is that Komodo or whatever other engine is used for analysis understands better the (lower level) play of Karyakin than the (positionally emphatic) play of Carlsen.

but who needs tens of pages of unrecycled paper, to get to a certain conclusion, if at least one of the basic assumptions is wrong?

Laskos · Post by **Laskos** » Sat Dec 10, 2016 9:18 am

Lyudmil Tsvetkov wrote:I guess this is an article which Regan himself does not understand.

performance rating of Carlsen for the match lower than that of Karyakin?
please, spare me the laughter. I replayed the games and, with the exception of 1 or 2, it was Carlsen who has been pushing, and playing better.

the much more likely hypothesis is that Komodo or whatever other engine is used for analysis understands better the (lower level) play of Karyakin than the (positionally emphatic) play of Carlsen.

but who needs tens of pages of unrecycled paper, to get to a certain conclusion, if at least one of the basic assumptions is wrong?

Humans have certain goals. Some play for win, some for draw, some with contempt, some with respect. Strong humans up to these days have longer plans than even top engines, Analysing move by move may bring style and these idiosyncrasies of humans. If Carlsen has contempt for an adversary, should we use Komodo +50 Contempt or Komodo 0 Contempt for analysis? Then the scaling of blunders. Also, never mentioned there is the compression of logistic with time control. And no mention of the phases of the game, which changes the shape of the logistic.

But all in all an interesting, article, I will try to read it more carefully.

Regan's conundrum

Regan's conundrum

Re: Regan's conundrum

Re: Regan's conundrum

Re: Regan's conundrum

Re: Regan's conundrum

Re: Regan's conundrum

Re: Regan's conundrum

Re: Regan's conundrum

Re: Regan's conundrum

Re: Regan's conundrum