Absolute ELO scale

nionita · Post by **nionita** » Sat Dec 17, 2016 4:23 pm

hgm wrote:
nionita wrote:Is there any theoretical problem if we define ELO 0 = play strenght of an engine which plays in every (legal) position one of the legal moves following a random uniform probability distribution?
This is a logical way to define things. There is the practical problem, however, that almost anything that does some rudimentary thinking is somuch stronger that it scores close to 100% against the random mover, so that you cannot derive the Elo difference from it. So the 'calibration point' would lie (too) far outside the populated range.

So you would need to fill the gap with series of engines that would bridge the difference. There are some non-searching engines like N.E.G., but even these beat a random mover badly (if they have stalemate avoidance), and probably by too much for reliable Elo determination. And they lose equally badly against searchig engines, even if these do just 1-ply + QS.

Perhaps a series of searching engines based on the Beal effect, with progressively deeper search (and random evaluation) could bridge the gap, and have reasonable scores agaist each other, the random mover on one side, and searching +evaluating engines on the other. There is no guarantee that bridging the gap with another series of engines would not give you a totally different rating, however. Basing ratings on just a sigle pairing of players is already bad, and making a chain of such pairings...

I was thinking that, for new engine authors, having a standard (weak) engine can be beneficial to measure the progress in the beginning, and even the fact that having bugs you lose more than necessary against that engine could be helpful for them.

Also if we make experiements for (new) ML algorithms and get (mostly) weaker results, such a reference would make results comparable.

Dirt · Post by **Dirt** » Sat Dec 17, 2016 6:02 pm

hgm wrote:There are some non-searching engines like N.E.G., but even these beat a random mover badly (if they have stalemate avoidance), and probably by too much for reliable Elo determination. And they lose equally badly against searchig engines, even if these do just 1-ply + QS.

Yeah, I immediately thought of N.E.G. when reading the first post. I think that would be a better zero point, but maybe not good enough.

Uri Blass · Post by **Uri Blass** » Sat Dec 17, 2016 6:53 pm

nionita wrote:Is there any theoretical problem if we define ELO 0 = play strenght of an engine which plays in every (legal) position one of the legal moves following a random uniform probability distribution?

Yes

I think that using random player for elo is not natural behavior of weak players and it may distort elo.

Let take an extreme example.

Suppose that you have an engine that play like stockfish with white but play random moves with black.

Suppose that it play in human tournaments

What is going to be the engine's rating against humans?
It is clear that the rating is going to be dependent on the opponent

In every match against humans that are not extremely weak and may draw by stalemate even against the random mover or lose by 2 illegal moves against the random mover it is going to score 50%(except maybe few humans who have practical chances not to lose against stockfish with black).

Laskos · Post by **Laskos** » Sat Dec 17, 2016 8:08 pm

hgm wrote:There are some non-searching engines like N.E.G., but even these beat a random mover badly (if they have stalemate avoidance), and probably by too much for reliable Elo determination. And they lose equally badly against searchig engines, even if these do just 1-ply + QS.

It's not so bad. I let Andscacs eval + 1 ply against random mover Andscacs. No time losses and such.

Code: Select all

Score of Ands depth=1 vs Ands Random&#58; 39993 - 0 - 7  &#91;1.000&#93; 40000
ELO difference&#58; 1623.18 +/- 165.00
Finished match

In line with previous test, and with FIDE ratings: I estimate the FIDE strength of 1 ply Andscacs about 1200. Probably engines dilute a bit ratings comparing cu FIDE, and weak humans are unlike both 1-ply mover and random mover.

hgm · Post by **hgm** » Sat Dec 17, 2016 11:11 pm

Well, you cannot get a reliable rating difference from a score of 0.01%. That is way to sensitive for the Elo model.

tysen2k · Post by **tysen2k** » Sat Dec 17, 2016 11:51 pm

I've been toying around with the idea of setting the "reference" Elo level to be the level that gives handicap odds a multiplying effect. For example, if you subtract about 425 from current Elo levels, knight odds support about a 1.43x difference in Elo across a wide range of Elo.

Laskos · Post by **Laskos** » Sun Dec 18, 2016 2:00 am

hgm wrote:Well, you cannot get a reliable rating difference from a score of 0.01%. That is way to sensitive for the Elo model.

Elo model for engines seems to be logistic. It is corroborated in this case by the fact that adding small intervals adds up to a logistic value in the total result on large span. Besides that, it was more to show that random mover can from time to time, say one in 10,000, draw ply 1 full engine. Not one on 10^20 cases.

hgm · Post by **hgm** » Sun Dec 18, 2016 9:50 am

Laskos wrote:Elo model for engines seems to be logistic.

In what range? I am pretty sure there isn't much statistics this far out in the tails. And what holds in one range of ratings might not hold in a completely different range (where engines must be buggy to be as weak as they are).

Laskos · Post by **Laskos** » Sun Dec 18, 2016 10:03 am

hgm wrote:
Laskos wrote:Elo model for engines seems to be logistic.
In what range? I am pretty sure there isn't much statistics this far out in the tails. And what holds in one range of ratings might not hold in a completely different range (where engines must be buggy to be as weak as they are).

You might look at this thread and plot:
http://www.talkchess.com/forum/viewtopic.php?t=60791

On 1400 ELO points. Sure, no resignations, no time forfeits, no illegal moves. The conditions for a logistic behavior can be set easily.

hgm · Post by **hgm** » Sun Dec 18, 2016 10:18 am

It seems you only used one method to weaken the engines there, namely reducing the size of the search tree of healthy engines by node count. You cannot assume this would hold for other methods of weakening too (like random pruning, gross misevaluation).

Absolute ELO scale

Re: Absolute ELO scale

Re: Absolute ELO scale

Re: Absolute ELO scale

Re: Absolute ELO scale

Re: Absolute ELO scale

Re: Absolute ELO scale

Re: Absolute ELO scale

Re: Absolute ELO scale

Re: Absolute ELO scale

Re: Absolute ELO scale