A better rating method ?

Alayan · Post by **Alayan** » Tue Oct 27, 2020 7:46 pm

The naive assumption that is used for standard elo calculation has been shown many times to not capture enough of an engine's properties. This thread is about discussing how could be designed a system that capture more of those properties.

For a standard list aiming to get a single number to be able to have a clear ordering of engines, not many improvements are possible. We have (elo A, elo B) => expected score of A vs B.

When pooling multiple opponents some care must be taken to compute the elo with the best fit (in particular, the performance against engines of different strength can NOT be abstracted to a performance against a single engine rated as the pool's average).

However, to assess correctly an engine's properties, it's clearly not enough. No need to consider Leela-type engines at all: the contempt value can make a big difference as well.

SF11 c24 will perform almost as well as c0 in h2h, but significantly better against weak opponents and significantly worse against strong opponents. So mapping a % score to an elo difference is inaccurate. One pair of engine configurations needs to be set as a reference, and engines to have a (rating, consistency) pair, with a consistent engine scoring closer to 50% against weaker and stronger opponents.

Laskos suggested describing Leela's behavior with a pair (rating1, rating2). it's unclear if this would yield a better fit in some situations than the (rating, consistency) pair, or if adding more such parameters could significantly improve the fit.

Another major issue is the opening. For very weak players, the opening has little relevance, but at the current level of chess engines, having white or black makes a big difference. Using the average of white and black performance for the computations is clearly incorrect. This issue is made most obvious when using very skewed openings like TCEC SuFi openings. The elo tends to be inflated up to 75% score, but it will tend to be deflated if beyond 75% score (as getting wins and not losing with the weak side is extremely hard). Any elo difference computed based on games played with skewed openings will be deceptive and a poor predictor because the % score with the strong side is very different from the % score with the weak side. On the other hand, sticking with dead drawn openings will hide or minimize very real strength differences.

To have a good score prediction output, it appears very useful to add an opening skew input, but objectively any chess position is drawn or won for one side. So any opening skew input must be empirical and how to define it or measure it remains an open problem. The same applies to an opening complexity measure (two positions might be equal and measured as such through average score or some top engine eval, yet one will have a much higher elo spread than the other, the drawkiller openings being a good example).

Ultimately, to evaluate any system, we'd need a DB of games that exhibit significant inconsistencies, in that the predicted h2h result of A vs B is significantly off, and the proposed improvement would have to show it allows for much better predictions.

Frank Quisinsky · Post by **Frank Quisinsky** » Tue Oct 27, 2020 9:50 pm

Hi there,

- an error free balanced opening book (FEOBOS is)
- many different opponents (each one against each other)
- many games
- best Excel statistics we can do with the final results ... the games!
- great Elo calculation programs
- tricky engine settings

and much more ...

All these things aren't enough!

An exactly Elo isn't possible if we have:

Chess computer times, example MM V
- 1.900 Elo in the earlier middlegames
- 1.750 Elo in the late middlegames
- 1.500 Elo in transposition into endgames
- 1.300 Elo in endgames

Differents are to high!

Other example for Stockfish 11:
- 2.900 Elo in the earlier middlegames
- 3.200 Elo in the late middlegames
- 3.500 Elo in the transposition into endgames
- 3.800 Elo in endgames

Elo = human view!
Differents are to high!

Third example:
If Wasp can hold the middlegame vs. 100-150 Elo stronger engines and lost a lot of strength in the transposition into endgame is an exactly Elo measurement for Wasp not possible.

If we build a field of 10 engines, one is Wasp and the nine others are programs strong in endgames the Elo performance as final results will be 40 Elo lesser as for a group of opponents with different strength and weaknesses. All that is very easy to simulate.

Thinking on this one:
If you have a group of 19 different programs in style and all are on the same level Stockfish 11 is, Stockfish 11 will be participate engines number 20 in such a tournament, Stockfish 11 will give you a complete other Elo as final result as you know today. Easy to simulate with older chess programs in Excel. If we added the older Shredder number 1 to a group of equal programs today.

Most very strong chess programs, available today, have for the same playing phase the same strength.
"Transposition into endgames".

Back to my first points:
We can do a lot in testing engines. With better opening books and an organization we know that best possible results we can produce. But if most of strong chess programs have the same main strength is __today__ a better measurement not possible.

Same as in the years, the older chess computer's are available for us.
Club players are thinking MMV have 2.000 Elo, other club players are thinking max. 1.750 Elo.

If an engine lost so many strength between the playing phases MMV do or today Stockfish do ... is that the main point today. Never we can build the ultimative Elo.

What we can do ...
To find out where are the strengths and weaknesses of engines.
To compare the strength of engines for the playing phases!

Nothing is bad what we do in testing engines today.
Fast time controls or slower time controls, ponder = on or off, group of 10 20 or 40 engines or simple a group of 2 engines.
But the results we produced will be never ultimative and quit different.

I am searching all the years the perfect rules, settings ... but I gave it up for a long time.
Elo is absolutly not interesting (for me today very boring) ... much more interesting are the strength and weaknesses of engines!
Of course, if a human can understand it.

So I need programs like Excel and good databases.

Best
Frank

PS: LcO is very simple:
The first playing phase of games haven't the strength best other chess programs can play. To many tactical holes! But with more moves for humans not to understand what Lc0 like to play. Looks very speculative and fantastic, looks a new style was born. In reality a style not to understand for humans and for that reason not very interesting for humans because learnings from it = over the max. horizon of humans. And without better statistic software can give us we are very blind with 1.400 - 2100 Elo (most of us have).

Frank Quisinsky · Post by **Frank Quisinsky** » Tue Oct 27, 2020 10:05 pm

Chess program testing around the years 1998 - 2004 are most interesting, because most of chess programs have not the big differents to the three playing phases. We understand a lot, can see the blunders faster and the rating systems works better as today with all the possibilites we have so many years later.

D Sceviour · Post by **D Sceviour** » Tue Oct 27, 2020 10:07 pm

How about a Power Punch rating? Take the CCRL or any other testers elo rating. and divide it by the file size of the engine executable. Some relatively weak engines could demonstrate a lot of power for the size of the engine. It is also a demonstration of good programming.

Power Punch = ELO / engine size

mar · Post by **mar** » Tue Oct 27, 2020 10:11 pm

D Sceviour wrote: ↑Tue Oct 27, 2020 10:07 pm How about a Power Punch rating? Take the CCRL or any other testers elo rating. and divide it by the file size of the engine executable. Some relatively weak engines could demonstrate a lot of power for the size of the engine. It is also a demonstration of good programming.

Power Punch = ELO / engine size

well people could simply start using UPX. I don't see how this correlates with programming quality. template instantiations in C++, embedded data (NN in SF) and so on. a load of nonsense as usual

Ferdy · Post by **Ferdy** » Wed Oct 28, 2020 4:35 am

Alayan wrote: ↑Tue Oct 27, 2020 7:46 pm The naive assumption that is used for standard elo calculation has been shown many times to not capture enough of an engine's properties. This thread is about discussing how could be designed a system that capture more of those properties.

The normal method to measure engine's strength is by Win/Loss/Draw. Could you describe what are these engine's properties?

lkaufman · Post by **lkaufman** » Wed Oct 28, 2020 5:22 am

I think that in principle the best method is to vary the time limit for each engine (keeping increment at 1% of base time) up or down until that engine scores 50% against the field. The rating would just be a function of the time limit needed to score 50%. Presumably no engine should use Contempt, and it shouldn't matter much whether opening books were drawish or highly unbalanced. The main downside of this idea is that it takes many more games to get an accurate rating for each engine, and it might not be practical for testing groups with multiple testers using different machines.

Frank Quisinsky · Post by **Frank Quisinsky** » Wed Oct 28, 2020 7:56 am

Hi Larry,

that is an idea, indeed!

For computer chess ratings I think it's better to build four for the first time and from this four later one:

Ratings for the earlier and late middlegams, ratings for transposition into endgame, ratings for endgames.

Then we give the four ratings a better weight.
The final result will be the main rating, build from the 4 subratings!

With the effect that is much more interesting for looking on rating systems and ratings are from human view much more meaningful. I try this one a longer time and for me all is clearly better but "not perfect".

Ratings without any more information about the enignes are in my opinion "nothing" because I can see nothing on interesting information, only a boring number with four-character ... and all this quit different in subject to the conditions. To waste the energy!!

The advantage we can have in build ratings is that chess engine versions don't have the "shape-variation" ... hope the word is correct. We all know that but we aren't able to use the advantage.

BTW:
Differents between Stockfish 11 and Stockfish 12 NNUE (longer time controls with many opponents) is 80 Elo only. We had a discussion about it for a short time. I test it out on a second i9-10900 system ... later it can be see if FCP Tourney-2021 will run. My tipp with 75 Elo seems to be good before I test it out.

Main problem for older rating-systems = Elo-Inflation!

Elo results are quiet different with other conditions.
We have a lot of users generate ratings and all the ratings are quiet different.
What should we do with so many different results without any more information?

For better understandings we need better Elo calculation programs based on other ideas.

Best
Frank

Hint: Komodo 14 is around 10 Elo stronger in endgames as Stockfish 12 NNUE.
Congratulation, Komodo 14.0 in pure endgame is really strong and since a long time the number 1.
You can see ... I am looking with other eyes!

Frank Quisinsky · Post by **Frank Quisinsky** » Wed Oct 28, 2020 8:28 am

Hint:
Komodo 14 is around 10 Elo stronger in endgames like Stockfish 12 NNUE.
Congratulation, Komodo 14.0 in pure endgame is really strong and since a long time the number 1.
You can see ... I am looking with other eyes!

For the others:
Easy to see that:

Step 1: Search the games with 100-299 moves. in database without resing mode.
Step 2: Search the games with 80-99 moves in databases without resign mode.
Step 3: Search the games with 40-79 moves in databases without resing mode.

Topic: endgames can start after short number of moves

Step 4: Filter the games with pieces on board.
Step 5: Build new databases

Step 6: Need most of time, because no software do that
Now have a look in engine's eval of the new databases.
Equal should be equal before the endgame started.

And you have the strength for endgames.
I do that from time to time.
So I can build my own opinion about engine strength for endgames (for an example)!

But for the late middlegame Stockfish 12 NNUE is 190 Elo stronger as Komodo 14.0.
And for the transposition into endgame Stockfish 12 NNUE is 120 Elo stronger as Komodo 14.0.

Longer time controls, ratings gerated with 40 opponents.
All in all is Stockfish 12 NNUE 160 Elo stronger as Komodo 14.0.

160 Elo if I give the four playing phases a better weigth ... what for humans is more important!

Frank Quisinsky · Post by **Frank Quisinsky** » Wed Oct 28, 2020 9:01 am

The only way in times strongest chess programs are to strong in "transposition into endgames" ... what I wrote before!

The engine differents in the first two playing phases are most interesting today.
Here all the available stronger chess programs produced really bigger differents.
Most have to do with more "aggressive pawns".

The transposition into endgames are near the perfection. With better pawn structures, comes from the middlegames, higher ratings in transposition into endgames are possible ... but not so much it's interesting to speak about it. Stockfish is here really a monster!! Komodo can win many pure endgames with very small advantages (move-by-move) with slighly better pawn structures move-by-move. The strength from Komodo if I study the games with statistics.

Not a wonder in times today:
With more pieces on board it's more complicated to generate stronger chess programs.
May many better moves for earlier middlegames are for humans unlogical.

With the result that strongest chess program can reach higher Elo jumpings in the earlier middlegames and late middlegames only. Here I see a lot of potential for Stockfish developers. Also a reason that Lc0 freaks like the often unlogicial style.

Example: Stockfish 12 NNUE
Stockfish is much more aggressive around 15-25 moves after opening book phase as before. Not important that Stockfish is after opening books the strongest chess programs but the differents to Komodo 14.0 are small here.

Best
Frank

A better rating method ?

A better rating method ?

Re: A better rating method ?

Re: A better rating method ?

Re: A better rating method ?

Re: A better rating method ?

Re: A better rating method ?

Re: A better rating method ?

Re: A better rating method ?

Re: A better rating method ?

Re: A better rating method ?