MRL - The MEA Rating List

Rebel · Post by **Rebel** » Thu Jun 21, 2018 6:50 pm

Thanks Dann, Ferdy, I will look into it.

In the meantime -

Fun with Komodo 9 and King Safety - http://rebel13.nl/k9.html
Komodo 9.42 and Dynamism - http://rebel13.nl/k942.html
ProDeo - http://rebel13.nl/mea.html

So 1-5 different time controls possible, then in the second overview the overall (average) rating is calculated, the blue ratings.

Feedback welcome.

Albert Silver · Post by **Albert Silver** » Thu Jun 21, 2018 7:36 pm

I love the idea, but from my own experience having to clean out the WAC test suite, I know how important it is for all positions to be correct and have unique solutions. Otherwise seeing it not find 500 positions, or whatever the number, will be meaningless if those official solutions were wrong in the first place.

Dann Corbit · Post by **Dann Corbit** » Thu Jun 21, 2018 8:21 pm

Yes, my formula is quite complicated.
But, I think that (for instance) it is crucial to include depth.
And a dominant score at a lower depth should be considered (and probably reanalyzed).
However, your efforts are quite interesting.
I have two computer programs I use to analyze the data to produce the scores.
I try to normalize to 10.9,8...,2,1 scoring

Joost Buijs · Post by **Joost Buijs** » Fri Jun 22, 2018 8:43 am

Albert Silver wrote: ↑Thu Jun 21, 2018 7:36 pm I love the idea, but from my own experience having to clean out the WAC test suite, I know how important it is for all positions to be correct and have unique solutions. Otherwise seeing it not find 500 positions, or whatever the number, will be meaningless if those official solutions were wrong in the first place.

Why is it important for test positions to have only one unique solution? When there are several moves leading to a significant advantage and your engine chooses one of them, there is nothing wrong with that. The WAC test, like it is, is very usefull to quickly check whether an engine is tactically broken or not. For instance Leela scores incredibly bad at WAC, but this doesn't mean that WAC is broken, it means that Leela is tactically broken, maybe it can compensate for this weakness by superior positional play, I don't know, but you certainly don't need to clean out or repair WAC just to make Leela look better.

Albert Silver · Post by **Albert Silver** » Fri Jun 22, 2018 10:32 pm

Joost Buijs wrote: ↑Fri Jun 22, 2018 8:43 am
Albert Silver wrote: ↑Thu Jun 21, 2018 7:36 pm I love the idea, but from my own experience having to clean out the WAC test suite, I know how important it is for all positions to be correct and have unique solutions. Otherwise seeing it not find 500 positions, or whatever the number, will be meaningless if those official solutions were wrong in the first place.
Why is it important for test positions to have only one unique solution? When there are several moves leading to a significant advantage and your engine chooses one of them, there is nothing wrong with that. The WAC test, like it is, is very usefull to quickly check whether an engine is tactically broken or not. For instance Leela scores incredibly bad at WAC, but this doesn't mean that WAC is broken, it means that Leela is tactically broken, maybe it can compensate for this weakness by superior positional play, I don't know, but you certainly don't need to clean out or repair WAC just to make Leela look better.

Make Leela look better? What are you talking about?

Dann Corbit · Post by **Dann Corbit** » Sat Jun 23, 2018 1:52 am

Ferdy wrote: ↑Thu Jun 21, 2018 5:46 am {snip}
I had been studying how engine score can be converted to a point system. One method to handle the negative scores is by using logistic function.
Code: Select all
scoring_rate = 1/[1 + 10 ^(-score_cp/400)]
Example from reb-glob.epd
Code: Select all
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Nc6; c3 Nc6; acd 34; ce -3;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Nxd5; c3 Nc6; acd 26; ce -37;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Re8; c3 Nc6; acd 26; ce -49;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Be6; c3 Nc6; acd 26; ce -79;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Na6; c3 Nc6; acd 26; ce -85;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm a5; c3 Nc6; acd 26; ce -104;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm h6; c3 Nc6; acd 26; ce -106;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Bg4; c3 Nc6; acd 26; ce -122;
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Bd7; c3 Nc6; acd 26; ce -140;
Example calculations:
rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Nc6; c3 Nc6; acd 34; ce -3;
ce = -3
sr (scoring_rate) = 1/(1+10^(-(-3)/400)) = 0.49568

rnbq1rk1/pp2bppp/5n2/2pN4/2Pp4/3Q1NP1/PP2PPBP/R1B2RK1 b - - bm Bd7; c3 Nc6; acd 26; ce -140;
ce = -140
sr = 1/(1+10^(-(-140)/400)) = 0.30876

If you want the top to get 10 points,
factor = 10/0.49568 = 20.174

bm Nc6, ce = -3, sr = 0.49568
pt = factor * sr = 10

bm Bd7, ce = -140, sr = 0.30876
pt = factor * sr = 6
{snip}

I tried coding it up, but I felt horrified when I looked at the scores. I am sure it is purely psychological, but in the example above Bd7 gets 6 points.
Bd7 is truly an awful move, considering all the better alternatives.
In the high end games in my collection, only two moves have been played: Nc6 and Nxd5.

Now, I realize that mathematically, the logistic fit makes perfect sense, especially considering how the Elo values are calculated.
But the artist in me cannot stomach those values.

The way that I perform the calculations is rather arbitrary, and the code for doing it is attached.

The view that I use looks like this:
USE [Chess];

CREATE view [dbo].[Reb] as
SELECT Epd, abm, c3, ace, aacd, apv
FROM [Chess].[dbo].[epd_and_alternates]
--WHERE Epd IN (select Epd from BareEPD)
WHERE Epd IN (select Epd from Rebel)
UNION
SELECT Epd, bm, c3, ce, acd, pv
FROM Epd
--WHERE Epd IN (select Epd from BareEPD)
WHERE Epd IN (select Epd from Rebel);

and the view epd_and_alternates looks like this:
USE [Chess];

CREATE VIEW [dbo].[epd_and_alternates]
AS
SELECT dbo.AlternateEvals.ace, dbo.AlternateEvals.apv, dbo.AlternateEvals.aacd, dbo.AlternateEvals.aacn, dbo.AlternateEvals.aam, dbo.AlternateEvals.abm,
dbo.AlternateEvals.apm, dbo.AlternateEvals.aid, dbo.AlternateEvals.aacs, dbo.Epd.Epd, dbo.Epd.acn, dbo.Epd.acd, dbo.Epd.ce, dbo.Epd.pv, dbo.Epd.am, dbo.Epd.bm,
dbo.Epd.dm, dbo.Epd.pm, dbo.Epd.id, dbo.Epd.CheckoutCount, dbo.Epd.c0, dbo.Epd.c1, dbo.Epd.c2, dbo.Epd.c3, dbo.Epd.c4, dbo.Epd.acs
FROM dbo.AlternateEvals INNER JOIN
dbo.Epd ON dbo.AlternateEvals.EpdID = dbo.Epd.EpdID;

Ferdy · Post by **Ferdy** » Sun Jun 24, 2018 4:15 am

Dann Corbit wrote: ↑Thu Jun 21, 2018 8:21 pm Yes, my formula is quite complicated.
But, I think that (for instance) it is crucial to include depth.
And a dominant score at a lower depth should be considered (and probably reanalyzed).
However, your efforts are quite interesting.
I have two computer programs I use to analyze the data to produce the scores.
I try to normalize to 10.9,8...,2,1 scoring

One issue on the scoring of 1 to 10 is that if the solutions provided is only 8 and the top 8 is scored 1, and there are more than 8 legal moves in the position, the scaling of score are only fitted to the top 8. This scoring range may help measure the top engines since those are capable of finding moves within the top 8 moves, for weaker engines it would be a different story.

Example from original rebel.epd

Code: Select all

r3r1k1/1p3nqp/2pp4/p4p2/Pn3P1Q/2N4P/1PPR2P1/3R1BK1 w - - bm Ne2; c0 "positional scores are: Ne2=10, g4=6, Bd3=5, Rxd6=2, Re1=2, Qh5=1, Kh2=1, Be2=1"; id "rebel.pos.01";

Another issue is that with limited range [1 to 10] only, the point gap between moves may appear to be too small to differentiate the scoring. Example Rxd6=2, Re1=2 or Qh5=1, Kh2=1, Be2=1.

To solve these issues, make the system use a 100 point system i.e top move is set to 100 and other moves will get a score based on the logistic function, score_rate = 1/(1+10^(-score/400))

Using SF analyze the above position at multipv set to legal_move, and using 100 pt system.

ScoreRate = 1/(1+10^(-engce/400))
factor = 100/ScoreRate_1 = 100/0.68382
pts_1 = factor * ScoreRate_1 = 100
pts_2 = factor * ScoreRate_2 = 95
dep column is the depth searched by SF. To improve the correctness of the analysis of all moves, SF can be run to exceed that 24.

legal_move: 38

Code: Select all

num move  engce dep ScoreRate pts
  1 Ne2     134  24   0.68382 100
  2 Nb1     110  24   0.65322  95
  3 Re1      77  24   0.60903  89
  4 Be2      50  24   0.57146  83
  5 Kh2      46  24   0.56582  82
  6 Bd3      42  24   0.56015  81
  7 Qh5      40  24   0.55731  81
  8 Rc1      36  24   0.55162  80
  9 Re2      27  24   0.53878  78
 10 g4       27  24   0.53878  78
 11 Kh1      10  24   0.51439  75
 12 Bc4       0  24   0.50000  73
 13 g3        0  24   0.50000  73
 14 Qf2       0  24   0.50000  73
 15 Rb1      -6  24   0.49137  71
 16 Ra1      -6  24   0.49137  71
 17 Rf2      -8  24   0.48849  71
 18 Kf2     -28  24   0.45979  67
 19 Rd4     -74  24   0.39509  57
 20 Nb5    -110  24   0.34678  50
 21 Rd3    -138  24   0.31123  45
 22 Rxd6   -152  24   0.29422  43
 23 Bb5    -173  24   0.26975  39
 24 Na2    -178  24   0.26412  38
 25 b3     -400  24   0.09091  13
 26 Ba6    -428  24   0.07844  11
 27 Rd5    -432  24   0.07679  11
 28 Nd5    -438  24   0.07438  10
 29 Ne4    -463  24   0.06506   9
 30 Qe1    -742  24   0.01377   2
 31 Qg5    -945  24   0.00432   0
 32 Qxh7+ -1161  24   0.00125   0
 33 Qh6   -1239  24   0.00080   0
 34 Qf6   -1240  24   0.00079   0
 35 Qg4   -1259  24   0.00071   0
 36 Qg3   -1267  24   0.00068   0
 37 Qe7   -1267  24   0.00068   0
 38 Qd8   -1280  24   0.00063   0

Final epd would look like below with moves having 0 pts removed.

Code: Select all

r3r1k1/1p3nqp/2pp4/p4p2/Pn3P1Q/2N4P/1PPR2P1/3R1BK1 w - - bm Ne2; id "rebel.pos.01"; c0 "Ne2=100, Nb1=95, Re1=89, Be2=83, Kh2=82, Bd3=81, Qh5=81, Rc1=80, Re2=78, g4=78, Kh1=75, Bc4=73, g3=73, Qf2=73, Rb1=71, Ra1=71, Rf2=71, Kf2=67, Rd4=57, Nb5=50, Rd3=45, Rxd6=43, Bb5=39, Na2=38, b3=13, Ba6=11, Rd5=11, Nd5=10, Ne4=9, Qe1=2";

This kind of scoring may perhaps differentiate the engines properly, including those weaker engines.

MRL - The MEA Rating List

Re: MRL - The MEA Rating List

Re: MRL - The MEA Rating List

Re: MRL - The MEA Rating List

Re: MRL - The MEA Rating List

Re: MRL - The MEA Rating List

Re: MRL - The MEA Rating List

Re: MRL - The MEA Rating List