Evaluation Score History of (some) Engines

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Evaluation Score History of (some) Engines

Post by Rebel »

The below table is a byproduct of the SIMEX DEPTH=1 testing.

Input: 10,000 positions.
Output : sum all eval scores / 10000 = average eval score.

Code: Select all

Andscacs       Arasan        Ethereal       Komodo      Stockfish   Wasp          Xiphos       Lc0
0.81 : 0.27    17 : 0.29     8.00 : 0.25    1 : 0.17    1 : 0.25    1.01 : 0.19   0.1 : 0.43   2018-08 : -0.08
0.82 : 0.29    18 : 0.30     8.37 : 0.23    2 : 0.20    2 : 0.18    1.02 : 0.19   0.1 : 0.43   2019-01 : 0.02
0.83 : 0.29    19 : 0.31     8.61 : 0.17    3 : 0.18    3 : 0.29    1.25 : 0.19   0.2 : 0.50   2019-02 : 0.00
0.84 : 0.23    20 : 0.52     9.00 : 0.19    4 : 0.17    4 : 0.28    2.00 : 0.21   0.3 : 0.44   2019-03 : 0.04
0.85 : 0.24    21 : 0.37     9.30 : 0.25    5 : 0.18    5 : 0.25    2.60 : 0.19   0.4 : 0.36   2019-04 : 0.05
0.87 : 0.25    21.1 : 0.37   9.65 : 0.26    6 : 0.21    6 : 0.28    3.00 : 0.17   0.5 : 0.36   2019-05 : 0.02
0.88 : 0.26    21.3 : 0.40   10.00 : 0.29   7 : 0.27    7 : 0.29    3.50 : 0.18   0.6 : 0.42   2019-06 : 0.03
0.89 : 0.30                  10.55 : 0.41   8 : 0.27    8 : 0.31    3.60 : 0.20                2019-07 : 0.00
0.90 : 0.29                  11.00 : 0.53   9 : 0.26    9 : 0.52    3.75 : 0.25                2019-08 : -0.11
0.91 : 0.31                  11.25 : 0.48   10 : 0.34   10 : 0.64                              2019-09 : 0.04
0.92 : 0.33
0.93 : 0.33
0.94 : 0.36
0.95 : 0.36
The Lc0 scores are taken from a network of August 2018, there after from each month in 2019.

AB engines evaluations seems to rise, Lc0 obviously is a different animal, the average score is 0.00 :lol:
90% of coding is debugging, the other 10% is writing bugs.
chrisw
Posts: 4315
Joined: Tue Apr 03, 2012 4:28 pm

Re: Evaluation Score History of (some) Engines

Post by chrisw »

Rebel wrote: Sat Nov 09, 2019 12:02 pm The below table is a byproduct of the SIMEX DEPTH=1 testing.

Input: 10,000 positions.
Output : sum all eval scores / 10000 = average eval score.

Code: Select all

Andscacs       Arasan        Ethereal       Komodo      Stockfish   Wasp          Xiphos       Lc0
0.81 : 0.27    17 : 0.29     8.00 : 0.25    1 : 0.17    1 : 0.25    1.01 : 0.19   0.1 : 0.43   2018-08 : -0.08
0.82 : 0.29    18 : 0.30     8.37 : 0.23    2 : 0.20    2 : 0.18    1.02 : 0.19   0.1 : 0.43   2019-01 : 0.02
0.83 : 0.29    19 : 0.31     8.61 : 0.17    3 : 0.18    3 : 0.29    1.25 : 0.19   0.2 : 0.50   2019-02 : 0.00
0.84 : 0.23    20 : 0.52     9.00 : 0.19    4 : 0.17    4 : 0.28    2.00 : 0.21   0.3 : 0.44   2019-03 : 0.04
0.85 : 0.24    21 : 0.37     9.30 : 0.25    5 : 0.18    5 : 0.25    2.60 : 0.19   0.4 : 0.36   2019-04 : 0.05
0.87 : 0.25    21.1 : 0.37   9.65 : 0.26    6 : 0.21    6 : 0.28    3.00 : 0.17   0.5 : 0.36   2019-05 : 0.02
0.88 : 0.26    21.3 : 0.40   10.00 : 0.29   7 : 0.27    7 : 0.29    3.50 : 0.18   0.6 : 0.42   2019-06 : 0.03
0.89 : 0.30                  10.55 : 0.41   8 : 0.27    8 : 0.31    3.60 : 0.20                2019-07 : 0.00
0.90 : 0.29                  11.00 : 0.53   9 : 0.26    9 : 0.52    3.75 : 0.25                2019-08 : -0.11
0.91 : 0.31                  11.25 : 0.48   10 : 0.34   10 : 0.64                              2019-09 : 0.04
0.92 : 0.33
0.93 : 0.33
0.94 : 0.36
0.95 : 0.36
The Lc0 scores are taken from a network of August 2018, there after from each month in 2019.

AB engines evaluations seems to rise, Lc0 obviously is a different animal, the average score is 0.00 :lol:
When bias hunting, I found there’s a persistent bias to white in test suites. I guess this because white wins more games than black, and if sampling of positions from games doesn’t specifically try to equalize for the black white win result of PGNs used for sampling from, then bias. Next sampling, this is on my anti-bias list to correct for, but it probably is not significant for current purpose.
There may also be an effect due to white starting with slightly better position due to tempo, black either neutralizes this and goes on to draw, or negates it and goes on to win, but in all variations white initial moves are going to be skewed positive. Not at all sure how to eliminate this bias, although again it may not be consequential.