Evaluation Score History of (some) Engines

Rebel · Post by **Rebel** » Sat Nov 09, 2019 12:02 pm

The below table is a byproduct of the SIMEX DEPTH=1 testing.

Input: 10,000 positions.
Output : sum all eval scores / 10000 = average eval score.

Code: Select all

Andscacs       Arasan        Ethereal       Komodo      Stockfish   Wasp          Xiphos       Lc0
0.81 : 0.27    17 : 0.29     8.00 : 0.25    1 : 0.17    1 : 0.25    1.01 : 0.19   0.1 : 0.43   2018-08 : -0.08
0.82 : 0.29    18 : 0.30     8.37 : 0.23    2 : 0.20    2 : 0.18    1.02 : 0.19   0.1 : 0.43   2019-01 : 0.02
0.83 : 0.29    19 : 0.31     8.61 : 0.17    3 : 0.18    3 : 0.29    1.25 : 0.19   0.2 : 0.50   2019-02 : 0.00
0.84 : 0.23    20 : 0.52     9.00 : 0.19    4 : 0.17    4 : 0.28    2.00 : 0.21   0.3 : 0.44   2019-03 : 0.04
0.85 : 0.24    21 : 0.37     9.30 : 0.25    5 : 0.18    5 : 0.25    2.60 : 0.19   0.4 : 0.36   2019-04 : 0.05
0.87 : 0.25    21.1 : 0.37   9.65 : 0.26    6 : 0.21    6 : 0.28    3.00 : 0.17   0.5 : 0.36   2019-05 : 0.02
0.88 : 0.26    21.3 : 0.40   10.00 : 0.29   7 : 0.27    7 : 0.29    3.50 : 0.18   0.6 : 0.42   2019-06 : 0.03
0.89 : 0.30                  10.55 : 0.41   8 : 0.27    8 : 0.31    3.60 : 0.20                2019-07 : 0.00
0.90 : 0.29                  11.00 : 0.53   9 : 0.26    9 : 0.52    3.75 : 0.25                2019-08 : -0.11
0.91 : 0.31                  11.25 : 0.48   10 : 0.34   10 : 0.64                              2019-09 : 0.04
0.92 : 0.33
0.93 : 0.33
0.94 : 0.36
0.95 : 0.36

The Lc0 scores are taken from a network of August 2018, there after from each month in 2019.

AB engines evaluations seems to rise, Lc0 obviously is a different animal, the average score is 0.00

chrisw · Post by **chrisw** » Sat Nov 09, 2019 5:17 pm

Rebel wrote: ↑Sat Nov 09, 2019 12:02 pm The below table is a byproduct of the SIMEX DEPTH=1 testing.

Input: 10,000 positions.
Output : sum all eval scores / 10000 = average eval score.

Code: Select all

Andscacs       Arasan        Ethereal       Komodo      Stockfish   Wasp          Xiphos       Lc0
0.81 : 0.27    17 : 0.29     8.00 : 0.25    1 : 0.17    1 : 0.25    1.01 : 0.19   0.1 : 0.43   2018-08 : -0.08
0.82 : 0.29    18 : 0.30     8.37 : 0.23    2 : 0.20    2 : 0.18    1.02 : 0.19   0.1 : 0.43   2019-01 : 0.02
0.83 : 0.29    19 : 0.31     8.61 : 0.17    3 : 0.18    3 : 0.29    1.25 : 0.19   0.2 : 0.50   2019-02 : 0.00
0.84 : 0.23    20 : 0.52     9.00 : 0.19    4 : 0.17    4 : 0.28    2.00 : 0.21   0.3 : 0.44   2019-03 : 0.04
0.85 : 0.24    21 : 0.37     9.30 : 0.25    5 : 0.18    5 : 0.25    2.60 : 0.19   0.4 : 0.36   2019-04 : 0.05
0.87 : 0.25    21.1 : 0.37   9.65 : 0.26    6 : 0.21    6 : 0.28    3.00 : 0.17   0.5 : 0.36   2019-05 : 0.02
0.88 : 0.26    21.3 : 0.40   10.00 : 0.29   7 : 0.27    7 : 0.29    3.50 : 0.18   0.6 : 0.42   2019-06 : 0.03
0.89 : 0.30                  10.55 : 0.41   8 : 0.27    8 : 0.31    3.60 : 0.20                2019-07 : 0.00
0.90 : 0.29                  11.00 : 0.53   9 : 0.26    9 : 0.52    3.75 : 0.25                2019-08 : -0.11
0.91 : 0.31                  11.25 : 0.48   10 : 0.34   10 : 0.64                              2019-09 : 0.04
0.92 : 0.33
0.93 : 0.33
0.94 : 0.36
0.95 : 0.36

The Lc0 scores are taken from a network of August 2018, there after from each month in 2019.

AB engines evaluations seems to rise, Lc0 obviously is a different animal, the average score is 0.00

When bias hunting, I found there’s a persistent bias to white in test suites. I guess this because white wins more games than black, and if sampling of positions from games doesn’t specifically try to equalize for the black white win result of PGNs used for sampling from, then bias. Next sampling, this is on my anti-bias list to correct for, but it probably is not significant for current purpose.
There may also be an effect due to white starting with slightly better position due to tempo, black either neutralizes this and goes on to draw, or negates it and goes on to win, but in all variations white initial moves are going to be skewed positive. Not at all sure how to eliminate this bias, although again it may not be consequential.

Evaluation Score History of (some) Engines

Evaluation Score History of (some) Engines

Re: Evaluation Score History of (some) Engines