Why Lc0 eval (in cp) is asymmetric against AB engines?

Laskos · Post by **Laskos** » Wed Jul 25, 2018 1:21 pm

I guess that they convert winning probabilities to evals in CPs, but do they convert them asymmetrically? I mean, if equal in strength opponent, the performance vs eval graph would better be symmetric with regard to eval=0 and performance=50% point (point of symmetry).

I managed to choose time controls such as Komodo 12.1.1 on one i7 3.8 GHz core is almost exactly equal in strength to Lc0 ID10166 on GTX 1060.
The result in 1000 games match at fast time control is:

Score of Komodo 12.1.1 vs lc0_v16 ID10166: 399 - 406 - 195 [0.496] 1000
Elo difference: -2.43 +/- 19.31

So, at these chosen time controls, they are practically equal in strength. But while Komodo shows an approximate symmetry about the point of symmetry, Lc0 does not. Also, high eval, say of 300cp to 400cp, does not mean more than 65-70% performance against Komodo (of equal strength), which is unusual. Moreover, eval of 0.00 of Lc0 in middlegames means a performance of only some 32% against an equal in strength AB engine like Komodo. 50% is achieved by Lc0 only at eval of +140cp.
Here is the plot of performance of engines versus eval shown by them in middlegames (moves 10 to 30):

Also, for curiosity, I plotted the length of the won games for Komodo (median of 46 moves) and Lc0 (median of 67 moves). No adjudication was used. Lc0 seems to linger more to mate the opponent.

yanquis1972 · Post by **yanquis1972** » Wed Jul 25, 2018 5:04 pm

a couple possibly related thoughts no the finer points:

blunders (partic at fast TC, which i assume you used to get 1000 games): leela could have a correct 0.00 eval & not draw against a strong A/B opponent. whether or not it'd be that frequent, no idea.

length of wins: observationally, leela has a tendency to be passive in late-middlegame positions she evaluates as winning (seems somehow related to the 50 move rule but idk. it's like she assumes her opponent will blink first, even though it's not in their interest, & the active move is presumably evaluated as suboptimal unless the alternative is a draw. could just boil down to search)

i don't usually see the eval at 1.50 if a win isn't there [edit: discounting endgame evals; with those i think leela is going to be quite terrible until LR is reduced once or even twice, & that may go a long way in explaining the average], but even at much longer TC these often end up drawn.

so my guess is symmetric but combination of (relatively) terrible search + typical 'still just needs a lot of training' excuse

some 30/30 games against single-core stockfish, if they help: https://lichess.org/study/guI3lnix

crem · Post by **crem** » Wed Jul 25, 2018 6:00 pm

Laskos wrote: ↑Wed Jul 25, 2018 1:21 pm I guess that they convert winning probabilities to evals in CPs, but do they convert them asymmetrically? I mean, if equal in strength opponent, the performance vs eval graph would better be symmetric with regard to eval=0 and performance=50% point (point of symmetry).

Interesting! Our Q to centipawns formula is symmetric (namely, 290.680623072 * tan(1.548090806 * Q)), where Q is between -1 and 1).
The reason for that bias is probably Lc0 being overly optimistic, due to blunders, not knowing how to win won endgames etc.

It would be interesting to see distribution of win/loss/draw games per every bucket, and also do the same with games with itself. Not sure how insightful it would be though.

yanquis1972 · Post by **yanquis1972** » Wed Jul 25, 2018 6:06 pm

is bucket terminology or is that just shorthand for 'bucketful of nets'

Laskos · Post by **Laskos** » Wed Jul 25, 2018 8:45 pm

I maybe should specify the conditions, as the plots depend (mildly) on them. The time control was short, 0.1s/move for Lc0 ID10166, 0.013s/move for Komodo 12.1.1 (I tuned Komodo's time control to such a small value to equal it in strength to Lc0). In these conditions Lc0 had about 150-200 nodes per move, a pretty low number.
As for the the raw data for the first (eval) plot:

Code: Select all

                 players,      min,      max,     Gcnt,     Wcnt,     Lcnt,     Dcnt,  perf(%),
                 
                   K1211,    -4.00,    -3.60,      207,        5,      189,       13,    5.56,
                   K1211,    -3.60,    -3.20,      330,       10,      300,       20,    6.06,
                   K1211,    -3.20,    -2.80,      406,       25,      365,       16,    8.13,
                   K1211,    -2.80,    -2.40,      548,       56,      440,       52,   14.96,
                   K1211,    -2.40,    -2.00,      714,       78,      524,      112,   18.77,
                   K1211,    -2.00,    -1.60,      934,      129,      610,      195,   24.25,
                   K1211,    -1.60,    -1.20,     1352,      243,      774,      335,   30.36,
                   K1211,    -1.20,    -0.80,     1659,      413,      812,      434,   37.97,
                   K1211,    -0.80,    -0.40,     2223,      680,      935,      608,   44.26,
                   K1211,    -0.40,     0.00,     2993,     1101,     1014,      878,   51.45,
                   K1211,     0.00,     0.40,     2965,     1350,      780,      835,   59.61,
                   K1211,     0.40,     0.80,     1605,      929,      352,      324,   67.98,
                   K1211,     0.80,     1.20,      858,      597,      114,      147,   78.15,
                   K1211,     1.20,     1.60,      505,      391,       56,       58,   83.17,
                   K1211,     1.60,     2.00,      383,      325,       30,       28,   88.51,
                   K1211,     2.00,     2.40,      284,      257,       17,       10,   92.25,
                   K1211,     2.40,     2.80,      204,      193,        7,        4,   95.59,
                   K1211,     2.80,     3.20,      174,      167,        5,        2,   96.55,
                   K1211,     3.20,     3.60,      149,      138,        3,        8,   95.30,
                   K1211,     3.60,     4.00,      132,      125,        1,        6,   96.97,
                   
                 lc0_v16,    -4.00,    -3.60,       96,        6,       87,        3,    7.81,
                 lc0_v16,    -3.60,    -3.20,       93,        1,       88,        4,    3.23,
                 lc0_v16,    -3.20,    -2.80,       99,        5,       91,        3,    6.57,
                 lc0_v16,    -2.80,    -2.40,      124,        3,      110,       11,    6.85,
                 lc0_v16,    -2.40,    -2.00,      159,        9,      142,        8,    8.18,
                 lc0_v16,    -2.00,    -1.60,      244,        9,      214,       21,    7.99,
                 lc0_v16,    -1.60,    -1.20,      336,       23,      275,       38,   12.50,
                 lc0_v16,    -1.20,    -0.80,      428,       29,      327,       72,   15.19,
                 lc0_v16,    -0.80,    -0.40,      566,       80,      368,      118,   24.56,
                 lc0_v16,    -0.40,     0.00,     1019,      148,      586,      285,   28.51,
                 lc0_v16,     0.00,     0.40,     1571,      316,      800,      455,   34.60,
                 lc0_v16,     0.40,     0.80,     1606,      444,      770,      392,   39.85,
                 lc0_v16,     0.80,     1.20,     1482,      526,      589,      367,   47.87,
                 lc0_v16,     1.20,     1.60,     1306,      475,      475,      356,   50.00,
                 lc0_v16,     1.60,     2.00,     1046,      436,      334,      276,   54.88,
                 lc0_v16,     2.00,     2.40,      918,      382,      293,      243,   54.85,
                 lc0_v16,     2.40,     2.80,      798,      379,      212,      207,   60.46,
                 lc0_v16,     2.80,     3.20,      661,      335,      188,      138,   61.12,
                 lc0_v16,     3.20,     3.60,      566,      301,      124,      141,   65.64,
                 lc0_v16,     3.60,     4.00,      505,      273,       93,      139,   67.82,

(min, max) refer to eval window for the designated engine.

Gcnt is the total number of occurrences in games this eval window for that engine occurred in the match.
Wcnt is the number of occurrences in Wins of this total Gcnt
Lcnt is the number of occurrences in losses of this total Gcnt
Dcnt is the number of occurrences in draws of this total Gcnt
So, Wcnt + Lcnt + Dcnt = Gcnt

perf (%) is the averaged performance of the designated engine inside that (min, max) eval interval, or (Wcnt + Dcnt/2) / Gcnt.
=====================

I made the same computation for self-games of ID10166 at 0.1s/move:

In self-games the plot is pretty symmetric, although not completely. The raw data is here:

Code: Select all

                  players,      min,      max,     Gcnt,     Wcnt,     Lcnt,     Dcnt,  perf(%),
                 
                 lc0_v16,    -4.00,    -3.60,       61,        6,       50,        5,   13.93,
                 lc0_v16,    -3.60,    -3.20,       61,        4,       54,        3,    9.02,
                 lc0_v16,    -3.20,    -2.80,       80,       10,       59,       11,   19.38,
                 lc0_v16,    -2.80,    -2.40,      121,       19,       76,       26,   26.45,
                 lc0_v16,    -2.40,    -2.00,      159,       20,      100,       39,   24.84,
                 lc0_v16,    -2.00,    -1.60,      234,       43,      132,       59,   30.98,
                 lc0_v16,    -1.60,    -1.20,      325,       62,      176,       87,   32.46,
                 lc0_v16,    -1.20,    -0.80,      493,      134,      231,      128,   40.16,
                 lc0_v16,    -0.80,    -0.40,      762,      208,      376,      178,   38.98,
                 lc0_v16,    -0.40,     0.00,     1153,      382,      478,      293,   45.84,
                 lc0_v16,     0.00,     0.40,     1428,      547,      519,      362,   50.98,
                 lc0_v16,     0.40,     0.80,     1195,      524,      356,      315,   57.03,
                 lc0_v16,     0.80,     1.20,      750,      349,      211,      190,   59.20,
                 lc0_v16,     1.20,     1.60,      542,      243,      145,      154,   59.04,
                 lc0_v16,     1.60,     2.00,      366,      182,       74,      110,   64.75,
                 lc0_v16,     2.00,     2.40,      312,      164,       52,       96,   67.95,
                 lc0_v16,     2.40,     2.80,      251,      145,       41,       65,   70.72,
                 lc0_v16,     2.80,     3.20,      174,      107,       34,       33,   70.98,
                 lc0_v16,     3.20,     3.60,      135,       85,       18,       32,   74.81,
                 lc0_v16,     3.60,     4.00,       99,       69,        8,       22,   80.81,

Now the won self-games of Lc0 has the median of 55.5 moves, below the median length of Lc0 wins against equal in strength Komodo (67), and above the median length of won games of Komodo (46). No adjudication was used.

jorose · Post by **jorose** » Wed Jul 25, 2018 10:47 pm

I see two options.

Perhaps this is a sign that Leela is overfitting on certain features? In essence it is showing a positive bias in the positions it is getting and not performing on the level believes the position is worth. Since when facing itself both sides have the same bias the engines will naturally avoid allowing the other engine to have a position that the engines agree is better for the side it is biased towards.

An alternative possibility is that Leela's evaluations are actually completely correct, but if it blunders then a small advantage will not matter and a rapid end will ensue. This would fit well with my experience playing against engines. When I win it tends to be some long drawn out struggle, slowly building up a position, but when I lose I can often resign very quickly as I missed a tactic costing me a minor or something of that magnitude.

If I had to guess I'd vote for option number 2. There are quite a few things one can be critical of with regards to Leela's design, but I don't think there can be any doubt with regards to the precision of a large, well trained, convolutional neural net based eval function vs an eval designed by hand in a general position. On the other hand I don't see any reason to believe Leela has yet been able to over-fit to a large degree.

Laskos · Post by **Laskos** » Thu Jul 26, 2018 11:10 am

jorose wrote: ↑Wed Jul 25, 2018 10:47 pm I see two options.

Perhaps this is a sign that Leela is overfitting on certain features? In essence it is showing a positive bias in the positions it is getting and not performing on the level believes the position is worth. Since when facing itself both sides have the same bias the engines will naturally avoid allowing the other engine to have a position that the engines agree is better for the side it is biased towards.

An alternative possibility is that Leela's evaluations are actually completely correct, but if it blunders then a small advantage will not matter and a rapid end will ensue. This would fit well with my experience playing against engines. When I win it tends to be some long drawn out struggle, slowly building up a position, but when I lose I can often resign very quickly as I missed a tactic costing me a minor or something of that magnitude.

If I had to guess I'd vote for option number 2. There are quite a few things one can be critical of with regards to Leela's design, but I don't think there can be any doubt with regards to the precision of a large, well trained, convolutional neural net based eval function vs an eval designed by hand in a general position. On the other hand I don't see any reason to believe Leela has yet been able to over-fit to a large degree.

Yes, I also tend to believe in the option 2. The fact is Leela is prone to blunders which are completely uncharacteristic of regular hand-crafted eval AB search engines. It also plays poorly endgames against AB engines with hand-crafted knowledge of some endgames. So, often it indeed has some objective advantage in medgame against a regular engine, an objective advantage which Leela fails to convert against regular engines either due to blunders or by mishandling the endgame. Perhaps the endgames are the clearest evidence where hand-crafted knowledge helps greatly, and where by self-games Leela is learning only very slowly this knowledge. And in self-games these shortcomings are not revealed, so the shape of performance vs eval graph is different.

An indirect evidence that option 2 is probably the culprit for this shape of performance versus eval graph against such a knowledgeable engines as Komodo 12.1.1 is comparing the tactically stronger main-server ID517 (which, by the way, plays also better endgames) to the previous plot for test-server ID10166. I tuned now Komodo 12.1.1 to 0.017s/move as equal in strength to Lc0 ID517 at 0.1s/move (smaller, faster net, getting about 400 nodes per move at this tc). They are practically equal in strength is these conditions (1000 games):

Code: Select all

   # PLAYER             : RATING  ERROR    POINTS  PLAYED     (%)   CFS(next)
   1 Komodo 12.1.1      :    0.2   10.0     500.5    1000    50.0      51    
   2 lc0_v16 ID517      :   -0.2   10.0     499.5    1000    50.0     ---

And I plotted the behavior of ID517 and that of ID10166:

ID10166 hits 50% performance at about 140cp in midgame, while ID517 at about 100cp. Generally, the ID517 shape for midgame is a bit better, probably due to it being stronger tactically and converting a bit more often midgame advantage against Komodo.
Also, while the median of length of the games won by ID10166 was 67 moves, that median is 61 moves for ID517. The median of the won games of Komodo agains tID10166 was 46 moves, against ID517 it is 49 moves. So, the difference compared to Komodo is smaller with ID517

yanquis1972 · Post by **yanquis1972** » Thu Jul 26, 2018 3:56 pm

the test net endgames are quite unfortunate at this level (from the very small sample of LTC games i've seen). it seems to do fine with small-piece endgames but when there are many pieces it seems to suffer if there's no king attack. it's not that surprising; even w/ the LR reduction the rate is apparently the same as the one deepmind started with.

fwiw, supposedly the intention is to saturate for another ~200 nets, so the mainserver may be the more interesting one yet again for the next 2+ weeks.

dvneal · Post by **dvneal** » Thu Jul 26, 2018 9:26 pm

Laskos, can you try the same experiment with two other (non-leela) engines against each other? It seems like whichever engine's ratings more correctly predict the result of the specific matchup between the two would show up as being "unbiased" or "symmetrical" in this example. What this really seems to show is that Leela's ratings are good at predicting how she will do against herself, but not so good at predicting how she will do against Komodo.

Laskos · Post by **Laskos** » Fri Jul 27, 2018 2:13 pm

dvneal wrote: ↑Thu Jul 26, 2018 9:26 pm Laskos, can you try the same experiment with two other (non-leela) engines against each other? It seems like whichever engine's ratings more correctly predict the result of the specific matchup between the two would show up as being "unbiased" or "symmetrical" in this example. What this really seems to show is that Leela's ratings are good at predicting how she will do against herself, but not so good at predicting how she will do against Komodo.

I tuned Komodo at some 80ms/move against SF dev at some 50 ms/move, to have them almost equal in strength (contempt for both was seto to 0, they show a fake eval using the default). Engines on 1 core, move overhead is default.

Score of Komodo 12.1.1 vs SF_dev: 1192 - 1174 - 1634 [0.502] 4000
Elo difference: 1.56 +/- 8.27
Finished match

Komodo shows a much more symmetric about axis origin eval, but I am not sure what it means.

Why Lc0 eval (in cp) is asymmetric against AB engines?

Why Lc0 eval (in cp) is asymmetric against AB engines?

Re: Why Lc0 eval (in cp) is asymmetric against AB engines?

Re: Why Lc0 eval (in cp) is asymmetric against AB engines?

Re: Why Lc0 eval (in cp) is asymmetric against AB engines?

Re: Why Lc0 eval (in cp) is asymmetric against AB engines?

Re: Why Lc0 eval (in cp) is asymmetric against AB engines?

Re: Why Lc0 eval (in cp) is asymmetric against AB engines?

Re: Why Lc0 eval (in cp) is asymmetric against AB engines?

Re: Why Lc0 eval (in cp) is asymmetric against AB engines?

Re: Why Lc0 eval (in cp) is asymmetric against AB engines?