Effect of adjudication and TC on testing process

xr_a_y · Post by **xr_a_y** » Tue Feb 09, 2021 9:21 am

Last week I tested more than 200 nets with last Minic from a long training process.

At 3s+0.3 with adjudication at "-resign 3 700 -draw 8 10", I have many nets above current Niggling Nymph and some at/near +50Elo (https://ibb.co/BCh5D1L)
Book is 8 moves GM, hardware is AVX2.

Code: Select all

   # PLAYER                   :  RATING  ERROR   POINTS  PLAYED   (%)  CFS(%)
   1 logs/nn-epoch225.nnue    :    59.5   10.6   1308.0    2240    58      54
   2 logs/nn-epoch242.nnue    :    58.6   12.0   1212.0    2080    58      51
   3 logs/nn-epoch272.nnue    :    58.4   14.3   1188.0    2040    58      60
   4 logs/nn-epoch258.nnue    :    56.2   12.9   1124.0    1940    58      53
   5 logs/nn-epoch255.nnue    :    55.6   12.2   1087.5    1880    58      51
   6 logs/nn-epoch270.nnue    :    55.3   13.7   1052.0    1820    58      52
   7 logs/nn-epoch273.nnue    :    54.8   12.9    995.5    1724    58      53
   8 logs/nn-epoch267.nnue    :    54.1   13.6   1003.0    1740    58      51
   9 logs/nn-epoch234.nnue    :    53.9   13.1   1037.0    1800    58      53
  10 logs/nn-epoch231.nnue    :    53.1   14.2    943.0    1640    58      54
  11 logs/nn-epoch274.nnue    :    52.1   15.0    883.5    1540    57      51

At 20s+0.1 without any adjudication, I have no nets above Niggling Nymph ! Same book, AVX hardware.

Code: Select all

 # PLAYER                 :  RATING  ERROR   POINTS  PLAYED   (%)  CFS(%)
   1 nn/nn-epoch221.nnue    :     6.5   25.9    275.0     540    51      69
   2 master                 :     0.0   ----  13071.5   23816    55      64
   3 nn/nn-epoch185.nnue    :    -4.8   26.8    217.0     440    49      58
   4 nn/nn-epoch263.nnue    :    -8.2   26.0    146.5     300    49      51
   5 nn/nn-epoch245.nnue    :    -8.8   32.7    156.0     320    49      51
   6 nn/nn-epoch212.nnue    :    -9.3   31.2    165.5     340    49      50
   7 nn/nn-epoch172.nnue    :    -9.5   37.6    126.5     260    49      50
   8 nn/nn-epoch240.nnue    :    -9.5   32.9    126.5     260    49      52
   9 nn/nn-epoch196.nnue    :   -10.5   27.8    194.0     400    48      52
  10 nn/nn-epoch179.nnue    :   -11.4   25.8    193.5     400    48      51
  11 nn/nn-epoch198.nnue    :   -11.7   34.0    116.0     240    48      54
  12 nn/nn-epoch205.nnue    :   -14.3   30.1    153.5     320    48      51

Best net "225" is not even in top ten. Of course, a lot less game were played at the longer TC but I don't really know what to deduce from those tests ...

I'm currently running a 10s+0.1 test of the "225" net versus Niggling Nymph, no adjudication test on my AVX hardware (the one where the 3s+0.3 with adjudication test ran). I'll report the results soon.

Any hints/ ideas ?

xr_a_y · Post by **xr_a_y** » Tue Feb 09, 2021 11:54 am

So the 10s+0.1 no adjudication test, still same book, on the AVX hardware gives

Code: Select all

Score of minic_3.03_niggling_nymph vs minic_dev+net225: 560 - 717 - 1101 [0.467]
...      minic_3.03_niggling_nymph playing White: 317 - 296 - 576  [0.509] 1189
...      minic_3.03_niggling_nymph playing Black: 243 - 421 - 525  [0.425] 1189
...      White vs Black: 738 - 539 - 1101  [0.542] 2378
Elo difference: -23.0 +/- 10.2, LOS: 0.0 %, DrawRatio: 46.3 %

So we are far from the expected +50Elo based on the 3s+0.1 with adjudication test but better than the 20s+0.1 test.

So what ? Does nets are more "expressive", more "different" from each other at short TC, where search compensate less.

I'll now run a test at 20s+0.1, no adjudication, same book, on the AVX2 hardware including Niggling Nymph, "225" and Minic HCE to see where those nets are versus standard evaluation.

brianr · Post by **brianr** » Tue Feb 09, 2021 1:19 pm

I have observed something similar with Leela nets, which are not the same as NNUE nets, of course.

I got significantly different results with and without draw adjudication (for example: -draw movenumber=50 movecount=6 score=10).

My thinking is that what would have been considered small score differences for traditional HCEs, are much more meaningful for the nets. Leela nets work with winning probabilities, and the centipawn values are backed into from formulas. Perhaps there is enough compression to mask small differences that the nets "see" in close positions.

Accordingly, I now run matches with only tablebase adjudication (most common 7 piece syzygy). The vast majority of my matches are 10b nets using fixed nodes per move games (when the net architectures are exactly the same), and there does not seem to be any major slowdown (running 5 or 6 games at a time). Time per move games are vastly slower of course, and I only ever run one at a time as only have one GPU.

Incidentally, the fastest Leela net time per move games I run are 0:10+1.0. In the past I have also seen inconsistent results with sub-second increments with Leela nets on GPUs. Of course, that would not matter for NNUE nets running on CPUs.

Finally, there is some randomness in Leela net training even with identical input and training parameters. Some nets are just "lucky" and stronger than others. It is often difficult to run enough games to overcome the training noise variance to be able to determine strength differences that actually result from other small training parameter changes. Again, NNUE nets may not be as random.

I try to run matches to at least 95% CFS, and typically wait until 99 or 100%.
Sometimes is it quite difficult to remain patient, but I have seen too many reversals with small sample sizes, FWIW.

Ferdy · Post by **Ferdy** » Tue Feb 09, 2021 2:11 pm

xr_a_y wrote: ↑Tue Feb 09, 2021 9:21 am Last week I tested more than 200 nets with last Minic from a long training process.

At 3s+0.3 with adjudication at "-resign 3 700 -draw 8 10", I have many nets above current Niggling Nymph and some at/near +50Elo (https://ibb.co/BCh5D1L)
Book is 8 moves GM, hardware is AVX2.

Code: Select all

   # PLAYER                   :  RATING  ERROR   POINTS  PLAYED   (%)  CFS(%)
   1 logs/nn-epoch225.nnue    :    59.5   10.6   1308.0    2240    58      54
   2 logs/nn-epoch242.nnue    :    58.6   12.0   1212.0    2080    58      51
   3 logs/nn-epoch272.nnue    :    58.4   14.3   1188.0    2040    58      60
   4 logs/nn-epoch258.nnue    :    56.2   12.9   1124.0    1940    58      53
   5 logs/nn-epoch255.nnue    :    55.6   12.2   1087.5    1880    58      51
   6 logs/nn-epoch270.nnue    :    55.3   13.7   1052.0    1820    58      52
   7 logs/nn-epoch273.nnue    :    54.8   12.9    995.5    1724    58      53
   8 logs/nn-epoch267.nnue    :    54.1   13.6   1003.0    1740    58      51
   9 logs/nn-epoch234.nnue    :    53.9   13.1   1037.0    1800    58      53
  10 logs/nn-epoch231.nnue    :    53.1   14.2    943.0    1640    58      54
  11 logs/nn-epoch274.nnue    :    52.1   15.0    883.5    1540    57      51

At 20s+0.1 without any adjudication, I have no nets above Niggling Nymph ! Same book, AVX hardware.

Code: Select all

 # PLAYER                 :  RATING  ERROR   POINTS  PLAYED   (%)  CFS(%)
   1 nn/nn-epoch221.nnue    :     6.5   25.9    275.0     540    51      69
   2 master                 :     0.0   ----  13071.5   23816    55      64
   3 nn/nn-epoch185.nnue    :    -4.8   26.8    217.0     440    49      58
   4 nn/nn-epoch263.nnue    :    -8.2   26.0    146.5     300    49      51
   5 nn/nn-epoch245.nnue    :    -8.8   32.7    156.0     320    49      51
   6 nn/nn-epoch212.nnue    :    -9.3   31.2    165.5     340    49      50
   7 nn/nn-epoch172.nnue    :    -9.5   37.6    126.5     260    49      50
   8 nn/nn-epoch240.nnue    :    -9.5   32.9    126.5     260    49      52
   9 nn/nn-epoch196.nnue    :   -10.5   27.8    194.0     400    48      52
  10 nn/nn-epoch179.nnue    :   -11.4   25.8    193.5     400    48      51
  11 nn/nn-epoch198.nnue    :   -11.7   34.0    116.0     240    48      54
  12 nn/nn-epoch205.nnue    :   -14.3   30.1    153.5     320    48      51

Best net "225" is not even in top ten. Of course, a lot less game were played at the longer TC but I don't really know what to deduce from those tests ...

I'm currently running a 10s+0.1 test of the "225" net versus Niggling Nymph, no adjudication test on my AVX hardware (the one where the 3s+0.3 with adjudication test ran). I'll report the results soon.

Any hints/ ideas ?

I think "-draw 8 10" is a bit aggressive, perhaps request an additional feature to include the movenumber like
-draw movenumber count score
and -draw count score is only activated after movenumber 40 or so.

To truly compare the effect of adjudication, you may use the same TC. So a TC of 3+0.3 with and without adjudication. Then measure the effect of time scaling by another test at TC 20+0.1 with and without adjudication. From that we will be able to know the effect of time scaling and adjudication. The performance of nets at a given TC may give a hint on what depth the training data has to generated.

xr_a_y · Post by **xr_a_y** » Tue Feb 09, 2021 2:18 pm

Looks like I'm getting the 50Elo back when running versus HCE

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw 
   1 minic_dev_net225               51      18     780   57.3%   42.8% 
   2 minic_3.03_niggling_nymph       0      18     779   50.1%   44.2% 
   3 minic_3.03_HCE                -52      18     779   42.6%   43.1%

Yes indeed, I'll have to run with/without adjudication to really check the effect, but this is a huge test and requieres a lot of time.

Effect of adjudication and TC on testing process

Effect of adjudication and TC on testing process

Re: Effect of adjudication and TC on testing process

Re: Effect of adjudication and TC on testing process

Re: Effect of adjudication and TC on testing process

Re: Effect of adjudication and TC on testing process