Why do the latest Sergio Vieri 384x30b networks scale so badly?

Laskos · Post by **Laskos** » Thu Jun 25, 2020 12:29 am

mbabigian wrote: ↑Wed Jun 24, 2020 10:25 pm
Here is the result for 6s+0.1s
How many time forfeits did you get running that timecontrol? I tried reproducing your results with similar conditions and I got a ridiculous number of time forfeits. More than 10%. I ran 5s+0.08 as I have a faster GPU and wanted to be close to your nodes/move. Perhaps that's just too fast for reliable results.

Did you adjust any of the time settings in LC0?

I was terribly disappointed with cutechess as I see no suspend and resume capability like Arena. Running a test that finishes overnight is doable, but tests that take days are not. I'm not giving up my machine for a week to run silly engine matches. I need a match player that can be killed and restarted where it left off. Is there something I'm missing with not-so-cutechess?

Mike

First, I use for Lc0 in json file

{
"name" : "MoveOverheadMs",
"value" : 0
},

Second, I set in Cutechess-Cli a ridiculous "timemargin=5000" in every match at every time control. So, Lc0 might overstep its time limit, but it never happens in these fast games by more than some dozen milliseconds. All in all, I have no time losses (not at all if something else is not wrong) and I think the result is reliable, engines usually don't cheat and between two Lc0's, both might cheat just a bit but equally. Windows 10 used with its 1ms clock tick.

Laskos · Post by **Laskos** » Thu Jun 25, 2020 12:37 am

Here is the LTC test 60s+1s:

Code: Select all

Score of SV_384x30_4206_mlh vs SV_384x30_3010_mlh: 46 - 42 - 112  [0.510] 200
...      SV_384x30_4206_mlh playing White: 46 - 3 - 51  [0.715] 100
...      SV_384x30_4206_mlh playing Black: 0 - 39 - 61  [0.305] 100
...      White vs Black: 85 - 3 - 112  [0.705] 200
Elo difference: 6.9 +/- 32.0, LOS: 66.5 %, DrawRatio: 56.0 %
Finished match

The real pentanomial error margins are 30% smaller than those shown.

Within error margins of the non-mlh test, and outside error margins that the scaling is poor.

mbabigian · Post by **mbabigian** » Thu Jun 25, 2020 1:12 am

Ok, I'll adjust the time margins and try that, although I will probably use Arena so I can stop and start the match when I need to. I also notice that the opening pairings with cutechess make no sense to me. If I select Gauntlet, and "play each opening twice" it should pair Kvs3010 with the same opening with both colors and then do the same for 4175, but it pairs 4175 twice with the second opening in the pgn book not the first. So i am running two head to head matches and combining the pgn output files. Otherwise the two nets end up playing different openings which isn't fair.

I am testing against Komodo 14 with 17 cores. That core count seems a bit stronger than 3010 and a bit weaker than 4175 on my slightly OC 2080TI (At least if the Elo differences were not random noise created by the forfeits.). It takes about 6 hours to run 1000 games which can easily be complete overnight if I don't have the PC doing something else. I'm using Drawkiller_EloZoom_small500.pgn for the openings. The 50s/game tests will take 5 days to complete. Hence the need to stop and restart the match when I need to "disturb" the test environment.

I set no adjudication, but all the programs have access to 7 piece syzygy on SSD. As you have pointed out, there is currently no point in watching NN's play endgames, and I never use the nets without tablebases anyway.
Mike

mbabigian · Post by **mbabigian** » Thu Jun 25, 2020 10:39 pm

So the Elo's were way off due to the time forfeits. Also after trying both cutechess 1.0 and 1.1 I dumped it as it doesn't appear to pair the openings properly. Unfortunately Arena has no "time margin" setting I've seen. So I have been playing with the time controls and move overhead settings to find ones that will avoid time defaults. I'm currently testing 6+315ms for the lower end time control. Once I find a setting that avoids all defaults, I'll use that as the fastest setting for testing. The current test shows Komodo 14 with 30 cores getting its butt kicked by both 3010 and 4229. If the Elo delta is too big during this forfeit test, I may have to switch to SF-dev.

Hopefully NVIDIA will hurry up and deliver the already late Ampere GPUs so I can get out of this single GPU configuration. Having only 1 GPU is what makes using the machine for this type of testing so annoying. A spare GPU that can run undisturbed by my other uses would make it possible to test more often and I never intended to have a single GPU beyond June.

I'll post results once I have verified the output pgn and have experienced zero other issues.

mbabigian · Post by **mbabigian** » Fri Jun 26, 2020 11:06 pm

Ok, I have some results. First, I increased the delay to 450ms (6s+450ms) and put a move overhead of 300ms into LC0 and Stockfish. Everything looked good for quite a while, but I was shocked to see 4 time forfeits anyway. Going forward I will use a move overhead of 400ms. Equally surprising was seeing Stockfish with the bulk of the forfeits. I had originally assumed since Komodo wasn't going over, that the delay of loading the big nets was contributing to the LC0 forfeits at the faster times. Stockfish forfeiting 3 times ruined that theory. By the way, I still got time forfeits with cutechess margin set to 5000ms (wasn't using large move overhead during), but due to opening book pairing issues, I didn't try 9999ms before switching to Arena.

SF played with 24 threads and contempt 0. LC0 used defaults except minibatchsize 224, and 2,000,000 NN cache. Also all engines had access to 7 man syzygy on SSD.

Due to the long increment, I also decided to only run 500 games rather than 1000. As it got close to reaching 500, I went out to walk the dog and it took longer than I anticipated; hence, the 556 games.

Code: Select all

   # PLAYER                           :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)    W    D    L  D(%)
   1 Lc0-3010                         :    11.4   27.1   143.5     278  51.6      67   72  143   63  51.4
   2 Lc0-4229                         :     2.5   28.4   140.0     278  50.4      57   62  156   60  56.1
   3 Stockfish_20062422_x64_modern    :     0.0   ----   272.5     556  49.0     ---  123  299  134  53.8

White advantage = 32.26 +/- 9.96
Draw rate (equal opponents) = 54.31 % +/- 2.16

I don't like the noise created by the forfeits, so I removed the 8 opening pairs impacted by the 4 forfeits (16 games) that were spoiled by the forfeit problem and with 540 games we get:

Code: Select all

   # PLAYER                           :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)    W    D    L  D(%)
   1 Lc0-3010                         :    11.8   27.7   139.5     270  51.7      80   71  137   62  50.7
   2 Stockfish_20062422_x64_modern    :     0.0   ----   266.0     541  49.2      50  121  290  130  53.6
   3 Lc0-4229                         :    -0.1   28.6   135.5     271  50.0     ---   59  153   59  56.5

White advantage = 37.09 +/- 10.05
Draw rate (equal opponents) = 54.30 % +/- 2.17

The other surprise was that on my hardware (2080TI & 3970X) at this time control, 3010 is already outperforming. MLH was enabled for both nets and mlh3 values set.

I will likely run a shorter (200game) test at 60s+2.050s, but it will be a while before I'll have those results.

FYI

mbabigian · Post by **mbabigian** » Mon Jun 29, 2020 3:59 am

I have some results for the 60+2 time control and as I may not have free CPU/GPU to continue this match for a while, I'll post the results so far.

Code: Select all

   # PLAYER                           :  RATING  ERROR  POINTS  PLAYED   (%)  CFS(%)    W    D    L  D(%)
   1 Lc0-4229                         :    45.8   41.3    56.5     100  56.5      86   24   65   11  65.0
   2 Lc0-3010                         :    14.0   41.0    52.0     100  52.0      75   20   64   16  64.0
   3 Stockfish_20062422_x64_modern    :     0.0   ----    91.5     200  45.8     ---   27  129   44  64.5

White advantage = 5.31 +/- 14.64
Draw rate (equal opponents) = 65.39 % +/- 3.47

One note from what I posted for the 6s+450ms match conditions. When I downloaded SF to update to the latest dev before switching from Komodo, I failed to paste the syzygy path into SF. So although LC0 had 7 piece TBs. SF had none. This isn't a problem as this is a scaling test and both time controls were played with SF not having TBs. Perhaps over the next week or two, I'll have enough time to lower the error bars on the above with another couple hundred games.

FYI

Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?