Why do the latest Sergio Vieri 384x30b networks scale so badly?

Laskos · Post by **Laskos** » Fri Jun 19, 2020 7:51 pm

Here is the link to the series of Sergio Vieri very strong large networks (net 3010 won TCEC 17), best Lc0 nets on RTX GPU to LTC (maybe the latest LS nets can match it).
https://www.comp.nus.edu.sg/~sergio-v/t60/384x30/

His late releases like
384x30-t60-4155.pb.gz 2020-06-18 09:22 131M
seem very strong in fast testing, significantly stronger than 3010 net
RTX 2070 GPU

TC: 6s + 0.1s

Code: Select all

Score of SV_384x30_4155 vs SV_384x30_3010: 259 - 178 - 363  [0.551] 800
...      SV_384x30_4155 playing White: 217 - 19 - 164  [0.748] 400
...      SV_384x30_4155 playing Black: 42 - 159 - 199  [0.354] 400
...      White vs Black: 376 - 61 - 363  [0.697] 800
Elo difference: 35.3 +/- 17.8, LOS: 100.0 %, DrawRatio: 45.4 %
Finished match

The pentanomial error margins are 30% smaller than shown above.

But at longer time control, they are invariably weaker:

TC: 60s + 1s

Code: Select all

Score of SV_384x30_4155 vs SV_384x30_3010: 30 - 39 - 81  [0.470] 150
...      SV_384x30_4155 playing White: 30 - 1 - 44  [0.693] 75
...      SV_384x30_4155 playing Black: 0 - 38 - 37  [0.247] 75
...      White vs Black: 68 - 1 - 81  [0.723] 150
Elo difference: -20.9 +/- 37.8, LOS: 13.9 %, DrawRatio: 54.0 %
Finished match

The pentanomial error margins are 35% smaller than shown above.

The remark is outside error margins, I am not sure what he is doing since the net 3290, when he reset the LR, but the nets don't seem to improve their bad scaling which started since then.

dkappe · Post by **dkappe** » Sat Jun 20, 2020 7:54 pm

It seems the MLH is still an experiment. Most testing in the leela discord is short tc, so unlikely to uncover this.

We’ll see if the bad start in tcec SuFi is just a statistical wobble or the start of an epic collapse.

jjoshua2 · Post by **jjoshua2** » Sat Jun 20, 2020 9:09 pm

I will still be surprised if lc0 does not win a couple games in a row sometime in the reminder. Some engine has to win first, and close to 50% chance it will win the 2nd one, so getting 2-0 means nothing really. EDIT well 3 losses in a row now

dkappe · Post by **dkappe** » Sat Jun 20, 2020 11:13 pm

Something looks suspicious in the tcec gpu utilization. Maybe.

https://tcec-chess.com/gpu_temperature.txt

Milos · Post by **Milos** » Sun Jun 21, 2020 12:30 am

dkappe wrote: ↑Sat Jun 20, 2020 11:13 pm Something looks suspicious in the tcec gpu utilization. Maybe.

https://tcec-chess.com/gpu_temperature.txt

It's taken every 5min, so when low, you look at one during SF move when GPU is idle. Lc0 fansboys are really starting to get paranoid.
So hard to accept the fact that SF might have actually improved significantly since S17 and Lc0 didn't improve much if at all.
Current 3972 net is very, very close to 3010. And while mlh might produce a bit faster wins for Lc0, there is no clear indication that it brings any Elo at all (also that it looses Elo).

Ovyron · Post by **Ovyron** » Sun Jun 21, 2020 9:31 am

Milos wrote: ↑Sun Jun 21, 2020 12:30 am So hard to accept the fact that SF might have actually improved significantly since S17

Did you mean to write "S1.7"?, I don't think comparing the development of Stockfish 1.7 (1.7.1?) since such an early version with that of Leela is relevant.

Alayan · Post by **Alayan** » Sun Jun 21, 2020 9:35 am

S17 obviously means TCEC season 17.

Ovyron · Post by **Ovyron** » Sun Jun 21, 2020 10:00 am

Oh, that makes sense. I'd never gotten why SF is shortened like that when its name is not Stock Fish, but it's probably so when people say "S17" instead of "Season 17" people don't get confused when reading it in a sentence.

Laskos · Post by **Laskos** » Sun Jun 21, 2020 1:29 pm

dkappe wrote: ↑Sat Jun 20, 2020 7:54 pm It seems the MLH is still an experiment. Most testing in the leela discord is short tc, so unlikely to uncover this.

We’ll see if the bad start in tcec SuFi is just a statistical wobble or the start of an epic collapse.

As you seem to be in contact with Lc0 community advice them in testing methodology. My 150 game 60s+1s match is probably equivalent to some 300 games "others" tests. People for some reason try to choose balanced very regular openings in their tests, and often not that short openings. The result from such "regular" opening suite of net 3972 versus net 3010 at 60s+1s on an RTX GPU will be 80-90% draw rate. It was shown that this is a very inefficient way of testing, factors of often 2-6 needing more games for the same statistical significance than testing from unbalanced opening suite. The unbalance was shown to be at the border 50% White win 50% draw border, or 0.8-1.0 eval of SF. I take my suites from human games, to be short 3-movers, but in that 0.8-1.0 range of the SF evals. One can build hundreds of such openings and play thousands of games side-reversed in a match from FIDE Elo above 2200 human not very balanced openings. Read "Match Statistics" in the Chess Wiki on how to use pentanomial variance with unbalanced openings, which often is almost 2 times smaller than the usual trinomial variance with ultra-balanced openings, thus needing 4 times less games for the same statistical significance. People don't read Chess Wiki and its links? Michel posted very useful material there and links. One cannot test LTC between 2 similar Lc0 nets and expect high statistical significance not reading that page. 90% draw rate is not a joke in separating engines strength-wise, while trinomial errors are no smaller than the pentanomial ones using unbalanced openings. Lc0 nets are so similar style-wise, that high draw rates between them from regular balanced openings are unavoidable.

mbabigian · Post by **mbabigian** » Sun Jun 21, 2020 6:12 pm

Read "Match Statistics" in the Chess Wiki on how to use pentanomial variance with unbalanced openings, which often is almost 2 times smaller than the usual trinomial variance with ultra-balanced openings, thus needing 4 times less games for the same statistical significance.

The direct link: https://www.chessprogramming.org/Match_Statistics

Unfortunately the Leela project is still in its infancy and its maturing duration is hamstrung by the normal social nonsense that plagues all open source efforts. If the project remains active long enough "to be or not to be" will eventually be typed. It is just a matter of patience unfortunately. It often drives me nuts every time they rely on voting for decisions that can be imperically derived, but social norms, an obsession with trying to "appear" democratic keeps progress at a snails pace.

No worries however, they are slowly rediscovering knowledge documented decades ago. Someone looking to earn their PhD in the social sciences could find a gaggle of Thesis ideas by studying human behavior in open source projects. The results could even suggest ways to speed up the maturing process!

Why do the latest Sergio Vieri 384x30b networks scale so badly?

Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?

Re: Why do the latest Sergio Vieri 384x30b networks scale so badly?