Smallnet (128x10) run1 progresses remarkably well

yanquis1972 · Post by **yanquis1972** » Thu Dec 20, 2018 2:15 am

But kais main point, I think, was that experimental runs should be as efficient as possible. Could eg tests20+30, and again test40 (if the aim is truly experimental as the name implies) be performed entirely on 128x10 (or smaller) nets, or does it not translate?

jp · Post by jp » Thu Dec 20, 2018 2:27 am

jkiliani wrote: ↑Thu Dec 20, 2018 2:15 am
jp wrote: ↑Thu Dec 20, 2018 2:02 am Again I hope TB rescoring will not be part of test40, now we know what A0's C(s) was.
As far as I know it will be. Why shouldn't it? TB rescoring has been shown to actually help, unlikely a lot of proposed changes to Lc0...

How has it been shown to actually help? Many things were changed in test30 compared with test10.

It shouldn't be used now, whether it helps or not, because it's clearly non-zero.

jkiliani · Post by **jkiliani** » Thu Dec 20, 2018 10:36 am

yanquis1972 wrote: ↑Thu Dec 20, 2018 2:15 am But kais main point, I think, was that experimental runs should be as efficient as possible. Could eg tests20+30, and again test40 (if the aim is truly experimental as the name implies) be performed entirely on 128x10 (or smaller) nets, or does it not translate?

test30 is an experimental run, and so is test35. test40 will not be an experimental run, the name "test" is something of a misnomer. Instead, it will be a main run with hopefully no parameter changes during the run, like the (failed) test20, just with a better architecture and much more confidence in parameters this time.

Arguably, many of the experiments of test30 could have been done with a 128x10 net, but it is too late for that now, and as I mentioned earlier, test20 was a "main run" with some bad parameters derived from an erroneous deduction out of a publication. I agree with Kai's point that experimental runs should ideally not be "full size", but this whole project has been a learning experience for the developers as well.

chrisw · Post by **chrisw** » Thu Dec 20, 2018 11:28 am

jkiliani wrote: ↑Thu Dec 20, 2018 1:45 am What is called smallnet in this thread is called test35, and its only real purpose is to test whether the SE (Squeeze-Excitation) Neural network architecture is implemented correctly in Lc0 0.20 Dev. So far it looks like that is the case, and SE nets are significantly stronger at the same network size compared to Residual Neural nets (used in test10, test30 and all other previous runs).

After Lc0 0.20 is released, test35 will be retired in favour of test40, a SE net of the same dimensions as test10 and test30, i.e. 256x20. That run is widely expected to significantly improve on test10.

SE should theoretically give more bang for the same net size, or conversely allow smaller net.
AZ showed me that chess is an image processing problem. We see patterns, literally, visually, and we use our visual system to process and solve it. When I am thinking about chess, I do things like scan the diagonals, or the dark squares, and while 3x3 conv can do that by linking together different board parts, it doesn’t come naturally. Some time, I guess, someone will propose presenting a different spatial mapping to the network. SE goes part way, if my reading is right, as it naturally makes links between different board regions. Smart spatial mapping would be better, but becomes non-zero I think.

yanquis1972 · Post by **yanquis1972** » Thu Dec 20, 2018 9:36 pm

jkiliani wrote: ↑Thu Dec 20, 2018 10:36 am
yanquis1972 wrote: ↑Thu Dec 20, 2018 2:15 am But kais main point, I think, was that experimental runs should be as efficient as possible. Could eg tests20+30, and again test40 (if the aim is truly experimental as the name implies) be performed entirely on 128x10 (or smaller) nets, or does it not translate?
test30 is an experimental run, and so is test35. test40 will not be an experimental run, the name "test" is something of a misnomer. Instead, it will be a main run with hopefully no parameter changes during the run, like the (failed) test20, just with a better architecture and much more confidence in parameters this time.

Arguably, many of the experiments of test30 could have been done with a 128x10 net, but it is too late for that now, and as I mentioned earlier, test20 was a "main run" with some bad parameters derived from an erroneous deduction out of a publication. I agree with Kai's point that experimental runs should ideally not be "full size", but this whole project has been a learning experience for the developers as well.

Thanks; perhaps test30 is a blessing in disguise, since the A0 publication might’ve occurred somewhere in test40s run (and it should, I think, surpass test10 in the meantime

Laskos · Post by **Laskos** » Fri Dec 21, 2018 1:14 am

jkiliani wrote: ↑Thu Dec 20, 2018 10:36 am

I have a question about the engine I have: lc0 v0.20-de 12 Dec

With older test30 bignets it achieves high NPS from Initial Board position after filling about half of its large chache size, a long time:
ID 32112
info depth 17 seldepth 46 time 166191 nodes 7327539 score cp 48 hashfull 426 nps 44091

With 4x smaller test35 nets, I would expect 3.5-4x speed-up, but at similarly large cache with similar settings I get a maximum after just 3-4 seconds, then a stall and a decrease in NPS:
ID 35245
info depth 10 seldepth 31 time 3722 nodes 340617 score cp 40 hashfull 31 nps 91514

Only some 2.1 speed-up at a cache fill-up of only 3%. Then even a decrease in NPS. Is it a bug or a feature?

On separate note, test35 really advances fast almost beyond error margins day by day:

Code: Select all

Games Completed = 100 of 100 (Avg game length = 61.417 sec)
Settings = Gauntlet/512MB/15000ms+250ms/M 2000cp for 5 moves, D 200 moves/EPD:F:\LittleBlitzer\gm2600_12plies.epd(27202)
Time = 6417 sec elapsed, 0 sec remaining
 1.  lc0_v20-dev 35229          	37.0/100	14-40-46  	(L: m=19 t=0 i=0 a=21)	(D: r=28 i=11 f=7 s=0 a=0)	(tpm=439.2 d=9.10 nps=1061510)
 2.  Stockfish 8 64 BMI2         	63.0/100	40-14-46  	(L: m=0 t=0 i=0 a=14)	(D: r=28 i=11 f=7 s=0 a=0)	(tpm=429.3 d=26.26 nps=9257779)

Code: Select all

Games Completed = 100 of 100 (Avg game length = 63.061 sec)
Settings = Gauntlet/512MB/15000ms+250ms/M 2000cp for 5 moves, D 200 moves/EPD:F:\LittleBlitzer\gm2600_12plies.epd(27202)
Time = 6582 sec elapsed, 0 sec remaining
 1.  lc0_v20-dev 35255        	45.0/100	19-29-52  	(L: m=12 t=0 i=0 a=17)	(D: r=39 i=6 f=3 s=2 a=2)	(tpm=425.5 d=9.35 nps=281308)
 2.  Stockfish 8 64 BMI2      	55.0/100	29-19-52  	(L: m=0 t=0 i=0 a=19)	(D: r=39 i=6 f=3 s=2 a=2)	(tpm=418.4 d=27.19 nps=9319530)

Already close to SF8 in my conditions.

ankan · Post by **ankan** » Fri Dec 21, 2018 8:18 pm

Laskos wrote: ↑Fri Dec 21, 2018 1:14 am I have a question about the engine I have: lc0 v0.20-de 12 Dec

With older test30 bignets it achieves high NPS from Initial Board position after filling about half of its large chache size, a long time:
ID 32112
info depth 17 seldepth 46 time 166191 nodes 7327539 score cp 48 hashfull 426 nps 44091

With 4x smaller test35 nets, I would expect 3.5-4x speed-up, but at similarly large cache with similar settings I get a maximum after just 3-4 seconds, then a stall and a decrease in NPS:
ID 35245
info depth 10 seldepth 31 time 3722 nodes 340617 score cp 40 hashfull 31 nps 91514

Only some 2.1 speed-up at a cache fill-up of only 3%. Then even a decrease in NPS. Is it a bug or a feature?

This is not a bug but rather a limitation of current lc0 engine. We run into severe CPU side bottlenecks at around 70-100knps (depending on CPU). This was also reported with 256x20 net but only on very powerful hardware (like 4x2080Ti's used by CCC). With the smaller 128x10 network it would happen much sooner (like on your single 2070). The nps decreases with time because the tree grows and CPU side search code has to hold locks for longer periods. We are trying to understand and reduce these bottlenecks.

Laskos · Post by **Laskos** » Fri Dec 21, 2018 8:30 pm

ankan wrote: ↑Fri Dec 21, 2018 8:18 pm
Laskos wrote: ↑Fri Dec 21, 2018 1:14 am I have a question about the engine I have: lc0 v0.20-de 12 Dec

With older test30 bignets it achieves high NPS from Initial Board position after filling about half of its large chache size, a long time:
ID 32112
info depth 17 seldepth 46 time 166191 nodes 7327539 score cp 48 hashfull 426 nps 44091

With 4x smaller test35 nets, I would expect 3.5-4x speed-up, but at similarly large cache with similar settings I get a maximum after just 3-4 seconds, then a stall and a decrease in NPS:
ID 35245
info depth 10 seldepth 31 time 3722 nodes 340617 score cp 40 hashfull 31 nps 91514

Only some 2.1 speed-up at a cache fill-up of only 3%. Then even a decrease in NPS. Is it a bug or a feature?
This is not a bug but rather a limitation of current lc0 engine. We run into severe CPU side bottlenecks at around 70-100knps (depending on CPU). This was also reported with 256x20 net but only on very powerful hardware (like 4x2080Ti's used by CCC). With the smaller 128x10 network it would happen much sooner (like on your single 2070). The nps decreases with time because the tree grows and CPU side search code has to hold locks for longer periods. We are trying to understand and reduce these bottlenecks.

Thanks, so there is an issue with CPU bottlenecking. I was not sure, as with different architecture and engine, I was thinking that it's maybe a feature.

ankan · Post by **ankan** » Sun Dec 23, 2018 2:04 pm

The latest v0.20.0 rc1 release has some improvements for CPU bottlenecked scenarios in case you want to try (but don't expect more than 5-10% speedup).
https://github.com/LeelaChessZero/lc0/releases

Laskos · Post by **Laskos** » Sun Dec 23, 2018 2:36 pm

ankan wrote: ↑Sun Dec 23, 2018 2:04 pm The latest v0.20.0 rc1 release has some improvements for CPU bottlenecked scenarios in case you want to try (but don't expect more than 5-10% speedup).
https://github.com/LeelaChessZero/lc0/releases

Yes, I saw this morning, it has higher NPS (at peak, by some 15-20%), stalls later than before and then decreases NPS.

The peaks for the same net:

ID 35245:

v0.20-dev 12 Dec
info depth 10 seldepth 31 time 3722 nodes 340617 score cp 40 hashfull 31 nps 91514

v0.20.0 rc1
info depth 12 seldepth 34 time 12521 nodes 1356132 score cp 38 hashfull 99 nps 108308

Still, there seems to be more room for GPU, it's not at full load because of the CPU bottleneck. I would expect for ideal case on my card some 130,000-140,000 peak NPS.

Smallnet (128x10) run1 progresses remarkably well

Re: Smallnet (128x10) run1 progresses remarkably well

Re: Smallnet (128x10) run1 progresses remarkably well

Re: Smallnet (128x10) run1 progresses remarkably well

Re: Smallnet (128x10) run1 progresses remarkably well

Re: Smallnet (128x10) run1 progresses remarkably well

Re: Smallnet (128x10) run1 progresses remarkably well

Re: Smallnet (128x10) run1 progresses remarkably well

Re: Smallnet (128x10) run1 progresses remarkably well

Re: Smallnet (128x10) run1 progresses remarkably well

Re: Smallnet (128x10) run1 progresses remarkably well