Yeah, I forgot about the policy part when I wrote that post earlier.Laskos wrote: ↑Tue Oct 29, 2019 9:55 pmNot necessarily, there are policy and value parts affecting the tactics, not just NPS. It depends on say time control too. I have mixed results on that on tactical suites like Arasan and trimmed WAC. Bigger nets seem to improve on tactics faster with longer TC than smaller nets. But many smaller nets start from better tactics at very short TC.dragontamer5788 wrote: ↑Tue Oct 29, 2019 8:51 pmSmaller nets probably will be better in tactics, while bigger nets will probably be stronger for positional play. A bigger net means spending more time evaluating the neural net... which necessarily means having fewer-and-fewer nodes-per-second.Dann Corbit wrote: ↑Tue Oct 29, 2019 8:47 pmI suspect that I made a bad net choice, so I am replaying.dragontamer5788 wrote: ↑Tue Oct 29, 2019 8:41 pmLC0 has only a few-hundred thousand nodes per second. This means it is significantly stronger at tactics than any human player, but still far slower than the millions or even hundred-millions nodes-per-second of classic A/B engines.Dann Corbit wrote: ↑Tue Oct 29, 2019 5:51 pm Since LC0 performed so terribly (I gave 12 minutes per position on a test set meant to run for 15 seconds per position) and is beaten by all the strong tactical engines by a landslide, I can only conclude that tactics are not very important for the game of chess.
Tactics remain important: but LC0 has "enough" tactics to play decently. Where it wins is that LC0 has very strong positional play / long-term skill in chess.
But I gave LC0 48 times more time per position, and the results were still terrible.
So I suspect your diagnosis is correct.
I also expect the bigger net to do better
Case in point: a small net may have 10-microseconds per node (100,000 nodes per second), while a big net may have 50 microseconds per node (20,000 nodes per second).
One more potential issue: GPUs need a very large number of threads before they're practical. The NVidia 1080 Ti has 3584 CUDA-cores, so you need at least 3584 SIMD-threads before you really utilize the system (1-thread will run just as quickly as 3584-threads). Furthermore, GPUs don't have branch-prediction, memory-prefetching, or other high-performance features that CPU-programmers use.
Instead: GPUs use a hyperthread-like feature where they can quickly switch between many threads while waiting for memory requests. For NVidia, IIRC there can be up to 32x warps per SM, or roughly 8x logical-threads per physical CUDA-thread. Max occupancy would therefore be 28672 SIMD-threads per NVidia 1080 Ti.
I figure the neural nets are evaluated in parallel to some degree with MCTS-virtual loss + asynchronous compute. But a larger neural net will force a larger amount of work, which would make the GPU more efficient. Having lots-and-lots of work helps a lot. In this case, its not that the "bigger net" will run faster, its that you have so many CUDA-cores that the "bigger net" might not really lose much in terms of nodes-per-second compared to a smaller net.