Are you doing rollouts without NN? I think even in the original paper they trained a smaller policy network for that purpose. In my case, the search is always guided by the NN policy. I have stopped guiding the search with hand-crafted eval or other heuristic search (qsearch captures etc) once I added policy head.chrisw wrote: ↑Fri Apr 26, 2019 8:03 pmWell, if batch=1, you get full NN guidance whenever you want it. If batch=infinity, you get no NN guidance at all. As batchsize increases from 1 you get increasingly less NN guidance at and near leaf nodes. On the other side of the balance you get faster NN lookups.Daniel Shawul wrote: ↑Fri Apr 26, 2019 6:16 pmWhich begs the question, why use small batch sizes at all ? I don't use batch_size of less than 128.
Even launching 128 to 256 threads for multi-threaded batching on a 4-core cpu i see no problems...
Lc0 uses single threaded batching and defaults to batch size of 256 -- though smaller batch size of 32 is used for training ...
Best N for batchsize is easily established by results. So far, so obvious, I guess the interesting bit is how, or not, one provides useful search guidance in the absense of the NN. The original paper, I think, just did a random selection. Does it make sense to use some kind of handcrafted selector? I’ve been trying with selecting obvious captures over the last few days, but only succeeded in breaking everything, so nothing useful to report right now. Which is basically why I want c++ gpu, fiddling around at a chess level with Python is not good. Need C++.
Well going from a batch size of 128 to 16 my nps goes down by a factor of 4x, so I haven't really bothered to measure if the increased selectivity from smaller batch size can compensate for the loss in nps. However, I have now started using single thread search for generating training games. When I do 800 playouts training, say with batch size of 128, each thread builds its own tree and 128 games are produced separately.Did you measure the effect of batch size on playing strength? Nodes per second is not a good measurement of performance. I am using very large batches for self-play game generation, because I can play N self-play games in parallel. But when playing a single game in a tournament, I feel such huge batches may hurt performance, especially when the number of nodes is small. I have not measured this very seriously, though. I will run some tests during the week-end.
For actual search (not training with small number of playotus), one can make cpuct a function of the batch size to account for the added exploration due to virtual loss. I wonder how A0 got away with batch size of 8 -- maybe TPU has a lot less memory transfer overhead than GPU.
I noticed that batching helps not only because it reduces this latency, but also because tensorflow/tensorrt nn evaluation code uses generic std containers (list/vector etc) that is very slow. I remember first time I tried to use tensorflow on CPU, my nps tanked even after i commented out the actual NN evaluation code while keeping construction of input tensors etc.
Looking forward to your test results.