ankan wrote: ↑Sun Oct 28, 2018 5:27 am
chrisw wrote: ↑Sat Oct 27, 2018 11:28 am
oh, that’s intereresting. what does minibatch do? I guess evaluates 512 positions as a batch?! but who provides the positions, I mean if everytime I want an evaluation, I have to wait for somebody else to want another 511 positions before i get get mine, I’m going to be doing a lot of stalling, waiting, or?
Yes, lc0 search algorithm tries to gather a batch of positions that are evaluated in a single call to the NN evaluation backend (i.e one CPU->GPU->CPU trip per batch). The minibatch-size param controls the max size of the batch (i.e, it can be smaller if not enough positions are ready to be evaluated). I am not entirely familiar with the search algorithm. Crem is the author. For details see GatherMinibatch() function in:
https://github.com/LeelaChessZero/lc0/b ... /search.cc
On one hand small batch sizes are inefficient (especially on bigger GPUs), on the other hand trying to gather bigger batch at the cost of evaluating positions that are less interesting weakens the search. Also the latency of waiting for results (before the tree can be explored further) is a bottleneck.
If the NN eval can be made faster for smaller batch sizes (ideally batch size of 1), it would help a lot.
hmmm, complicated. my first pass over what seems to be happening, so possibly entirely random ....
seems also connected to the nn-cache, is that hashed? looks so.
it seems to, on non terminal nodes due for expansion, to get all the children and add them to the batch-list(512). It’s chess, so that would add between 1-40 or so positions. I guess other parallel threads are doing the same.
When 512 reached, then all are sent to the NN. Results get saved in nn cache.
Then in search, if a node is found in nn-cache, the search will process it ...
if not, then send it to be batched as above.
There’s presumably a gain from the batching.
And a loss, because not all nodes sent to nn-cache are going to be used. Sending all children is a speculative send. Although there may/will be some cross-hashing from other parts of the tree/other threads which is a bonus.
I guess the idea is that once the cache-ing has got going, the search is not impeded by waiting for batches to process, because it keeps finding nodes that are already in the cache. nevertheless, there is the inefficiency that never-to-be-used child nodes get batched and then cached.
(And some problems to deal with, like collisions and locks and so on.)
Does that make sense?