Page 4 of 6

Re: How good is the RTX 2080 Ti for Leela?

Posted: Thu Oct 25, 2018 11:32 am
by jpqy
In Ipman's Lc0 Benchmark is also RTX 2070 included..2080 & 2080 Ti used same Lc0 v0.18.1
There are the differences bigger..

https://docs.google.com/spreadsheets/d/ ... 1508569046

JP.

Re: How good is the RTX 2080 Ti for Leela?

Posted: Thu Oct 25, 2018 10:06 pm
by Laskos
jpqy wrote: Thu Oct 25, 2018 11:32 am In Ipman's Lc0 Benchmark is also RTX 2070 included..2080 & 2080 Ti used same Lc0 v0.18.1
There are the differences bigger..

https://docs.google.com/spreadsheets/d/ ... 1508569046

JP.
I am a bit skeptical about such a large difference between 2080 and 2080 Ti in that link. The tests were performed by different people maybe with different parameters.

Re: How good is the RTX 2080 Ti for Leela?

Posted: Fri Oct 26, 2018 1:02 pm
by crem
Laskos wrote: Thu Oct 25, 2018 10:06 pm
jpqy wrote: Thu Oct 25, 2018 11:32 am In Ipman's Lc0 Benchmark is also RTX 2070 included..2080 & 2080 Ti used same Lc0 v0.18.1
There are the differences bigger..

https://docs.google.com/spreadsheets/d/ ... 1508569046

JP.
I am a bit skeptical about such a large difference between 2080 and 2080 Ti in that link. The tests were performed by different people maybe with different parameters.
47664nps for 2080ti there is from Ankan who said in the table above it's 32472nps for 2080ti (and also he was active in this thread a few messages ago).

So some settings are surely different. I can see at least that the batch size is different (512 vs 1024) and also it was run through multiplexing backend (why?..).

Re: How good is the RTX 2080 Ti for Leela?

Posted: Fri Oct 26, 2018 1:17 pm
by jpqy
So there is a lot improvement needed in all these data/settings ,like a bench tool that everyone can run to get a more accurate compare!

JP.

Re: How good is the RTX 2080 Ti for Leela?

Posted: Fri Oct 26, 2018 10:49 pm
by Dann Corbit
Laskos wrote: Thu Oct 25, 2018 10:06 pm
jpqy wrote: Thu Oct 25, 2018 11:32 am In Ipman's Lc0 Benchmark is also RTX 2070 included..2080 & 2080 Ti used same Lc0 v0.18.1
There are the differences bigger..

https://docs.google.com/spreadsheets/d/ ... 1508569046

JP.
I am a bit skeptical about such a large difference between 2080 and 2080 Ti in that link. The tests were performed by different people maybe with different parameters.
Ankan is a serious expert.
I guess he can configure better than anyone else.

Re: How good is the RTX 2080 Ti for Leela?

Posted: Sat Oct 27, 2018 8:11 am
by ankan
I think I might have used different settings (just to maximize nps). I also might have the GPU overclocked when I posted that result.
Generally letting it run for longer (like for 5m nodes) with bigger nncache setting results in higher nps.

I just updated the sheet with benchmark at two different settings for RTX 2080Ti:
42915 nps for "--minibatch-size=512 -t 3 --backend=cudnn-fp16 --nncache=2000000; go nodes 5000000"
37499 nps for "--minibatch-size=512 -t 2 --backend=cudnn-fp16 --nncache=2000000; go nodes 1000000"

On RTX 2080Ti, batch size 512 vs 1024 don't have much of nps difference, so I would recommend 512 as people have found loss in strength with bigger batch sizes.

The reason longer run increases nps is due to more nn cache hits. Actual RAW NN eval speed peaks at about 24 knps on 2080Ti.

Re: How good is the RTX 2080 Ti for Leela?

Posted: Sat Oct 27, 2018 11:28 am
by chrisw
ankan wrote: Sat Oct 27, 2018 8:11 am I think I might have used different settings (just to maximize nps). I also might have the GPU overclocked when I posted that result.
Generally letting it run for longer (like for 5m nodes) with bigger nncache setting results in higher nps.

I just updated the sheet with benchmark at two different settings for RTX 2080Ti:
42915 nps for "--minibatch-size=512 -t 3 --backend=cudnn-fp16 --nncache=2000000; go nodes 5000000"
37499 nps for "--minibatch-size=512 -t 2 --backend=cudnn-fp16 --nncache=2000000; go nodes 1000000"

On RTX 2080Ti, batch size 512 vs 1024 don't have much of nps difference, so I would recommend 512 as people have found loss in strength with bigger batch sizes.

The reason longer run increases nps is due to more nn cache hits. Actual RAW NN eval speed peaks at about 24 knps on 2080Ti.
oh, that’s intereresting. what does minibatch do? I guess evaluates 512 positions as a batch?! but who provides the positions, I mean if everytime I want an evaluation, I have to wait for somebody else to want another 511 positions before i get get mine, I’m going to be doing a lot of stalling, waiting, or?

Re: How good is the RTX 2080 Ti for Leela?

Posted: Sun Oct 28, 2018 5:27 am
by ankan
chrisw wrote: Sat Oct 27, 2018 11:28 am oh, that’s intereresting. what does minibatch do? I guess evaluates 512 positions as a batch?! but who provides the positions, I mean if everytime I want an evaluation, I have to wait for somebody else to want another 511 positions before i get get mine, I’m going to be doing a lot of stalling, waiting, or?
Yes, lc0 search algorithm tries to gather a batch of positions that are evaluated in a single call to the NN evaluation backend (i.e one CPU->GPU->CPU trip per batch). The minibatch-size param controls the max size of the batch (i.e, it can be smaller if not enough positions are ready to be evaluated). I am not entirely familiar with the search algorithm. Crem is the author. For details see GatherMinibatch() function in:
https://github.com/LeelaChessZero/lc0/b ... /search.cc
On one hand small batch sizes are inefficient (especially on bigger GPUs), on the other hand trying to gather bigger batch at the cost of evaluating positions that are less interesting weakens the search. Also the latency of waiting for results (before the tree can be explored further) is a bottleneck.

If the NN eval can be made faster for smaller batch sizes (ideally batch size of 1), it would help a lot.

Re: How good is the RTX 2080 Ti for Leela?

Posted: Sun Oct 28, 2018 10:31 am
by chrisw
ankan wrote: Sun Oct 28, 2018 5:27 am
chrisw wrote: Sat Oct 27, 2018 11:28 am oh, that’s intereresting. what does minibatch do? I guess evaluates 512 positions as a batch?! but who provides the positions, I mean if everytime I want an evaluation, I have to wait for somebody else to want another 511 positions before i get get mine, I’m going to be doing a lot of stalling, waiting, or?
Yes, lc0 search algorithm tries to gather a batch of positions that are evaluated in a single call to the NN evaluation backend (i.e one CPU->GPU->CPU trip per batch). The minibatch-size param controls the max size of the batch (i.e, it can be smaller if not enough positions are ready to be evaluated). I am not entirely familiar with the search algorithm. Crem is the author. For details see GatherMinibatch() function in:
https://github.com/LeelaChessZero/lc0/b ... /search.cc
On one hand small batch sizes are inefficient (especially on bigger GPUs), on the other hand trying to gather bigger batch at the cost of evaluating positions that are less interesting weakens the search. Also the latency of waiting for results (before the tree can be explored further) is a bottleneck.

If the NN eval can be made faster for smaller batch sizes (ideally batch size of 1), it would help a lot.
hmmm, complicated. my first pass over what seems to be happening, so possibly entirely random ....

seems also connected to the nn-cache, is that hashed? looks so.

it seems to, on non terminal nodes due for expansion, to get all the children and add them to the batch-list(512). It’s chess, so that would add between 1-40 or so positions. I guess other parallel threads are doing the same.
When 512 reached, then all are sent to the NN. Results get saved in nn cache.

Then in search, if a node is found in nn-cache, the search will process it ...
if not, then send it to be batched as above.

There’s presumably a gain from the batching.
And a loss, because not all nodes sent to nn-cache are going to be used. Sending all children is a speculative send. Although there may/will be some cross-hashing from other parts of the tree/other threads which is a bonus.

I guess the idea is that once the cache-ing has got going, the search is not impeded by waiting for batches to process, because it keeps finding nodes that are already in the cache. nevertheless, there is the inefficiency that never-to-be-used child nodes get batched and then cached.

(And some problems to deal with, like collisions and locks and so on.)

Does that make sense?

Re: How good is the RTX 2080 Ti for Leela?

Posted: Wed Nov 07, 2018 12:07 pm
by Gian-Carlo Pascutto
crem wrote: Thu Oct 25, 2018 10:27 am Updated information from Ankan on discord:

Code: Select all

with cudnn 7.3 and 411.63 driver available at nvidia.com 
minibatch-size=512, network id: 11250, go nodes 1000000
v0.17, default values for all other settings
(2070 run was with v0.18.1 lc0 build but with same settings)
             fp32    fp16    
GTX 1080Ti:   8996     -
Titan V:     13295   29379
RTX 2070:     8841   23721
RTX 2080:     9708   26678
RTX 2080Ti:  12208   32472
I'm not sure though whether it's fair to have v0.17 vs v0.18.1 comparison, don't remember which changes were there between..
Why is the RTX 2070 on par with the GTX 1080 Ti in fp32 mode?

It's 11 GFLOPS fp32 vs 7.5 GFLOPS fp32. The performance discrepancy is strange, unless the RTX is somehow much better at getting to the theoretical performance. Or it's the build difference (0.17 vs 0.18). Would be nice to get more reliable numbers.

Is Leela Zero Chess using fp16 and/or tensor mode in training (not inference!)? Or are you using fp32? (I saw that the RTX cards, unlike the V100, have gimped fp32 accumulate in tensor cores, so it may be not be useful for training on them even if it is on a V100)

Does anyone have the numbers for having tensor cores enabled/disabled in fp16 mode? I know they aren't accurate enough to do Winograd, so there's a factor ~3 loss of efficiency for 3x3 convolution. I'm curious if there's still a worthwhile speedup left.