How good is the RTX 2080 Ti for Leela?

ankan · Post by **ankan** » Sun Oct 28, 2018 5:27 am

chrisw wrote: ↑Sat Oct 27, 2018 11:28 am oh, that’s intereresting. what does minibatch do? I guess evaluates 512 positions as a batch?! but who provides the positions, I mean if everytime I want an evaluation, I have to wait for somebody else to want another 511 positions before i get get mine, I’m going to be doing a lot of stalling, waiting, or?

Yes, lc0 search algorithm tries to gather a batch of positions that are evaluated in a single call to the NN evaluation backend (i.e one CPU->GPU->CPU trip per batch). The minibatch-size param controls the max size of the batch (i.e, it can be smaller if not enough positions are ready to be evaluated). I am not entirely familiar with the search algorithm. Crem is the author. For details see GatherMinibatch() function in:
https://github.com/LeelaChessZero/lc0/b ... /search.cc
On one hand small batch sizes are inefficient (especially on bigger GPUs), on the other hand trying to gather bigger batch at the cost of evaluating positions that are less interesting weakens the search. Also the latency of waiting for results (before the tree can be explored further) is a bottleneck.

If the NN eval can be made faster for smaller batch sizes (ideally batch size of 1), it would help a lot.

chrisw · Post by **chrisw** » Sun Oct 28, 2018 10:31 am

ankan wrote: ↑Sun Oct 28, 2018 5:27 am
chrisw wrote: ↑Sat Oct 27, 2018 11:28 am oh, that’s intereresting. what does minibatch do? I guess evaluates 512 positions as a batch?! but who provides the positions, I mean if everytime I want an evaluation, I have to wait for somebody else to want another 511 positions before i get get mine, I’m going to be doing a lot of stalling, waiting, or?
Yes, lc0 search algorithm tries to gather a batch of positions that are evaluated in a single call to the NN evaluation backend (i.e one CPU->GPU->CPU trip per batch). The minibatch-size param controls the max size of the batch (i.e, it can be smaller if not enough positions are ready to be evaluated). I am not entirely familiar with the search algorithm. Crem is the author. For details see GatherMinibatch() function in:
https://github.com/LeelaChessZero/lc0/b ... /search.cc
On one hand small batch sizes are inefficient (especially on bigger GPUs), on the other hand trying to gather bigger batch at the cost of evaluating positions that are less interesting weakens the search. Also the latency of waiting for results (before the tree can be explored further) is a bottleneck.

If the NN eval can be made faster for smaller batch sizes (ideally batch size of 1), it would help a lot.

hmmm, complicated. my first pass over what seems to be happening, so possibly entirely random ....

seems also connected to the nn-cache, is that hashed? looks so.

it seems to, on non terminal nodes due for expansion, to get all the children and add them to the batch-list(512). It’s chess, so that would add between 1-40 or so positions. I guess other parallel threads are doing the same.
When 512 reached, then all are sent to the NN. Results get saved in nn cache.

Then in search, if a node is found in nn-cache, the search will process it ...
if not, then send it to be batched as above.

There’s presumably a gain from the batching.
And a loss, because not all nodes sent to nn-cache are going to be used. Sending all children is a speculative send. Although there may/will be some cross-hashing from other parts of the tree/other threads which is a bonus.

I guess the idea is that once the cache-ing has got going, the search is not impeded by waiting for batches to process, because it keeps finding nodes that are already in the cache. nevertheless, there is the inefficiency that never-to-be-used child nodes get batched and then cached.

(And some problems to deal with, like collisions and locks and so on.)

Does that make sense?

Gian-Carlo Pascutto · Wed Nov 07, 2018 12:07 pm

crem wrote: ↑Thu Oct 25, 2018 10:27 am Updated information from Ankan on discord:
Code: Select all
with cudnn 7.3 and 411.63 driver available at nvidia.com 
minibatch-size=512, network id: 11250, go nodes 1000000
v0.17, default values for all other settings
(2070 run was with v0.18.1 lc0 build but with same settings)
             fp32    fp16    
GTX 1080Ti:   8996     -
Titan V:     13295   29379
RTX 2070:     8841   23721
RTX 2080:     9708   26678
RTX 2080Ti:  12208   32472
I'm not sure though whether it's fair to have v0.17 vs v0.18.1 comparison, don't remember which changes were there between..

Why is the RTX 2070 on par with the GTX 1080 Ti in fp32 mode?

It's 11 GFLOPS fp32 vs 7.5 GFLOPS fp32. The performance discrepancy is strange, unless the RTX is somehow much better at getting to the theoretical performance. Or it's the build difference (0.17 vs 0.18). Would be nice to get more reliable numbers.

Is Leela Zero Chess using fp16 and/or tensor mode in training (not inference!)? Or are you using fp32? (I saw that the RTX cards, unlike the V100, have gimped fp32 accumulate in tensor cores, so it may be not be useful for training on them even if it is on a V100)

Does anyone have the numbers for having tensor cores enabled/disabled in fp16 mode? I know they aren't accurate enough to do Winograd, so there's a factor ~3 loss of efficiency for 3x3 convolution. I'm curious if there's still a worthwhile speedup left.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Nov 07, 2018 1:37 pm

Another thing to notice in this table is that fp16 numbers tend to be >> 2x fp32 numbers.

This makes me a bit suspicious.

h1a8 · Post by **h1a8** » Wed Nov 14, 2018 9:00 pm

I have a question.
I can buy a dual 2080 OC with 8gb GDDR6 each
for a similar price as
A single 2080 ti OC with 11GB GDDR6

Which would give the stronger Leela performance and by how much?

Milos · Post by **Milos** » Wed Nov 14, 2018 9:18 pm

h1a8 wrote: ↑Wed Nov 14, 2018 9:00 pm I have a question.
I can buy a dual 2080 OC with 8gb GDDR6 each
for a similar price as
A single 2080 ti OC with 11GB GDDR6

Which would give the stronger Leela performance and by how much?

Dual 2080 hands down.

Dann Corbit · Post by **Dann Corbit** » Wed Nov 14, 2018 9:20 pm

Milos wrote: ↑Wed Nov 14, 2018 9:18 pm
h1a8 wrote: ↑Wed Nov 14, 2018 9:00 pm I have a question.
I can buy a dual 2080 OC with 8gb GDDR6 each
for a similar price as
A single 2080 ti OC with 11GB GDDR6

Which would give the stronger Leela performance and by how much?
Dual 2080 hands down.

Better have a big power supply.

ankan · Post by **ankan** » Thu Nov 15, 2018 7:42 am

Gian-Carlo Pascutto wrote: ↑Wed Nov 07, 2018 1:37 pm Another thing to notice in this table is that fp16 numbers tend to be >> 2x fp32 numbers.

This makes me a bit suspicious.

That is because we use Tensor Cores for FP16 path. The speedup is only about 3X (and not 8X as raw TFlops nos would suggest) because fp32 path uses Winograd algorithm for convolutions but FP16 path doesn't (not well supported by cudnn).
Right now we don't have a path using FP16 without using tensor math. It should be pretty easy to support - just need to change tensor layout, tensor math option in cudnn and cudnn algorithm selection setting but it would be slower than the current version (with tensor cores enabled).
AFAIK, there is only one Nvidia GPU (P100) supporting high throughput fp16 that doesn't have tensor cores so adding another path just to support it is probably not worth given that P100 is anyway a server only product (Tesla/Quadro).

h1a8 · Post by **h1a8** » Tue Nov 20, 2018 10:11 am

Milos wrote: ↑Wed Nov 14, 2018 9:18 pm
h1a8 wrote: ↑Wed Nov 14, 2018 9:00 pm I have a question.
I can buy a dual 2080 OC with 8gb GDDR6 each
for a similar price as
A single 2080 ti OC with 11GB GDDR6

Which would give the stronger Leela performance and by how much?
Dual 2080 hands down.

Thank you! From the spreadsheet,
the rtx 2080 gives about 26000 nps and the rtx 2080 ti gives about 37500 nps. A dual rtx 2080 would give about how many nps?

jpqy wrote: ↑Thu Oct 25, 2018 11:32 am In Ipman's Lc0 Benchmark is also RTX 2070 included..2080 & 2080 Ti used same Lc0 v0.18.1
There are the differences bigger..

JP.

I have a few questions about the spreadsheet.

1. Why is the rtx 2080 ti is listed for 3 threads and for 2 threads? Are there two different versions we can buy?

2. Why does the rtx 2070 have 4 threads and not 2? Is that really a dual 2070 setup?

3. Why is the 2 thread rtx 2070 (slightly OC) significantly faster than the 4 thread rtx 2070?

4. Would a dual rtx 2080 setup result in about 52000nps for Leela (since a single 2080 is about 26000nps)?

Robert Pope · Post by **Robert Pope** » Tue Nov 20, 2018 4:03 pm

The threads refers to how many instances of Lc0 you have running at the same time, nothing to do with different GPU cards. Fast cards don't run at full capacity with just one CPU thread, so you get more output by having two threads sending data to the GPU.

How good is the RTX 2080 Ti for Leela?

Re: How good is the RTX 2080 Ti for Leela?

Re: How good is the RTX 2080 Ti for Leela?

Re: How good is the RTX 2080 Ti for Leela?

Re: How good is the RTX 2080 Ti for Leela?

Re: How good is the RTX 2080 Ti for Leela?

Re: How good is the RTX 2080 Ti for Leela?

Re: How good is the RTX 2080 Ti for Leela?

Re: How good is the RTX 2080 Ti for Leela?

Re: How good is the RTX 2080 Ti for Leela?

Re: How good is the RTX 2080 Ti for Leela?