How good is the RTX 2080 Ti for Leela?

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

jpqy
Posts: 550
Joined: Thu Apr 24, 2008 9:31 am
Location: Belgium

Re: How good is the RTX 2080 Ti for Leela?

Post by jpqy »

In Ipman's Lc0 Benchmark is also RTX 2070 included..2080 & 2080 Ti used same Lc0 v0.18.1
There are the differences bigger..

https://docs.google.com/spreadsheets/d/ ... 1508569046

JP.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: How good is the RTX 2080 Ti for Leela?

Post by Laskos »

jpqy wrote: Thu Oct 25, 2018 11:32 am In Ipman's Lc0 Benchmark is also RTX 2070 included..2080 & 2080 Ti used same Lc0 v0.18.1
There are the differences bigger..

https://docs.google.com/spreadsheets/d/ ... 1508569046

JP.
I am a bit skeptical about such a large difference between 2080 and 2080 Ti in that link. The tests were performed by different people maybe with different parameters.
crem
Posts: 177
Joined: Wed May 23, 2018 9:29 pm

Re: How good is the RTX 2080 Ti for Leela?

Post by crem »

Laskos wrote: Thu Oct 25, 2018 10:06 pm
jpqy wrote: Thu Oct 25, 2018 11:32 am In Ipman's Lc0 Benchmark is also RTX 2070 included..2080 & 2080 Ti used same Lc0 v0.18.1
There are the differences bigger..

https://docs.google.com/spreadsheets/d/ ... 1508569046

JP.
I am a bit skeptical about such a large difference between 2080 and 2080 Ti in that link. The tests were performed by different people maybe with different parameters.
47664nps for 2080ti there is from Ankan who said in the table above it's 32472nps for 2080ti (and also he was active in this thread a few messages ago).

So some settings are surely different. I can see at least that the batch size is different (512 vs 1024) and also it was run through multiplexing backend (why?..).
jpqy
Posts: 550
Joined: Thu Apr 24, 2008 9:31 am
Location: Belgium

Re: How good is the RTX 2080 Ti for Leela?

Post by jpqy »

So there is a lot improvement needed in all these data/settings ,like a bench tool that everyone can run to get a more accurate compare!

JP.
Dann Corbit
Posts: 12537
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: How good is the RTX 2080 Ti for Leela?

Post by Dann Corbit »

Laskos wrote: Thu Oct 25, 2018 10:06 pm
jpqy wrote: Thu Oct 25, 2018 11:32 am In Ipman's Lc0 Benchmark is also RTX 2070 included..2080 & 2080 Ti used same Lc0 v0.18.1
There are the differences bigger..

https://docs.google.com/spreadsheets/d/ ... 1508569046

JP.
I am a bit skeptical about such a large difference between 2080 and 2080 Ti in that link. The tests were performed by different people maybe with different parameters.
Ankan is a serious expert.
I guess he can configure better than anyone else.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
ankan
Posts: 77
Joined: Sun Apr 21, 2013 3:29 pm
Full name: Ankan Banerjee

Re: How good is the RTX 2080 Ti for Leela?

Post by ankan »

I think I might have used different settings (just to maximize nps). I also might have the GPU overclocked when I posted that result.
Generally letting it run for longer (like for 5m nodes) with bigger nncache setting results in higher nps.

I just updated the sheet with benchmark at two different settings for RTX 2080Ti:
42915 nps for "--minibatch-size=512 -t 3 --backend=cudnn-fp16 --nncache=2000000; go nodes 5000000"
37499 nps for "--minibatch-size=512 -t 2 --backend=cudnn-fp16 --nncache=2000000; go nodes 1000000"

On RTX 2080Ti, batch size 512 vs 1024 don't have much of nps difference, so I would recommend 512 as people have found loss in strength with bigger batch sizes.

The reason longer run increases nps is due to more nn cache hits. Actual RAW NN eval speed peaks at about 24 knps on 2080Ti.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: How good is the RTX 2080 Ti for Leela?

Post by chrisw »

ankan wrote: Sat Oct 27, 2018 8:11 am I think I might have used different settings (just to maximize nps). I also might have the GPU overclocked when I posted that result.
Generally letting it run for longer (like for 5m nodes) with bigger nncache setting results in higher nps.

I just updated the sheet with benchmark at two different settings for RTX 2080Ti:
42915 nps for "--minibatch-size=512 -t 3 --backend=cudnn-fp16 --nncache=2000000; go nodes 5000000"
37499 nps for "--minibatch-size=512 -t 2 --backend=cudnn-fp16 --nncache=2000000; go nodes 1000000"

On RTX 2080Ti, batch size 512 vs 1024 don't have much of nps difference, so I would recommend 512 as people have found loss in strength with bigger batch sizes.

The reason longer run increases nps is due to more nn cache hits. Actual RAW NN eval speed peaks at about 24 knps on 2080Ti.
oh, that’s intereresting. what does minibatch do? I guess evaluates 512 positions as a batch?! but who provides the positions, I mean if everytime I want an evaluation, I have to wait for somebody else to want another 511 positions before i get get mine, I’m going to be doing a lot of stalling, waiting, or?
ankan
Posts: 77
Joined: Sun Apr 21, 2013 3:29 pm
Full name: Ankan Banerjee

Re: How good is the RTX 2080 Ti for Leela?

Post by ankan »

chrisw wrote: Sat Oct 27, 2018 11:28 am oh, that’s intereresting. what does minibatch do? I guess evaluates 512 positions as a batch?! but who provides the positions, I mean if everytime I want an evaluation, I have to wait for somebody else to want another 511 positions before i get get mine, I’m going to be doing a lot of stalling, waiting, or?
Yes, lc0 search algorithm tries to gather a batch of positions that are evaluated in a single call to the NN evaluation backend (i.e one CPU->GPU->CPU trip per batch). The minibatch-size param controls the max size of the batch (i.e, it can be smaller if not enough positions are ready to be evaluated). I am not entirely familiar with the search algorithm. Crem is the author. For details see GatherMinibatch() function in:
https://github.com/LeelaChessZero/lc0/b ... /search.cc
On one hand small batch sizes are inefficient (especially on bigger GPUs), on the other hand trying to gather bigger batch at the cost of evaluating positions that are less interesting weakens the search. Also the latency of waiting for results (before the tree can be explored further) is a bottleneck.

If the NN eval can be made faster for smaller batch sizes (ideally batch size of 1), it would help a lot.
chrisw
Posts: 4313
Joined: Tue Apr 03, 2012 4:28 pm

Re: How good is the RTX 2080 Ti for Leela?

Post by chrisw »

ankan wrote: Sun Oct 28, 2018 5:27 am
chrisw wrote: Sat Oct 27, 2018 11:28 am oh, that’s intereresting. what does minibatch do? I guess evaluates 512 positions as a batch?! but who provides the positions, I mean if everytime I want an evaluation, I have to wait for somebody else to want another 511 positions before i get get mine, I’m going to be doing a lot of stalling, waiting, or?
Yes, lc0 search algorithm tries to gather a batch of positions that are evaluated in a single call to the NN evaluation backend (i.e one CPU->GPU->CPU trip per batch). The minibatch-size param controls the max size of the batch (i.e, it can be smaller if not enough positions are ready to be evaluated). I am not entirely familiar with the search algorithm. Crem is the author. For details see GatherMinibatch() function in:
https://github.com/LeelaChessZero/lc0/b ... /search.cc
On one hand small batch sizes are inefficient (especially on bigger GPUs), on the other hand trying to gather bigger batch at the cost of evaluating positions that are less interesting weakens the search. Also the latency of waiting for results (before the tree can be explored further) is a bottleneck.

If the NN eval can be made faster for smaller batch sizes (ideally batch size of 1), it would help a lot.
hmmm, complicated. my first pass over what seems to be happening, so possibly entirely random ....

seems also connected to the nn-cache, is that hashed? looks so.

it seems to, on non terminal nodes due for expansion, to get all the children and add them to the batch-list(512). It’s chess, so that would add between 1-40 or so positions. I guess other parallel threads are doing the same.
When 512 reached, then all are sent to the NN. Results get saved in nn cache.

Then in search, if a node is found in nn-cache, the search will process it ...
if not, then send it to be batched as above.

There’s presumably a gain from the batching.
And a loss, because not all nodes sent to nn-cache are going to be used. Sending all children is a speculative send. Although there may/will be some cross-hashing from other parts of the tree/other threads which is a bonus.

I guess the idea is that once the cache-ing has got going, the search is not impeded by waiting for batches to process, because it keeps finding nodes that are already in the cache. nevertheless, there is the inefficiency that never-to-be-used child nodes get batched and then cached.

(And some problems to deal with, like collisions and locks and so on.)

Does that make sense?
Gian-Carlo Pascutto
Posts: 1243
Joined: Sat Dec 13, 2008 7:00 pm

Re: How good is the RTX 2080 Ti for Leela?

Post by Gian-Carlo Pascutto »

crem wrote: Thu Oct 25, 2018 10:27 am Updated information from Ankan on discord:

Code: Select all

with cudnn 7.3 and 411.63 driver available at nvidia.com 
minibatch-size=512, network id: 11250, go nodes 1000000
v0.17, default values for all other settings
(2070 run was with v0.18.1 lc0 build but with same settings)
             fp32    fp16    
GTX 1080Ti:   8996     -
Titan V:     13295   29379
RTX 2070:     8841   23721
RTX 2080:     9708   26678
RTX 2080Ti:  12208   32472
I'm not sure though whether it's fair to have v0.17 vs v0.18.1 comparison, don't remember which changes were there between..
Why is the RTX 2070 on par with the GTX 1080 Ti in fp32 mode?

It's 11 GFLOPS fp32 vs 7.5 GFLOPS fp32. The performance discrepancy is strange, unless the RTX is somehow much better at getting to the theoretical performance. Or it's the build difference (0.17 vs 0.18). Would be nice to get more reliable numbers.

Is Leela Zero Chess using fp16 and/or tensor mode in training (not inference!)? Or are you using fp32? (I saw that the RTX cards, unlike the V100, have gimped fp32 accumulate in tensor cores, so it may be not be useful for training on them even if it is on a V100)

Does anyone have the numbers for having tensor cores enabled/disabled in fp16 mode? I know they aren't accurate enough to do Winograd, so there's a factor ~3 loss of efficiency for 3x3 convolution. I'm curious if there's still a worthwhile speedup left.