Page 1 of 1

Correct LC0 syntax for multiple GPUs

Posted: Mon Dec 30, 2019 7:33 pm
by Dann Corbit
Here you can see how my UCI settings were attempted.
http://rybkaforum.net/cgi-bin/rybkaforu ... ?tid=33296

Here is the part that I find confusing:
How do I tell it how many threads to use? I see no place for that.
And after choosing multiplexing, how do I describe my GPUs to it correctly.
I have 2x 2080 Ti Supers.
When I try to describe it as I have seen in posts, nothing happens when I try to run it.
Do I need to make an ini file?

Re: Correct LC0 syntax for multiple GPUs

Posted: Mon Dec 30, 2019 7:42 pm
by AdminX
Dann Corbit wrote: Mon Dec 30, 2019 7:33 pm Here you can see how my UCI settings were attempted.
http://rybkaforum.net/cgi-bin/rybkaforu ... ?tid=33296

Here is the part that I find confusing:
How do I tell it how many threads to use? I see no place for that.
And after choosing multiplexing, how do I describe my GPUs to it correctly.
I have 2x 2080 Ti Supers.
When I try to describe it as I have seen in posts, nothing happens when I try to run it.
Do I need to make an ini file?
Hi Dan,

For some reason it is not showing in Arena. You can set it in lc0.config file.

Here is an example:

Code: Select all

--weights=384x30-T40-1573.pb.gz
--backend=multiplexing
--backend-opts=(backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
--Threads=3
--show-wdl=true
--syzygy-paths=E:\SyzygyBases\DTZ_7;E:\SyzygyBases\DTZ_6;E:\SyzygyBases\DTZ_345;C:\SyzygyBases\WDL_7;E:\SyzygyBases\WDL_6;E:\SyzygyBases\WDL_345

Re: Correct LC0 syntax for multiple GPUs

Posted: Mon Dec 30, 2019 8:08 pm
by Dann Corbit
Thanks, that is exactly what I needed.

Re: Correct LC0 syntax for multiple GPUs

Posted: Mon Dec 30, 2019 10:06 pm
by Dann Corbit
I also found a way from this guide:
http://blog.lczero.org/2018/09/guide-se ... s-gui.html
how to supply the commands on the command line for various GUIs.

Re: Correct LC0 syntax for multiple GPUs

Posted: Tue Dec 31, 2019 7:19 am
by mwyoung
I was told when running the exact GPU's. it was better to use roundrobin or demux instead of multiplexing.

You might find this useful.

roundrobin
Can have multiple child backends. Alternates to which backend the request is sent. E.g. if there are 3 children, 1st request goes to 1st backend, 2nd -- to 2nd, then 3rd, then 1st, 2nd, 3rd, 1st, ... and so on.
Somewhat similar to multiplexing backend, but doesn't combine/accumulate requests from different threads, but rather sends them verbatim immediately. It also doesn't need to use any locks which makes it a bit faster.

It's important for this backend that all child backends have the same speed (e.g. same GPU model, and none of them is throttled/overheated). Otherwise all backends will be slowed down to the slowest one. If you use non-uniform child backends, it's better to use multiplexing backend.

Options:
Takes list of subdictionaries as options, and creates one child backend per dictionary. All subdictionary parameters are passed to those backends, but there are also one additional param:

backend=<string> (default: name of the subdictionary) Name of child backend to use.


demux
Does the opposite from what multiplexing does: takes large batch which comes from search, splits into smaller batches and sends them to children backends to compute in parallel.
May be useful for multi-GPU configurations, or multicore CPU configurations too.

As with roundrobin backend, it's important that all child backends have the same performance, otherwise everyone will wait for the slowest one.

Options:

minimum-split-size=<int> (default: 0) Do not split batch to subbatches smaller than that.
Also takes list of subdictionaries as options, and creates one child backend per dictionary. All subdictionary parameters are passed to those backends, but there are also additional params:
threads=<int> (default: 1) Number of eval threads allocated for this backend.
backend=<string> (default: name of the subdictionary) Name of child backend to use.


Backend configuration at competitions
Here is what we use in competitions (as far as I could find):
CCCC:
backend: demux
backend-opts: backend=cudnn-fp16,(gpu=0),(gpu=1),(gpu=2),(gpu=3)
TCEC:
backend: roundrobin (starting from current DivP; before it was multiplexing)
backend-opts: backend=cudnn-fp16,(gpu=0),(gpu=1)

Re: Correct LC0 syntax for multiple GPUs

Posted: Tue Dec 31, 2019 9:03 am
by corres
mwyoung wrote: Tue Dec 31, 2019 7:19 am I was told when running the exact GPU's. it was better to use roundrobin or demux instead of multiplexing.
You might find this useful.
roundrobin
Can have multiple child backends. Alternates to which backend the request is sent. E.g. if there are 3 children, 1st request goes to 1st backend, 2nd -- to 2nd, then 3rd, then 1st, 2nd, 3rd, 1st, ... and so on.
Somewhat similar to multiplexing backend, but doesn't combine/accumulate requests from different threads, but rather sends them verbatim immediately. It also doesn't need to use any locks which makes it a bit faster.
It's important for this backend that all child backends have the same speed (e.g. same GPU model, and none of them is throttled/overheated). Otherwise all backends will be slowed down to the slowest one. If you use non-uniform child backends, it's better to use multiplexing backend.
Options:
Takes list of subdictionaries as options, and creates one child backend per dictionary. All subdictionary parameters are passed to those backends, but there are also one additional param:
backend=<string> (default: name of the subdictionary) Name of child backend to use.
demux
Does the opposite from what multiplexing does: takes large batch which comes from search, splits into smaller batches and sends them to children backends to compute in parallel.
May be useful for multi-GPU configurations, or multicore CPU configurations too.
As with roundrobin backend, it's important that all child backends have the same performance, otherwise everyone will wait for the slowest one.
Options:
minimum-split-size=<int> (default: 0) Do not split batch to subbatches smaller than that.
Also takes list of subdictionaries as options, and creates one child backend per dictionary. All subdictionary parameters are passed to those backends, but there are also additional params:
threads=<int> (default: 1) Number of eval threads allocated for this backend.
backend=<string> (default: name of the subdictionary) Name of child backend to use.
Backend configuration at competitions
Here is what we use in competitions (as far as I could find):
CCCC:
backend: demux
backend-opts: backend=cudnn-fp16,(gpu=0),(gpu=1),(gpu=2),(gpu=3)
TCEC:
backend: roundrobin (starting from current DivP; before it was multiplexing)
backend-opts: backend=cudnn-fp16,(gpu=0),(gpu=1)
Some note
In the case of the well tuned backends there is no important difference between them.
Practically there is no opportunity to block the different heating of GPUs and the different throttling of GPUs.
So in a long run the Multiplexing gives the best stability and the best reproducibility for tests.
Using Multiplexing you no need GPUs chosen for pair and you can use different GPUs together.