Correct LC0 syntax for multiple GPUs

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Correct LC0 syntax for multiple GPUs

Post by Dann Corbit »

Here you can see how my UCI settings were attempted.
http://rybkaforum.net/cgi-bin/rybkaforu ... ?tid=33296

Here is the part that I find confusing:
How do I tell it how many threads to use? I see no place for that.
And after choosing multiplexing, how do I describe my GPUs to it correctly.
I have 2x 2080 Ti Supers.
When I try to describe it as I have seen in posts, nothing happens when I try to run it.
Do I need to make an ini file?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
User avatar
AdminX
Posts: 6340
Joined: Mon Mar 13, 2006 2:34 pm
Location: Acworth, GA

Re: Correct LC0 syntax for multiple GPUs

Post by AdminX »

Dann Corbit wrote: Mon Dec 30, 2019 7:33 pm Here you can see how my UCI settings were attempted.
http://rybkaforum.net/cgi-bin/rybkaforu ... ?tid=33296

Here is the part that I find confusing:
How do I tell it how many threads to use? I see no place for that.
And after choosing multiplexing, how do I describe my GPUs to it correctly.
I have 2x 2080 Ti Supers.
When I try to describe it as I have seen in posts, nothing happens when I try to run it.
Do I need to make an ini file?
Hi Dan,

For some reason it is not showing in Arena. You can set it in lc0.config file.

Here is an example:

Code: Select all

--weights=384x30-T40-1573.pb.gz
--backend=multiplexing
--backend-opts=(backend=cudnn-fp16,gpu=0),(backend=cudnn-fp16,gpu=1)
--Threads=3
--show-wdl=true
--syzygy-paths=E:\SyzygyBases\DTZ_7;E:\SyzygyBases\DTZ_6;E:\SyzygyBases\DTZ_345;C:\SyzygyBases\WDL_7;E:\SyzygyBases\WDL_6;E:\SyzygyBases\WDL_345
"Good decisions come from experience, and experience comes from bad decisions."
__________________________________________________________________
Ted Summers
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Correct LC0 syntax for multiple GPUs

Post by Dann Corbit »

Thanks, that is exactly what I needed.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Correct LC0 syntax for multiple GPUs

Post by Dann Corbit »

I also found a way from this guide:
http://blog.lczero.org/2018/09/guide-se ... s-gui.html
how to supply the commands on the command line for various GUIs.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
mwyoung
Posts: 2727
Joined: Wed May 12, 2010 10:00 pm

Re: Correct LC0 syntax for multiple GPUs

Post by mwyoung »

I was told when running the exact GPU's. it was better to use roundrobin or demux instead of multiplexing.

You might find this useful.

roundrobin
Can have multiple child backends. Alternates to which backend the request is sent. E.g. if there are 3 children, 1st request goes to 1st backend, 2nd -- to 2nd, then 3rd, then 1st, 2nd, 3rd, 1st, ... and so on.
Somewhat similar to multiplexing backend, but doesn't combine/accumulate requests from different threads, but rather sends them verbatim immediately. It also doesn't need to use any locks which makes it a bit faster.

It's important for this backend that all child backends have the same speed (e.g. same GPU model, and none of them is throttled/overheated). Otherwise all backends will be slowed down to the slowest one. If you use non-uniform child backends, it's better to use multiplexing backend.

Options:
Takes list of subdictionaries as options, and creates one child backend per dictionary. All subdictionary parameters are passed to those backends, but there are also one additional param:

backend=<string> (default: name of the subdictionary) Name of child backend to use.


demux
Does the opposite from what multiplexing does: takes large batch which comes from search, splits into smaller batches and sends them to children backends to compute in parallel.
May be useful for multi-GPU configurations, or multicore CPU configurations too.

As with roundrobin backend, it's important that all child backends have the same performance, otherwise everyone will wait for the slowest one.

Options:

minimum-split-size=<int> (default: 0) Do not split batch to subbatches smaller than that.
Also takes list of subdictionaries as options, and creates one child backend per dictionary. All subdictionary parameters are passed to those backends, but there are also additional params:
threads=<int> (default: 1) Number of eval threads allocated for this backend.
backend=<string> (default: name of the subdictionary) Name of child backend to use.


Backend configuration at competitions
Here is what we use in competitions (as far as I could find):
CCCC:
backend: demux
backend-opts: backend=cudnn-fp16,(gpu=0),(gpu=1),(gpu=2),(gpu=3)
TCEC:
backend: roundrobin (starting from current DivP; before it was multiplexing)
backend-opts: backend=cudnn-fp16,(gpu=0),(gpu=1)
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Correct LC0 syntax for multiple GPUs

Post by corres »

mwyoung wrote: Tue Dec 31, 2019 7:19 am I was told when running the exact GPU's. it was better to use roundrobin or demux instead of multiplexing.
You might find this useful.
roundrobin
Can have multiple child backends. Alternates to which backend the request is sent. E.g. if there are 3 children, 1st request goes to 1st backend, 2nd -- to 2nd, then 3rd, then 1st, 2nd, 3rd, 1st, ... and so on.
Somewhat similar to multiplexing backend, but doesn't combine/accumulate requests from different threads, but rather sends them verbatim immediately. It also doesn't need to use any locks which makes it a bit faster.
It's important for this backend that all child backends have the same speed (e.g. same GPU model, and none of them is throttled/overheated). Otherwise all backends will be slowed down to the slowest one. If you use non-uniform child backends, it's better to use multiplexing backend.
Options:
Takes list of subdictionaries as options, and creates one child backend per dictionary. All subdictionary parameters are passed to those backends, but there are also one additional param:
backend=<string> (default: name of the subdictionary) Name of child backend to use.
demux
Does the opposite from what multiplexing does: takes large batch which comes from search, splits into smaller batches and sends them to children backends to compute in parallel.
May be useful for multi-GPU configurations, or multicore CPU configurations too.
As with roundrobin backend, it's important that all child backends have the same performance, otherwise everyone will wait for the slowest one.
Options:
minimum-split-size=<int> (default: 0) Do not split batch to subbatches smaller than that.
Also takes list of subdictionaries as options, and creates one child backend per dictionary. All subdictionary parameters are passed to those backends, but there are also additional params:
threads=<int> (default: 1) Number of eval threads allocated for this backend.
backend=<string> (default: name of the subdictionary) Name of child backend to use.
Backend configuration at competitions
Here is what we use in competitions (as far as I could find):
CCCC:
backend: demux
backend-opts: backend=cudnn-fp16,(gpu=0),(gpu=1),(gpu=2),(gpu=3)
TCEC:
backend: roundrobin (starting from current DivP; before it was multiplexing)
backend-opts: backend=cudnn-fp16,(gpu=0),(gpu=1)
Some note
In the case of the well tuned backends there is no important difference between them.
Practically there is no opportunity to block the different heating of GPUs and the different throttling of GPUs.
So in a long run the Multiplexing gives the best stability and the best reproducibility for tests.
Using Multiplexing you no need GPUs chosen for pair and you can use different GPUs together.