I am running LC on dual RTX 2080 Tis and have been suffering from reboots when both GPUs are in use over CUDA. Other apps including Leela Zero (Go) which only uses OpenCl and KataGo (both CUDA and OpenCl) do not cause this problem even though the power and temperature profiles they create appear similar.
One thing I have observed is two different entries under Clock Throttle Reasons if you run the query
nvidia-smi -q -d PERFORMANCE
As soon as LC's turn to move starts the 'SW Power Cap' goes Active.
I have noticed that often the power draw slightly exceeds the current limit of 250W, usually just by a handful of watts to 258W or so.
Also once the temperature hits about 84-85C I see
SW Thermal Slowdown Active.
I am not sure if these status reports are a red herring or not - the device is supposed to be good up to 89C and the hottest I've seen it is 86C. I'd be interested to hear if anyone else has these conditions or any feedback on whether the throttling situation is anything to be concerned about either from a performance or stability point of view. Obviously I would accept a bit less performance rather than have a tournament reboot half way through.
I have had a spring clean inside the box and turned up the chassis fans to full blast. It is also possible to take over GPU fan management but not done that so far.
Leela and NVIDIA clock throttling (and overheating)
Moderators: hgm, Rebel, chrisw
-
- Posts: 55
- Joined: Sun Feb 04, 2018 12:38 pm
- Location: UK
Leela and NVIDIA clock throttling (and overheating)
Author of the actively developed PSYCHO chess engine
-
- Posts: 6340
- Joined: Mon Mar 13, 2006 2:34 pm
- Location: Acworth, GA
Re: Leela and NVIDIA clock throttling (and overheating)
When I had that problem with dual GPUs it was because I needed a bigger PSU. Once I upgraded the Power Supply on my system the problem never happened again. Not sure if this is your issue as well, but wanted to give you a lead to follow up on.
"Good decisions come from experience, and experience comes from bad decisions."
__________________________________________________________________
Ted Summers
__________________________________________________________________
Ted Summers
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Leela and NVIDIA clock throttling (and overheating)
You can alter the default power limit (up to the value of Max Power Limit) using nvidia-smi:IanKennedy wrote: ↑Wed May 20, 2020 12:07 pm I am running LC on dual RTX 2080 Tis and have been suffering from reboots when both GPUs are in use over CUDA. Other apps including Leela Zero (Go) which only uses OpenCl and KataGo (both CUDA and OpenCl) do not cause this problem even though the power and temperature profiles they create appear similar.
One thing I have observed is two different entries under Clock Throttle Reasons if you run the query
nvidia-smi -q -d PERFORMANCE
As soon as LC's turn to move starts the 'SW Power Cap' goes Active.
I have noticed that often the power draw slightly exceeds the current limit of 250W, usually just by a handful of watts to 258W or so.
Also once the temperature hits about 84-85C I see
SW Thermal Slowdown Active.
I am not sure if these status reports are a red herring or not - the device is supposed to be good up to 89C and the hottest I've seen it is 86C. I'd be interested to hear if anyone else has these conditions or any feedback on whether the throttling situation is anything to be concerned about either from a performance or stability point of view. Obviously I would accept a bit less performance rather than have a tournament reboot half way through.
I have had a spring clean inside the box and turned up the chassis fans to full blast. It is also possible to take over GPU fan management but not done that so far.
-pl, --power-limit=POWER_LIMIT
Specifies maximum power limit in watts. Accepts integer and floating point numbers. Only on supported devices from Kepler family. Requires administrator privileges. Value needs to be between Min and Max Power Limit as reported by nvidia-smi.
-
- Posts: 55
- Joined: Sun Feb 04, 2018 12:38 pm
- Location: UK
Re: Leela and NVIDIA clock throttling (and overheating)
Thanks for the comments.
My power supply is 1000W and the machine was built around the dual GPUs so I'm hoping that suffices. It is sold as a deep learning workstation and I told them what GPU usage I expected from it.
The clock speed is mostly 1545MHz on gpu0 which is actually the official BOOST speed. It is higher (and cooler) on gpu1. They are not supposed to be overclocked so I'm not quite sure where the 'throttling' comes in (hardly ever goes below 1545). The Performance Level in Nvidia X Server config is level 3 with a max speed of 2100MHz.
My power supply is 1000W and the machine was built around the dual GPUs so I'm hoping that suffices. It is sold as a deep learning workstation and I told them what GPU usage I expected from it.
The clock speed is mostly 1545MHz on gpu0 which is actually the official BOOST speed. It is higher (and cooler) on gpu1. They are not supposed to be overclocked so I'm not quite sure where the 'throttling' comes in (hardly ever goes below 1545). The Performance Level in Nvidia X Server config is level 3 with a max speed of 2100MHz.
Author of the actively developed PSYCHO chess engine
-
- Posts: 3657
- Joined: Wed Nov 18, 2015 11:41 am
- Location: hungary
Re: Leela and NVIDIA clock throttling (and overheating)
If you need reboot it is very possible your system consumes more than 1000 W or the PSU is overheated.IanKennedy wrote: ↑Wed May 20, 2020 2:53 pm Thanks for the comments.
My power supply is 1000W and the machine was built around the dual GPUs so I'm hoping that suffices. It is sold as a deep learning workstation and I told them what GPU usage I expected from it.
The clock speed is mostly 1545MHz on gpu0 which is actually the official BOOST speed. It is higher (and cooler) on gpu1. They are not supposed to be overclocked so I'm not quite sure where the 'throttling' comes in (hardly ever goes below 1545). The Performance Level in Nvidia X Server config is level 3 with a max speed of 2100MHz.
Maybe it is a weaker sample.
Temperature of GPUs rather high (80-85 degrees Celsius) so throttling of GPU is normal phenomena this is one cause why I use Backend = Multiplexing with 4 threads for 2 GPUs.
-
- Posts: 4889
- Joined: Thu Mar 09, 2006 6:34 am
- Location: Pen Argyl, Pennsylvania
Re: Leela and NVIDIA clock throttling (and overheating)
You are running hot - I have dual 2060 Super my temp max out on the GPUs at 80C. They are pulling 350 combined watts. Your dual RTX 2080 Ti are pulling 550 watts . At full load my motherboard and CPU is pulling 430 watts - I'm right at the sweet spot of 78% (780/1000). You could be underpowered. I maxout my entire system with everything running at 79C (3970x) . Fwiw. my Fast Fritz nps is around 45k-50k nps on average with opening position 4 threads , , multiplexing , FP16.IanKennedy wrote: ↑Wed May 20, 2020 12:07 pm I am running LC on dual RTX 2080 Tis and have been suffering from reboots when both GPUs are in use over CUDA. Other apps including Leela Zero (Go) which only uses OpenCl and KataGo (both CUDA and OpenCl) do not cause this problem even though the power and temperature profiles they create appear similar.
One thing I have observed is two different entries under Clock Throttle Reasons if you run the query
nvidia-smi -q -d PERFORMANCE
As soon as LC's turn to move starts the 'SW Power Cap' goes Active.
I have noticed that often the power draw slightly exceeds the current limit of 250W, usually just by a handful of watts to 258W or so.
Also once the temperature hits about 84-85C I see
SW Thermal Slowdown Active.
I am not sure if these status reports are a red herring or not - the device is supposed to be good up to 89C and the hottest I've seen it is 86C. I'd be interested to hear if anyone else has these conditions or any feedback on whether the throttling situation is anything to be concerned about either from a performance or stability point of view. Obviously I would accept a bit less performance rather than have a tournament reboot half way through.
I have had a spring clean inside the box and turned up the chassis fans to full blast. It is also possible to take over GPU fan management but not done that so far.
-
- Posts: 204
- Joined: Tue Oct 15, 2013 2:34 am
- Location: US
- Full name: Mike Babigian
Re: Leela and NVIDIA clock throttling (and overheating)
If you are running 1 or 2 CPU threads to drive your 2080 TIs (ala LC0) then it is unlikely you are hitting the PSU limit; however, if you are also maxing your CPU, you can easily go over 1000W. Mike and I both have 3970X systems so they are a bit power hungry anyway, but if I run 32 threads and max my 2080TI, I can hit 780 watts out. If I run hyperthreads, even more. That doesn't leave enough headroom for an additional 2080TI with only a 1000W PSU (I have an AX1600i in preparation for two 3080TIs).
If you are not pushing the CPU, I suspect you are overheating one of the GPUs. To test, open your case completely and put the biggest house fan you own blowing on the system at max (I'm assuming you are air cooling your GPUs). Max out the GPUs as usual. You should stay below 80C. If no reboot, you are definitely overheating. If you reboot anyway, it is either the PSU is too small or one of the GPUs is defective.
Hope that helps,
Mike
If you are not pushing the CPU, I suspect you are overheating one of the GPUs. To test, open your case completely and put the biggest house fan you own blowing on the system at max (I'm assuming you are air cooling your GPUs). Max out the GPUs as usual. You should stay below 80C. If no reboot, you are definitely overheating. If you reboot anyway, it is either the PSU is too small or one of the GPUs is defective.
Hope that helps,
Mike
“Censorship is telling a man he can't have a steak just because a baby can't chew it.” ― Mark Twain
-
- Posts: 87
- Joined: Sun Jun 15, 2014 6:40 am
- Location: New Zealand
- Full name: Graham O'Neill
Re: Leela and NVIDIA clock throttling (and overheating)
If it's a really hot day and I'm running LC0 I sometimes throttle back the GPU using MSI Afterburner. I've only got a GTX970 but I still get worried when it's running flat out.
https://www.msi.com/page/afterburner
It displays loads of monitoring data too which you might find useful.
https://www.msi.com/page/afterburner
It displays loads of monitoring data too which you might find useful.
-
- Posts: 3657
- Joined: Wed Nov 18, 2015 11:41 am
- Location: hungary
Re: Leela and NVIDIA clock throttling (and overheating)
To monitor the state of GPUs is a good tool is GPU-Z.GONeill wrote: ↑Thu May 21, 2020 7:23 am If it's a really hot day and I'm running LC0 I sometimes throttle back the GPU using MSI Afterburner. I've only got a GTX970 but I still get worried when it's running flat out.
https://www.msi.com/page/afterburner
It displays loads of monitoring data too which you might find useful.
It measure a lot of parameters of every each GPU continuously.
-
- Posts: 55
- Joined: Sun Feb 04, 2018 12:38 pm
- Location: UK
Re: Leela and NVIDIA clock throttling (and overheating)
Thanks again, I did include some single threaded a-b engines as opponents in testing so it seems it isn't dependent on CPU usage. It even went down playing Sabrina.
Afterburner and GPU-Z are both only on Windows I believe, can someone recommend a good Linux tool?
Afterburner and GPU-Z are both only on Windows I believe, can someone recommend a good Linux tool?
Author of the actively developed PSYCHO chess engine