Leela and NVIDIA clock throttling (and overheating)

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

IanKennedy
Posts: 55
Joined: Sun Feb 04, 2018 12:38 pm
Location: UK

Leela and NVIDIA clock throttling (and overheating)

Post by IanKennedy »

I am running LC on dual RTX 2080 Tis and have been suffering from reboots when both GPUs are in use over CUDA. Other apps including Leela Zero (Go) which only uses OpenCl and KataGo (both CUDA and OpenCl) do not cause this problem even though the power and temperature profiles they create appear similar.

One thing I have observed is two different entries under Clock Throttle Reasons if you run the query

nvidia-smi -q -d PERFORMANCE

As soon as LC's turn to move starts the 'SW Power Cap' goes Active.
I have noticed that often the power draw slightly exceeds the current limit of 250W, usually just by a handful of watts to 258W or so.

Also once the temperature hits about 84-85C I see

SW Thermal Slowdown Active.

I am not sure if these status reports are a red herring or not - the device is supposed to be good up to 89C and the hottest I've seen it is 86C. I'd be interested to hear if anyone else has these conditions or any feedback on whether the throttling situation is anything to be concerned about either from a performance or stability point of view. Obviously I would accept a bit less performance rather than have a tournament reboot half way through.

I have had a spring clean inside the box and turned up the chassis fans to full blast. It is also possible to take over GPU fan management but not done that so far.
Author of the actively developed PSYCHO chess engine
User avatar
AdminX
Posts: 6340
Joined: Mon Mar 13, 2006 2:34 pm
Location: Acworth, GA

Re: Leela and NVIDIA clock throttling (and overheating)

Post by AdminX »

When I had that problem with dual GPUs it was because I needed a bigger PSU. Once I upgraded the Power Supply on my system the problem never happened again. Not sure if this is your issue as well, but wanted to give you a lead to follow up on.
"Good decisions come from experience, and experience comes from bad decisions."
__________________________________________________________________
Ted Summers
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Leela and NVIDIA clock throttling (and overheating)

Post by zullil »

IanKennedy wrote: Wed May 20, 2020 12:07 pm I am running LC on dual RTX 2080 Tis and have been suffering from reboots when both GPUs are in use over CUDA. Other apps including Leela Zero (Go) which only uses OpenCl and KataGo (both CUDA and OpenCl) do not cause this problem even though the power and temperature profiles they create appear similar.

One thing I have observed is two different entries under Clock Throttle Reasons if you run the query

nvidia-smi -q -d PERFORMANCE

As soon as LC's turn to move starts the 'SW Power Cap' goes Active.
I have noticed that often the power draw slightly exceeds the current limit of 250W, usually just by a handful of watts to 258W or so.

Also once the temperature hits about 84-85C I see

SW Thermal Slowdown Active.

I am not sure if these status reports are a red herring or not - the device is supposed to be good up to 89C and the hottest I've seen it is 86C. I'd be interested to hear if anyone else has these conditions or any feedback on whether the throttling situation is anything to be concerned about either from a performance or stability point of view. Obviously I would accept a bit less performance rather than have a tournament reboot half way through.

I have had a spring clean inside the box and turned up the chassis fans to full blast. It is also possible to take over GPU fan management but not done that so far.
You can alter the default power limit (up to the value of Max Power Limit) using nvidia-smi:

-pl, --power-limit=POWER_LIMIT
Specifies maximum power limit in watts. Accepts integer and floating point numbers. Only on supported devices from Kepler family. Requires administrator privileges. Value needs to be between Min and Max Power Limit as reported by nvidia-smi.
IanKennedy
Posts: 55
Joined: Sun Feb 04, 2018 12:38 pm
Location: UK

Re: Leela and NVIDIA clock throttling (and overheating)

Post by IanKennedy »

Thanks for the comments.

My power supply is 1000W and the machine was built around the dual GPUs so I'm hoping that suffices. It is sold as a deep learning workstation and I told them what GPU usage I expected from it.

The clock speed is mostly 1545MHz on gpu0 which is actually the official BOOST speed. It is higher (and cooler) on gpu1. They are not supposed to be overclocked so I'm not quite sure where the 'throttling' comes in (hardly ever goes below 1545). The Performance Level in Nvidia X Server config is level 3 with a max speed of 2100MHz.
Author of the actively developed PSYCHO chess engine
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Leela and NVIDIA clock throttling (and overheating)

Post by corres »

IanKennedy wrote: Wed May 20, 2020 2:53 pm Thanks for the comments.

My power supply is 1000W and the machine was built around the dual GPUs so I'm hoping that suffices. It is sold as a deep learning workstation and I told them what GPU usage I expected from it.

The clock speed is mostly 1545MHz on gpu0 which is actually the official BOOST speed. It is higher (and cooler) on gpu1. They are not supposed to be overclocked so I'm not quite sure where the 'throttling' comes in (hardly ever goes below 1545). The Performance Level in Nvidia X Server config is level 3 with a max speed of 2100MHz.
If you need reboot it is very possible your system consumes more than 1000 W or the PSU is overheated.
Maybe it is a weaker sample.
Temperature of GPUs rather high (80-85 degrees Celsius) so throttling of GPU is normal phenomena this is one cause why I use Backend = Multiplexing with 4 threads for 2 GPUs.
User avatar
MikeB
Posts: 4889
Joined: Thu Mar 09, 2006 6:34 am
Location: Pen Argyl, Pennsylvania

Re: Leela and NVIDIA clock throttling (and overheating)

Post by MikeB »

IanKennedy wrote: Wed May 20, 2020 12:07 pm I am running LC on dual RTX 2080 Tis and have been suffering from reboots when both GPUs are in use over CUDA. Other apps including Leela Zero (Go) which only uses OpenCl and KataGo (both CUDA and OpenCl) do not cause this problem even though the power and temperature profiles they create appear similar.

One thing I have observed is two different entries under Clock Throttle Reasons if you run the query

nvidia-smi -q -d PERFORMANCE

As soon as LC's turn to move starts the 'SW Power Cap' goes Active.
I have noticed that often the power draw slightly exceeds the current limit of 250W, usually just by a handful of watts to 258W or so.

Also once the temperature hits about 84-85C I see

SW Thermal Slowdown Active.

I am not sure if these status reports are a red herring or not - the device is supposed to be good up to 89C and the hottest I've seen it is 86C. I'd be interested to hear if anyone else has these conditions or any feedback on whether the throttling situation is anything to be concerned about either from a performance or stability point of view. Obviously I would accept a bit less performance rather than have a tournament reboot half way through.

I have had a spring clean inside the box and turned up the chassis fans to full blast. It is also possible to take over GPU fan management but not done that so far.
You are running hot - I have dual 2060 Super my temp max out on the GPUs at 80C. They are pulling 350 combined watts. Your dual RTX 2080 Ti are pulling 550 watts . At full load my motherboard and CPU is pulling 430 watts - I'm right at the sweet spot of 78% (780/1000). You could be underpowered. I maxout my entire system with everything running at 79C (3970x) . Fwiw. my Fast Fritz nps is around 45k-50k nps on average with opening position 4 threads , , multiplexing , FP16.
Image
mbabigian
Posts: 204
Joined: Tue Oct 15, 2013 2:34 am
Location: US
Full name: Mike Babigian

Re: Leela and NVIDIA clock throttling (and overheating)

Post by mbabigian »

If you are running 1 or 2 CPU threads to drive your 2080 TIs (ala LC0) then it is unlikely you are hitting the PSU limit; however, if you are also maxing your CPU, you can easily go over 1000W. Mike and I both have 3970X systems so they are a bit power hungry anyway, but if I run 32 threads and max my 2080TI, I can hit 780 watts out. If I run hyperthreads, even more. That doesn't leave enough headroom for an additional 2080TI with only a 1000W PSU (I have an AX1600i in preparation for two 3080TIs).

If you are not pushing the CPU, I suspect you are overheating one of the GPUs. To test, open your case completely and put the biggest house fan you own blowing on the system at max (I'm assuming you are air cooling your GPUs). Max out the GPUs as usual. You should stay below 80C. If no reboot, you are definitely overheating. If you reboot anyway, it is either the PSU is too small or one of the GPUs is defective.

Hope that helps,
Mike
“Censorship is telling a man he can't have a steak just because a baby can't chew it.” ― Mark Twain
User avatar
GONeill
Posts: 87
Joined: Sun Jun 15, 2014 6:40 am
Location: New Zealand
Full name: Graham O'Neill

Re: Leela and NVIDIA clock throttling (and overheating)

Post by GONeill »

If it's a really hot day and I'm running LC0 I sometimes throttle back the GPU using MSI Afterburner. I've only got a GTX970 but I still get worried when it's running flat out.

https://www.msi.com/page/afterburner

It displays loads of monitoring data too which you might find useful.
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Leela and NVIDIA clock throttling (and overheating)

Post by corres »

GONeill wrote: Thu May 21, 2020 7:23 am If it's a really hot day and I'm running LC0 I sometimes throttle back the GPU using MSI Afterburner. I've only got a GTX970 but I still get worried when it's running flat out.
https://www.msi.com/page/afterburner
It displays loads of monitoring data too which you might find useful.
To monitor the state of GPUs is a good tool is GPU-Z.
It measure a lot of parameters of every each GPU continuously.
IanKennedy
Posts: 55
Joined: Sun Feb 04, 2018 12:38 pm
Location: UK

Re: Leela and NVIDIA clock throttling (and overheating)

Post by IanKennedy »

Thanks again, I did include some single threaded a-b engines as opponents in testing so it seems it isn't dependent on CPU usage. It even went down playing Sabrina.

Afterburner and GPU-Z are both only on Windows I believe, can someone recommend a good Linux tool?
Author of the actively developed PSYCHO chess engine