Diagnosing NVidia GPU Issues

Jun 3, 2021 · 2 min read · Linux Windows NVidia drivers troubleshooting ·

Share on:

A small note on what you can do to troubleshoot and diagnose issues with NVidia GPUs (NOTE: this is GPUs, not GFX cards).

You can check for issues by using NVidia SMI or checking for Xid errors in the OS event logs:

Use Nvidia SMI.
- Linux and Vmware
  - nvidia-bug-report.sh
  Generates a diagnostic bundle
- Windows
  - c:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe -q > nvidia.log
  Redirects the output to 'nvidia.log'
Xid errors - these are usually software related and can be found in the system's kernel or event log.
- For RHEL/CentOS based distributions)
  - sosreport
- Windows
  - Check the Windows System Event logs
- Vmware
  - Gather the Vmware Support Bundle.
NOTE: Nvidia document on Xid errors: https://docs.nvidia.com/deploy/xid-errors/index.html Trying different driver versions can have a big impact on this depending on the workload/application.

SMI Commands

Command	Explanation
`nvidia-smi -pl 100`	Set power limit to 100W
`nvidia-smi -q -d PAGE_RETIREMENT`	Queries the GPU memory to check for retired pages. If the table shows SBE rate 100)+ or DBE rate is 10+, replace the GPU and test
`nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv`	View current retired pages in CSV format - could use redirection ( `> retiredpages.csv`) to save for reading later
`nvidia-smi -q -d ECC`	Output error correction information
`nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.free,memory.used --format=csv -l 1`	check power, temperature, core utilization and memory usage (in CSV format, could use redirection to output to CSV file)
`nvidia-smi --query-gpu=index,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sync_boost --format=csv`	Check reasons for power/clock throttle
`nvidia-smi -q -d SUPPORTED_CLOCKS`	Check supported GPU clocks
`nvidia-smi dmon -s pucvmet -i 0`	Probe selected GPU status (where 0=specific GPU) p: Power usage and temperature u: Utilization c: Proc and mem clocks v: Power and thermal violations m: FB and Bar1 memory e: ECC errors and PCIe replay errors t: PCIe Rx and Tx throughput
`nvidia-smi pmon -i 0 -s u -o T`	'top' like command for GPU (where 0=specific GPU) sm%: CUDA core utilization mem%: Sampled time ratio for memory operations enc%/dec%: HW encoder's utilization fb: FB memory usage
`nvidia-smi topo -m`	Output GPU topology information
`nvidia-smi topo -p2p rwnap`	Check peer to peer access

Diagnosing NVidia GPU Issues

Linux and Vmware

Windows

SMI Commands