Diagnosing NVidia GPU Issues
A small note on what you can do to troubleshoot and diagnose issues with NVidia GPUs (NOTE: this is GPUs, not GFX cards).
You can check for issues by using NVidia SMI or checking for Xid errors in the OS event logs:
Use Nvidia SMI.
Linux and Vmware
nvidia-bug-report.sh
Generates a diagnostic bundle
Windows
c:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe -q > nvidia.log
Redirects the output to 'nvidia.log'
Xid errors - these are usually software related and can be found in the system's kernel or event log.
- For RHEL/CentOS based distributions)
sosreport
- Windows
- Check the Windows System Event logs
- Vmware
- Gather the Vmware Support Bundle.
NOTE: Nvidia document on Xid errors: https://docs.nvidia.com/deploy/xid-errors/index.html Trying different driver versions can have a big impact on this depending on the workload/application.
- For RHEL/CentOS based distributions)
SMI Commands
Command | Explanation |
---|---|
nvidia-smi -pl 100 | Set power limit to 100W |
nvidia-smi -q -d PAGE_RETIREMENT | Queries the GPU memory to check for retired pages. If the table shows SBE rate 100)+ or DBE rate is 10+, replace the GPU and test |
nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csv | View current retired pages in CSV format - could use redirection ( > retiredpages.csv ) to save for reading later |
nvidia-smi -q -d ECC | Output error correction information |
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.free,memory.used --format=csv -l 1 | check power, temperature, core utilization and memory usage (in CSV format, could use redirection to output to CSV file) |
nvidia-smi --query-gpu=index,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sync_boost --format=csv | Check reasons for power/clock throttle |
nvidia-smi -q -d SUPPORTED_CLOCKS | Check supported GPU clocks |
nvidia-smi dmon -s pucvmet -i 0 | Probe selected GPU status (where 0=specific GPU) p: Power usage and temperature u: Utilization c: Proc and mem clocks v: Power and thermal violations m: FB and Bar1 memory e: ECC errors and PCIe replay errors t: PCIe Rx and Tx throughput |
nvidia-smi pmon -i 0 -s u -o T | 'top' like command for GPU (where 0=specific GPU) sm%: CUDA core utilization mem%: Sampled time ratio for memory operations enc%/dec%: HW encoder's utilization fb: FB memory usage |
nvidia-smi topo -m | Output GPU topology information |
nvidia-smi topo -p2p rwnap | Check peer to peer access |