Diagnosing NVidia GPU Issues

A small note on what you can do to troubleshoot and diagnose issues with NVidia GPUs (NOTE: this is GPUs, not GFX cards).

You can check for issues by using NVidia SMI or checking for Xid errors in the OS event logs:

  1. Use Nvidia SMI.

    • Linux and Vmware
      • nvidia-bug-report.sh

      Generates a diagnostic bundle

    • Windows
      • c:\Program Files\NVIDIA Corporation\NVSMI\nvidia-smi.exe -q > nvidia.log

      Redirects the output to 'nvidia.log'

  2. Xid errors - these are usually software related and can be found in the system's kernel or event log.

    • For RHEL/CentOS based distributions)
      • sosreport
    • Windows
      • Check the Windows System Event logs
    • Vmware
      • Gather the Vmware Support Bundle.

    NOTE: Nvidia document on Xid errors: https://docs.nvidia.com/deploy/xid-errors/index.html Trying different driver versions can have a big impact on this depending on the workload/application.

SMI Commands

CommandExplanation
nvidia-smi -pl 100Set power limit to 100W
nvidia-smi -q -d PAGE_RETIREMENTQueries the GPU memory to check for retired pages. If the table shows SBE rate 100)+ or DBE rate is 10+, replace the GPU and test
nvidia-smi --query-retired-pages=gpu_uuid,retired_pages.address,retired_pages.cause --format=csvView current retired pages in CSV format - could use redirection ( > retiredpages.csv) to save for reading later
nvidia-smi -q -d ECCOutput error correction information
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,pcie.link.gen.max,pcie.link.gen.current,temperature.gpu,utilization.gpu,utilization.memory,memory.used,memory.free,memory.used --format=csv -l 1check power, temperature, core utilization and memory usage (in CSV format, could use redirection to output to CSV file)
nvidia-smi --query-gpu=index,clocks_throttle_reasons.active,clocks_throttle_reasons.gpu_idle,clocks_throttle_reasons.applications_clocks_setting,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown,clocks_throttle_reasons.hw_thermal_slowdown,clocks_throttle_reasons.hw_power_brake_slowdown,clocks_throttle_reasons.sync_boost --format=csvCheck reasons for power/clock throttle
nvidia-smi -q -d SUPPORTED_CLOCKSCheck supported GPU clocks
nvidia-smi dmon -s pucvmet -i 0Probe selected GPU status (where 0=specific GPU)
p: Power usage and temperature
u: Utilization
c: Proc and mem clocks
v: Power and thermal violations
m: FB and Bar1 memory
e: ECC errors and PCIe replay errors
t: PCIe Rx and Tx throughput
nvidia-smi pmon -i 0 -s u -o T'top' like command for GPU (where 0=specific GPU)
sm%: CUDA core utilization
mem%: Sampled time ratio for memory operations
enc%/dec%: HW encoder's utilization
fb: FB memory usage
nvidia-smi topo -mOutput GPU topology information
nvidia-smi topo -p2p rwnapCheck peer to peer access