Why not nvidia-smi

According to the documents, the utilization filed in the SMI output just reflects the fraction time of GPU being used, but not how much of GPU power being used. Knowing the fraction of used time is important when we are to remove the unnecessary CPU usage at the initial stage of some project, so that the app would not be blocked by CPU.

But when most of job is done by GPU, we need to profile the output power of the GPU and determine which part is the bottleneck.

Find the slow stage with nsys

Nsight system can be used to get the general sequence chart of CUDA API, Kernel invoking, and NVTX Tags.

  1. Tag different stages of the application with NVTX.
    # pytorch
    torch.cuda.nvtx.range_push(tag_name)
    # code of the stage
    torch.cuda.nvtx.range_pop()
    # or you can warp them with a context manager
    
  2. Remove the stages that you do not want to profile, like preparation and finalization.
    torch.cuda.cudart().cudaProfilerStart()
    # the code that you want to profile
    torch.cuda.cudart().cudaProfilerStop()
    
  3. Profile the code with nsys
    nsys profile --gpu-metrics-device=0,1 \
    --capture-range=cudaProfilerApi \
    --stop-on-range-end=true \
    -w true \
    -t cuda,nvtx \
    -o report%p \
    -f true \
    your_program \
    your_args 
    

    If you use MPI, the %p would show the rank of the report.

Refer to Metric for explanation. Some important metrics are:

  1. GR Active
  2. Compute Warps in Flight
  3. Active SM Unused Warp Slots
  4. Idle SM Unused Warp Slots
  5. SM Active
  6. SM Issue
  7. Tensor Active
  8. DRAM Bandwidth