How to properly profile GPU applications

Why not `nvidia-smi`

According to the documents, the utilization filed in the SMI output just reflects the fraction time of GPU being used, but not how much of GPU power being used. Knowing the fraction of used time is important when we are to remove the unnecessary CPU usage at the initial stage of some project, so that the app would not be blocked by CPU.

But when most of job is done by GPU, we need to profile the output power of the GPU and determine which part is the bottleneck.

Find the slow stage with `nsys`

Nsight system can be used to get the general sequence chart of CUDA API, Kernel invoking, and NVTX Tags.

Tag different stages of the application with NVTX.

# pytorch
torch.cuda.nvtx.range_push(tag_name)
# code of the stage
torch.cuda.nvtx.range_pop()
# or you can warp them with a context manager

Remove the stages that you do not want to profile, like preparation and finalization.

torch.cuda.cudart().cudaProfilerStart()
# the code that you want to profile
torch.cuda.cudart().cudaProfilerStop()

Profile the code with nsys

nsys profile --gpu-metrics-device=0,1 \
--capture-range=cudaProfilerApi \
--stop-on-range-end=true \
-w true \
-t cuda,nvtx \
-o report%p \
-f true \
your_program \
your_args 

If you use MPI, the %p would show the rank of the report.

Refer to Metric for explanation. Some important metrics are:

GR Active
Compute Warps in Flight
Active SM Unused Warp Slots
Idle SM Unused Warp Slots
SM Active
SM Issue
Tensor Active
DRAM Bandwidth

Why not nvidia-smi

Find the slow stage with nsys

Enjoy Reading This Article?

Why not `nvidia-smi`

Find the slow stage with `nsys`