How to properly profile GPU applications
Why not nvidia-smi
According to the documents, the utilization filed in the SMI output just reflects the fraction time of GPU being used, but not how much of GPU power being used. Knowing the fraction of used time is important when we are to remove the unnecessary CPU usage at the initial stage of some project, so that the app would not be blocked by CPU.
But when most of job is done by GPU, we need to profile the output power of the GPU and determine which part is the bottleneck.
Find the slow stage with nsys
Nsight system can be used to get the general sequence chart of CUDA API
, Kernel invoking
, and NVTX Tags
.
- Tag different stages of the application with
NVTX
.# pytorch torch.cuda.nvtx.range_push(tag_name) # code of the stage torch.cuda.nvtx.range_pop() # or you can warp them with a context manager
- Remove the stages that you do not want to profile, like preparation and finalization.
torch.cuda.cudart().cudaProfilerStart() # the code that you want to profile torch.cuda.cudart().cudaProfilerStop()
- Profile the code with
nsys
nsys profile --gpu-metrics-device=0,1 \ --capture-range=cudaProfilerApi \ --stop-on-range-end=true \ -w true \ -t cuda,nvtx \ -o report%p \ -f true \ your_program \ your_args
If you use MPI, the
%p
would show the rank of the report.
Refer to Metric for explanation. Some important metrics are:
- GR Active
- Compute Warps in Flight
- Active SM Unused Warp Slots
- Idle SM Unused Warp Slots
- SM Active
- SM Issue
- Tensor Active
- DRAM Bandwidth
Enjoy Reading This Article?
Here are some more articles you might like to read next: