Profiling¶
HPCToolkit¶
HPCToolkit can profile any code built with debug
symbols, it does not require special code instrumenting. In CMake, one simply
needs to select RelWithDebInfo as the build type.
Consult HPCToolkit documentation to find out how to obtain and install all the required profiling software. Hopefully, your system administrator will install HPCToolkit for you and you would only need to load appropriate modules.
Note
On Frontier supercomputer all you need to get are ums,
ums023 and hpctoolkit modules.
Once you built your code with debug symbols and got the working HPCToolkit installation, you run profiling with
hpcrun -t -e CPUTIME -e gpu=amd -o out.m my_executable.exe
hpcrun is a profiling tool that collects call path profiles of program executions using statistical sampling of hardware counters, software counters, or timers. In the example above:
Flag
-ttells profiler to generate a call path trace in addition to a call path profile.Flag
-eselects events or sampling periods for profiling.
-e CPUTIMEwill tell profiler to use Linux system timer to set the sampling rate.
-e gpu=amdwill instruct profiler to collect comprehensive operation-level measurements for HIP programs on AMD GPU.Flag
-ospecifies the directory where the output data will be stored.In this example,
my_executable.exeis the binary being profiled.
It is often helpful to map profiling data to specific CPU and GPU code. This can be done by invoking hpcstruct command as
hpcstruct out.m
where out.m is the directory with profiling measurements collected by
hpcrun. This step is optional.
hpcprof analyzes profile performance measurements and attributes them to static source code structure. One can simply call
hpcprof -o out.d out.m
This will generate profiling information using measurements in directory
out.m and store it in directory out.d. The data in out.d can be
viewed using hpcviewer tool.
It is recommended to download and install the graphical version of HPCToolkit viewer on your local machine and analyze profiling data there. The graphical user interface provides many productivity features but often runs slowly over SSH connection.
When running the profiler on a machine with a scheduler, it is best to use a script. When using SLURM, for example, a script to run the profiler would look something like this:
#!/bin/bash
#SBATCH -A MyAllocation
#SBATCH -J resolve_profile
#SBATCH -o %x-%j.out
#SBATCH -t 00:30:00
#SBATCH -N 1
EXE=resolve_executable.exe
OUT=hpctoolkit_resolve_profile
# Profile Re::Solve code on a single GPU
echo "`date` Starting run"
srun -N 1 -n 1 -c 1 -G 1 \
hpcrun -t -e CPUTIME -e gpu=amd -o ${OUT}.m ${EXE}
echo "`date` Finished run"
# Use 56 cores to process profiling measurements
srun -N 1 -n 1 -c 56 hpcstruct ${OUT}.m
srun -N 1 -n 1 -c 56 hpcprof -o ${OUT}.d ${OUT}.m
Any arguments that the executable takes can be simply added as when the executable is called locally. A similar script could be written for LSF or MOAB scheduler.
AMD ROCProfiler¶
To profile GPU kernels in the HIP backend, we recommend using AMD tools. Since ROCm 6.2,
the recommended approach is ROCprofiler-SDK.
ROCprofiler-SDK is part of the ROCm library, so no additional software needs to be installed
once you obtain ROCm.
ROCTX annotations are useful to place kernel execution in the context of CPU code execution. These can be added as follow:
// some include files ...
#include <rocprofiler-sdk-roctx/roctx.h>
// some code ...
roctxRangePush("My Event");
// my event code ...
roctxRangePop();
roctxMarkA("My Event");
The string label is an optional argument to the annotation code.
At this time, Re::Solve implements the macros RESOLVE_RANGE_PUSH and RESOLVE_RANGE_POP
to select NVTX or ROCTX with the appropriate nvtxRangePush or roctxRangePush,
respectively, depending on the active backend. Make sure your Re::Solve library was built with
profiling support, i.e. the build was configured with CMake boolean flag RESOLVE_USE_PROFILING
set to On. The appropriate profiling library will be linked depending on the backend selected.
Once your instrumented code is built, it can be profiled as follows:
rocprofv3 --stats --hip-trace --roctx-trace -o out.csv ./my_executable.exe
In this example
Flag
-ospecifies the output file in comma separated values format, in this caseout.csv.File
my_executable.exeis the binary being profiled.Flag
--statsenables kernel execution stats.Flag
--hip-traceincludes HIP API timelines in profiling data.Flag
--roctx-traceenables rocTX application code annotation trace.
The profiler will create several files with name out but with different
extensions. To visualize output, one can upload the out.json file to
Perfetto.
When running ROCProfiler on a machine with a scheduler, it is a good idea to write a profiling script. Here is an example for a SLURM scheduler:
#!/bin/bash
#SBATCH -A CSC359
#SBATCH -J resolve_test
#SBATCH -o %x-%j.out
#SBATCH -t 00:30:00
#SBATCH -N 1
EXE=build/examples/klu_rocsolverrf_fgmres.exe
OUT=rocprof-resolve25k
ARGS=""
echo "`date` Starting run"
srun -N 1 -n 1 -c 1 -G 1 \
rocprofv3 --stats --hip-trace --roctx-trace -o ${OUT}.csv \
${EXE} ${ARGS}
echo "`date` Finished run"
NVIDIA Nsight Systems¶
Similar to ROCTX annotations, NVTX annotations are useful to place kernel execution in the context of CPU code execution when using the CUDA backend.
You can annotate events you want to trace in your code execution as follows:
// some include files ...
#include <nvToolsExt.h>
// some code ...
nvtxRangePush("My Event");
// my event code ...
nvtxRangePop();
nvtxMarkA("My Event");
Multiple tools support NVTX annotations. We recommend Nsight Systems for a high level profile. It can be used as follows:
nsys profile --stats=true --trace=cuda,nvtx ./my_executable.exe
This will generate a text output with high level execution statistics (time spent in kernels,
NVTX ranges, memory) as well as *.nsys-rep and *.sqlite files. The *nsys-rep
can be opened in the Nsight System graphical user interface to visualize the timeline.
You will need to download the file to a local machine if you are profiling on the remote
system that does not support graphical user interfaces.