Profiling

Content

Profiling#

Tools#

The identification of resource bottlenecks is achieved through the utilization of specialized monitoring and accounting tools.These tools allow for the empirical measurement of Central Processing Unit (CPU), memory, Input/Output (I/O), and network utilization across different storage and compute architectures.

Slurm

Job accounting data is retrieved via the sacct and sstat utilities. This allows for real-time and retrospective analysis of resource consumption. While sacct queries the accounting database for historical execution metrics, sstat provides near real-time telemetry for active job steps.

# Retrospective: List all past jobs for a specific user since the specified date
sacct -u <username> -S 2026-01-01 --format=JobID,State,NodeList,CPUTime,MaxRSS,MaxRSSNode

# Retrospective: Detailed resource accounting for a specific job ID
sacct -j <jobID> --format=JobID,JobName,State,AllocCPUS,CPUTimeRAW,TotalCPU,ReqMem,AveRSS,MaxRSS

# Real-time: View active resource utilization for a running job step
sstat -j <jobID.stepID> --format=JobID,AveCPU,AveRSS,MaxRSS,MaxDiskRead,MaxDiskWrite

# (Plugin) High-frequency data collection: Request continuous profiling during submission
sbatch --profile=all my_script.sh

Unix

General-purpose UNIX utilities provide high-frequency observation of the current system state.

Command	Function
`btop`	Interactive resource monitor (CPU, memory, disks, network).
`sar -u 1 3`	Reports CPU utilization at 1-second intervals.
`lshw -short`	Generates a brief hardware inventory.
`iostat -x 1`	Reports extended CPU and device input/output statistics at 1-second intervals.
`nvidia-smi`	Monitors NVIDIA GPU utilization, power draw, and memory allocation.
`nvtop`	Interactive GPU resource monitor (supports NVIDIA, AMD, and Intel GPUs).

I/O — fio

The Flexible I/O Tester (fio) is utilized for benchmarking storage backends by simulating precise I/O workloads. Throughput (bandwidth) is typically evaluated using sequential operations with large block sizes, whereas latency (IOPS) is assessed using random operations with small block sizes. The --direct=1 parameter is crucial as it bypasses the operating system’s buffer cache, ensuring measurements reflect the underlying hardware capabilities.

Intensive Throughput Warning

Execution of fio on shared filesystems (e.g., CephFS, NFS, Lustre) generates heavy synthetic load that may severely degrade performance for concurrent users. Coordination with system administrators is required before execution on shared infrastructure.

The provided clat output string details the performance profile of an exceptionally fast storage operation. To properly translate these values into an architectural understanding of latency, the base unit of measurement must first be contextualized. The execution engine reports specific values in nanoseconds (nsec). In standard storage benchmarking, latency is more commonly discussed in microseconds ($\mu$s) or milliseconds (ms).

Standard mechanical Hard Disk Drives (HDDs) operate in the millisecond range (e.g., 5,000 to 10,000 $\mu$s). Standard SATA Solid State Drives (SSDs) operate in the mid-microsecond range (e.g., 100 to 500 $\mu$s).

# Throughput (Bandwidth): Sequential Read/Write (1MB blocks, queue depth 16)
fio --name=seq_throughput --directory=/var/tmp/test --rw=readwrite --bs=1M \
    --time_based --runtime=60 \
    --size=10G --numjobs=1 --iodepth=16 --direct=1

# Latency (IOPS): Random Read/Write (4KB blocks, queue depth 1, 4 threads)
fio --name=rand_latency --directory=/tmp/test --rw=randrw --bs=4k \
    --time_based --runtime=60 \
    --size=10G --numjobs=4 --iodepth=1 --direct=1

Base Metrics (clat)

Example: Standard Deviation (stdev=412) corresponds to 0.41 μs.

# bench.fio
[global]
directory=${FIO_TEST_DIR}
size=10G
time_based
runtime=60
direct=1
iodepth=1
numjobs=1

[throughput]
rw=readwrite
bs=1M
numjobs=1
iodepth=16

[latency]
rw=randrw
bs=4k
numjobs=4
iodepth=1

FIO_TEST_DIR=/var/tmp/test fio bench.fio --section=latency

wrk

The wrk utility is a modern HTTP benchmarking tool capable of generating significant load utilizing a single multi-core CPU. It is deployed to empirically measure the response latency and maximum throughput of RESTful endpoints, making it the standard tool for profiling object storage gateways (e.g., S3, Swift) where data is retrieved over HTTP/HTTPS rather than a POSIX filesystem.

# Throughput and Latency: HTTP GET requests against an object storage endpoint
# Configuration: 12 threads, 100 concurrent connections, 30-second duration
wrk -t12 -c100 -d30s --latency "https://object-storage.example.com/bucket/test-file.bin"

Runtime#

Resource profiling is categorized into three primary architectural tiers. The selection of methodology determines the degree of measurement interference, temporal accuracy, and the overall reproducibility of the analytical pipeline.

Infrastructure-Level (Workload Manager)#

Data collection is executed passively by optimized daemons integrated directly into the compute node architecture. This methodology operates independently of the application runtime and avoids the overhead of localized polling loops. It is prioritized for generating highly reproducible, objective system-level telemetry.

Primary Tools: SLURM acct_gather_profile/hdf5 plugin, sh5util.
Underlying Principle: System metrics (Energy, CPU, Memory, Network) are aggregated at the hardware/OS level by the scheduling daemon and written to standardized, open-source HDF5 binary archives.

Example (SLURM Batch Script):

#!/bin/bash
#SBATCH --job-name=infrastructure_profiling
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=4
#SBATCH --time=00:30:00
#SBATCH --output=job_%j.out

# The --profile=all directive instructs the SLURM step launcher to activate node-level telemetry.
srun --profile=all ./my_application

# Post-execution consolidation of distributed binary archives into a single reproducible dataset.
# Note: Executable only if acct_gather.conf permissions permit user access.
sh5util -j $SLURM_JOB_ID

Application-Level (Native Instrumentation)#

Telemetry hooks are inserted directly into the application’s runtime environment or memory allocator. This methodology eliminates temporal misalignment by directly correlating hardware utilization with specific computational operations. It is the optimal approach for granular hardware analysis, specifically for accelerator memory tracking.

Primary Tools: PyTorch Profiler, Score-P, HPCToolkit.
Underlying Principle: The application programming interface (API) is instrumented to intercept and record resource allocation events natively during execution.

Example (PyTorch GPU Memory Profiling):

import torch
import torchvision.models as models
from torch.profiler import profile, record_function, ProfilerActivity

# Initialize workload and allocate to hardware accelerator
model = models.resnet18().cuda()
inputs = torch.randn(5, 3, 224, 224).cuda()

# Instantiate the open-source profiler with memory tracking enabled
with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    profile_memory=True,
    record_shapes=True,
    with_stack=True
) as prof:
    # Context manager correlates hardware telemetry with a specific operational block
    with record_function("model_forward_pass"):
        model(inputs)

# Export standard summary table to standard output
print(prof.key_averages().table(sort_by="cuda_memory_usage", row_limit=10))

# Export comprehensive time-series trace to an open format for reproducible visualization (e.g., via chrome://tracing)
prof.export_chrome_trace("pytorch_memory_trace.json")

User-Space Polling Level (Concurrent Shell Processes)#

Concurrent monitoring utilities are executed via background loops. This approach is classified as a structural fallback. Reliance on the operating system’s standard process scheduler introduces resource contention and temporal misalignment, reducing the objective reproducibility of the telemetry.

Primary Tools: sysstat (pidstat), nvidia-smi (or rocm-smi).
Underlying Principle: The host operating system’s pseudo-filesystem (/proc) and vendor-specific hardware drivers are queried iteratively by independent, user-space processes.

Example (Concurrent Bash Polling):

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --time=00:20:00
#SBATCH --output=job_%j.out

# Ensure propagation of termination signals to background telemetry processes
trap 'kill -TERM $APP_PID $HOST_PROF_PID $GPU_PROF_PID 2>/dev/null; wait $APP_PID' TERM INT

# 1. Execute primary workload in the background
./my_application &
APP_PID=$!

# 2. Execute host-level telemetry (CPU, Memory, I/O)
# -h: Horizontal formatting for automated parsing
# 5: Polling interval (seconds)
pidstat -p $APP_PID -u -r -d -h 5 > "host_telemetry_${SLURM_JOB_ID}.log" &
HOST_PROF_PID=$!

# 3. Execute hardware-level telemetry (GPU)
# --format=csv: Enforces structured data output
# -l 5: Polling interval (seconds)
nvidia-smi --query-gpu=timestamp,index,utilization.gpu,utilization.memory,memory.used,power.draw \
           --format=csv -l 5 > "gpu_telemetry_${SLURM_JOB_ID}.csv" &
GPU_PROF_PID=$!

# Suspend script execution until the primary workload terminates natively
wait $APP_PID

# Clean termination of isolated polling loops
kill -TERM $HOST_PROF_PID $GPU_PROF_PID 2>/dev/null

👷 Practical Part 👷#

The theoretical understanding of profiling tools is solidified through empirical testing across different hardware architectures. The following exercises isolate specific storage tiers to measure their distinct latency and throughput characteristics.

Local vs. Cloud VM#

Evaluate the performance discrepancy between a locally attached block storage device and network-attached object storage from the perspective of an isolated virtual machine.

Use fio to evaluate the read/write performance discrepancy between a locally attached block storage device and network-attached block storage on a virtual machine.

Check the discrepancy between volatile temporal storage (typically /tmp/) and non-volatile temporal storage (typically /var/tmp/).

Use df to check the type of temporal storage.
Use fio to perform a read/write profiling.

fio overwrites!

fio will attempt to overwrite the provided destination!

HPC Cluster (Shared Filesystem)#

Avoid performing benchmark tests

Benchmarking a shared filesystem is almost never a good idea! You will be a very noisy neighbour!