Debugging & Profiling
Systematic approaches to finding and fixing GPU kernel issues. From cryptic error messages to silent numerical bugs—learn to diagnose problems efficiently.
- Diagnose common kernel errors from CUDA error messages
- Apply systematic debugging flowcharts for "wrong results" and "slow kernel" issues
- Detect and fix numerical issues (NaN, Inf, precision loss)
- Ensure deterministic execution for reproducibility
- Use Nsight tools to identify performance bottlenecks
Decoding CUDA Errors
CUDA errors can be cryptic. The CUDA Runtime API defines error codes that tell you what went wrong—if you know how to interpret them.
Select an error code to see its explanation and common fixes.
CUDA kernel launches are asynchronous.
Errors may not appear until you synchronize. Always check errors after cudaDeviceSynchronize()
or cudaStreamSynchronize() to catch kernel failures.
Common Error Patterns
// Thread accesses beyond array int idx = blockIdx.x * blockDim.x + threadIdx.x; output[idx] = input[idx]; // No bounds check! // Error: cudaErrorIllegalAddress
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) { // Guard clause
output[idx] = input[idx];
}
"My Kernel Gives Wrong Results"
When a kernel runs but produces incorrect output, systematic diagnosis beats random guessing. Follow this flowchart to isolate the problem.
Comparison Testing
The gold standard: compare your kernel output against a known-correct reference. PyTorch and NumPy provide reliable baselines.
# NumPy reference (always correct)
def layernorm_ref(x, gamma, beta, eps=1e-5):
mean = x.mean(axis=-1, keepdims=True)
var = x.var(axis=-1, keepdims=True)
return gamma * (x - mean) / np.sqrt(var + eps) + beta
# Compare with tolerance
ref = layernorm_ref(x_np, gamma_np, beta_np)
out = my_kernel(x_gpu, gamma_gpu, beta_gpu)
if not np.allclose(ref, out.cpu().numpy(),
rtol=1e-3, atol=1e-5):
print("MISMATCH:", np.abs(ref - out).max())
Different precisions require different tolerances. Per IEEE 754:
FP32 has ~7 decimal digits of precision, FP16 has ~3-4, BF16 has ~2-3.
Use rtol=1e-3 for FP16, rtol=1e-5 for FP32.
"My Kernel is Slow"
Before optimizing, you must identify the bottleneck. Is your kernel memory-bound or compute-bound?
Memory-Bound vs Compute-Bound
Arithmetic intensity = FLOPs / Bytes loaded. If your kernel's intensity is below the threshold, you're memory-bound—no amount of compute optimization helps.
--maxrregcount to limit registers. Reduce shared memory per block.Key Metrics to Check
| Metric | Good Value | What Low Values Mean |
|---|---|---|
| Memory Throughput | >80% of peak | Poor coalescing, cache thrashing |
| Occupancy | 50-100% | Too many registers or shared memory per thread |
| SM Efficiency | >80% | Warp divergence, load imbalance |
| L2 Hit Rate | Depends on kernel | Working set doesn't fit in cache |
NaN, Inf, and Precision Loss
Numerical bugs are insidious—kernels run without errors but produce garbage. Understanding floating-point behavior is essential for GPU programming.
How NaN and Inf Propagate
Once NaN appears, it propagates through all subsequent operations. The key is finding where it first appears.
Common NaN Sources
| Operation | Dangerous Input | Result | Fix |
|---|---|---|---|
sqrt(x) |
x < 0 | NaN | sqrt(max(x, 0)) |
log(x) |
x ≤ 0 | -Inf or NaN | log(x + eps) |
x / y |
y = 0 | Inf or NaN | x / (y + eps) |
exp(x) |
x > 88 (FP32) | Inf | Clamp input or use log-space |
Inf - Inf |
Softmax overflow | NaN | Subtract max before exp() |
Precision Comparison
| Format | Bits | Exponent/Mantissa | Range | Precision |
|---|---|---|---|---|
| FP32 | 32 | 8 / 23 | ±3.4×10³⁸ | ~7 decimal digits |
| FP16 | 16 | 5 / 10 | ±65,504 | ~3-4 digits |
| BF16 | 16 | 8 / 7 | ±3.4×10³⁸ | ~2-3 digits |
| FP8 (E4M3) | 8 | 4 / 3 | ±448 | ~2 digits |
FP16: Better precision but limited range (max ~65K). Overflow-prone in training. BF16: Same range as FP32, less precision. Generally safer for training. See Mixed Precision Training (Micikevicius et al., 2018).
Reproducibility
Non-deterministic results make debugging nightmares. GPU operations can vary between runs due to floating-point reassociation, atomics, and framework choices.
Sources of Non-Determinism
| Source | Why It's Non-Deterministic | How to Fix |
|---|---|---|
| Atomic operations | Order of accumulation varies | Use deterministic reductions or sorting |
| cuBLAS/cuDNN | May choose different algorithms per run | Set CUBLAS_WORKSPACE_CONFIG and torch.backends.cudnn.deterministic = True |
| Parallel reductions | FP addition is not associative | Use tree reduction with fixed order |
| Thread scheduling | Warp execution order varies | Ensure algorithm doesn't depend on order |
PyTorch Determinism Settings
# PyTorch Reproducibility Guide import torch import os # Set seeds torch.manual_seed(42) torch.cuda.manual_seed_all(42) # Force deterministic algorithms torch.use_deterministic_algorithms(True) # Required for some cuBLAS operations os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8" # cuDNN determinism torch.backends.cudnn.deterministic = True torch.backends.cudnn.benchmark = False
Deterministic mode can be significantly slower. Per PyTorch docs, some operations have no deterministic implementation and will raise errors. Use deterministic mode for debugging and validation, not production.
Nsight Deep Dive
NVIDIA provides two primary profiling tools: Nsight Systems for system-wide timeline analysis and Nsight Compute for detailed kernel analysis.
When to Use Each Tool
| Tool | Best For | Key Metrics |
|---|---|---|
| Nsight Systems | Finding bottlenecks, CPU-GPU overlap, timeline | Kernel duration, memory transfers, API calls, gaps |
| Nsight Compute | Optimizing specific kernels, roofline | Memory bandwidth, compute throughput, occupancy, stalls |
Nsight Compute Key Sections
GPU Speed of Light
Shows achieved percentage of peak compute and memory throughput. If both are low, the kernel has inefficiencies. If memory is high but compute is low, you're memory-bound.
Memory Workload Analysis
L1/L2 cache hit rates, global memory access patterns. Low hit rates suggest poor locality or working set doesn't fit.
Warp State Statistics
Shows why warps are stalled. Common reasons: waiting for memory (memory-bound), waiting for barrier (sync overhead), waiting for instruction fetch (code too large).
Source Correlation
Maps metrics back to source lines. Requires compiling with -lineinfo.
Shows which lines cause the most stalls or memory traffic.
Command Line Quick Start
# Capture timeline nsys profile -o report python train.py # View in GUI nsys-ui report.nsys-rep
# Profile specific kernel ncu --set full -o report python train.py # Quick metrics only ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed python train.py
Nsight Compute reruns kernels multiple times for accurate metrics, causing significant slowdown.
Profile representative workloads, not full training runs. Use --kernel-name to target specific kernels.
Citations & Further Reading
Official Documentation
-
CUDA C++ Programming Guide - Floating Point
IEEE 754 compliance, precision guarantees
docs.nvidia.com/cuda -
CUDA Runtime API - Error Types
Complete list of CUDA error codes
docs.nvidia.com/cuda/cuda-runtime-api -
Nsight Compute Documentation
Kernel profiling and optimization guide
docs.nvidia.com/nsight-compute -
Nsight Systems Documentation
System-wide profiling and timeline analysis
docs.nvidia.com/nsight-systems -
PyTorch Reproducibility Guide
Determinism settings and limitations
pytorch.org/docs -
cuBLAS Reproducibility
CUBLAS_WORKSPACE_CONFIG and deterministic modes
docs.nvidia.com/cuda/cublas
Papers
-
Mixed Precision Training
Micikevicius et al., 2018. Loss scaling for FP16 training.
arXiv:1710.03740