Chapter 6

Debugging & Profiling

Systematic approaches to finding and fixing GPU kernel issues. From cryptic error messages to silent numerical bugs—learn to diagnose problems efficiently.

What You'll Learn
  1. Diagnose common kernel errors from CUDA error messages
  2. Apply systematic debugging flowcharts for "wrong results" and "slow kernel" issues
  3. Detect and fix numerical issues (NaN, Inf, precision loss)
  4. Ensure deterministic execution for reproducibility
  5. Use Nsight tools to identify performance bottlenecks
01 - ERROR MESSAGES

Decoding CUDA Errors

CUDA errors can be cryptic. The CUDA Runtime API defines error codes that tell you what went wrong—if you know how to interpret them.

Interactive: Error Code Lookup

Select an error code to see its explanation and common fixes.

Async Error Checking

CUDA kernel launches are asynchronous. Errors may not appear until you synchronize. Always check errors after cudaDeviceSynchronize() or cudaStreamSynchronize() to catch kernel failures.

Common Error Patterns

Out-of-Bounds Access
// Thread accesses beyond array
int idx = blockIdx.x * blockDim.x + threadIdx.x;
output[idx] = input[idx];  // No bounds check!
// Error: cudaErrorIllegalAddress
With Bounds Check
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {  // Guard clause
    output[idx] = input[idx];
}
A kernel that worked yesterday now gives "cudaErrorIllegalAddress". What's most likely?
The GPU driver updated
Input data changed size, causing out-of-bounds access
The kernel is using too much shared memory

02 - WRONG RESULTS

"My Kernel Gives Wrong Results"

When a kernel runs but produces incorrect output, systematic diagnosis beats random guessing. Follow this flowchart to isolate the problem.

Does a small input (e.g., 32 elements) give correct results?
Test with trivially small inputs to isolate scaling issues.
YES → Problem is size-dependent
Check: loop bounds, grid/block dimensions, buffer sizes, index calculations at boundaries.
NO → Core logic is wrong
Check: algorithm implementation, operator precedence, off-by-one errors.
Does running with 1 thread give correct results?
Eliminate parallelism to test sequential correctness.
YES → Race condition or synchronization issue
Check: missing __syncthreads(), atomic operations needed, shared memory bank conflicts.
NO → Algorithm bug, not parallelism issue
Debug as you would sequential code. Print intermediate values.

Comparison Testing

The gold standard: compare your kernel output against a known-correct reference. PyTorch and NumPy provide reliable baselines.

Reference Implementation
# NumPy reference (always correct)
def layernorm_ref(x, gamma, beta, eps=1e-5):
    mean = x.mean(axis=-1, keepdims=True)
    var = x.var(axis=-1, keepdims=True)
    return gamma * (x - mean) / np.sqrt(var + eps) + beta
Comparison Code
# Compare with tolerance
ref = layernorm_ref(x_np, gamma_np, beta_np)
out = my_kernel(x_gpu, gamma_gpu, beta_gpu)

if not np.allclose(ref, out.cpu().numpy(), 
                   rtol=1e-3, atol=1e-5):
    print("MISMATCH:", np.abs(ref - out).max())
Tolerance Selection

Different precisions require different tolerances. Per IEEE 754: FP32 has ~7 decimal digits of precision, FP16 has ~3-4, BF16 has ~2-3. Use rtol=1e-3 for FP16, rtol=1e-5 for FP32.

Your kernel works for batch_size=1 but fails for batch_size=64. What should you check first?
GPU temperature
Grid/block dimensions and index calculations
CUDA driver version
CPU memory usage

03 - SLOW KERNELS

"My Kernel is Slow"

Before optimizing, you must identify the bottleneck. Is your kernel memory-bound or compute-bound?

Memory-Bound vs Compute-Bound

H100 Memory Bandwidth
3.35
H100 FP16 Tensor Core
1,979
Arithmetic Intensity Threshold
~590
FLOP/byte (FP16)

Arithmetic intensity = FLOPs / Bytes loaded. If your kernel's intensity is below the threshold, you're memory-bound—no amount of compute optimization helps.

Is achieved memory bandwidth close to peak?
Check with Nsight Compute. H100 peak: 3.35 TB/s
YES (>80% peak) → Memory-bound, well-optimized
You've hit the limit. Reduce memory traffic via fusion, caching, or algorithmic changes.
NO (<50% peak) → Memory access pattern issue
Check: coalescing, bank conflicts, L2 cache utilization, TLB misses.
Is occupancy reasonable?
Check warps per SM. Low occupancy → not enough parallelism to hide latency.
Low occupancy → Check register and shared memory usage
Use --maxrregcount to limit registers. Reduce shared memory per block.

Key Metrics to Check

Metric Good Value What Low Values Mean
Memory Throughput >80% of peak Poor coalescing, cache thrashing
Occupancy 50-100% Too many registers or shared memory per thread
SM Efficiency >80% Warp divergence, load imbalance
L2 Hit Rate Depends on kernel Working set doesn't fit in cache
Your kernel achieves 95% of peak memory bandwidth but is still "slow". What does this mean?
There's a bug in your kernel
The kernel is memory-bound; need algorithmic changes to go faster
You should increase occupancy

04 - NUMERICAL ISSUES

NaN, Inf, and Precision Loss

Numerical bugs are insidious—kernels run without errors but produce garbage. Understanding floating-point behavior is essential for GPU programming.

How NaN and Inf Propagate

Input
1e38
After x²
Inf
After Inf - Inf
NaN
Everything after
NaN

Once NaN appears, it propagates through all subsequent operations. The key is finding where it first appears.

Common NaN Sources

Operation Dangerous Input Result Fix
sqrt(x) x < 0 NaN sqrt(max(x, 0))
log(x) x ≤ 0 -Inf or NaN log(x + eps)
x / y y = 0 Inf or NaN x / (y + eps)
exp(x) x > 88 (FP32) Inf Clamp input or use log-space
Inf - Inf Softmax overflow NaN Subtract max before exp()

Precision Comparison

Format Bits Exponent/Mantissa Range Precision
FP32 32 8 / 23 ±3.4×10³⁸ ~7 decimal digits
FP16 16 5 / 10 ±65,504 ~3-4 digits
BF16 16 8 / 7 ±3.4×10³⁸ ~2-3 digits
FP8 (E4M3) 8 4 / 3 ±448 ~2 digits
FP16 vs BF16 Trade-off

FP16: Better precision but limited range (max ~65K). Overflow-prone in training. BF16: Same range as FP32, less precision. Generally safer for training. See Mixed Precision Training (Micikevicius et al., 2018).

Softmax produces NaN. What's the most likely cause?
Division by zero in the denominator
exp() overflow followed by Inf - Inf
Input contains negative numbers
Wrong data type

05 - DETERMINISM

Reproducibility

Non-deterministic results make debugging nightmares. GPU operations can vary between runs due to floating-point reassociation, atomics, and framework choices.

Sources of Non-Determinism

Source Why It's Non-Deterministic How to Fix
Atomic operations Order of accumulation varies Use deterministic reductions or sorting
cuBLAS/cuDNN May choose different algorithms per run Set CUBLAS_WORKSPACE_CONFIG and torch.backends.cudnn.deterministic = True
Parallel reductions FP addition is not associative Use tree reduction with fixed order
Thread scheduling Warp execution order varies Ensure algorithm doesn't depend on order

PyTorch Determinism Settings

Enable Deterministic Mode
# PyTorch Reproducibility Guide
import torch
import os

# Set seeds
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

# Force deterministic algorithms
torch.use_deterministic_algorithms(True)

# Required for some cuBLAS operations
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"

# cuDNN determinism
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
Performance Cost

Deterministic mode can be significantly slower. Per PyTorch docs, some operations have no deterministic implementation and will raise errors. Use deterministic mode for debugging and validation, not production.

Why is floating-point addition non-associative?
It's a GPU hardware bug
Rounding errors accumulate differently based on order
CUDA uses different addition algorithms

06 - PROFILING TOOLS

Nsight Deep Dive

NVIDIA provides two primary profiling tools: Nsight Systems for system-wide timeline analysis and Nsight Compute for detailed kernel analysis.

When to Use Each Tool

Tool Best For Key Metrics
Nsight Systems Finding bottlenecks, CPU-GPU overlap, timeline Kernel duration, memory transfers, API calls, gaps
Nsight Compute Optimizing specific kernels, roofline Memory bandwidth, compute throughput, occupancy, stalls

Nsight Compute Key Sections

GPU Speed of Light

Shows achieved percentage of peak compute and memory throughput. If both are low, the kernel has inefficiencies. If memory is high but compute is low, you're memory-bound.

Memory Workload Analysis

L1/L2 cache hit rates, global memory access patterns. Low hit rates suggest poor locality or working set doesn't fit.

Warp State Statistics

Shows why warps are stalled. Common reasons: waiting for memory (memory-bound), waiting for barrier (sync overhead), waiting for instruction fetch (code too large).

Source Correlation

Maps metrics back to source lines. Requires compiling with -lineinfo. Shows which lines cause the most stalls or memory traffic.

Command Line Quick Start

Nsight Systems
# Capture timeline
nsys profile -o report python train.py

# View in GUI
nsys-ui report.nsys-rep
Nsight Compute
# Profile specific kernel
ncu --set full -o report python train.py

# Quick metrics only
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed python train.py
Profiling Overhead

Nsight Compute reruns kernels multiple times for accurate metrics, causing significant slowdown. Profile representative workloads, not full training runs. Use --kernel-name to target specific kernels.


REFERENCES

Citations & Further Reading

Official Documentation

  1. CUDA C++ Programming Guide - Floating Point
    IEEE 754 compliance, precision guarantees
    docs.nvidia.com/cuda
  2. CUDA Runtime API - Error Types
    Complete list of CUDA error codes
    docs.nvidia.com/cuda/cuda-runtime-api
  3. Nsight Compute Documentation
    Kernel profiling and optimization guide
    docs.nvidia.com/nsight-compute
  4. Nsight Systems Documentation
    System-wide profiling and timeline analysis
    docs.nvidia.com/nsight-systems
  5. PyTorch Reproducibility Guide
    Determinism settings and limitations
    pytorch.org/docs
  6. cuBLAS Reproducibility
    CUBLAS_WORKSPACE_CONFIG and deterministic modes
    docs.nvidia.com/cuda/cublas

Papers

  1. Mixed Precision Training
    Micikevicius et al., 2018. Loss scaling for FP16 training.
    arXiv:1710.03740
All material licensed under CC BY-NC-SA 4.0