Chapter 6

Debugging & Profiling

Systematic approaches to finding and fixing GPU kernel issues. From cryptic error messages to silent numerical bugs—learn to diagnose problems efficiently.

What You'll Learn

Diagnose common kernel errors from CUDA error messages
Apply systematic debugging flowcharts for "wrong results" and "slow kernel" issues
Detect and fix numerical issues (NaN, Inf, precision loss)
Ensure deterministic execution for reproducibility
Use Nsight tools to identify performance bottlenecks

01 - ERROR MESSAGES

Decoding CUDA Errors

CUDA errors can be cryptic. The CUDA Runtime API defines error codes that tell you what went wrong—if you know how to interpret them.

Select an error code to see its explanation and common fixes.

Async Error Checking

CUDA kernel launches are asynchronous. Errors may not appear until you synchronize. Always check errors after cudaDeviceSynchronize() or cudaStreamSynchronize() to catch kernel failures.

Common Error Patterns

// Thread accesses beyond array
int idx = blockIdx.x * blockDim.x + threadIdx.x;
output[idx] = input[idx];  // No bounds check!
// Error: cudaErrorIllegalAddress

int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) {  // Guard clause
    output[idx] = input[idx];
}

A kernel that worked yesterday now gives "cudaErrorIllegalAddress". What's most likely?

The GPU driver updated

Input data changed size, causing out-of-bounds access

The kernel is using too much shared memory

02 - WRONG RESULTS

"My Kernel Gives Wrong Results"

When a kernel runs but produces incorrect output, systematic diagnosis beats random guessing. Follow this flowchart to isolate the problem.

Does a small input (e.g., 32 elements) give correct results?

Test with trivially small inputs to isolate scaling issues.

YES → Problem is size-dependent

Check: loop bounds, grid/block dimensions, buffer sizes, index calculations at boundaries.

NO → Core logic is wrong

Check: algorithm implementation, operator precedence, off-by-one errors.

Does running with 1 thread give correct results?

Eliminate parallelism to test sequential correctness.

YES → Race condition or synchronization issue

Check: missing __syncthreads(), atomic operations needed, shared memory bank conflicts.

NO → Algorithm bug, not parallelism issue

Debug as you would sequential code. Print intermediate values.

Comparison Testing

The gold standard: compare your kernel output against a known-correct reference. PyTorch and NumPy provide reliable baselines.

# NumPy reference (always correct)
def layernorm_ref(x, gamma, beta, eps=1e-5):
    mean = x.mean(axis=-1, keepdims=True)
    var = x.var(axis=-1, keepdims=True)
    return gamma * (x - mean) / np.sqrt(var + eps) + beta

# Compare with tolerance
ref = layernorm_ref(x_np, gamma_np, beta_np)
out = my_kernel(x_gpu, gamma_gpu, beta_gpu)

if not np.allclose(ref, out.cpu().numpy(), 
                   rtol=1e-3, atol=1e-5):
    print("MISMATCH:", np.abs(ref - out).max())

Tolerance Selection

Different precisions require different tolerances. Per IEEE 754: FP32 has ~7 decimal digits of precision, FP16 has ~3-4, BF16 has ~2-3. Use rtol=1e-3 for FP16, rtol=1e-5 for FP32.

Your kernel works for batch_size=1 but fails for batch_size=64. What should you check first?

GPU temperature

Grid/block dimensions and index calculations

CUDA driver version

CPU memory usage

03 - SLOW KERNELS

"My Kernel is Slow"

Before optimizing, you must identify the bottleneck. Is your kernel memory-bound or compute-bound?

Memory-Bound vs Compute-Bound

H100 Memory Bandwidth

3.35

TB/s (HBM3)

H100 FP16 Tensor Core

1,979

TFLOPS

Arithmetic Intensity Threshold

~590

FLOP/byte (FP16)

Arithmetic intensity = FLOPs / Bytes loaded. If your kernel's intensity is below the threshold, you're memory-bound—no amount of compute optimization helps.

Is achieved memory bandwidth close to peak?

Check with Nsight Compute. H100 peak: 3.35 TB/s

YES (>80% peak) → Memory-bound, well-optimized

You've hit the limit. Reduce memory traffic via fusion, caching, or algorithmic changes.

NO (<50% peak) → Memory access pattern issue

Check: coalescing, bank conflicts, L2 cache utilization, TLB misses.

Is occupancy reasonable?

Check warps per SM. Low occupancy → not enough parallelism to hide latency.

Low occupancy → Check register and shared memory usage

Use --maxrregcount to limit registers. Reduce shared memory per block.

Key Metrics to Check

Metric	Good Value	What Low Values Mean
Memory Throughput	>80% of peak	Poor coalescing, cache thrashing
Occupancy	50-100%	Too many registers or shared memory per thread
SM Efficiency	>80%	Warp divergence, load imbalance
L2 Hit Rate	Depends on kernel	Working set doesn't fit in cache

Your kernel achieves 95% of peak memory bandwidth but is still "slow". What does this mean?

There's a bug in your kernel

The kernel is memory-bound; need algorithmic changes to go faster

You should increase occupancy

04 - NUMERICAL ISSUES

NaN, Inf, and Precision Loss

Numerical bugs are insidious—kernels run without errors but produce garbage. Understanding floating-point behavior is essential for GPU programming.

How NaN and Inf Propagate

Input

1e38

→

After x²

Inf

→

After Inf - Inf

NaN

→

Everything after

NaN

Once NaN appears, it propagates through all subsequent operations. The key is finding where it first appears.

Common NaN Sources

Operation	Dangerous Input	Result	Fix
`sqrt(x)`	x < 0	NaN	`sqrt(max(x, 0))`
`log(x)`	x ≤ 0	-Inf or NaN	`log(x + eps)`
`x / y`	y = 0	Inf or NaN	`x / (y + eps)`
`exp(x)`	x > 88 (FP32)	Inf	Clamp input or use log-space
`Inf - Inf`	Softmax overflow	NaN	Subtract max before exp()

Precision Comparison

Format	Bits	Exponent/Mantissa	Range	Precision
FP32	32	8 / 23	±3.4×10³⁸	~7 decimal digits
FP16	16	5 / 10	±65,504	~3-4 digits
BF16	16	8 / 7	±3.4×10³⁸	~2-3 digits
FP8 (E4M3)	8	4 / 3	±448	~2 digits

FP16 vs BF16 Trade-off

FP16: Better precision but limited range (max ~65K). Overflow-prone in training. BF16: Same range as FP32, less precision. Generally safer for training. See Mixed Precision Training (Micikevicius et al., 2018).

Softmax produces NaN. What's the most likely cause?

Division by zero in the denominator

exp() overflow followed by Inf - Inf

Input contains negative numbers

Wrong data type

05 - DETERMINISM

Reproducibility

Non-deterministic results make debugging nightmares. GPU operations can vary between runs due to floating-point reassociation, atomics, and framework choices.

Sources of Non-Determinism

Source	Why It's Non-Deterministic	How to Fix
Atomic operations	Order of accumulation varies	Use deterministic reductions or sorting
cuBLAS/cuDNN	May choose different algorithms per run	Set `CUBLAS_WORKSPACE_CONFIG` and `torch.backends.cudnn.deterministic = True`
Parallel reductions	FP addition is not associative	Use tree reduction with fixed order
Thread scheduling	Warp execution order varies	Ensure algorithm doesn't depend on order

PyTorch Determinism Settings

# PyTorch Reproducibility Guide
import torch
import os

# Set seeds
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)

# Force deterministic algorithms
torch.use_deterministic_algorithms(True)

# Required for some cuBLAS operations
os.environ["CUBLAS_WORKSPACE_CONFIG"] = ":4096:8"

# cuDNN determinism
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

Performance Cost

Deterministic mode can be significantly slower. Per PyTorch docs, some operations have no deterministic implementation and will raise errors. Use deterministic mode for debugging and validation, not production.

Why is floating-point addition non-associative?

It's a GPU hardware bug

Rounding errors accumulate differently based on order

CUDA uses different addition algorithms

06 - PROFILING TOOLS

Nsight Deep Dive

NVIDIA provides two primary profiling tools: Nsight Systems for system-wide timeline analysis and Nsight Compute for detailed kernel analysis.

When to Use Each Tool

Tool	Best For	Key Metrics
Nsight Systems	Finding bottlenecks, CPU-GPU overlap, timeline	Kernel duration, memory transfers, API calls, gaps
Nsight Compute	Optimizing specific kernels, roofline	Memory bandwidth, compute throughput, occupancy, stalls

Nsight Compute Key Sections

GPU Speed of Light

Shows achieved percentage of peak compute and memory throughput. If both are low, the kernel has inefficiencies. If memory is high but compute is low, you're memory-bound.

Memory Workload Analysis

L1/L2 cache hit rates, global memory access patterns. Low hit rates suggest poor locality or working set doesn't fit.

Warp State Statistics

Shows why warps are stalled. Common reasons: waiting for memory (memory-bound), waiting for barrier (sync overhead), waiting for instruction fetch (code too large).

Source Correlation

Maps metrics back to source lines. Requires compiling with -lineinfo. Shows which lines cause the most stalls or memory traffic.

Command Line Quick Start

# Capture timeline
nsys profile -o report python train.py

# View in GUI
nsys-ui report.nsys-rep

# Profile specific kernel
ncu --set full -o report python train.py

# Quick metrics only
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed python train.py

Profiling Overhead

Nsight Compute reruns kernels multiple times for accurate metrics, causing significant slowdown. Profile representative workloads, not full training runs. Use --kernel-name to target specific kernels.

REFERENCES

Citations & Further Reading

Official Documentation

CUDA C++ Programming Guide - Floating Point
IEEE 754 compliance, precision guarantees
docs.nvidia.com/cuda
CUDA Runtime API - Error Types
Complete list of CUDA error codes
docs.nvidia.com/cuda/cuda-runtime-api
Nsight Compute Documentation
Kernel profiling and optimization guide
docs.nvidia.com/nsight-compute
Nsight Systems Documentation
System-wide profiling and timeline analysis
docs.nvidia.com/nsight-systems
PyTorch Reproducibility Guide
Determinism settings and limitations
pytorch.org/docs
cuBLAS Reproducibility
CUBLAS_WORKSPACE_CONFIG and deterministic modes
docs.nvidia.com/cuda/cublas

Papers

Mixed Precision Training
Micikevicius et al., 2018. Loss scaling for FP16 training.
arXiv:1710.03740

All material licensed under CC BY-NC-SA 4.0