Chapter 5

Optimization

Push your kernels to 80%+ of peak performance. Profiling, bank conflicts, Tensor Cores, and TMA—the techniques that separate good code from great code.

Building on Chapter 4

You can write working kernels now. But working isn't enough—you need fast. Chapter 2's memory hierarchy becomes your optimization target. Chapter 4's synchronization primitives become essential for reductions and shared memory coordination.

What You'll Learn

Use Nsight Compute to identify kernel bottlenecks
Apply tiling to improve cache utilization
Optimize memory access patterns for coalescing
Balance occupancy against per-thread resources
Achieve >50% of theoretical memory bandwidth

📚

Prerequisites

This chapter assumes you've written tiled kernels. Chapter 3: First Kernels | Chapter 2: Memory Hierarchy

📝

Practice Notebooks

Part 2: Optimization Labs - Profiling through optimized GEMM

01 — PROFILING

Measure Before You Optimize

The first rule of optimization: don't guess. Use Nsight Compute to understand where time is actually spent.

# Profile a kernel
ncu --set full -o profile python my_kernel.py

# Key metrics to check:
# - Memory throughput (% of peak)
# - Compute throughput (% of peak)
# - Occupancy (warps/SM)
# - L1/L2 hit rates
# - Bank conflicts
        

Common Bottlenecks

Symptom	Likely Cause	Fix
High memory latency	Uncoalesced access	Fix access patterns
Low occupancy	Too many registers	Reduce register pressure
SMEM conflicts	Bank conflicts	Pad arrays, change stride
Low compute %	Memory bound	Increase data reuse

The 80% Rule

If you're at 80% of theoretical peak, you're done. The last 20% requires heroic effort and architecture-specific tricks. Ship it and move on.

Most important metric for memory-bound kernels:

Occupancy

Memory throughput (% of peak)

Compute throughput

02 — BANK CONFLICTS

Shared Memory Banks

Shared memory is divided into 32 banks. Each bank can serve one address per cycle. When multiple threads in a warp access different addresses in the same bank, accesses serialize—this is a bank conflict.

Bank Conflict Visualizer

Bank = (address / 4) % 32. Consecutive 4-byte words map to consecutive banks.

Click a pattern to visualize bank access.

Conflict-Free

// Each thread hits different bank
smem[threadIdx.x]

// All 32 accesses parallel

N-way Conflict

// Stride of 32 = same bank!
smem[threadIdx.x * 32]

// Serialized: 32x slower

Broadcast Exception

If multiple threads read the same address, it's a broadcast—no conflict. Conflicts only occur when threads access different addresses in the same bank.

What stride causes maximum bank conflicts (32-way)?

Stride of 1 (consecutive access)

Stride of 16

Stride of 32 (or any multiple of 32)

Stride of 33

03 — TENSOR CORES

Matrix Multiply-Accumulate Units

Tensor Cores are specialized hardware for matrix operations. They perform matrix multiply-accumulate (MMA) operations on small tiles in a single cycle—8-16x faster than CUDA cores for compatible workloads.

Tensor Core Specs (H100 SXM)

Format	Shape (M×N×K)	Peak TFLOPS
FP16	16×8×16	~990
BF16	16×8×16	~990
FP8	16×8×32	~1979
INT8	16×8×32	~1979

Source: NVIDIA H100 Datasheet. MMA shapes from PTX ISA.

# Triton automatically uses Tensor Cores when:
# 1. Matrix dimensions align to MMA shapes
# 2. Data types are FP16, BF16, FP8, or INT8
# 3. Using tl.dot() operation

@triton.jit
def matmul_kernel(...):
    # This uses Tensor Cores automatically
    acc = tl.dot(a, b, acc)  # MMA operation
        

Alignment Requirements

Tensor Core performance requires alignment. Tile sizes should be multiples of MMA shapes (16, 32, etc.). Misaligned tiles fall back to slower CUDA core execution.

Tiling improves performance by:

Using more threads

Reusing data in fast memory (SMEM)

Reducing computation

04 — TMA

Tensor Memory Accelerator

Hopper introduced TMA—dedicated hardware for bulk data movement between HBM and shared memory. TMA offloads address computation from SMs, enabling asynchronous tile transfers.

HBM
Global tensors

→

TMA Unit
Address gen + transfer

→

SMEM
Ready for compute

TMA vs Manual Loads

Aspect	Manual (LDG/STG)	TMA
Address computation	SM cycles	TMA unit (free)
Tile dimensions	Manual indexing	Descriptor-based
Multicast	Not available	Built-in to multiple SMs
Async	cp.async	Native, with barriers

When to Use TMA

TMA shines for structured tile access patterns (GEMM, convolution, attention). For irregular access, manual loads may still be necessary. CuTe and Triton abstract TMA through their copy operations.

05 — CHECKLIST

Optimization Priority Order

Memory optimization follows a hierarchy of impact. Address these in order:

Memory Coalescing

Adjacent threads access adjacent addresses. Single biggest impact on bandwidth.

Data Reuse (Tiling)

Load once from HBM, use many times from SMEM. Transforms memory-bound to compute-bound.

Occupancy

Enough warps to hide latency. Balance registers/SMEM against parallelism.

Bank Conflicts

Avoid same-bank access in SMEM. Pad arrays or use different strides.

Tensor Cores + TMA

Hardware acceleration. Requires proper alignment and data types.

Performance Targets

Metric	Good	Great
Memory Bandwidth	60%	80%+
Compute Throughput	50%	70%+
vs cuBLAS	50%	80%+

06 — PRACTICE

Hands-On Labs

Part 2 Labs

Profiling - Learn Nsight Compute
Coalescing - Optimize memory access patterns
Bank Conflicts - Eliminate shared memory bottlenecks
Pipelining - Overlap compute and memory
TMA (Hopper+) - Hardware-accelerated data movement
Tensor Cores - MMA operations
Optimized GEMM - Put it all together

References

Nsight Compute Documentation - NVIDIA's kernel profiler
CUDA Programming Guide: Shared Memory - Bank conflict details
PTX ISA: Warp-Level Matrix Instructions - MMA shapes and semantics
Hopper Tuning Guide - TMA and H100-specific optimizations
NVIDIA H100 Datasheet - Official specs and TFLOPS numbers
NVIDIA Tensor Core Resources - Architecture whitepapers

07 — CUDA GRAPHS

Reducing Launch Overhead

Each kernel launch has overhead: ~5-10 microseconds for the CPU to set up and dispatch work to the GPU. For large kernels, this is negligible. For many small kernels (common in inference), it dominates runtime.

When CUDA Graphs Help

Good Candidates

Inference with fixed shapes — Same operations repeated many times
Small batch sizes — Kernel compute time comparable to launch overhead
Many small kernels — Transformer layers with separate attention, FFN, norm calls

Not Useful For

Training — Dynamic shapes, gradient checkpointing, varying batch sizes
Large kernels — Compute time >> launch overhead anyway
Control flow dependent on data — Graphs can't handle dynamic branches

How CUDA Graphs Work

A CUDA Graph captures a sequence of operations (kernels, memory copies) and their dependencies, then replays them with minimal CPU involvement.

Capture
Record ops

→

Instantiate
Create executable

→

Launch
Single API call

PyTorch Example

import torch

# Warmup (required to allocate memory, compile kernels)
for _ in range(3):
    output = model(static_input)

# Capture graph
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
    static_output = model(static_input)

# Replay graph for inference (much faster for small batches)
for batch in dataloader:
    # Copy new data into static input buffer
    static_input.copy_(batch)
    
    # Replay the captured graph
    g.replay()
    
    # Results are in static_output
    process(static_output)

Static Shapes Required

The graph captures exact tensor shapes and memory addresses. Changing shapes requires recapturing the graph. For variable-length sequences, pad to max length or bucket by length and capture multiple graphs.

Performance Impact

Scenario	Without Graphs	With Graphs	Speedup
Small batch inference (BS=1)	~500 μs	~200 μs	~2.5x
Large batch inference (BS=64)	~8 ms	~7.9 ms	~1.01x

Typical numbers for a Transformer layer. Speedup depends on kernel count and individual kernel duration. Profile your specific workload.

CUDA Graphs are most beneficial when:

Training large models

Inference with small batches and fixed shapes

Variable-length sequence processing

All material licensed under CC BY-NC-SA 4.0