Chapter 2

Memory Hierarchy

GPU performance lives or dies by memory access patterns. Master the hierarchy from registers to HBM and learn the techniques that separate 10% from 80% hardware utilization.

Building on Chapter 1
In Chapter 1, you learned that warps execute 32 threads in lockstep. This seemingly simple fact has profound implications for memory access. When 32 threads read memory simultaneously, where they read from determines whether you get 8 TB/s or 500 GB/s.
What You'll Learn
  1. List the GPU memory types in order of speed and size
  2. Explain why memory bandwidth is often the bottleneck
  3. Demonstrate coalesced vs uncoalesced memory access patterns
  4. Identify and fix shared memory bank conflicts
  5. Calculate effective memory bandwidth for a given access pattern
01 - THE BOTTLENECK

Memory-Bound Reality

Modern GPUs have enormous compute capacity. A B200 delivers ~2.5 PFLOPS of FP8 compute. But that compute is useless if you can't feed it data fast enough.

Most GPU workloads are memory-bound, not compute-bound. The arithmetic intensity (FLOPS per byte) of your kernel determines which regime you're in.

Arithmetic Intensity

AI = FLOPS / Bytes transferred

For a B200 with ~8 TB/s bandwidth and ~2500 TFLOPS (FP16):

Ridge point = 2500 TFLOPS / 8 TB/s = 312.5 FLOPS/byte

Below 312.5: Memory-bound (bandwidth limited)
Above 312.5: Compute-bound (can saturate ALUs)

GEMM has high AI (~hundreds). Element-wise ops have low AI (~1-2).

The Real Challenge

Peak bandwidth is theoretical. Achieving even 80% requires perfect access patterns. Random access can drop effective bandwidth by 10-100x.

Which memory type is fastest?
Global memory (HBM)
Shared memory
Registers

02 - THE HIERARCHY

Four Levels of Memory

GPU memory forms a hierarchy trading capacity for speed. Each level has distinct characteristics and use cases. Click each tier to learn more.

Registers 256KB/SM | ~1 cycle | ~20 TB/s

Per-thread private storage. instant, like your own name

Characteristics:

  • Each thread has private registers (up to 255 per thread)
  • Very low latency (~1 cycle read-after-write)
  • Compiler-managed allocation
  • Register pressure limits occupancy

Usage: Loop variables, intermediate results, frequently accessed values. The compiler will spill to local memory (slow!) if you use too many.

Shared Memory (SMEM) 228KB/SM | ~20 cycles | ~20 TB/s

Block-level scratchpad. quick glance at a sticky note

Characteristics:

  • Shared among all threads in a block
  • On-chip SRAM with very low latency
  • Organized into 32 banks (bank conflicts possible)
  • Configurable L1/SMEM split on some architectures

Usage: Inter-thread communication, tiled algorithms, reused data. Essential for efficient matrix multiplication.

L2 Cache ~60MB | ~200 cycles | ~10 TB/s

Chip-wide cache. Hardware-managed.

Characteristics:

  • Shared across all SMs on the GPU
  • Automatic caching of global memory
  • Can use persistence hints (cudaAccessPolicyWindow)
  • Helps with irregular access patterns

Usage: Automatic caching. Can hint persistence for working sets that fit. Effective for data reused across blocks.

HBM (Global Memory) 192GB | ~400 cycles | ~8 TB/s

Main GPU memory. walk to filing cabinet in another room

Characteristics:

Usage: Input/output tensors, large datasets. Access patterns determine whether you get 10% or 90% of peak bandwidth.

The Latency Gap

Register access: ~1 cycle. HBM access: ~400 cycles. This enormous gap (hundreds of cycles) is why data reuse dominates GPU optimization. Every byte loaded from HBM should be used as many times as possible before eviction.

Note on Latency Figures

Cycle counts shown above are approximate and vary by architecture, workload, and memory access patterns. Actual latency depends on factors like cache hits, bank conflicts, and memory coalescing. For precise measurements, refer to Jia et al.'s GPU microbenchmarking or use Nsight Compute profiling.

Why is the 128-byte cache line important?
When you request even 4 bytes from global memory, the hardware fetches an entire 128-byte cache line. This is why coalescing matters: 32 threads reading 4 bytes each = 128 bytes = 1 cache line. But 32 threads reading scattered 4-byte values = potentially 32 cache lines (4KB!) for 128 bytes of useful data. That's 32x wasted bandwidth.

03 - COALESCING

Memory Access Patterns

When threads in a warp access global memory, the hardware combines (coalesces) their requests into as few transactions as possible. Proper coalescing is the single most important optimization for memory-bound kernels.

📚
Prerequisite Check

Understanding thread indexing is key. Need a refresher? See Prerequisites: Index Arithmetic

Interactive: Coalescing Patterns
Threads
Memory
Click a pattern to visualize memory access.

Good: Coalesced

// Thread i accesses element i
data[threadIdx.x]

// 1 transaction for 32 threads

Bad: Strided

// Thread i accesses element i*32
data[threadIdx.x * 32]

// 32 transactions! (worst case)
For a warp accessing 32 consecutive float values starting at a 128-byte aligned address, how many memory transactions occur?
1 transaction (128 bytes = 32 floats)
4 transactions (32 bytes each)
32 transactions (one per thread)
2 transactions (64 bytes each)
Coalesced memory access means:
All threads access the same address
Adjacent threads access adjacent addresses
Threads access memory in random order

04 - BANK CONFLICTS

Shared Memory Banks

Shared memory is divided into 32 banks. Each bank can serve one address per cycle. When multiple threads in a warp access different addresses in the same bank, accesses serialize—this is a bank conflict.

Bank Layout

Consecutive 4-byte words map to consecutive banks. Bank = (address / 4) % 32.

Click a pattern to visualize bank access.

Conflict-Free

// Each thread hits different bank
smem[threadIdx.x]

// All 32 accesses parallel

N-way Conflict

// Stride of 32 = same bank!
smem[threadIdx.x * 32]

// Serialized: 32x slower
Broadcast Exception

If multiple threads read the same address, it's a broadcast—no conflict. Conflicts only occur when threads access different addresses in the same bank.

What stride causes maximum bank conflicts (32-way)?
Stride of 1 (consecutive access)
Stride of 16
Stride of 32 (or any multiple of 32)
Stride of 33
A 2-way bank conflict means the access takes:
Half as long
Twice as long (serialized)
The same time

05 - TMA

Tensor Memory Accelerator

Hopper introduced TMA—dedicated hardware for bulk data movement between HBM and shared memory. TMA offloads address computation from SMs, enabling asynchronous tile transfers.

HBM
Global tensors
TMA Unit
Address gen + transfer
SMEM
Ready for compute

TMA vs Manual Loads

Aspect Manual (LDG/STG) TMA
Address computation SM cycles TMA unit (free)
Tile dimensions Manual indexing Descriptor-based
Multicast Not available Built-in to multiple SMs
Async cp.async Native, with barriers
// TMA descriptor setup (host side)
CUtensorMap tensor_map;
cuTensorMapEncodeTiled(&tensor_map,
    CU_TENSOR_MAP_DATA_TYPE_FLOAT16,
    2,                    // 2D tensor
    global_ptr,           // Base address
    {N, K},              // Global shape
    {lda * sizeof(half), sizeof(half)},  // Strides
    {TILE_N, TILE_K},    // Box (tile) shape
    ...);

// Kernel: single instruction loads entire tile
cp.async.bulk.tensor.2d.shared::cluster.global.tile
    [smem_ptr], [tensor_map, {tile_x, tile_y}];
When to Use TMA

TMA shines for structured tile access patterns (GEMM, convolution, attention). For irregular access, manual loads may still be necessary. CuTe abstracts TMA through its copy operations.


06 - OPTIMIZATION

Putting It Together

Memory optimization follows a hierarchy of impact. Address these in order:

Optimization Priority

  1. Coalescing (10-100x impact)
    Ensure warps access contiguous memory. Transpose data if needed.
  2. Data Reuse via SMEM (2-10x impact)
    Tile algorithms to maximize reuse. Classic: GEMM tiling.
  3. Bank Conflict Elimination (1.1-2x impact)
    Pad shared memory arrays or use conflict-free access patterns.
  4. Occupancy Tuning (1.1-1.5x impact)
    Balance register/SMEM usage against parallelism.
Bandwidth Efficiency Calculator
Data Transferred
Transfer Time
Achieved Bandwidth 5.0 TB/s
Peak Bandwidth (B200) 8.0 TB/s
Efficiency 62.5%
62.5%
Profile, Don't Guess

Use ncu (Nsight Compute) to measure actual memory throughput, cache hit rates, and bank conflicts. Theoretical analysis guides optimization; profiling validates it.

What if register spilling occurs?
When a kernel uses too many registers, the compiler "spills" excess values to local memory (which is actually in global memory, just private to each thread). This is slow! Signs of spilling: look for lmem in ncu output. Fix by: reducing live values, using __launch_bounds__, or accepting lower occupancy for more registers.

PRACTICE

Hands-On Labs

REFERENCES

Citations & Further Reading

Video Resources

Deep dives into GPU memory optimization.

Introduction to CUDA Programming

Comprehensive introduction to NVIDIA's GPU parallel programming architecture including memory hierarchy.

Watch on YouTube
Understanding GPU Memory (CoffeeBeforeArch)

Practical walkthrough of coalescing, bank conflicts, and memory access patterns.

Watch on YouTube

Primary Documentation

  1. NVIDIA CUDA C++ Programming Guide
    Chapter 5: Memory Hierarchy - Registers, Shared Memory, Global Memory
    docs.nvidia.com/cuda/cuda-c-programming-guide
  2. NVIDIA CUDA C++ Best Practices Guide
    Memory Optimizations: Coalescing, Bank Conflicts, Caching
    docs.nvidia.com/cuda/cuda-c-best-practices-guide
  3. NVIDIA Hopper Architecture Whitepaper
    TMA (Tensor Memory Accelerator), Asynchronous Copy, Distributed Shared Memory
    resources.nvidia.com/en-us-hopper-architecture
  4. NVIDIA GB200 NVL72 Specifications
    HBM3e bandwidth: 576 TB/s total (72 GPUs) = ~8 TB/s per GPU
    nvidia.com/data-center/gb200-nvl72
  5. NVIDIA Nsight Compute Documentation
    Memory throughput analysis, profiling, bottleneck identification
    docs.nvidia.com/nsight-compute

Key Specifications with Sources

Specification Value Source
Shared memory banks 32 CUDA Programming Guide, Ch. 5
Cache line size 128 bytes CUDA Best Practices Guide
Max registers per thread 255 CUDA Programming Guide
B200 HBM3e bandwidth ~8 TB/s per GPU GB200 NVL72 Specifications
H100 shared memory Up to 228KB/SM H100 Datasheet
Memory latency (HBM) ~400-800 cycles Varies by access pattern
About Latency Numbers

Memory latency figures are approximations. Actual latencies depend on access patterns, cache behavior, memory contention, and specific workloads. Always use ncu (Nsight Compute) to profile your actual kernels.