Chapter 2

Memory Hierarchy

GPU performance lives or dies by memory access patterns. Master the hierarchy from registers to HBM and learn the techniques that separate 10% from 80% hardware utilization.

Building on Chapter 1

In Chapter 1, you learned that warps execute 32 threads in lockstep. This seemingly simple fact has profound implications for memory access. When 32 threads read memory simultaneously, where they read from determines whether you get 8 TB/s or 500 GB/s.

What You'll Learn

List the GPU memory types in order of speed and size
Explain why memory bandwidth is often the bottleneck
Demonstrate coalesced vs uncoalesced memory access patterns
Identify and fix shared memory bank conflicts
Calculate effective memory bandwidth for a given access pattern

01 - THE BOTTLENECK

Memory-Bound Reality

Modern GPUs have enormous compute capacity. A B200 delivers ~2.5 PFLOPS of FP8 compute. But that compute is useless if you can't feed it data fast enough.

Most GPU workloads are memory-bound, not compute-bound. The arithmetic intensity (FLOPS per byte) of your kernel determines which regime you're in.

Arithmetic Intensity

AI = FLOPS / Bytes transferred

For a B200 with ~8 TB/s bandwidth and ~2500 TFLOPS (FP16):

Ridge point = 2500 TFLOPS / 8 TB/s = 312.5 FLOPS/byte

Below 312.5: Memory-bound (bandwidth limited)
Above 312.5: Compute-bound (can saturate ALUs)

GEMM has high AI (~hundreds). Element-wise ops have low AI (~1-2).

The Real Challenge

Peak bandwidth is theoretical. Achieving even 80% requires perfect access patterns. Random access can drop effective bandwidth by 10-100x.

Which memory type is fastest?

Global memory (HBM)

Shared memory

Registers

02 - THE HIERARCHY

Four Levels of Memory

GPU memory forms a hierarchy trading capacity for speed. Each level has distinct characteristics and use cases. Click each tier to learn more.

Per-thread private storage. instant, like your own name

Characteristics:

Each thread has private registers (up to 255 per thread)
Very low latency (~1 cycle read-after-write)
Compiler-managed allocation
Register pressure limits occupancy

Usage: Loop variables, intermediate results, frequently accessed values. The compiler will spill to local memory (slow!) if you use too many.

Block-level scratchpad. quick glance at a sticky note

Characteristics:

Shared among all threads in a block
On-chip SRAM with very low latency
Organized into 32 banks (bank conflicts possible)
Configurable L1/SMEM split on some architectures

Usage: Inter-thread communication, tiled algorithms, reused data. Essential for efficient matrix multiplication.

Chip-wide cache. Hardware-managed.

Characteristics:

Shared across all SMs on the GPU
Automatic caching of global memory
Can use persistence hints (cudaAccessPolicyWindow)
Helps with irregular access patterns

Usage: Automatic caching. Can hint persistence for working sets that fit. Effective for data reused across blocks.

Main GPU memory. walk to filing cabinet in another room

Characteristics:

HBM3e with ~8 TB/s peak bandwidth (B200)
Accessed via 128-byte cache lines
Coalescing critical for performance
Latency hidden by warp scheduling

Usage: Input/output tensors, large datasets. Access patterns determine whether you get 10% or 90% of peak bandwidth.

The Latency Gap

Register access: ~1 cycle. HBM access: ~400 cycles. This enormous gap (hundreds of cycles) is why data reuse dominates GPU optimization. Every byte loaded from HBM should be used as many times as possible before eviction.

Note on Latency Figures

Cycle counts shown above are approximate and vary by architecture, workload, and memory access patterns. Actual latency depends on factors like cache hits, bank conflicts, and memory coalescing. For precise measurements, refer to Jia et al.'s GPU microbenchmarking or use Nsight Compute profiling.

Why is the 128-byte cache line important?

When you request even 4 bytes from global memory, the hardware fetches an entire 128-byte cache line. This is why coalescing matters: 32 threads reading 4 bytes each = 128 bytes = 1 cache line. But 32 threads reading scattered 4-byte values = potentially 32 cache lines (4KB!) for 128 bytes of useful data. That's 32x wasted bandwidth.

03 - COALESCING

Memory Access Patterns

When threads in a warp access global memory, the hardware combines (coalesces) their requests into as few transactions as possible. Proper coalescing is the single most important optimization for memory-bound kernels.

📚

Prerequisite Check

Understanding thread indexing is key. Need a refresher? See Prerequisites: Index Arithmetic

Threads

Memory

Click a pattern to visualize memory access.

Good: Coalesced

// Thread i accesses element i
data[threadIdx.x]

// 1 transaction for 32 threads

Bad: Strided

// Thread i accesses element i*32
data[threadIdx.x * 32]

// 32 transactions! (worst case)

For a warp accessing 32 consecutive float values starting at a 128-byte aligned address, how many memory transactions occur?

1 transaction (128 bytes = 32 floats)

4 transactions (32 bytes each)

32 transactions (one per thread)

2 transactions (64 bytes each)

Coalesced memory access means:

All threads access the same address

Adjacent threads access adjacent addresses

Threads access memory in random order

04 - BANK CONFLICTS

Shared Memory Banks

Shared memory is divided into 32 banks. Each bank can serve one address per cycle. When multiple threads in a warp access different addresses in the same bank, accesses serialize—this is a bank conflict.

Consecutive 4-byte words map to consecutive banks. Bank = (address / 4) % 32.

Click a pattern to visualize bank access.

Conflict-Free

// Each thread hits different bank
smem[threadIdx.x]

// All 32 accesses parallel

N-way Conflict

// Stride of 32 = same bank!
smem[threadIdx.x * 32]

// Serialized: 32x slower

Broadcast Exception

If multiple threads read the same address, it's a broadcast—no conflict. Conflicts only occur when threads access different addresses in the same bank.

What stride causes maximum bank conflicts (32-way)?

Stride of 1 (consecutive access)

Stride of 16

Stride of 32 (or any multiple of 32)

Stride of 33

A 2-way bank conflict means the access takes:

Half as long

Twice as long (serialized)

The same time

05 - TMA

Tensor Memory Accelerator

Hopper introduced TMA—dedicated hardware for bulk data movement between HBM and shared memory. TMA offloads address computation from SMs, enabling asynchronous tile transfers.

HBM
Global tensors

→

TMA Unit
Address gen + transfer

→

SMEM
Ready for compute

TMA vs Manual Loads

Aspect	Manual (LDG/STG)	TMA
Address computation	SM cycles	TMA unit (free)
Tile dimensions	Manual indexing	Descriptor-based
Multicast	Not available	Built-in to multiple SMs
Async	cp.async	Native, with barriers

// TMA descriptor setup (host side)
CUtensorMap tensor_map;
cuTensorMapEncodeTiled(&tensor_map,
    CU_TENSOR_MAP_DATA_TYPE_FLOAT16,
    2,                    // 2D tensor
    global_ptr,           // Base address
    {N, K},              // Global shape
    {lda * sizeof(half), sizeof(half)},  // Strides
    {TILE_N, TILE_K},    // Box (tile) shape
    ...);

// Kernel: single instruction loads entire tile
cp.async.bulk.tensor.2d.shared::cluster.global.tile
    [smem_ptr], [tensor_map, {tile_x, tile_y}];

When to Use TMA

TMA shines for structured tile access patterns (GEMM, convolution, attention). For irregular access, manual loads may still be necessary. CuTe abstracts TMA through its copy operations.

06 - OPTIMIZATION

Putting It Together

Memory optimization follows a hierarchy of impact. Address these in order:

Optimization Priority

Coalescing (10-100x impact)
Ensure warps access contiguous memory. Transpose data if needed.
Data Reuse via SMEM (2-10x impact)
Tile algorithms to maximize reuse. Classic: GEMM tiling.
Bank Conflict Elimination (1.1-2x impact)
Pad shared memory arrays or use conflict-free access patterns.
Occupancy Tuning (1.1-1.5x impact)
Balance register/SMEM usage against parallelism.

Data Transferred

Transfer Time

Achieved Bandwidth 5.0 TB/s

Peak Bandwidth (B200) 8.0 TB/s

Efficiency 62.5%

62.5%

Profile, Don't Guess

Use ncu (Nsight Compute) to measure actual memory throughput, cache hit rates, and bank conflicts. Theoretical analysis guides optimization; profiling validates it.

What if register spilling occurs?

When a kernel uses too many registers, the compiler "spills" excess values to local memory (which is actually in global memory, just private to each thread). This is slow! Signs of spilling: look for lmem in ncu output. Fix by: reducing live values, using __launch_bounds__, or accepting lower occupancy for more registers.

PRACTICE

Hands-On Labs

📓

Part 2 Notebooks

7 labs: Profiling, coalescing, bank conflicts, tensor cores

REFERENCES

Citations & Further Reading

Video Resources

Deep dives into GPU memory optimization.

Introduction to CUDA Programming

Comprehensive introduction to NVIDIA's GPU parallel programming architecture including memory hierarchy.

Watch on YouTube

Understanding GPU Memory (CoffeeBeforeArch)

Practical walkthrough of coalescing, bank conflicts, and memory access patterns.

Watch on YouTube

Primary Documentation

NVIDIA CUDA C++ Programming Guide
Chapter 5: Memory Hierarchy - Registers, Shared Memory, Global Memory
docs.nvidia.com/cuda/cuda-c-programming-guide
NVIDIA CUDA C++ Best Practices Guide
Memory Optimizations: Coalescing, Bank Conflicts, Caching
docs.nvidia.com/cuda/cuda-c-best-practices-guide
NVIDIA Hopper Architecture Whitepaper
TMA (Tensor Memory Accelerator), Asynchronous Copy, Distributed Shared Memory
resources.nvidia.com/en-us-hopper-architecture
NVIDIA GB200 NVL72 Specifications
HBM3e bandwidth: 576 TB/s total (72 GPUs) = ~8 TB/s per GPU
nvidia.com/data-center/gb200-nvl72
NVIDIA Nsight Compute Documentation
Memory throughput analysis, profiling, bottleneck identification
docs.nvidia.com/nsight-compute

Key Specifications with Sources

Specification	Value	Source
Shared memory banks	32	CUDA Programming Guide, Ch. 5
Cache line size	128 bytes	CUDA Best Practices Guide
Max registers per thread	255	CUDA Programming Guide
B200 HBM3e bandwidth	~8 TB/s per GPU	GB200 NVL72 Specifications
H100 shared memory	Up to 228KB/SM	H100 Datasheet
Memory latency (HBM)	~400-800 cycles	Varies by access pattern

About Latency Numbers

Memory latency figures are approximations. Actual latencies depend on access patterns, cache behavior, memory contention, and specific workloads. Always use ncu (Nsight Compute) to profile your actual kernels.