GPU Fundamentals
Understanding the hardware execution model. From SMs to warps to threads— learn how modern NVIDIA GPUs actually execute your code.
- Explain why GPUs optimize for throughput over latency
- Describe the SM → Warp → Thread hierarchy
- Calculate theoretical occupancy given resource constraints
- Identify warp divergence in code and explain its performance impact
- Interpret basic GPU specifications (SM count, memory bandwidth)
Throughput Over Latency
CPUs optimize for latency—making single tasks fast. GPUs optimize for throughput—completing many tasks in parallel, even if each individual task takes longer.
A CPU might have 8-16 cores running at 4+ GHz with massive caches. A GPU has thousands of smaller cores running at ~2 GHz with limited cache per core. The magic is in the parallelism.
192 SMs = 192 mini-processors running thousands of threads each
8 TB/s = every book ever written in under 2 seconds
*B200 specifications from NVIDIA Blackwell Architecture. Verify against official datasheets for production use.
GPU programming is about keeping thousands of threads busy. When one thread waits for memory, others execute. This latency hiding is fundamental to GPU performance.
GPU → SM → Warp → Thread
NVIDIA GPUs have a strict hierarchy. Understanding each level is essential for writing efficient kernels.
GPU Device
Blackwell B200: up to 192 SMs, 192GB HBM3e, ~8TB/s bandwidth*
Streaming Multiprocessor
The fundamental execution unit
32 Threads in Lockstep
The atomic scheduling unit (SIMT) like a marching band playing in unison
Warpgroup (Hopper+)
4 warps (128 threads) for Tensor Core operations
Thread
Individual execution context with private registers
Why does the warp size matter?
The Warp: GPU's Atomic Unit
A warp is 32 threads that execute in SIMT (Single Instruction, Multiple Threads) fashion. Every thread in a warp executes the same instruction, but on different data.
Click buttons to see how warps execute. Green = active, Orange = diverged/waiting.
Click a button to simulate warp execution.
When threads in a warp take different branches (if/else), execution serializes. Both paths run, with inactive threads masked. 16 threads per path = 50% throughput. Minimize divergence within warps for maximum performance.
Organizing Your Parallelism
Block (also called CTA - Cooperative Thread Array): A group of warps that share resources and can synchronize. Blocks run on a single SM.
Grid: Your problem decomposition into blocks. The grid is how you map your problem to the GPU's parallel execution model.
Grid (your problem) └── Block 0 (256 threads = 8 warps) │ ├── Warp 0: threads 0-31 │ ├── Warp 1: threads 32-63 │ ├── ... │ └── Warp 7: threads 224-255 │ └── [Shared Memory: 48KB] │ └── [Can __syncthreads()] │ └── Block 1 (256 threads = 8 warps) └── Block 2 ... └── Block N-1
Block Size Selection
Common block sizes and their trade-offs:
- 128 threads (4 warps): 1 warpgroup, good for Tensor Core ops
- 256 threads (8 warps): Balanced, most common choice
- 512 threads (16 warps): More parallelism, higher register pressure
Block size affects: shared memory per thread, register availability, and occupancy.
Keeping the GPU Busy
Occupancy is the ratio of active warps to the maximum warps an SM can support. Higher occupancy generally means better latency hiding, but it's not the only factor in performance.
Simplified model based on Hopper SM (64 max warps, 256KB registers, 228KB shared memory).
When is low occupancy actually better?
--maxrregcount in
nvcc or occupancy calculators to experiment. Always profile with real workloads.
100% occupancy doesn't guarantee best performance. Sometimes using more registers (lower occupancy) enables better instruction-level parallelism. Profile your actual kernel to find the sweet spot.
Hands-On Labs
Citations & Further Reading
Video Resources
High-quality explanations of GPU architecture concepts.
Excellent visual explanation of GPU architecture fundamentals, parallelism, and memory hierarchy.
Watch on YouTubeOfficial NVIDIA explanation of threads, blocks, grids, and the CUDA execution model.
Read: CUDA Refresher Series - developer.nvidia.comPrimary Documentation
-
NVIDIA CUDA C++ Programming Guide
Chapters 4-5: Thread Hierarchy, SIMT Architecture, Memory Hierarchy
docs.nvidia.com/cuda/cuda-c-programming-guide -
NVIDIA CUDA C++ Best Practices Guide
Performance optimization, occupancy, memory access patterns
docs.nvidia.com/cuda/cuda-c-best-practices-guide -
NVIDIA Hopper Architecture Whitepaper
SM specifications, warpgroup operations, Tensor Core details
resources.nvidia.com/en-us-hopper-architecture -
NVIDIA Blackwell Architecture
Fifth-gen Tensor Cores, NVLink 5, 208B transistors
nvidia.com/blackwell-architecture -
NVIDIA GB200 NVL72 Specifications
Official Blackwell product specifications and performance data
nvidia.com/data-center/gb200-nvl72 -
NVIDIA H100 Tensor Core GPU Datasheet
Hardware specifications: 256KB registers/SM, 228KB SMEM, 64 max warps
H100 Datasheet (PDF)
Key Specifications with Sources
| Specification | Value | Source |
|---|---|---|
| Warp size | 32 threads | CUDA Programming Guide, Ch. 4 |
| Warpgroup (Hopper+) | 128 threads (4 warps) | Hopper Architecture Whitepaper |
| H100 registers/SM | 256KB (65,536 x 32-bit) | H100 Datasheet |
| H100 shared memory/SM | Up to 228KB configurable | H100 Datasheet |
| Max warps per SM | 64 (Hopper) | Hopper Architecture Whitepaper |
| Blackwell transistors | 208 billion | NVIDIA Blackwell Architecture |
| B200 HBM bandwidth | ~8 TB/s | GB200 NVL72 Specifications |
Note: GPU specifications vary by SKU and configuration. Always verify against official NVIDIA datasheets for production use. Memory latencies are approximate and vary by access pattern and workload.