Multi-GPU Programming
Scaling training across multiple GPUs with data parallelism, tensor parallelism, and communication primitives. Understanding the topology and trade-offs that determine whether scaling actually speeds things up.
- Explain data, tensor, and pipeline parallelism trade-offs
- Use NCCL primitives for GPU-to-GPU communication
- Understand NVLink/NVSwitch topology and bandwidth implications
- Implement basic distributed data parallel training
- Describe ZeRO optimization stages and their memory savings
Scaling Beyond One GPU
Large models don't fit on a single GPU. Even when they do, training is too slow. Multi-GPU training addresses both problems, but introduces communication overhead.
The Scale Problem
The LLaMA paper trained 65B parameters on 2048 A100 GPUs for 21 days. Scaling efficiently across thousands of GPUs is essential for frontier model training.
Scaling Efficiency
Perfect scaling means N GPUs give N× throughput. Reality is worse due to:
- Communication overhead — time spent synchronizing gradients
- Bubble time — GPUs waiting for data from other GPUs
- Memory overhead — duplicate buffers, communication buffers
- Load imbalance — some GPUs finish before others
Efficiency = (N × single_GPU_throughput) / actual_throughput
Good systems achieve 90%+ efficiency at 100s of GPUs.
Megatron-LM
reports 52% efficiency at 3072 GPUs for GPT-3 scale models.
Data, Tensor, and Pipeline
There are three fundamental ways to distribute work across GPUs. Modern large-scale training combines all three in what's called 3D parallelism.
Same model on each GPU, different data batches. Gradients synchronized after backward.
Pro: Simple, scales batch size
Con: Model must fit on one GPU
Split individual layers across GPUs. Each GPU computes part of each layer.
Pro: Splits large layers
Con: High communication every layer
Different layers on different GPUs. Data flows through GPUs sequentially.
Pro: Low communication
Con: Pipeline bubbles, memory imbalance
When to Use Each
| Strategy | Best For | Communication Pattern | Typical Scale |
|---|---|---|---|
| Data Parallel | Model fits on 1 GPU | AllReduce after each step | Any number of GPUs |
| Tensor Parallel | Large layers (big hidden dim) | AllReduce after each layer | 2-8 GPUs (NVLink required) |
| Pipeline Parallel | Many layers | P2P between stages | 4-64 GPUs per pipeline |
Large-scale training uses all three: tensor parallelism within a node (NVLink), pipeline parallelism across nodes, and data parallelism across replicas. Megatron-Turing NLG uses (TP=8) × (PP=8) × (DP=48) = 3072 GPUs.
Collective Communication Operations
NCCL (NVIDIA Collective Communications Library) provides optimized primitives for GPU-to-GPU communication. Understanding these operations is essential for distributed training.
AllReduce
The most common operation: sum gradients across all GPUs, result available on all GPUs.
AllGather
Concatenate data from all GPUs onto all GPUs. Used in tensor parallelism to reconstruct tensors.
ReduceScatter
Reduce then scatter: each GPU gets a different portion of the reduced result. This is the key operation in ZeRO Stage 2.
Communication Cost
| Operation | Data Transferred per GPU | Use Case |
|---|---|---|
AllReduce |
2 × data_size × (N-1)/N | Gradient sync (DDP) |
AllGather |
data_size × (N-1)/N | Tensor parallel gather |
ReduceScatter |
data_size × (N-1)/N | ZeRO gradient reduce |
Broadcast |
data_size | Send from one to all |
NCCL implements AllReduce using a ring algorithm that achieves optimal bandwidth utilization. Each GPU sends/receives 2 × data_size total, regardless of the number of GPUs—making it highly scalable.
NVLink, NVSwitch, and Interconnects
Communication bandwidth determines which parallelism strategies are practical. GPU interconnect technology varies dramatically in bandwidth.
Interconnect Bandwidth
| Interconnect | Bandwidth (bidirectional) | Typical Use |
|---|---|---|
| NVLink 4.0 (H100) | 900 GB/s | Intra-node GPU-to-GPU |
| NVLink 3.0 (A100) | 600 GB/s | Intra-node GPU-to-GPU |
| PCIe 5.0 x16 | ~64 GB/s | GPU to CPU, older systems |
| InfiniBand HDR | ~50 GB/s | Inter-node |
| 100 GbE | ~12.5 GB/s | Ethernet clusters |
DGX H100 Topology
The DGX H100 uses NVSwitch to provide full-bisection bandwidth between all 8 GPUs:
Without NVSwitch, GPU pairs connect via NVLink but some pairs must route through intermediate GPUs. This creates topology-dependent performance—tensor parallel degree should match the number of GPUs with direct NVLink connections.
Multi-Node Considerations
Inter-node bandwidth (InfiniBand/Ethernet) is 10-50× slower than NVLink. This constrains which parallelism works across nodes:
- Data parallelism — works well (gradient sync once per step)
- Pipeline parallelism — works well (point-to-point only)
- Tensor parallelism — avoid across nodes (communication every layer)
Implementing DDP
DistributedDataParallel (DDP) is the workhorse of multi-GPU training. Each GPU has a full model copy, processes different data, and synchronizes gradients.
Basic DDP Setup
import torch.distributed as dist from torch.nn.parallel import DistributedDataParallel as DDP # Initialize process group (called once per process) dist.init_process_group(backend="nccl") # Get local rank (which GPU this process uses) local_rank = int(os.environ["LOCAL_RANK"]) torch.cuda.set_device(local_rank) # Create model and wrap with DDP model = MyModel().cuda() model = DDP(model, device_ids=[local_rank]) # Create distributed sampler for data train_sampler = DistributedSampler(train_dataset) train_loader = DataLoader( train_dataset, sampler=train_sampler, batch_size=per_gpu_batch_size ) # Training loop (gradients auto-synced by DDP) for epoch in range(epochs): train_sampler.set_epoch(epoch) # Shuffle per epoch for batch in train_loader: loss = model(batch) loss.backward() # DDP syncs gradients here optimizer.step() optimizer.zero_grad()
How DDP Works
- Bucket gradients — group gradients into ~25MB buckets for efficient AllReduce
- Overlap compute and communication — AllReduce starts as soon as bucket is ready
- Broadcast initial weights — rank 0 broadcasts weights to all ranks at start
# Single node, 8 GPUs torchrun --nproc_per_node=8 train.py # Multi-node (run on each node) torchrun --nnodes=2 --nproc_per_node=8 \ --rdzv_id=job1 --rdzv_backend=c10d \ --rdzv_endpoint=master_ip:29500 train.py
DDP groups gradients into buckets
for efficient AllReduce. Small AllReduces have high overhead—bucketing amortizes this.
Default bucket size is 25MB, tunable via bucket_cap_mb.
Scaling Efficiency
DDP efficiency depends on the ratio of compute to communication:
| Model Size | Gradient Size (FP16) | AllReduce Time (8×H100) | Typical Efficiency |
|---|---|---|---|
| 1B params | 2 GB | ~4 ms | >95% |
| 7B params | 14 GB | ~31 ms | ~90% |
| 70B params | 140 GB | ~311 ms | Needs ZeRO |
Memory-Efficient Data Parallelism
ZeRO (Zero Redundancy Optimizer) eliminates memory redundancy in data parallelism. Instead of each GPU storing full optimizer states, gradients, and parameters—they're partitioned.
ZeRO Stages
Baseline DDP — each GPU stores everything:
ZeRO Stage 1 — partition optimizer states:
ZeRO Stage 2 — partition gradients (ReduceScatter instead of AllReduce):
ZeRO Stage 3 — partition parameters (AllGather before each forward/backward):
Memory Savings
| Stage | Memory per GPU (N GPUs, FP16+FP32 opt) | Communication |
|---|---|---|
| Baseline DDP | 2Ψ + 2Ψ + 12Ψ = 16Ψ | AllReduce |
| ZeRO-1 | 2Ψ + 2Ψ + 12Ψ/N = 4Ψ + 12Ψ/N | AllReduce |
| ZeRO-2 | 2Ψ + 2Ψ/N + 12Ψ/N = 2Ψ + 14Ψ/N | ReduceScatter + AllGather |
| ZeRO-3 | 16Ψ/N | AllGather (2x per layer) |
Ψ = model parameters. Adam FP32 optimizer states = 12 bytes/param (momentum + variance + master weight).
import deepspeed # Config enables ZeRO Stage 2 ds_config = { "zero_optimization": { "stage": 2, "offload_optimizer": {"device": "cpu"}, # Optional CPU offload "contiguous_gradients": True, "overlap_comm": True }, "fp16": {"enabled": True}, "train_batch_size": 32 } model, optimizer, _, _ = deepspeed.initialize( model=model, config=ds_config )
ZeRO-Offload extends ZeRO by offloading optimizer states and computation to CPU memory. This enables training 10B+ models on a single GPU, at the cost of CPU-GPU transfer overhead.
Splitting Layers Across GPUs
Megatron-LM style tensor parallelism splits individual linear layers across GPUs. This is essential for models where a single layer doesn't fit in GPU memory.
Column-Parallel Linear
Split the weight matrix along the output dimension. Each GPU computes part of the output.
# Weight W shape: (in_features, out_features) # Split into W = [W_0, W_1] along columns # GPU 0: A_0 = X @ W_0 (all of X, part of W) # GPU 1: A_1 = X @ W_1 (all of X, part of W) # Result: A = [A_0, A_1] (concatenated) # Requires AllGather to get full A on all GPUs (if needed)
Row-Parallel Linear
Split the weight matrix along the input dimension. Requires input to be partitioned.
# Weight W shape: (in_features, out_features) # Split into W = [W_0; W_1] along rows # Input A = [A_0, A_1] partitioned across GPUs # GPU 0: B_0 = A_0 @ W_0 (partial product) # GPU 1: B_1 = A_1 @ W_1 (partial product) # Result: B = B_0 + B_1 (requires AllReduce)
Transformer Tensor Parallelism
In transformers, column-parallel and row-parallel layers are combined to minimize communication:
| Layer | Parallelism Type | Communication |
|---|---|---|
| QKV projection | Column parallel | None (input broadcast) |
| Attention output | Row parallel | AllReduce after |
| FFN first linear | Column parallel | None |
| FFN second linear | Row parallel | AllReduce after |
This pattern requires only 2 AllReduces per transformer layer (one after attention, one after FFN).
Tensor parallel degree is typically 2, 4, or 8—matching GPUs connected by NVLink. Higher degrees have diminishing returns: more communication, smaller per-GPU matrices (worse GPU utilization).
Key Takeaways
- Data parallel — simplest, scales batch size, model must fit on 1 GPU
- Tensor parallel — splits layers, requires NVLink, typically 2-8 GPUs
- Pipeline parallel — splits by layers, works across nodes, has bubble overhead
- 3D parallelism — combine all three for maximum scale
- AllReduce — sum to all, used in DDP gradient sync
- AllGather — concat to all, used in tensor parallel and ZeRO-3
- ReduceScatter — sum and partition, used in ZeRO-2
- ZeRO-1 — partition optimizer states (~4x memory savings)
- ZeRO-2 — + partition gradients (~8x savings)
- ZeRO-3 — + partition parameters (~N× savings, more communication)