GPU Learning | From Kernels to Clusters

📚

Not confident with linear algebra or floating point math? Take the diagnostic quiz to identify gaps and find curated resources.

Amdahl's Law, parallel patterns, thinking in SIMT

SMs, warps, blocks, and the execution hierarchy

Registers, SMEM, L2, HBM—latency, bandwidth, coalescing

Triton basics, index arithmetic, tiling for data reuse

Barriers, warp shuffles, parallel reductions, atomics

Profiling, bank conflicts, Tensor Cores, TMA, CUDA Graphs

Error messages, flowcharts, NaN hunting, Nsight tools

Fusion, LayerNorm, RMSNorm, embeddings, fused Adam

Dot products, softmax, online algorithms, FlashAttention

Floating point formats, INT8/INT4, FP8, numerical stability

Backward passes, mixed precision, checkpointing, gradient accumulation

NCCL, data/tensor/pipeline parallelism, ZeRO optimizer

KV cache strategies, TensorRT-LLM, vLLM integration

GPU Programming
& Systems