From architecture to production. Learn memory optimization, kernel development, training at scale, multi-GPU parallelism, and deployment through interactive lessons and hands-on notebooks.
Not confident with linear algebra or floating point math? Take the diagnostic quiz to identify gaps and find curated resources.
Amdahl's Law, parallel patterns, thinking in SIMT
SMs, warps, blocks, and the execution hierarchy
Registers, SMEM, L2, HBM—latency, bandwidth, coalescing
Triton basics, index arithmetic, tiling for data reuse
Barriers, warp shuffles, parallel reductions, atomics
Profiling, bank conflicts, Tensor Cores, TMA, CUDA Graphs
Error messages, flowcharts, NaN hunting, Nsight tools
Fusion, LayerNorm, RMSNorm, embeddings, fused Adam
Dot products, softmax, online algorithms, FlashAttention
Floating point formats, INT8/INT4, FP8, numerical stability
Backward passes, mixed precision, checkpointing, gradient accumulation
NCCL, data/tensor/pipeline parallelism, ZeRO optimizer
KV cache strategies, TensorRT-LLM, vLLM integration
33 hands-on Jupyter notebooks with free GPU access. View all notebooks