From Python to production-grade CUDA kernels. Learn GPU architecture, memory optimization, Triton, Tensor Cores, and quantization through interactive lessons and hands-on notebooks.
SMs, warps, blocks, and the execution hierarchy
Registers, SMEM, L2, HBM—latency, bandwidth, coalescing
Triton basics, index arithmetic, tiling for data reuse
Profiling, bank conflicts, Tensor Cores, TMA
Dot products, softmax, online algorithms, FlashAttention
Floating point formats, INT8/INT4, FP8, numerical stability
KV cache strategies, TensorRT-LLM, vLLM integration
Not confident with linear algebra or floating point math? Take the diagnostic quiz to identify gaps and find curated resources. This is a reference—work through it as needed, not a required prerequisite.
Take the diagnostic quiz →29 hands-on Jupyter notebooks covering the complete GPU programming journey. Run them in Google Colab with free GPU access.
View all notebooks →