GPU Learning | From Fundamentals to Production Kernels

GPU Fundamentals

SMs, warps, blocks, and the execution hierarchy

Memory Hierarchy

Registers, SMEM, L2, HBM—latency, bandwidth, coalescing

First Kernels

Triton basics, index arithmetic, tiling for data reuse

Optimization

Profiling, bank conflicts, Tensor Cores, TMA

Attention

Dot products, softmax, online algorithms, FlashAttention

Quantization

Floating point formats, INT8/INT4, FP8, numerical stability

Production

KV cache strategies, TensorRT-LLM, vLLM integration

Not confident with linear algebra or floating point math? Take the diagnostic quiz to identify gaps and find curated resources. This is a reference—work through it as needed, not a required prerequisite.

Take the diagnostic quiz →

29 hands-on Jupyter notebooks covering the complete GPU programming journey. Run them in Google Colab with free GPU access.

View all notebooks →