Hands-On Practice

Practice Notebooks

29 Jupyter notebooks covering the complete GPU programming journey. Run them in Google Colab with free GPU access.

Back to Course
Part 1

Foundations: NumPy to Triton

Chapters 1-3
00
Environment Check

Verify GPU setup and dependencies

Open in Colab
01
NumPy Baseline

Establish CPU performance benchmarks

Open in Colab
02
CuPy Introduction

Instant GPU speedup with drop-in NumPy replacement

Open in Colab
03
GPU Architecture

Understand SMs, warps, and the execution model

Open in Colab
04
First Triton Kernel

Write your first GPU kernel with Triton

Open in Colab
05
Memory Hierarchy

Profile and understand memory access patterns

Open in Colab
06
Tiling Basics

Implement tiled memory access for better cache utilization

Open in Colab
07
Fast Matrix Multiplication

Achieve 500+ GFLOPS with optimized matmul

Open in Colab
Part 2

Optimization Deep Dive

Chapter 4
01
Profiling with Nsight

Learn to use Nsight Compute for kernel analysis

Open in Colab
02
Memory Coalescing

Optimize global memory access patterns

Open in Colab
03
Bank Conflicts

Eliminate shared memory bottlenecks

Open in Colab
04
Software Pipelining

Overlap compute and memory operations

Open in Colab
05
TMA (Hopper+)

Hardware-accelerated async data movement

Open in Colab
06
Tensor Cores

Use MMA operations for matrix math

Open in Colab
07
Optimized GEMM

Put it all together for peak performance

Open in Colab
Part 3

Attention Mechanisms

Chapter 5
01
Dot Product Attention

Implement basic Q K^T computation

Open in Colab
02
The Softmax Problem

Why naive softmax fails at scale

Open in Colab
03
Stable Softmax

Numerical stability with max subtraction

Open in Colab
04
Full Attention

Complete attention implementation

Open in Colab
05
Online Softmax

Single-pass softmax algorithm

Open in Colab
06
Tiled Attention

Block-wise computation for memory efficiency

Open in Colab
07
FlashAttention

Production-grade fused attention kernel

Open in Colab
Part 4

Production & Quantization

Chapters 6-7
01
FP8 Conversion

Convert between floating point formats

Open in Colab
02
Quantization Fundamentals

Symmetric and asymmetric quantization

Open in Colab
03
INT8 and INT4

Integer quantization for inference

Open in Colab
04
NVFP4

NVIDIA's 4-bit floating point format

Open in Colab
05
KV Cache Strategy

Efficient key-value cache management

Open in Colab
06
Fused Quantized Attention

Combine quantization with FlashAttention

Open in Colab
07
Production Integration

Deploy optimized kernels in serving systems

Open in Colab