First Kernels
Write your first GPU kernels with Triton. From Python to 500,000x faster code— understanding index arithmetic, memory access patterns, and why tiling matters.
- Write a basic Triton kernel with proper grid/block configuration
- Use tl.load and tl.store with appropriate masking
- Implement element-wise operations (add, multiply, activation functions)
- Debug common kernel errors (out-of-bounds, wrong output)
- Profile kernel performance using basic metrics
This chapter builds on GPU architecture and memory concepts. Chapter 1 | Chapter 2 | Index Arithmetic
Part 1: NumPy to Triton - Environment setup through fast matmul
Your Code is Slow
Let's establish a baseline. Here's matrix multiplication in Python with NumPy:
import numpy as np
import time
# Two 1024x1024 matrices
A = np.random.randn(1024, 1024).astype(np.float32)
B = np.random.randn(1024, 1024).astype(np.float32)
# Time it
start = time.perf_counter()
C = A @ B
elapsed = time.perf_counter() - start
# Calculate GFLOPS (2 * N^3 operations for matmul)
flops = 2 * 1024**3
gflops = flops / elapsed / 1e9
print(f"NumPy: {gflops:.1f} GFLOPS") # ~50-100 GFLOPS on modern CPU
NumPy uses optimized BLAS libraries, so it's not terrible—maybe 50-100 GFLOPS on a good CPU. But your GPU can do 100+ TFLOPS. That's 1000x more potential, sitting unused.
GFLOPS = billion floating-point operations per second. Matrix multiplication of two N×N matrices requires 2N³ operations (multiply + add for each element). For 1024×1024: 2 × 1024³ ≈ 2.1 billion operations.
CuPy: One Import Changes Everything
import cupy as cp # Drop-in NumPy replacement for GPU
# Same code, different import
A_gpu = cp.asarray(A) # Copy to GPU
B_gpu = cp.asarray(B)
start = time.perf_counter()
C_gpu = A_gpu @ B_gpu
cp.cuda.Stream.null.synchronize() # Wait for GPU
elapsed = time.perf_counter() - start
gflops = 2 * 1024**3 / elapsed / 1e9
print(f"CuPy: {gflops:.1f} GFLOPS") # ~5000-10000 GFLOPS!
50-100x speedup with zero algorithm changes. CuPy calls cuBLAS under the hood, which is NVIDIA's heavily optimized matrix library.
But why is it so fast? To write our own fast kernels, we need to understand how GPUs execute code—which you learned in Chapters 1 and 2.
Writing GPU Code with Triton
Let's write a simple kernel in Triton, a Python-like language for GPU programming:
import triton
import triton.language as tl
@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
# Which block am I?
pid = tl.program_id(0)
# Calculate which elements this block handles
offsets = pid * BLOCK + tl.arange(0, BLOCK)
# Mask for bounds checking
mask = offsets < n
# Load, compute, store
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
tl.store(out_ptr + offsets, x + y, mask=mask)
# Launch kernel
n = 1024
grid = (n // 256,) # Number of blocks
add_kernel[grid](x, y, out, n, BLOCK=256)
Key concepts:
- program_id: Which block am I? (like blockIdx in CUDA)
- BLOCK: How many elements each block processes
- offsets: Linear indices into the array
- mask: Bounds checking (don't access beyond array end)
For 2D arrays stored in row-major order: linear_idx = row * num_cols + col
This formula is the foundation of all GPU memory access patterns.
Why do we need masks in tl.load?
mask = offsets < n
and pass it to both tl.load and tl.store.
Why Naive Kernels Are Slow
Your kernel works, but it's 10-50x slower than cuBLAS. Why? Memory bandwidth is the bottleneck, not compute.
Latencies are approximate and vary by access pattern. See Chapter 2 for details.
400 cycles to read from global memory! If your kernel reads from HBM every operation, most time is spent waiting, not computing.
Adjacent threads should access adjacent memory addresses for maximum bandwidth.
Tiling: Data Reuse is Everything
The key insight: load data once from slow HBM, reuse many times from fast shared memory.
# Tiled matrix multiplication concept:
# Instead of: for each output element, read entire row/column from HBM
# Do this: load tiles into shared memory, compute partial results
for tile in range(num_tiles):
# 1. Load tile of A and tile of B into shared memory
A_tile = tl.load(A_ptr + tile_offsets) # One HBM read
B_tile = tl.load(B_ptr + tile_offsets) # One HBM read
# 2. Compute partial results using shared memory (FAST!)
for k in range(TILE_SIZE):
C_acc += A_tile[:, k] * B_tile[k, :] # Many reuses!
# 3. Synchronize before loading next tile
tl.debug_barrier()
Data reuse is everything. A TILE_SIZE×TILE_SIZE tile is loaded once but used TILE_SIZE times for computation. This transforms your kernel from memory-bound to compute-bound.
Arithmetic Intensity = FLOPS / Bytes
Naive matmul: ~1 FLOP/byte (memory-bound)
Tiled matmul: ~100 FLOPS/byte (compute-bound)
Higher intensity = less time waiting for memory.
Chapter 3 Milestone
By implementing tiled matmul, you can achieve 500+ GFLOPS—a 500,000x improvement over naive Python. You understand WHY: coalesced access, data reuse in shared memory, and hiding memory latency.
How do you choose the right BLOCK_SIZE?
Build It Yourself
The best way to learn is hands-on. Work through these notebooks in order:
Part 1 Labs
- Environment Check - Verify GPU setup
- NumPy Baseline - Establish CPU performance
- CuPy Introduction - Instant GPU speedup
- GPU Architecture - Understand the hardware
- First Triton Kernel - Write your own GPU code
- Memory Hierarchy - Profile memory access
- Tiling Basics - Implement tiled access
- Fast Matrix Multiplication - Achieve 500+ GFLOPS
References
- Triton Language - Python-like GPU programming
- CuPy Documentation - NumPy-compatible GPU arrays
- CUDA Best Practices: Memory Optimizations - Coalescing and bandwidth
- NVIDIA H100 Datasheet - Memory bandwidth specs