Chapter 3

First Kernels

Write your first GPU kernels with Triton. From Python to 500,000x faster code— understanding index arithmetic, memory access patterns, and why tiling matters.

Building on Chapters 1-2
Chapters 1-2 gave you the mental model: thousands of threads executing in warps, memory access patterns determining performance. Now you'll write actual code. Every line maps directly to those concepts—thread indices become array offsets, block sizes affect occupancy, access patterns determine coalescing.
What You'll Learn
  1. Write a basic Triton kernel with proper grid/block configuration
  2. Use tl.load and tl.store with appropriate masking
  3. Implement element-wise operations (add, multiply, activation functions)
  4. Debug common kernel errors (out-of-bounds, wrong output)
  5. Profile kernel performance using basic metrics
📚
Prerequisites

This chapter builds on GPU architecture and memory concepts. Chapter 1 | Chapter 2 | Index Arithmetic

01 — THE CHALLENGE

Your Code is Slow

Let's establish a baseline. Here's matrix multiplication in Python with NumPy:

import numpy as np import time # Two 1024x1024 matrices A = np.random.randn(1024, 1024).astype(np.float32) B = np.random.randn(1024, 1024).astype(np.float32) # Time it start = time.perf_counter() C = A @ B elapsed = time.perf_counter() - start # Calculate GFLOPS (2 * N^3 operations for matmul) flops = 2 * 1024**3 gflops = flops / elapsed / 1e9 print(f"NumPy: {gflops:.1f} GFLOPS") # ~50-100 GFLOPS on modern CPU

NumPy uses optimized BLAS libraries, so it's not terrible—maybe 50-100 GFLOPS on a good CPU. But your GPU can do 100+ TFLOPS. That's 1000x more potential, sitting unused.

What's a GFLOP?

GFLOPS = billion floating-point operations per second. Matrix multiplication of two N×N matrices requires 2N³ operations (multiply + add for each element). For 1024×1024: 2 × 1024³ ≈ 2.1 billion operations.

02 — THE INSTANT FIX

CuPy: One Import Changes Everything

import cupy as cp # Drop-in NumPy replacement for GPU # Same code, different import A_gpu = cp.asarray(A) # Copy to GPU B_gpu = cp.asarray(B) start = time.perf_counter() C_gpu = A_gpu @ B_gpu cp.cuda.Stream.null.synchronize() # Wait for GPU elapsed = time.perf_counter() - start gflops = 2 * 1024**3 / elapsed / 1e9 print(f"CuPy: {gflops:.1f} GFLOPS") # ~5000-10000 GFLOPS!

50-100x speedup with zero algorithm changes. CuPy calls cuBLAS under the hood, which is NVIDIA's heavily optimized matrix library.

But why is it so fast? To write our own fast kernels, we need to understand how GPUs execute code—which you learned in Chapters 1 and 2.


03 — YOUR FIRST KERNEL

Writing GPU Code with Triton

Let's write a simple kernel in Triton, a Python-like language for GPU programming:

import triton import triton.language as tl @triton.jit def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr): # Which block am I? pid = tl.program_id(0) # Calculate which elements this block handles offsets = pid * BLOCK + tl.arange(0, BLOCK) # Mask for bounds checking mask = offsets < n # Load, compute, store x = tl.load(x_ptr + offsets, mask=mask) y = tl.load(y_ptr + offsets, mask=mask) tl.store(out_ptr + offsets, x + y, mask=mask) # Launch kernel n = 1024 grid = (n // 256,) # Number of blocks add_kernel[grid](x, y, out, n, BLOCK=256)

Key concepts:

Index Arithmetic

For 2D arrays stored in row-major order: linear_idx = row * num_cols + col
This formula is the foundation of all GPU memory access patterns.

In a 4×8 row-major matrix, what is the linear index of element [2, 5]?
13 (5 * 2 + 3)
21 (2 * 8 + 5)
25 (5 * 4 + 5)
10 (2 * 4 + 2)
The @triton.jit decorator:
Runs the function on CPU
Compiles the function for GPU execution
Optimizes Python bytecode
Why do we need masks in tl.load?
Masks prevent out-of-bounds memory access. When your array size isn't a multiple of BLOCK_SIZE, the last block would read beyond the array. Without masking, this causes undefined behavior (usually crashes). Always use mask = offsets < n and pass it to both tl.load and tl.store.

04 — THE BOTTLENECK

Why Naive Kernels Are Slow

Your kernel works, but it's 10-50x slower than cuBLAS. Why? Memory bandwidth is the bottleneck, not compute.

Memory Hierarchy Latency (H100 estimates)
Registers ~1 cycle, ~20 TB/s
Fastest
Shared Memory ~20 cycles, ~12 TB/s
Fast
L2 Cache ~200 cycles, ~5 TB/s
Medium
HBM (Global) ~400 cycles, ~3 TB/s
Slow

Latencies are approximate and vary by access pattern. See Chapter 2 for details.

400 cycles to read from global memory! If your kernel reads from HBM every operation, most time is spent waiting, not computing.

Memory Coalescing Visualizer

Adjacent threads should access adjacent memory addresses for maximum bandwidth.

Threads:
Memory:
Click a pattern to see memory access behavior.
Which access pattern achieves maximum bandwidth?
Thread i accesses address i (coalesced)
Thread i accesses address i*32 (strided)
Thread i accesses a random address
Your kernel outputs all zeros. Most likely cause?
GPU ran out of memory
Wrong pointer/offset calculation
Block size too large

05 — THE SOLUTION

Tiling: Data Reuse is Everything

The key insight: load data once from slow HBM, reuse many times from fast shared memory.

# Tiled matrix multiplication concept: # Instead of: for each output element, read entire row/column from HBM # Do this: load tiles into shared memory, compute partial results for tile in range(num_tiles): # 1. Load tile of A and tile of B into shared memory A_tile = tl.load(A_ptr + tile_offsets) # One HBM read B_tile = tl.load(B_ptr + tile_offsets) # One HBM read # 2. Compute partial results using shared memory (FAST!) for k in range(TILE_SIZE): C_acc += A_tile[:, k] * B_tile[k, :] # Many reuses! # 3. Synchronize before loading next tile tl.debug_barrier()

Data reuse is everything. A TILE_SIZE×TILE_SIZE tile is loaded once but used TILE_SIZE times for computation. This transforms your kernel from memory-bound to compute-bound.

Arithmetic Intensity

Arithmetic Intensity = FLOPS / Bytes
Naive matmul: ~1 FLOP/byte (memory-bound)
Tiled matmul: ~100 FLOPS/byte (compute-bound)
Higher intensity = less time waiting for memory.

+

Chapter 3 Milestone

By implementing tiled matmul, you can achieve 500+ GFLOPS—a 500,000x improvement over naive Python. You understand WHY: coalesced access, data reuse in shared memory, and hiding memory latency.

How do you choose the right BLOCK_SIZE?
BLOCK_SIZE affects occupancy and memory usage. Common choices: 64, 128, 256, 512. Too small = not enough work per block, poor latency hiding. Too large = not enough blocks to fill the GPU, register pressure. For matmul tiles, 64x64 or 128x128 are typical starting points. Always benchmark—optimal size depends on kernel and GPU.

06 — PRACTICE

Build It Yourself

The best way to learn is hands-on. Work through these notebooks in order:

Part 1 Labs

  1. Environment Check - Verify GPU setup
  2. NumPy Baseline - Establish CPU performance
  3. CuPy Introduction - Instant GPU speedup
  4. GPU Architecture - Understand the hardware
  5. First Triton Kernel - Write your own GPU code
  6. Memory Hierarchy - Profile memory access
  7. Tiling Basics - Implement tiled access
  8. Fast Matrix Multiplication - Achieve 500+ GFLOPS

References