Chapter 3

First Kernels

Write your first GPU kernels with Triton. From Python to 500,000x faster code— understanding index arithmetic, memory access patterns, and why tiling matters.

Building on Chapters 1-2

Chapters 1-2 gave you the mental model: thousands of threads executing in warps, memory access patterns determining performance. Now you'll write actual code. Every line maps directly to those concepts—thread indices become array offsets, block sizes affect occupancy, access patterns determine coalescing.

What You'll Learn

Write a basic Triton kernel with proper grid/block configuration
Use tl.load and tl.store with appropriate masking
Implement element-wise operations (add, multiply, activation functions)
Debug common kernel errors (out-of-bounds, wrong output)
Profile kernel performance using basic metrics

📚

Prerequisites

This chapter builds on GPU architecture and memory concepts. Chapter 1 | Chapter 2 | Index Arithmetic

📝

Practice Notebooks

Part 1: NumPy to Triton - Environment setup through fast matmul

01 — THE CHALLENGE

Your Code is Slow

Let's establish a baseline. Here's matrix multiplication in Python with NumPy:

import numpy as np
import time

# Two 1024x1024 matrices
A = np.random.randn(1024, 1024).astype(np.float32)
B = np.random.randn(1024, 1024).astype(np.float32)

# Time it
start = time.perf_counter()
C = A @ B
elapsed = time.perf_counter() - start

# Calculate GFLOPS (2 * N^3 operations for matmul)
flops = 2 * 1024**3
gflops = flops / elapsed / 1e9
print(f"NumPy: {gflops:.1f} GFLOPS")  # ~50-100 GFLOPS on modern CPU
        

NumPy uses optimized BLAS libraries, so it's not terrible—maybe 50-100 GFLOPS on a good CPU. But your GPU can do 100+ TFLOPS. That's 1000x more potential, sitting unused.

What's a GFLOP?

GFLOPS = billion floating-point operations per second. Matrix multiplication of two N×N matrices requires 2N³ operations (multiply + add for each element). For 1024×1024: 2 × 1024³ ≈ 2.1 billion operations.

02 — THE INSTANT FIX

CuPy: One Import Changes Everything

import cupy as cp  # Drop-in NumPy replacement for GPU

# Same code, different import
A_gpu = cp.asarray(A)  # Copy to GPU
B_gpu = cp.asarray(B)

start = time.perf_counter()
C_gpu = A_gpu @ B_gpu
cp.cuda.Stream.null.synchronize()  # Wait for GPU
elapsed = time.perf_counter() - start

gflops = 2 * 1024**3 / elapsed / 1e9
print(f"CuPy: {gflops:.1f} GFLOPS")  # ~5000-10000 GFLOPS!
        

50-100x speedup with zero algorithm changes. CuPy calls cuBLAS under the hood, which is NVIDIA's heavily optimized matrix library.

But why is it so fast? To write our own fast kernels, we need to understand how GPUs execute code—which you learned in Chapters 1 and 2.

03 — YOUR FIRST KERNEL

Writing GPU Code with Triton

Let's write a simple kernel in Triton, a Python-like language for GPU programming:

import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, out_ptr, n, BLOCK: tl.constexpr):
    # Which block am I?
    pid = tl.program_id(0)
    
    # Calculate which elements this block handles
    offsets = pid * BLOCK + tl.arange(0, BLOCK)
    
    # Mask for bounds checking
    mask = offsets < n
    
    # Load, compute, store
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    tl.store(out_ptr + offsets, x + y, mask=mask)

# Launch kernel
n = 1024
grid = (n // 256,)  # Number of blocks
add_kernel[grid](x, y, out, n, BLOCK=256)
        

Key concepts:

program_id: Which block am I? (like blockIdx in CUDA)
BLOCK: How many elements each block processes
offsets: Linear indices into the array
mask: Bounds checking (don't access beyond array end)

Index Arithmetic

For 2D arrays stored in row-major order: linear_idx = row * num_cols + col
This formula is the foundation of all GPU memory access patterns.

In a 4×8 row-major matrix, what is the linear index of element [2, 5]?

13 (5 * 2 + 3)

21 (2 * 8 + 5)

25 (5 * 4 + 5)

10 (2 * 4 + 2)

The @triton.jit decorator:

Runs the function on CPU

Compiles the function for GPU execution

Optimizes Python bytecode

Why do we need masks in tl.load?

Masks prevent out-of-bounds memory access. When your array size isn't a multiple of BLOCK_SIZE, the last block would read beyond the array. Without masking, this causes undefined behavior (usually crashes). Always use mask = offsets < n and pass it to both tl.load and tl.store.

04 — THE BOTTLENECK

Why Naive Kernels Are Slow

Your kernel works, but it's 10-50x slower than cuBLAS. Why? Memory bandwidth is the bottleneck, not compute.

Memory Hierarchy Latency (H100 estimates)

Registers ~1 cycle, ~20 TB/s

Fastest

Shared Memory ~20 cycles, ~12 TB/s

Fast

L2 Cache ~200 cycles, ~5 TB/s

Medium

HBM (Global) ~400 cycles, ~3 TB/s

Slow

Latencies are approximate and vary by access pattern. See Chapter 2 for details.

400 cycles to read from global memory! If your kernel reads from HBM every operation, most time is spent waiting, not computing.

Memory Coalescing Visualizer

Adjacent threads should access adjacent memory addresses for maximum bandwidth.

Threads:

Memory:

Click a pattern to see memory access behavior.

Which access pattern achieves maximum bandwidth?

Thread i accesses address i (coalesced)

Thread i accesses address i*32 (strided)

Thread i accesses a random address

Your kernel outputs all zeros. Most likely cause?

GPU ran out of memory

Wrong pointer/offset calculation

Block size too large

05 — THE SOLUTION

Tiling: Data Reuse is Everything

The key insight: load data once from slow HBM, reuse many times from fast shared memory.

# Tiled matrix multiplication concept:
# Instead of: for each output element, read entire row/column from HBM
# Do this: load tiles into shared memory, compute partial results

for tile in range(num_tiles):
    # 1. Load tile of A and tile of B into shared memory
    A_tile = tl.load(A_ptr + tile_offsets)  # One HBM read
    B_tile = tl.load(B_ptr + tile_offsets)  # One HBM read
    
    # 2. Compute partial results using shared memory (FAST!)
    for k in range(TILE_SIZE):
        C_acc += A_tile[:, k] * B_tile[k, :]  # Many reuses!
    
    # 3. Synchronize before loading next tile
    tl.debug_barrier()
        

Data reuse is everything. A TILE_SIZE×TILE_SIZE tile is loaded once but used TILE_SIZE times for computation. This transforms your kernel from memory-bound to compute-bound.

Arithmetic Intensity

Arithmetic Intensity = FLOPS / Bytes
Naive matmul: ~1 FLOP/byte (memory-bound)
Tiled matmul: ~100 FLOPS/byte (compute-bound)
Higher intensity = less time waiting for memory.

Chapter 3 Milestone

By implementing tiled matmul, you can achieve 500+ GFLOPS—a 500,000x improvement over naive Python. You understand WHY: coalesced access, data reuse in shared memory, and hiding memory latency.

How do you choose the right BLOCK_SIZE?

BLOCK_SIZE affects occupancy and memory usage. Common choices: 64, 128, 256, 512. Too small = not enough work per block, poor latency hiding. Too large = not enough blocks to fill the GPU, register pressure. For matmul tiles, 64x64 or 128x128 are typical starting points. Always benchmark—optimal size depends on kernel and GPU.

06 — PRACTICE

Build It Yourself

The best way to learn is hands-on. Work through these notebooks in order:

Part 1 Labs

Environment Check - Verify GPU setup
NumPy Baseline - Establish CPU performance
CuPy Introduction - Instant GPU speedup
GPU Architecture - Understand the hardware
First Triton Kernel - Write your own GPU code
Memory Hierarchy - Profile memory access
Tiling Basics - Implement tiled access
Fast Matrix Multiplication - Achieve 500+ GFLOPS

References

Triton Language - Python-like GPU programming
CuPy Documentation - NumPy-compatible GPU arrays
CUDA Best Practices: Memory Optimizations - Coalescing and bandwidth
NVIDIA H100 Datasheet - Memory bandwidth specs