Chapter 4

Synchronization & Reductions

When threads need to coordinate, things get interesting. Master barriers, warp shuffles, parallel reductions, and atomic operations—the building blocks of efficient collective operations.

Building on Chapter 3
In Chapter 3, you wrote kernels where each thread worked independently on its own data. Now we tackle the harder case: what happens when threads need to communicate? Reductions, scans, and histograms all require coordination between threads.
What You'll Learn
  1. Implement thread-block synchronization with __syncthreads()
  2. Use warp shuffle operations for intra-warp communication
  3. Write efficient parallel reductions (sum, max, argmax)
  4. Apply atomic operations correctly and understand their cost
  5. Use cooperative groups for flexible synchronization
01 - THE PROBLEM

Why Threads Need to Coordinate

Consider computing the sum of an array. Each thread could add some elements to a total variable—but without coordination, threads overwrite each other's results.

# Naive parallel sum (WRONG - race condition)
@triton.jit
def broken_sum(x_ptr, out_ptr, N):
    pid = tl.program_id(0)
    x = tl.load(x_ptr + pid)
    
    # Every thread reads, adds, writes to same location
    current = tl.load(out_ptr)  # Thread A reads 0
    tl.store(out_ptr, current + x)  # Thread A writes 5
    # Meanwhile Thread B also read 0, writes 3
    # Final result: 3 (not 8!)

The problem is a race condition: multiple threads reading and writing the same memory location. Solutions include:

Synchronization Has Cost

Every synchronization point is a potential bottleneck. Threads that finish early must wait. The goal is to minimize synchronization while maintaining correctness.

A race condition occurs when multiple threads:
Execute the same instruction
Access shared data with at least one write
Run on different SMs

02 - BARRIERS

Block-Level Synchronization

__syncthreads() (or tl.debug_barrier() in Triton) is a barrier: all threads in a block must reach it before any can proceed.

When Barriers Are Required

  1. After writing to shared memory that other threads will read
  2. Before reading shared memory that other threads wrote
  3. Between reduction stages that depend on previous results
# Shared memory pattern requiring barrier
shared = tl.zeros([BLOCK_SIZE], dtype=tl.float32)

# Stage 1: Each thread writes its value
shared[tid] = my_value

# BARRIER: Wait for ALL threads to finish writing
tl.debug_barrier()

# Stage 2: Now safe to read neighbor's values
neighbor = shared[(tid + 1) % BLOCK_SIZE]
Deadlock Danger

All threads in a block must reach the barrier. If some threads take a branch with a barrier and others don't, you get a deadlock—threads waiting forever for each other.

# DEADLOCK: Conditional barrier
if tid < 16:
    tl.debug_barrier()  # Only half the threads reach this!
# Threads 16-31 never call barrier → deadlock

# CORRECT: Barrier outside conditional
if tid < 16:
    do_something()
tl.debug_barrier()  # ALL threads reach this
What happens if only half the threads in a block call __syncthreads()?
Only those threads synchronize
Deadlock - threads wait forever
The kernel crashes immediately

03 - WARP SHUFFLES

Intra-Warp Communication

Within a warp (32 threads), threads can directly exchange register values using shuffle instructions—no shared memory or barriers needed.

Shuffle Operations

Operation Description Use Case
__shfl_sync Get value from specific lane Broadcast, gather
__shfl_down_sync Get value from lane + delta Reduction (log N steps)
__shfl_up_sync Get value from lane - delta Prefix scan
__shfl_xor_sync Get value from lane XOR delta Butterfly reduction
Warp Shuffle: shfl_down
Source lanes
↓ shfl_down(delta=2)
Dest lanes
# Warp-level sum reduction using shfl_down
# Each thread starts with its own value
val = my_value

# Log2(32) = 5 steps to reduce 32 values
for offset in [16, 8, 4, 2, 1]:
    val += __shfl_down_sync(0xFFFFFFFF, val, offset)

# After loop: lane 0 has the sum of all 32 values
if lane_id == 0:
    result = val
Why is warp shuffle faster than shared memory?
Shuffle operates directly on registers—no memory access at all. Shared memory requires a load/store through the memory subsystem (~20 cycles). Shuffles are essentially free (1 cycle). For warp-level operations, always prefer shuffles.
How many shfl_down steps are needed to sum 32 values in a warp?
32
5 (log2(32))
16

04 - REDUCTIONS

Parallel Reduction Patterns

A reduction combines N values into 1 using an associative operator (sum, max, min, product). The parallel approach: repeatedly combine pairs until one remains.

Interactive: Tree Reduction
Input
Click Step to begin

Efficient reduction structure:

  1. Warp-level: Use shuffles (5 steps for 32 threads, no sync needed)
  2. Block-level: Combine warp results in shared memory (need barriers)
  3. Grid-level: Either multiple kernel launches or atomics
# Two-level reduction: warp then block
# Step 1: Each warp reduces its 32 values
warp_sum = warp_reduce(my_value)  # Uses shuffles

# Step 2: First thread of each warp writes to shared mem
if lane_id == 0:
    shared[warp_id] = warp_sum

tl.debug_barrier()  # Wait for all warps

# Step 3: First warp reduces the warp sums
if warp_id == 0:
    val = shared[lane_id] if lane_id < num_warps else 0
    block_sum = warp_reduce(val)
    if lane_id == 0:
        output[block_id] = block_sum
Reducing 1024 values with 2-level reduction (warp + block) requires how many sync barriers?
0 (shuffles don't need sync)
1 (between warp and block levels)
10 (log2(1024))

05 - ATOMICS

Atomic Operations

Atomic operations guarantee that a read-modify-write sequence completes without interference from other threads. The hardware ensures only one thread accesses the location at a time.

Common Atomics

Operation Action Use Case
atomicAdd *addr += val Histograms, counters, grid reduction
atomicMax/Min *addr = max(*addr, val) Finding extrema
atomicExch swap(*addr, val) Locks, flags
atomicCAS compare-and-swap Lock-free algorithms
Atomic Contention Visualizer
counter 0

8 threads each adding 1. Watch them serialize!

Atomics Are Slow

When many threads atomicAdd to the same location, they serialize—only one succeeds at a time. High contention = sequential performance. Use atomics for final accumulation, not inner loops. Prefer reductions when possible.

Slow: Atomic in Inner Loop

# Every thread hits same address
for i in range(1000):
    atomicAdd(total, val[i])
# 1000 serialized atomics!

Fast: Local Sum + One Atomic

# Sum locally first
local_sum = 0
for i in range(1000):
    local_sum += val[i]
atomicAdd(total, local_sum)
# Only 1 atomic per thread!
When 1000 threads all atomicAdd to the same location, performance is:
1000x faster than sequential
Approximately sequential (serialized)
Undefined behavior

06 - COOPERATIVE GROUPS

Flexible Synchronization

Cooperative Groups (CUDA 9+) provide flexible thread groupings beyond the fixed block/warp hierarchy. You can synchronize subsets of threads or even across the entire grid.

Group Types

Group Size Sync Scope
thread_block Up to 1024 Same as __syncthreads()
thread_block_tile<N> N (power of 2, ≤32) Warp or sub-warp
coalesced_threads Active threads Dynamic (after divergence)
grid_group Entire grid All blocks (requires special launch)
// Cooperative groups example
#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void kernel() {
    // Get block group (replaces __syncthreads)
    cg::thread_block block = cg::this_thread_block();
    block.sync();
    
    // Get warp-sized tile for shuffle operations
    cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);
    int sum = cg::reduce(warp, my_val, cg::plus<int>());
    
    // Sub-warp tiles (e.g., 16 threads)
    cg::thread_block_tile<16> half_warp = cg::tiled_partition<16>(block);
}
Grid-Wide Sync

grid_group.sync() synchronizes all blocks in the grid—something impossible with __syncthreads(). Requires cudaLaunchCooperativeKernel() and limits grid size to what can run concurrently. Useful for iterative algorithms without kernel launch overhead.

Cooperative groups allow synchronization of:
Only threads within a warp
Only threads within a block
Flexible groups including entire grid

PRACTICE

Hands-On Labs

REFERENCES

Further Reading

Primary Documentation

  1. CUDA Programming Guide: Synchronization
    docs.nvidia.com/cuda/.../synchronization-functions
  2. CUDA Programming Guide: Warp Shuffle
    docs.nvidia.com/cuda/.../warp-shuffle-functions
  3. CUDA Programming Guide: Atomic Functions
    docs.nvidia.com/cuda/.../atomic-functions
  4. CUDA Programming Guide: Cooperative Groups
    docs.nvidia.com/cuda/.../cooperative-groups
  5. NVIDIA Developer Blog: Faster Parallel Reductions on Kepler
    developer.nvidia.com/blog/faster-parallel-reductions-kepler
All material licensed under CC BY-NC-SA 4.0