Chapter 4: Synchronization & Reductions

01 - THE PROBLEM

Why Threads Need to Coordinate

Consider computing the sum of an array. Each thread could add some elements to a total variable—but without coordination, threads overwrite each other's results.

# Naive parallel sum (WRONG - race condition)
@triton.jit
def broken_sum(x_ptr, out_ptr, N):
    pid = tl.program_id(0)
    x = tl.load(x_ptr + pid)
    
    # Every thread reads, adds, writes to same location
    current = tl.load(out_ptr)  # Thread A reads 0
    tl.store(out_ptr, current + x)  # Thread A writes 5
    # Meanwhile Thread B also read 0, writes 3
    # Final result: 3 (not 8!)

The problem is a race condition: multiple threads reading and writing the same memory location. Solutions include:

Barriers: Force all threads to reach a point before continuing
Atomics: Hardware-guaranteed read-modify-write operations
Reduction patterns: Structured algorithms that avoid conflicts

Synchronization Has Cost

Every synchronization point is a potential bottleneck. Threads that finish early must wait. The goal is to minimize synchronization while maintaining correctness.

A race condition occurs when multiple threads:

Execute the same instruction

Access shared data with at least one write

Run on different SMs

02 - BARRIERS

Block-Level Synchronization

__syncthreads() (or tl.debug_barrier() in Triton) is a barrier: all threads in a block must reach it before any can proceed.

When Barriers Are Required

After writing to shared memory that other threads will read
Before reading shared memory that other threads wrote
Between reduction stages that depend on previous results

# Shared memory pattern requiring barrier
shared = tl.zeros([BLOCK_SIZE], dtype=tl.float32)

# Stage 1: Each thread writes its value
shared[tid] = my_value

# BARRIER: Wait for ALL threads to finish writing
tl.debug_barrier()

# Stage 2: Now safe to read neighbor's values
neighbor = shared[(tid + 1) % BLOCK_SIZE]

Deadlock Danger

All threads in a block must reach the barrier. If some threads take a branch with a barrier and others don't, you get a deadlock—threads waiting forever for each other.

# DEADLOCK: Conditional barrier
if tid < 16:
    tl.debug_barrier()  # Only half the threads reach this!
# Threads 16-31 never call barrier → deadlock

# CORRECT: Barrier outside conditional
if tid < 16:
    do_something()
tl.debug_barrier()  # ALL threads reach this

What happens if only half the threads in a block call __syncthreads()?

Only those threads synchronize

Deadlock - threads wait forever

The kernel crashes immediately

03 - WARP SHUFFLES

Intra-Warp Communication

Within a warp (32 threads), threads can directly exchange register values using shuffle instructions—no shared memory or barriers needed.

Shuffle Operations

Operation	Description	Use Case
`__shfl_sync`	Get value from specific lane	Broadcast, gather
`__shfl_down_sync`	Get value from lane + delta	Reduction (log N steps)
`__shfl_up_sync`	Get value from lane - delta	Prefix scan
`__shfl_xor_sync`	Get value from lane XOR delta	Butterfly reduction

Source lanes

↓ shfl_down(delta=2)

Dest lanes

# Warp-level sum reduction using shfl_down
# Each thread starts with its own value
val = my_value

# Log2(32) = 5 steps to reduce 32 values
for offset in [16, 8, 4, 2, 1]:
    val += __shfl_down_sync(0xFFFFFFFF, val, offset)

# After loop: lane 0 has the sum of all 32 values
if lane_id == 0:
    result = val

Why is warp shuffle faster than shared memory?

Shuffle operates directly on registers—no memory access at all. Shared memory requires a load/store through the memory subsystem (~20 cycles). Shuffles are essentially free (1 cycle). For warp-level operations, always prefer shuffles.

How many shfl_down steps are needed to sum 32 values in a warp?

32

5 (log2(32))

16

04 - REDUCTIONS

Parallel Reduction Patterns

A reduction combines N values into 1 using an associative operator (sum, max, min, product). The parallel approach: repeatedly combine pairs until one remains.

Input

Step 1

Step 2

Result

Click Step to begin

Efficient reduction structure:

Warp-level: Use shuffles (5 steps for 32 threads, no sync needed)
Block-level: Combine warp results in shared memory (need barriers)
Grid-level: Either multiple kernel launches or atomics

# Two-level reduction: warp then block
# Step 1: Each warp reduces its 32 values
warp_sum = warp_reduce(my_value)  # Uses shuffles

# Step 2: First thread of each warp writes to shared mem
if lane_id == 0:
    shared[warp_id] = warp_sum

tl.debug_barrier()  # Wait for all warps

# Step 3: First warp reduces the warp sums
if warp_id == 0:
    val = shared[lane_id] if lane_id < num_warps else 0
    block_sum = warp_reduce(val)
    if lane_id == 0:
        output[block_id] = block_sum

Reducing 1024 values with 2-level reduction (warp + block) requires how many sync barriers?

0 (shuffles don't need sync)

1 (between warp and block levels)

10 (log2(1024))

05 - ATOMICS

Atomic Operations

Atomic operations guarantee that a read-modify-write sequence completes without interference from other threads. The hardware ensures only one thread accesses the location at a time.

Common Atomics

Operation	Action	Use Case
`atomicAdd`	*addr += val	Histograms, counters, grid reduction
`atomicMax/Min`	addr = max(addr, val)	Finding extrema
`atomicExch`	swap(*addr, val)	Locks, flags
`atomicCAS`	compare-and-swap	Lock-free algorithms

counter 0

8 threads each adding 1. Watch them serialize!

Atomics Are Slow

When many threads atomicAdd to the same location, they serialize—only one succeeds at a time. High contention = sequential performance. Use atomics for final accumulation, not inner loops. Prefer reductions when possible.

Slow: Atomic in Inner Loop

# Every thread hits same address
for i in range(1000):
    atomicAdd(total, val[i])
# 1000 serialized atomics!

Fast: Local Sum + One Atomic

# Sum locally first
local_sum = 0
for i in range(1000):
    local_sum += val[i]
atomicAdd(total, local_sum)
# Only 1 atomic per thread!

When 1000 threads all atomicAdd to the same location, performance is:

1000x faster than sequential

Approximately sequential (serialized)

Undefined behavior

06 - COOPERATIVE GROUPS

Flexible Synchronization

Cooperative Groups (CUDA 9+) provide flexible thread groupings beyond the fixed block/warp hierarchy. You can synchronize subsets of threads or even across the entire grid.

Group Types

Group	Size	Sync Scope
`thread_block`	Up to 1024	Same as __syncthreads()
`thread_block_tile<N>`	N (power of 2, ≤32)	Warp or sub-warp
`coalesced_threads`	Active threads	Dynamic (after divergence)
`grid_group`	Entire grid	All blocks (requires special launch)

// Cooperative groups example
#include <cooperative_groups.h>
namespace cg = cooperative_groups;

__global__ void kernel() {
    // Get block group (replaces __syncthreads)
    cg::thread_block block = cg::this_thread_block();
    block.sync();
    
    // Get warp-sized tile for shuffle operations
    cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);
    int sum = cg::reduce(warp, my_val, cg::plus<int>());
    
    // Sub-warp tiles (e.g., 16 threads)
    cg::thread_block_tile<16> half_warp = cg::tiled_partition<16>(block);
}

Grid-Wide Sync

grid_group.sync() synchronizes all blocks in the grid—something impossible with __syncthreads(). Requires cudaLaunchCooperativeKernel() and limits grid size to what can run concurrently. Useful for iterative algorithms without kernel launch overhead.

Cooperative groups allow synchronization of:

Only threads within a warp

Only threads within a block

Flexible groups including entire grid

PRACTICE

Hands-On Labs

📓

Part 4 Notebooks: Synchronization

5 labs: Barriers, warp shuffles, parallel reduction, atomics, cooperative groups

REFERENCES

Synchronization & Reductions

Why Threads Need to Coordinate

Block-Level Synchronization

When Barriers Are Required

Intra-Warp Communication

Shuffle Operations

Parallel Reduction Patterns

Atomic Operations

Common Atomics

Slow: Atomic in Inner Loop

Fast: Local Sum + One Atomic

Flexible Synchronization

Group Types

Hands-On Labs

Further Reading

Primary Documentation