Synchronization & Reductions
When threads need to coordinate, things get interesting. Master barriers, warp shuffles, parallel reductions, and atomic operations—the building blocks of efficient collective operations.
- Implement thread-block synchronization with __syncthreads()
- Use warp shuffle operations for intra-warp communication
- Write efficient parallel reductions (sum, max, argmax)
- Apply atomic operations correctly and understand their cost
- Use cooperative groups for flexible synchronization
Why Threads Need to Coordinate
Consider computing the sum of an array. Each thread could add some elements to a
total variable—but without coordination, threads overwrite each other's results.
# Naive parallel sum (WRONG - race condition)
@triton.jit
def broken_sum(x_ptr, out_ptr, N):
pid = tl.program_id(0)
x = tl.load(x_ptr + pid)
# Every thread reads, adds, writes to same location
current = tl.load(out_ptr) # Thread A reads 0
tl.store(out_ptr, current + x) # Thread A writes 5
# Meanwhile Thread B also read 0, writes 3
# Final result: 3 (not 8!)
The problem is a race condition: multiple threads reading and writing the same memory location. Solutions include:
- Barriers: Force all threads to reach a point before continuing
- Atomics: Hardware-guaranteed read-modify-write operations
- Reduction patterns: Structured algorithms that avoid conflicts
Every synchronization point is a potential bottleneck. Threads that finish early must wait. The goal is to minimize synchronization while maintaining correctness.
Block-Level Synchronization
__syncthreads()
(or tl.debug_barrier() in Triton) is a barrier: all threads in a block must reach it before any can proceed.
When Barriers Are Required
- After writing to shared memory that other threads will read
- Before reading shared memory that other threads wrote
- Between reduction stages that depend on previous results
# Shared memory pattern requiring barrier
shared = tl.zeros([BLOCK_SIZE], dtype=tl.float32)
# Stage 1: Each thread writes its value
shared[tid] = my_value
# BARRIER: Wait for ALL threads to finish writing
tl.debug_barrier()
# Stage 2: Now safe to read neighbor's values
neighbor = shared[(tid + 1) % BLOCK_SIZE]
All threads in a block must reach the barrier. If some threads take a branch with a barrier and others don't, you get a deadlock—threads waiting forever for each other.
# DEADLOCK: Conditional barrier
if tid < 16:
tl.debug_barrier() # Only half the threads reach this!
# Threads 16-31 never call barrier → deadlock
# CORRECT: Barrier outside conditional
if tid < 16:
do_something()
tl.debug_barrier() # ALL threads reach this
Intra-Warp Communication
Within a warp (32 threads), threads can directly exchange register values using shuffle instructions—no shared memory or barriers needed.
Shuffle Operations
| Operation | Description | Use Case |
|---|---|---|
__shfl_sync |
Get value from specific lane | Broadcast, gather |
__shfl_down_sync |
Get value from lane + delta | Reduction (log N steps) |
__shfl_up_sync |
Get value from lane - delta | Prefix scan |
__shfl_xor_sync |
Get value from lane XOR delta | Butterfly reduction |
# Warp-level sum reduction using shfl_down
# Each thread starts with its own value
val = my_value
# Log2(32) = 5 steps to reduce 32 values
for offset in [16, 8, 4, 2, 1]:
val += __shfl_down_sync(0xFFFFFFFF, val, offset)
# After loop: lane 0 has the sum of all 32 values
if lane_id == 0:
result = val
Why is warp shuffle faster than shared memory?
Parallel Reduction Patterns
A reduction combines N values into 1 using an associative operator (sum, max, min, product). The parallel approach: repeatedly combine pairs until one remains.
Efficient reduction structure:
- Warp-level: Use shuffles (5 steps for 32 threads, no sync needed)
- Block-level: Combine warp results in shared memory (need barriers)
- Grid-level: Either multiple kernel launches or atomics
# Two-level reduction: warp then block
# Step 1: Each warp reduces its 32 values
warp_sum = warp_reduce(my_value) # Uses shuffles
# Step 2: First thread of each warp writes to shared mem
if lane_id == 0:
shared[warp_id] = warp_sum
tl.debug_barrier() # Wait for all warps
# Step 3: First warp reduces the warp sums
if warp_id == 0:
val = shared[lane_id] if lane_id < num_warps else 0
block_sum = warp_reduce(val)
if lane_id == 0:
output[block_id] = block_sum
Atomic Operations
Atomic operations guarantee that a read-modify-write sequence completes without interference from other threads. The hardware ensures only one thread accesses the location at a time.
Common Atomics
| Operation | Action | Use Case |
|---|---|---|
atomicAdd |
*addr += val | Histograms, counters, grid reduction |
atomicMax/Min |
*addr = max(*addr, val) | Finding extrema |
atomicExch |
swap(*addr, val) | Locks, flags |
atomicCAS |
compare-and-swap | Lock-free algorithms |
8 threads each adding 1. Watch them serialize!
When many threads atomicAdd to the same location, they serialize—only one succeeds at a time. High contention = sequential performance. Use atomics for final accumulation, not inner loops. Prefer reductions when possible.
Slow: Atomic in Inner Loop
# Every thread hits same address
for i in range(1000):
atomicAdd(total, val[i])
# 1000 serialized atomics!
Fast: Local Sum + One Atomic
# Sum locally first
local_sum = 0
for i in range(1000):
local_sum += val[i]
atomicAdd(total, local_sum)
# Only 1 atomic per thread!
Flexible Synchronization
Cooperative Groups (CUDA 9+) provide flexible thread groupings beyond the fixed block/warp hierarchy. You can synchronize subsets of threads or even across the entire grid.
Group Types
| Group | Size | Sync Scope |
|---|---|---|
thread_block |
Up to 1024 | Same as __syncthreads() |
thread_block_tile<N> |
N (power of 2, ≤32) | Warp or sub-warp |
coalesced_threads |
Active threads | Dynamic (after divergence) |
grid_group |
Entire grid | All blocks (requires special launch) |
// Cooperative groups example
#include <cooperative_groups.h>
namespace cg = cooperative_groups;
__global__ void kernel() {
// Get block group (replaces __syncthreads)
cg::thread_block block = cg::this_thread_block();
block.sync();
// Get warp-sized tile for shuffle operations
cg::thread_block_tile<32> warp = cg::tiled_partition<32>(block);
int sum = cg::reduce(warp, my_val, cg::plus<int>());
// Sub-warp tiles (e.g., 16 threads)
cg::thread_block_tile<16> half_warp = cg::tiled_partition<16>(block);
}
grid_group.sync() synchronizes all blocks in the grid—something impossible
with __syncthreads(). Requires cudaLaunchCooperativeKernel()
and limits grid size to what can run concurrently. Useful for iterative algorithms
without kernel launch overhead.
Hands-On Labs
Further Reading
Primary Documentation
-
CUDA Programming Guide: Synchronization
docs.nvidia.com/cuda/.../synchronization-functions -
CUDA Programming Guide: Warp Shuffle
docs.nvidia.com/cuda/.../warp-shuffle-functions -
CUDA Programming Guide: Atomic Functions
docs.nvidia.com/cuda/.../atomic-functions -
CUDA Programming Guide: Cooperative Groups
docs.nvidia.com/cuda/.../cooperative-groups -
NVIDIA Developer Blog: Faster Parallel Reductions on Kepler
developer.nvidia.com/blog/faster-parallel-reductions-kepler