The Parallel Mindset
Before diving into GPU hardware, we need to rewire our thinking. Sequential programming intuition will mislead you. This chapter builds the mental model for thinking in parallel.
- Explain why parallelism is fundamentally different from sequential programming
- Apply Amdahl's Law to estimate parallel speedup limits
- Identify embarrassingly parallel vs communication-bound problems
- Describe the mental model shift from "one fast thing" to "many slow things"
- Recognize SIMD patterns in CPU code as a bridge to GPU SIMT
Why Parallelism is Hard
You've written sequential code for years. Your intuition says: "make the thing faster." But parallel programming requires a different question: "how do I divide this work?"
Sequential thinking leads to wrong parallel code. Consider summing an array:
# Sequential (correct, simple)
total = 0
for x in array:
total += x # Each step depends on the previous
# Naive parallel (WRONG - race condition!)
total = 0
parallel_for x in array:
total += x # Multiple threads read/write total
The naive parallel version has a race condition:
multiple threads read total, add their value, and write back—overwriting each other's work.
Parallel programming requires thinking about what can happen simultaneously. Your code must be correct regardless of execution order. This is fundamentally different from sequential programming where order is guaranteed.
Common parallel hazards include:
- Race conditions: Multiple threads access shared data, at least one writes
- Deadlocks: Threads wait for each other in a cycle, forever
- Data dependencies: Step N needs the result of step N-1
The good news: GPU programming has patterns that avoid these hazards. The bad news: you need to learn to recognize which patterns apply to your problem.
The Speedup Limit
Amdahl's Law tells us the maximum speedup from parallelization. If part of your program is inherently sequential, it limits how fast the whole program can go—no matter how many processors you throw at it.
The Formula
Speedup = 1 / (S + P/N)
Where S = sequential fraction, P = parallel fraction (S + P = 1), N = number of processors
The devastating insight: as N → ∞, speedup approaches 1/S. If 10% of your code is sequential, maximum speedup is 10x. Period.
What about Gustafson's Law?
Categories of Parallelism
Not all problems parallelize equally. Recognizing which category your problem falls into determines your implementation strategy.
When you see a problem, ask: "Is each output independent? Does it reduce? Does it access neighbors?" This determines which GPU optimization techniques apply.
From CPU Vectors to GPU Warps
If you've used NumPy, you already understand parallel thinking at a basic level.
a + b on arrays doesn't loop—it operates on all elements simultaneously.
This is SIMD
(Single Instruction, Multiple Data).
Modern CPUs have SIMD units like AVX-512 that process 8-16 elements at once. GPUs take this further with SIMT (Single Instruction, Multiple Threads): 32 threads execute in lockstep.
8 floats processed per instruction
32 threads execute same instruction (+ thousands more warps)
The key difference: a GPU doesn't have 32 lanes—it has thousands of warps, each with 32 threads. A B200 can have ~200,000+ threads in flight.
NumPy as Training Wheels
# NumPy: implicit parallelism (CPU SIMD under the hood)
result = np.exp(x) + np.log(y) # Operates on all elements
# GPU thinking: explicit parallelism
# "What does ONE thread do to ONE element?"
# Then run that 1 million times in parallel
Thinking in Parallel
The hardest part of GPU programming isn't the syntax—it's the mental model. You must stop thinking about the computation and start thinking about how to divide the computation.
The Thought Experiment
When approaching a problem, ask: "What if 10,000 copies of me worked on this?"
- What would each copy need to know? (input data, index)
- What would each copy produce? (one output element? a partial result?)
- Would any copies need to talk to each other? (synchronization)
- Would any copies fight over the same resource? (race conditions)
Sequential Thinking
- "Process item 1, then item 2..."
- "Loop through the array"
- "Accumulate into a variable"
- Focus on the algorithm
Parallel Thinking
- "Each thread handles item[thread_id]"
- "All elements processed simultaneously"
- "Reduce partial results at the end"
- Focus on data flow
In GPU programming, you don't write "a loop that processes data." You write "what happens to one piece of data" and let the hardware run it everywhere. Think about the data, not the loop.
This mental shift takes practice. The good news: once it clicks, you'll see parallelization opportunities everywhere—even in code you thought was inherently sequential.
Hands-On Labs
Further Reading
Key Concepts
-
Amdahl's Law
Wikipedia: Amdahl's Law — Original 1967 formulation and implications -
Gustafson's Law
Wikipedia: Gustafson's Law — The scaled speedup alternative -
SIMT Architecture
CUDA Programming Guide: SIMT — NVIDIA's official explanation of GPU execution model -
Race Conditions
CUDA Programming Guide: Synchronization — How to avoid race conditions in GPU code