Concurrency & Parallelism
Concurrency vs Parallelism
Two words the Python community uses interchangeably and incorrectly, constantly.
Concurrency — multiple tasks are in progress at the same time. They take turns on the CPU. One plumber fixing two sinks: while the glue dries on sink A, they work on sink B. No moment where both are being worked on simultaneously.
Parallelism — multiple tasks are literally executing at the same time, on different cores. Two plumbers, two sinks, both working simultaneously.
ELI5: Concurrency is a juggler keeping 5 balls in the air — at any given microsecond, they’re only touching one ball. Parallelism is 5 jugglers each holding their own ball. The juggler is faster because they waste no time waiting; the 5-person team is faster because they have 5x the hands.
Why this matters for Python: the GIL makes true parallelism of Python code impossible in one process. Threading in CPython gives you concurrency, not parallelism. This single fact drives almost every architectural decision in Python concurrency.
| Concurrency | Parallelism | |
|---|---|---|
| Definition | Tasks overlap in time | Tasks execute simultaneously |
| CPU at any instant | One task runs | Multiple tasks run |
| Python threading | Yes | No (GIL) |
| Python multiprocessing | Yes | Yes |
| asyncio | Yes | No |
| Useful for | I/O-bound | CPU-bound |
The Fundamental Decision: I/O-bound vs CPU-bound
This is the question you must answer before choosing any concurrency approach:
I/O-bound: Your code spends most time waiting — for network, disk, database. The CPU is idle. A web scraper making 1000 HTTP requests: 95% of time is waiting for responses.
CPU-bound: Your code does real computation. Image processing, ML inference, matrix math, password hashing. The CPU is pegged at 100%.
ELI5: I/O-bound is waiting at the DMV — you’re not doing anything, just waiting. Concurrency helps because while you wait, someone else can use the counter. CPU-bound is doing a 1000-piece puzzle — you’re working the whole time. Concurrency doesn’t help; you need more people (processes) to work faster.
If you answer this question wrong, you’ll make your code slower. Threading a CPU-bound workload adds overhead with zero benefit.
The Global Interpreter Lock (GIL)
The single most misunderstood thing in Python.
What it is
A mutex (mutual exclusion lock) protecting CPython’s internal state. At any moment, only one thread can execute Python bytecode. It’s not a language feature — it’s a CPython implementation detail.
Why it exists
CPython uses reference counting for memory management. Every Python object has a ob_refcnt field. When you do x = obj, that counter increments. When x goes out of scope, it decrements. Hits zero → object gets freed.
Reference counting is not thread-safe. Without the GIL, two threads could simultaneously decrement a ref count, both read 1, both subtract 1, both think they got 0, and both try to free the object. That’s a use-after-free bug.
The fix in 1992 was: one big lock around the interpreter. Fast to implement, still here 30 years later.
ELI5: Imagine a shared notebook where everyone tracks who borrowed what book. If two people try to erase and rewrite at the same time, the notebook gets corrupted. The GIL is a rule: only one person can touch the notebook at a time. It works, but it means no two people can ever update it simultaneously.
What the GIL prevents
True parallel execution of Python bytecode across multiple threads. Period.
What the GIL does NOT prevent
This is where people get confused:
- Parallel I/O: When a thread blocks on a
socket.recv()oropen(), it releases the GIL. Other threads can run Python code while one thread waits for the OS. - C extensions that release the GIL: NumPy, pandas, cryptography — they drop the GIL during computation.
numpy.dot()on two large matrices runs in parallel with other Python threads. - Subinterpreters (3.12+): Each has its own GIL.
GIL release cycle
CPython releases the GIL every sys.getswitchinterval() seconds (default: 5ms). This lets other threads run. But “switching” has overhead — context switches, cache misses. For CPU-bound code, this overhead makes multithreaded CPython slower than single-threaded.
import sys
sys.getswitchinterval() # 0.005 (5ms)
sys.setswitchinterval(0.001) # switch more often (rarely useful)
PEP 703 — Free-threaded Python (3.13+)
Python 3.13 ships with an experimental no-GIL build. Run with python3.13t. The GIL is replaced with fine-grained locks on individual objects. Overhead is still real, but true thread parallelism is possible.
This is not production-ready for most workloads yet. Watch this space.
Why this matters: If you’ve ever written a multithreaded Python app and wondered why it doesn’t use all your CPU cores, now you know. The GIL is the answer. The fix is either
multiprocessing,asyncio(for I/O), or waiting for free-threaded Python to mature.
Threading
When to reach for it
I/O-bound tasks where you want concurrency without the overhead of multiple processes. Database queries, HTTP requests, file I/O, reading from queues.
Not for: CPU-bound work. Threading a CPU-bound task in CPython = paying context-switch overhead with zero parallelism benefit.
Core primitives
import threading
# Basic thread
def worker(name):
print(f"Thread {name} starting")
# ... do work ...
t = threading.Thread(target=worker, args=("A",), daemon=True)
t.start()
t.join() # wait for completion
daemon=True means the thread dies when the main thread exits. Non-daemon threads keep the process alive.
| Primitive | Purpose |
|---|---|
Lock | Mutual exclusion — only one thread at a time |
RLock | Reentrant lock — same thread can acquire multiple times |
Semaphore | Allow N threads simultaneously (connection pool) |
Event | Signal between threads (one sets, others wait) |
Condition | Wait for a state change (producer-consumer) |
Barrier | Wait until N threads all reach a point |
Race conditions and Locks
counter = 0
lock = threading.Lock()
def increment():
global counter
with lock: # acquire on enter, release on exit (even if exception)
counter += 1 # now atomic
Without the lock: two threads read counter = 5, both add 1, both write 6. You lost an increment. This is a race condition.
ELI5: A lock is a bathroom key. The door says “occupied” and no one else can enter. When you leave, you hang the key back. Without a lock, two people could open the door at the same time and chaos ensues. The cost: everyone else waits.
Deadlocks
Four conditions (Coffman) must ALL be true for deadlock:
- Mutual exclusion — resources can’t be shared
- Hold and wait — thread holds A and waits for B
- No preemption — you can’t forcibly take a resource
- Circular wait — Thread 1 holds A wants B, Thread 2 holds B wants A
Break any one condition and deadlock is impossible. Most practical fix: always acquire locks in the same order across all threads.
# Deadlock-prone
def transfer_ab():
with lock_a:
with lock_b: ...
def transfer_ba():
with lock_b: # opposite order!
with lock_a: ...
# Safe: always alphabetical order
def transfer_any(from_lock, to_lock):
first, second = sorted([from_lock, to_lock], key=id)
with first:
with second: ...
Thread-local storage
threading.local() gives each thread its own copy of a variable. Django uses this for per-request database connections.
local = threading.local()
def worker():
local.connection = create_db_connection() # each thread has its own
# ... use local.connection ...
The modern way: ThreadPoolExecutor
from concurrent.futures import ThreadPoolExecutor, as_completed
urls = ["https://...", "https://...", ...]
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(fetch, url): url for url in urls}
for future in as_completed(futures):
url = futures[future]
result = future.result() # raises if the task raised
Prefer this over managing threads manually. It handles thread lifecycle, error propagation, and result collection.
Common mistake: Setting
max_workerstoo high. For HTTP requests, 10-50 is usually sweet spot. More workers = more overhead + server rate limiting. For database queries, match your connection pool size.
Multiprocessing
When to reach for it
CPU-bound tasks that need real parallelism. Image processing, data transformation, ML inference, compression, anything where you’d see 100% CPU on one core.
Each process has its own Python interpreter, its own GIL, its own memory space. True parallelism.
ELI5: Threading is one chef switching rapidly between 5 dishes. Multiprocessing is 5 chefs each with their own kitchen. The 5-chef solution actually cooks 5x faster, but you need 5x the kitchen equipment and can’t easily share ingredients between kitchens.
Core API
from multiprocessing import Pool
def cpu_heavy(n):
return sum(i * i for i in range(n))
with Pool(processes=4) as pool: # 4 worker processes
results = pool.map(cpu_heavy, [10**6, 10**6, 10**6, 10**6])
| Primitive | Purpose |
|---|---|
Pool | Fixed worker pool with map/apply |
Process | Single explicit process |
Queue | Thread/process-safe FIFO |
Pipe | Two-way channel between processes |
Value/Array | Shared typed memory |
shared_memory (3.8+) | Raw shared memory block |
The pickle bottleneck
Inter-process communication serializes everything with pickle. This is the performance ceiling. If you’re sending large numpy arrays back and forth, the serialization cost can exceed the computation benefit.
What can’t be pickled: lambda functions, local functions, database connections, file handles, many C extension objects.
# This fails silently or errors
pool.map(lambda x: x*2, data) # lambda can't be pickled
# Fix: use a module-level function
def double(x): return x * 2
pool.map(double, data)
Start methods: fork vs spawn vs forkserver
| Method | How | Speed | Safety | Default on |
|---|---|---|---|---|
fork | Copy entire parent process | Fast | Unsafe with threads | Linux |
spawn | Fresh interpreter, import everything | Slow startup | Safe | macOS (3.8+), Windows |
forkserver | Dedicated server forks on demand | Medium | Safe | — |
macOS gotcha: macOS changed default from fork to spawn in Python 3.8. Code that worked fine on Linux may break on macOS. fork + threads = undefined behavior because forking copies locks that were held by threads that no longer exist.
import multiprocessing
multiprocessing.set_start_method('spawn') # explicit, portable
Shared memory (3.8+)
For large numpy arrays, skip pickle entirely:
from multiprocessing import shared_memory
import numpy as np
shm = shared_memory.SharedMemory(create=True, size=1024*1024*100)
arr = np.ndarray((1000, 1000), dtype=np.float64, buffer=shm.buf)
# Pass shm.name to worker processes — they attach without copying
The modern way: ProcessPoolExecutor
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=4) as executor:
results = list(executor.map(cpu_heavy, data))
Same API as ThreadPoolExecutor. Easy to swap between the two.
Common mistake: Using
multiprocessingfor I/O-bound tasks. Each process has startup overhead (~50-100ms) and higher memory usage. For I/O-bound work with many short tasks, threading or asyncio is almost always better.
asyncio In Depth
The mental model
asyncio is single-threaded cooperative multitasking. One thread, one event loop, multiple coroutines that voluntarily yield control when they’re waiting.
The event loop is a dispatcher: it holds a queue of coroutines ready to run, picks one, runs it until it hits an await, suspends it, picks the next ready coroutine, runs it, and so on.
Event Loop
│
├── coroutine A runs → hits await → suspended
├── coroutine B runs → hits await → suspended
├── coroutine C runs → completes → removed
└── ... I/O event fires → coroutine A marked ready → runs again
ELI5: asyncio is one chef running a restaurant kitchen alone. When they put something in the oven (await), they don’t stand there watching it — they go work on something else. When the oven timer goes off (I/O event), they come back to finish that dish. No waiting around. One person, maximum efficiency through smart scheduling.
Coroutines: the mechanics
async def fetch_data(url): # coroutine function
...
coro = fetch_data("https://...") # coroutine object — nothing runs yet
result = await coro # run it, suspending this coroutine until done
async def does NOT run anything. It creates a coroutine function. Calling it returns a coroutine object. Nothing executes until you await it or schedule it on the event loop.
Common mistake: Forgetting
await.result = fetch_data(url)gives you a coroutine object, not the result. Python 3.10+ warns about this but older versions silently do the wrong thing.
Task management
import asyncio
# gather: run concurrently, wait for ALL, return results in order
results = await asyncio.gather(fetch(url1), fetch(url2), fetch(url3))
# create_task: schedule immediately, get a handle
task = asyncio.create_task(fetch(url))
# ... do other stuff ...
result = await task
# TaskGroup (3.11+): structured concurrency
async with asyncio.TaskGroup() as tg:
t1 = tg.create_task(fetch(url1))
t2 = tg.create_task(fetch(url2))
# All tasks done here. If ANY raises, all others are cancelled.
| API | Behavior on exception |
|---|---|
gather() | By default propagates first exception; others keep running |
gather(return_exceptions=True) | Returns exceptions as values |
TaskGroup | Cancels all siblings, re-raises after all cleaned up |
Why TaskGroup is better than bare gather
asyncio.gather() has subtle cancellation behavior. If one task fails and you have cleanup to do in others, it’s easy to leave coroutines dangling.
TaskGroup implements structured concurrency: tasks can’t outlive the scope they’re created in. When the async with block exits, all tasks are done (or cancelled). This prevents coroutine leaks.
ELI5:
gather()is like sending 5 employees on errands and telling them to meet back whenever — if one gets in a car crash, the others don’t know and keep shopping.TaskGroupis like a field trip with a chaperone: if one student goes missing, the chaperone halts everything and accounts for everyone before leaving.
Cancellation
task = asyncio.create_task(long_operation())
await asyncio.sleep(5)
task.cancel() # sends CancelledError into the coroutine
try:
await task
except asyncio.CancelledError:
pass # expected
Inside the coroutine, CancelledError is raised at the next await. You can catch it to clean up, but you must re-raise it:
async def long_operation():
try:
await some_io()
except asyncio.CancelledError:
await cleanup()
raise # MUST re-raise — swallowing CancelledError breaks everything
Running blocking code in async context
# WRONG: blocks the entire event loop
async def bad():
data = open("large_file.txt").read() # blocks! all other coroutines freeze
# RIGHT: run in a thread pool
async def good():
data = await asyncio.to_thread(open("large_file.txt").read)
# Or explicitly
loop = asyncio.get_event_loop()
data = await loop.run_in_executor(None, blocking_function, arg1, arg2)
Common mistake: Calling
time.sleep()inside an async function. This is the most common asyncio bug. Useawait asyncio.sleep()instead. The blocking version freezes the entire event loop — all your “concurrent” requests stop while one sleeps.
concurrent.futures — The Unified API
Why start here
Before reaching for raw threading or multiprocessing, use concurrent.futures. Same API for both, easy to swap, handles lifecycle and error propagation.
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
# I/O-bound: threads
with ThreadPoolExecutor(max_workers=10) as executor:
future = executor.submit(fetch, url)
result = future.result() # blocks until done, re-raises exceptions
# CPU-bound: processes (same API!)
with ProcessPoolExecutor(max_workers=4) as executor:
future = executor.submit(compress, data)
result = future.result()
A Future from executor.submit() has .done(), .result(timeout=N) (blocks, re-raises), .exception(), .cancel(), and .add_done_callback(). That’s the full API — lean but complete.
map vs as_completed
# map: simple, preserves order, blocks
results = list(executor.map(fn, items)) # blocks until ALL done
# as_completed: process results as they arrive (better for long-tail latency)
futures = {executor.submit(fn, item): item for item in items}
for future in as_completed(futures):
item = futures[future]
try:
result = future.result()
except Exception as e:
print(f"{item} failed: {e}")
as_completed is better when tasks have variable duration. First result in 100ms doesn’t have to wait for the slow one that takes 5s.
Synchronization Primitives
When you need them
Whenever two tasks share mutable state. If your concurrency model involves shared data, you need synchronization.
ELI5: Shared mutable state without synchronization is two people editing the same Google Doc but offline — when they sync, one person’s changes get overwritten. Locks are like Google’s “one editor at a time” cursor.
| Primitive | Threading | asyncio | Use case |
|---|---|---|---|
| Lock | threading.Lock | asyncio.Lock | Exclusive access to resource |
| Semaphore | threading.Semaphore | asyncio.Semaphore | Limit concurrent access (connection pool) |
| Event | threading.Event | asyncio.Event | Signal: “this thing happened” |
| Condition | threading.Condition | asyncio.Condition | Wait for state change |
| Queue | queue.Queue | asyncio.Queue | Safe producer-consumer |
Semaphore for rate limiting
# Limit to 5 concurrent HTTP requests
sem = asyncio.Semaphore(5)
async def fetch(url):
async with sem:
return await http.get(url)
await asyncio.gather(*[fetch(url) for url in urls]) # only 5 at a time
Common mistake: Using a plain
listas a queue between threads.list.append()andlist.pop()are not atomic in all Python implementations. Usequeue.Queue(maxsize=N)— it blocksput()when full andget()when empty, and is safe across threads by design.
Decision Framework
The big decision table
| Task type | Volume | Recommended | Why |
|---|---|---|---|
| I/O-bound | Moderate (10-100 tasks) | ThreadPoolExecutor | Simple, low overhead |
| I/O-bound | High (1000+ tasks) | asyncio | Scales to thousands; threads are expensive |
| CPU-bound | Any | ProcessPoolExecutor | Bypasses GIL, real parallelism |
| Mixed I/O + CPU | Separate phases | asyncio for I/O + ProcessPool for CPU | Each tool for its job |
| Fire-and-forget background tasks | Low | threading.Thread(daemon=True) | Simplest possible |
| Batch data processing | Large datasets | multiprocessing.Pool | Full process control |
Ask in order: (1) CPU-bound? → ProcessPoolExecutor. (2) I/O-bound, how many tasks? Dozens → ThreadPoolExecutor. Hundreds+ → asyncio. (3) Can you go async all the way? If not, threading avoids the pain of mixing sync/async code where one blocking call freezes the event loop.
Summary: the tl;dr
| Scenario | Use |
|---|---|
| HTTP client making many requests | asyncio + aiohttp |
| Scraping 50 URLs | ThreadPoolExecutor(10) |
| Processing 1M CSV rows | ProcessPoolExecutor(os.cpu_count()) |
| Background DB writes from web server | ThreadPoolExecutor |
| Parallel ML batch inference | multiprocessing.Pool |
| Real-time websocket server | asyncio |
| Simple parallel map | concurrent.futures.ProcessPoolExecutor.map() |