Concurrency & Parallelism

13 min read 2762 words

Table of Contents

Concurrency vs Parallelism
- The Fundamental Decision: I/O-bound vs CPU-bound
The Global Interpreter Lock (GIL)
Threading
Multiprocessing
asyncio In Depth
concurrent.futures — The Unified API
- Why start here
- map vs as_completed
Synchronization Primitives
- When you need them
- Semaphore for rate limiting
Decision Framework
- The big decision table
- Summary: the tl;dr

Concurrency vs Parallelism

Two words the Python community uses interchangeably and incorrectly, constantly.

Concurrency — multiple tasks are in progress at the same time. They take turns on the CPU. One plumber fixing two sinks: while the glue dries on sink A, they work on sink B. No moment where both are being worked on simultaneously.

Parallelism — multiple tasks are literally executing at the same time, on different cores. Two plumbers, two sinks, both working simultaneously.

ELI5: Concurrency is a juggler keeping 5 balls in the air — at any given microsecond, they’re only touching one ball. Parallelism is 5 jugglers each holding their own ball. The juggler is faster because they waste no time waiting; the 5-person team is faster because they have 5x the hands.

Why this matters for Python: the GIL makes true parallelism of Python code impossible in one process. Threading in CPython gives you concurrency, not parallelism. This single fact drives almost every architectural decision in Python concurrency.

	Concurrency	Parallelism
Definition	Tasks overlap in time	Tasks execute simultaneously
CPU at any instant	One task runs	Multiple tasks run
Python threading	Yes	No (GIL)
Python multiprocessing	Yes	Yes
asyncio	Yes	No
Useful for	I/O-bound	CPU-bound

The Fundamental Decision: I/O-bound vs CPU-bound

This is the question you must answer before choosing any concurrency approach:

I/O-bound: Your code spends most time waiting — for network, disk, database. The CPU is idle. A web scraper making 1000 HTTP requests: 95% of time is waiting for responses.

CPU-bound: Your code does real computation. Image processing, ML inference, matrix math, password hashing. The CPU is pegged at 100%.

ELI5: I/O-bound is waiting at the DMV — you’re not doing anything, just waiting. Concurrency helps because while you wait, someone else can use the counter. CPU-bound is doing a 1000-piece puzzle — you’re working the whole time. Concurrency doesn’t help; you need more people (processes) to work faster.

If you answer this question wrong, you’ll make your code slower. Threading a CPU-bound workload adds overhead with zero benefit.

The Global Interpreter Lock (GIL)

The single most misunderstood thing in Python.

What it is

A mutex (mutual exclusion lock) protecting CPython’s internal state. At any moment, only one thread can execute Python bytecode. It’s not a language feature — it’s a CPython implementation detail.

Why it exists

CPython uses reference counting for memory management. Every Python object has a ob_refcnt field. When you do x = obj, that counter increments. When x goes out of scope, it decrements. Hits zero → object gets freed.

Reference counting is not thread-safe. Without the GIL, two threads could simultaneously decrement a ref count, both read 1, both subtract 1, both think they got 0, and both try to free the object. That’s a use-after-free bug.

The fix in 1992 was: one big lock around the interpreter. Fast to implement, still here 30 years later.

ELI5: Imagine a shared notebook where everyone tracks who borrowed what book. If two people try to erase and rewrite at the same time, the notebook gets corrupted. The GIL is a rule: only one person can touch the notebook at a time. It works, but it means no two people can ever update it simultaneously.

What the GIL prevents

True parallel execution of Python bytecode across multiple threads. Period.

What the GIL does NOT prevent

This is where people get confused:

Parallel I/O: When a thread blocks on a socket.recv() or open(), it releases the GIL. Other threads can run Python code while one thread waits for the OS.
C extensions that release the GIL: NumPy, pandas, cryptography — they drop the GIL during computation. numpy.dot() on two large matrices runs in parallel with other Python threads.
Subinterpreters (3.12+): Each has its own GIL.

GIL release cycle

CPython releases the GIL every sys.getswitchinterval() seconds (default: 5ms). This lets other threads run. But “switching” has overhead — context switches, cache misses. For CPU-bound code, this overhead makes multithreaded CPython slower than single-threaded.

import sys
sys.getswitchinterval()  # 0.005 (5ms)
sys.setswitchinterval(0.001)  # switch more often (rarely useful)

PEP 703 — Free-threaded Python (3.13+)

Python 3.13 ships with an experimental no-GIL build. Run with python3.13t. The GIL is replaced with fine-grained locks on individual objects. Overhead is still real, but true thread parallelism is possible.

This is not production-ready for most workloads yet. Watch this space.

Why this matters: If you’ve ever written a multithreaded Python app and wondered why it doesn’t use all your CPU cores, now you know. The GIL is the answer. The fix is either multiprocessing, asyncio (for I/O), or waiting for free-threaded Python to mature.

Threading

When to reach for it

I/O-bound tasks where you want concurrency without the overhead of multiple processes. Database queries, HTTP requests, file I/O, reading from queues.

Not for: CPU-bound work. Threading a CPU-bound task in CPython = paying context-switch overhead with zero parallelism benefit.

Core primitives

import threading

# Basic thread
def worker(name):
    print(f"Thread {name} starting")
    # ... do work ...

t = threading.Thread(target=worker, args=("A",), daemon=True)
t.start()
t.join()  # wait for completion

daemon=True means the thread dies when the main thread exits. Non-daemon threads keep the process alive.

Primitive	Purpose
`Lock`	Mutual exclusion — only one thread at a time
`RLock`	Reentrant lock — same thread can acquire multiple times
`Semaphore`	Allow N threads simultaneously (connection pool)
`Event`	Signal between threads (one sets, others wait)
`Condition`	Wait for a state change (producer-consumer)
`Barrier`	Wait until N threads all reach a point

Race conditions and Locks

counter = 0
lock = threading.Lock()

def increment():
    global counter
    with lock:  # acquire on enter, release on exit (even if exception)
        counter += 1  # now atomic

Without the lock: two threads read counter = 5, both add 1, both write 6. You lost an increment. This is a race condition.

ELI5: A lock is a bathroom key. The door says “occupied” and no one else can enter. When you leave, you hang the key back. Without a lock, two people could open the door at the same time and chaos ensues. The cost: everyone else waits.

Deadlocks

Four conditions (Coffman) must ALL be true for deadlock:

Mutual exclusion — resources can’t be shared
Hold and wait — thread holds A and waits for B
No preemption — you can’t forcibly take a resource
Circular wait — Thread 1 holds A wants B, Thread 2 holds B wants A

Break any one condition and deadlock is impossible. Most practical fix: always acquire locks in the same order across all threads.

# Deadlock-prone
def transfer_ab():
    with lock_a:
        with lock_b: ...

def transfer_ba():
    with lock_b:       # opposite order!
        with lock_a: ...

# Safe: always alphabetical order
def transfer_any(from_lock, to_lock):
    first, second = sorted([from_lock, to_lock], key=id)
    with first:
        with second: ...

Thread-local storage

threading.local() gives each thread its own copy of a variable. Django uses this for per-request database connections.

local = threading.local()

def worker():
    local.connection = create_db_connection()  # each thread has its own
    # ... use local.connection ...

The modern way: ThreadPoolExecutor

from concurrent.futures import ThreadPoolExecutor, as_completed

urls = ["https://...", "https://...", ...]

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {executor.submit(fetch, url): url for url in urls}
    for future in as_completed(futures):
        url = futures[future]
        result = future.result()  # raises if the task raised

Prefer this over managing threads manually. It handles thread lifecycle, error propagation, and result collection.

Common mistake: Setting max_workers too high. For HTTP requests, 10-50 is usually sweet spot. More workers = more overhead + server rate limiting. For database queries, match your connection pool size.

Multiprocessing

When to reach for it

CPU-bound tasks that need real parallelism. Image processing, data transformation, ML inference, compression, anything where you’d see 100% CPU on one core.

Each process has its own Python interpreter, its own GIL, its own memory space. True parallelism.

ELI5: Threading is one chef switching rapidly between 5 dishes. Multiprocessing is 5 chefs each with their own kitchen. The 5-chef solution actually cooks 5x faster, but you need 5x the kitchen equipment and can’t easily share ingredients between kitchens.

Core API

from multiprocessing import Pool

def cpu_heavy(n):
    return sum(i * i for i in range(n))

with Pool(processes=4) as pool:  # 4 worker processes
    results = pool.map(cpu_heavy, [10**6, 10**6, 10**6, 10**6])

Primitive	Purpose
`Pool`	Fixed worker pool with map/apply
`Process`	Single explicit process
`Queue`	Thread/process-safe FIFO
`Pipe`	Two-way channel between processes
`Value`/`Array`	Shared typed memory
`shared_memory` (3.8+)	Raw shared memory block

The pickle bottleneck

Inter-process communication serializes everything with pickle. This is the performance ceiling. If you’re sending large numpy arrays back and forth, the serialization cost can exceed the computation benefit.

What can’t be pickled: lambda functions, local functions, database connections, file handles, many C extension objects.

# This fails silently or errors
pool.map(lambda x: x*2, data)  # lambda can't be pickled

# Fix: use a module-level function
def double(x): return x * 2
pool.map(double, data)

Start methods: fork vs spawn vs forkserver

Method	How	Speed	Safety	Default on
`fork`	Copy entire parent process	Fast	Unsafe with threads	Linux
`spawn`	Fresh interpreter, import everything	Slow startup	Safe	macOS (3.8+), Windows
`forkserver`	Dedicated server forks on demand	Medium	Safe	—

macOS gotcha: macOS changed default from fork to spawn in Python 3.8. Code that worked fine on Linux may break on macOS. fork + threads = undefined behavior because forking copies locks that were held by threads that no longer exist.

import multiprocessing
multiprocessing.set_start_method('spawn')  # explicit, portable

Shared memory (3.8+)

For large numpy arrays, skip pickle entirely:

from multiprocessing import shared_memory
import numpy as np

shm = shared_memory.SharedMemory(create=True, size=1024*1024*100)
arr = np.ndarray((1000, 1000), dtype=np.float64, buffer=shm.buf)
# Pass shm.name to worker processes — they attach without copying

The modern way: ProcessPoolExecutor

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(cpu_heavy, data))

Same API as ThreadPoolExecutor. Easy to swap between the two.

Common mistake: Using multiprocessing for I/O-bound tasks. Each process has startup overhead (~50-100ms) and higher memory usage. For I/O-bound work with many short tasks, threading or asyncio is almost always better.

asyncio In Depth

The mental model

asyncio is single-threaded cooperative multitasking. One thread, one event loop, multiple coroutines that voluntarily yield control when they’re waiting.

The event loop is a dispatcher: it holds a queue of coroutines ready to run, picks one, runs it until it hits an await, suspends it, picks the next ready coroutine, runs it, and so on.

Event Loop
    │
    ├── coroutine A runs → hits await → suspended
    ├── coroutine B runs → hits await → suspended  
    ├── coroutine C runs → completes → removed
    └── ... I/O event fires → coroutine A marked ready → runs again

ELI5: asyncio is one chef running a restaurant kitchen alone. When they put something in the oven (await), they don’t stand there watching it — they go work on something else. When the oven timer goes off (I/O event), they come back to finish that dish. No waiting around. One person, maximum efficiency through smart scheduling.

Coroutines: the mechanics

async def fetch_data(url):   # coroutine function
    ...

coro = fetch_data("https://...")  # coroutine object — nothing runs yet
result = await coro              # run it, suspending this coroutine until done

async def does NOT run anything. It creates a coroutine function. Calling it returns a coroutine object. Nothing executes until you await it or schedule it on the event loop.

Common mistake: Forgetting await. result = fetch_data(url) gives you a coroutine object, not the result. Python 3.10+ warns about this but older versions silently do the wrong thing.

Task management

import asyncio

# gather: run concurrently, wait for ALL, return results in order
results = await asyncio.gather(fetch(url1), fetch(url2), fetch(url3))

# create_task: schedule immediately, get a handle
task = asyncio.create_task(fetch(url))
# ... do other stuff ...
result = await task

# TaskGroup (3.11+): structured concurrency
async with asyncio.TaskGroup() as tg:
    t1 = tg.create_task(fetch(url1))
    t2 = tg.create_task(fetch(url2))
# All tasks done here. If ANY raises, all others are cancelled.

API	Behavior on exception
`gather()`	By default propagates first exception; others keep running
`gather(return_exceptions=True)`	Returns exceptions as values
`TaskGroup`	Cancels all siblings, re-raises after all cleaned up

Why TaskGroup is better than bare gather

asyncio.gather() has subtle cancellation behavior. If one task fails and you have cleanup to do in others, it’s easy to leave coroutines dangling.

TaskGroup implements structured concurrency: tasks can’t outlive the scope they’re created in. When the async with block exits, all tasks are done (or cancelled). This prevents coroutine leaks.

ELI5: gather() is like sending 5 employees on errands and telling them to meet back whenever — if one gets in a car crash, the others don’t know and keep shopping. TaskGroup is like a field trip with a chaperone: if one student goes missing, the chaperone halts everything and accounts for everyone before leaving.

Cancellation

task = asyncio.create_task(long_operation())
await asyncio.sleep(5)
task.cancel()  # sends CancelledError into the coroutine

try:
    await task
except asyncio.CancelledError:
    pass  # expected

Inside the coroutine, CancelledError is raised at the next await. You can catch it to clean up, but you must re-raise it:

async def long_operation():
    try:
        await some_io()
    except asyncio.CancelledError:
        await cleanup()
        raise  # MUST re-raise — swallowing CancelledError breaks everything

Running blocking code in async context

# WRONG: blocks the entire event loop
async def bad():
    data = open("large_file.txt").read()  # blocks! all other coroutines freeze

# RIGHT: run in a thread pool
async def good():
    data = await asyncio.to_thread(open("large_file.txt").read)
    
# Or explicitly
loop = asyncio.get_event_loop()
data = await loop.run_in_executor(None, blocking_function, arg1, arg2)

Common mistake: Calling time.sleep() inside an async function. This is the most common asyncio bug. Use await asyncio.sleep() instead. The blocking version freezes the entire event loop — all your “concurrent” requests stop while one sleeps.

concurrent.futures — The Unified API

Why start here

Before reaching for raw threading or multiprocessing, use concurrent.futures. Same API for both, easy to swap, handles lifecycle and error propagation.

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed

# I/O-bound: threads
with ThreadPoolExecutor(max_workers=10) as executor:
    future = executor.submit(fetch, url)
    result = future.result()  # blocks until done, re-raises exceptions

# CPU-bound: processes (same API!)
with ProcessPoolExecutor(max_workers=4) as executor:
    future = executor.submit(compress, data)
    result = future.result()

A Future from executor.submit() has .done(), .result(timeout=N) (blocks, re-raises), .exception(), .cancel(), and .add_done_callback(). That’s the full API — lean but complete.

map vs as_completed

# map: simple, preserves order, blocks
results = list(executor.map(fn, items))  # blocks until ALL done

# as_completed: process results as they arrive (better for long-tail latency)
futures = {executor.submit(fn, item): item for item in items}
for future in as_completed(futures):
    item = futures[future]
    try:
        result = future.result()
    except Exception as e:
        print(f"{item} failed: {e}")

as_completed is better when tasks have variable duration. First result in 100ms doesn’t have to wait for the slow one that takes 5s.

Synchronization Primitives

When you need them

Whenever two tasks share mutable state. If your concurrency model involves shared data, you need synchronization.

ELI5: Shared mutable state without synchronization is two people editing the same Google Doc but offline — when they sync, one person’s changes get overwritten. Locks are like Google’s “one editor at a time” cursor.

Primitive	Threading	asyncio	Use case
Lock	`threading.Lock`	`asyncio.Lock`	Exclusive access to resource
Semaphore	`threading.Semaphore`	`asyncio.Semaphore`	Limit concurrent access (connection pool)
Event	`threading.Event`	`asyncio.Event`	Signal: “this thing happened”
Condition	`threading.Condition`	`asyncio.Condition`	Wait for state change
Queue	`queue.Queue`	`asyncio.Queue`	Safe producer-consumer

Semaphore for rate limiting

# Limit to 5 concurrent HTTP requests
sem = asyncio.Semaphore(5)

async def fetch(url):
    async with sem:
        return await http.get(url)

await asyncio.gather(*[fetch(url) for url in urls])  # only 5 at a time

Common mistake: Using a plain list as a queue between threads. list.append() and list.pop() are not atomic in all Python implementations. Use queue.Queue(maxsize=N) — it blocks put() when full and get() when empty, and is safe across threads by design.

Decision Framework

The big decision table

Task type	Volume	Recommended	Why
I/O-bound	Moderate (10-100 tasks)	`ThreadPoolExecutor`	Simple, low overhead
I/O-bound	High (1000+ tasks)	`asyncio`	Scales to thousands; threads are expensive
CPU-bound	Any	`ProcessPoolExecutor`	Bypasses GIL, real parallelism
Mixed I/O + CPU	Separate phases	asyncio for I/O + ProcessPool for CPU	Each tool for its job
Fire-and-forget background tasks	Low	`threading.Thread(daemon=True)`	Simplest possible
Batch data processing	Large datasets	`multiprocessing.Pool`	Full process control

Ask in order: (1) CPU-bound? → ProcessPoolExecutor. (2) I/O-bound, how many tasks? Dozens → ThreadPoolExecutor. Hundreds+ → asyncio. (3) Can you go async all the way? If not, threading avoids the pain of mixing sync/async code where one blocking call freezes the event loop.

Summary: the tl;dr

Scenario	Use
HTTP client making many requests	`asyncio` + `aiohttp`
Scraping 50 URLs	`ThreadPoolExecutor(10)`
Processing 1M CSV rows	`ProcessPoolExecutor(os.cpu_count())`
Background DB writes from web server	`ThreadPoolExecutor`
Parallel ML batch inference	`multiprocessing.Pool`
Real-time websocket server	`asyncio`
Simple parallel map	`concurrent.futures.ProcessPoolExecutor.map()`

Concurrency vs Parallelism#

The Fundamental Decision: I/O-bound vs CPU-bound#

The Global Interpreter Lock (GIL)#

What it is#

Why it exists#

What the GIL prevents#

What the GIL does NOT prevent#

GIL release cycle#

PEP 703 — Free-threaded Python (3.13+)#

Threading#

When to reach for it#

Core primitives#

Race conditions and Locks#

Deadlocks#

Thread-local storage#

The modern way: ThreadPoolExecutor#

Multiprocessing#

When to reach for it#

Core API#

The pickle bottleneck#

Start methods: fork vs spawn vs forkserver#

Shared memory (3.8+)#

The modern way: ProcessPoolExecutor#

asyncio In Depth#

The mental model#

Coroutines: the mechanics#

Task management#

Why TaskGroup is better than bare gather#

Cancellation#

Running blocking code in async context#

concurrent.futures — The Unified API#

Why start here#

map vs as_completed#

Synchronization Primitives#

When you need them#

Semaphore for rate limiting#

Decision Framework#

The big decision table#

Summary: the tl;dr#

Concurrency vs Parallelism

The Fundamental Decision: I/O-bound vs CPU-bound

The Global Interpreter Lock (GIL)

What it is

Why it exists

What the GIL prevents

What the GIL does NOT prevent

GIL release cycle

PEP 703 — Free-threaded Python (3.13+)

Threading

When to reach for it

Core primitives

Race conditions and Locks

Deadlocks

Thread-local storage

The modern way: ThreadPoolExecutor

Multiprocessing

When to reach for it

Core API

The pickle bottleneck

Start methods: fork vs spawn vs forkserver

Shared memory (3.8+)

The modern way: ProcessPoolExecutor

asyncio In Depth

The mental model

Coroutines: the mechanics

Task management

Why TaskGroup is better than bare gather

Cancellation

Running blocking code in async context

concurrent.futures — The Unified API

Why start here

map vs as_completed

Synchronization Primitives

When you need them

Semaphore for rate limiting

Decision Framework

The big decision table

Summary: the tl;dr