← Python Mastery — Senior to Principal

Concurrency & Parallelism

Concurrency vs Parallelism

Two words the Python community uses interchangeably and incorrectly, constantly.

Concurrency — multiple tasks are in progress at the same time. They take turns on the CPU. One plumber fixing two sinks: while the glue dries on sink A, they work on sink B. No moment where both are being worked on simultaneously.

Parallelism — multiple tasks are literally executing at the same time, on different cores. Two plumbers, two sinks, both working simultaneously.

ELI5: Concurrency is a juggler keeping 5 balls in the air — at any given microsecond, they’re only touching one ball. Parallelism is 5 jugglers each holding their own ball. The juggler is faster because they waste no time waiting; the 5-person team is faster because they have 5x the hands.

Why this matters for Python: the GIL makes true parallelism of Python code impossible in one process. Threading in CPython gives you concurrency, not parallelism. This single fact drives almost every architectural decision in Python concurrency.

ConcurrencyParallelism
DefinitionTasks overlap in timeTasks execute simultaneously
CPU at any instantOne task runsMultiple tasks run
Python threadingYesNo (GIL)
Python multiprocessingYesYes
asyncioYesNo
Useful forI/O-boundCPU-bound

The Fundamental Decision: I/O-bound vs CPU-bound

This is the question you must answer before choosing any concurrency approach:

I/O-bound: Your code spends most time waiting — for network, disk, database. The CPU is idle. A web scraper making 1000 HTTP requests: 95% of time is waiting for responses.

CPU-bound: Your code does real computation. Image processing, ML inference, matrix math, password hashing. The CPU is pegged at 100%.

ELI5: I/O-bound is waiting at the DMV — you’re not doing anything, just waiting. Concurrency helps because while you wait, someone else can use the counter. CPU-bound is doing a 1000-piece puzzle — you’re working the whole time. Concurrency doesn’t help; you need more people (processes) to work faster.

If you answer this question wrong, you’ll make your code slower. Threading a CPU-bound workload adds overhead with zero benefit.


The Global Interpreter Lock (GIL)

The single most misunderstood thing in Python.

What it is

A mutex (mutual exclusion lock) protecting CPython’s internal state. At any moment, only one thread can execute Python bytecode. It’s not a language feature — it’s a CPython implementation detail.

Why it exists

CPython uses reference counting for memory management. Every Python object has a ob_refcnt field. When you do x = obj, that counter increments. When x goes out of scope, it decrements. Hits zero → object gets freed.

Reference counting is not thread-safe. Without the GIL, two threads could simultaneously decrement a ref count, both read 1, both subtract 1, both think they got 0, and both try to free the object. That’s a use-after-free bug.

The fix in 1992 was: one big lock around the interpreter. Fast to implement, still here 30 years later.

ELI5: Imagine a shared notebook where everyone tracks who borrowed what book. If two people try to erase and rewrite at the same time, the notebook gets corrupted. The GIL is a rule: only one person can touch the notebook at a time. It works, but it means no two people can ever update it simultaneously.

What the GIL prevents

True parallel execution of Python bytecode across multiple threads. Period.

What the GIL does NOT prevent

This is where people get confused:

  • Parallel I/O: When a thread blocks on a socket.recv() or open(), it releases the GIL. Other threads can run Python code while one thread waits for the OS.
  • C extensions that release the GIL: NumPy, pandas, cryptography — they drop the GIL during computation. numpy.dot() on two large matrices runs in parallel with other Python threads.
  • Subinterpreters (3.12+): Each has its own GIL.

GIL release cycle

CPython releases the GIL every sys.getswitchinterval() seconds (default: 5ms). This lets other threads run. But “switching” has overhead — context switches, cache misses. For CPU-bound code, this overhead makes multithreaded CPython slower than single-threaded.

import sys
sys.getswitchinterval()  # 0.005 (5ms)
sys.setswitchinterval(0.001)  # switch more often (rarely useful)

PEP 703 — Free-threaded Python (3.13+)

Python 3.13 ships with an experimental no-GIL build. Run with python3.13t. The GIL is replaced with fine-grained locks on individual objects. Overhead is still real, but true thread parallelism is possible.

This is not production-ready for most workloads yet. Watch this space.

Why this matters: If you’ve ever written a multithreaded Python app and wondered why it doesn’t use all your CPU cores, now you know. The GIL is the answer. The fix is either multiprocessing, asyncio (for I/O), or waiting for free-threaded Python to mature.


Threading

When to reach for it

I/O-bound tasks where you want concurrency without the overhead of multiple processes. Database queries, HTTP requests, file I/O, reading from queues.

Not for: CPU-bound work. Threading a CPU-bound task in CPython = paying context-switch overhead with zero parallelism benefit.

Core primitives

import threading

# Basic thread
def worker(name):
    print(f"Thread {name} starting")
    # ... do work ...

t = threading.Thread(target=worker, args=("A",), daemon=True)
t.start()
t.join()  # wait for completion

daemon=True means the thread dies when the main thread exits. Non-daemon threads keep the process alive.

PrimitivePurpose
LockMutual exclusion — only one thread at a time
RLockReentrant lock — same thread can acquire multiple times
SemaphoreAllow N threads simultaneously (connection pool)
EventSignal between threads (one sets, others wait)
ConditionWait for a state change (producer-consumer)
BarrierWait until N threads all reach a point

Race conditions and Locks

counter = 0
lock = threading.Lock()

def increment():
    global counter
    with lock:  # acquire on enter, release on exit (even if exception)
        counter += 1  # now atomic

Without the lock: two threads read counter = 5, both add 1, both write 6. You lost an increment. This is a race condition.

ELI5: A lock is a bathroom key. The door says “occupied” and no one else can enter. When you leave, you hang the key back. Without a lock, two people could open the door at the same time and chaos ensues. The cost: everyone else waits.

Deadlocks

Four conditions (Coffman) must ALL be true for deadlock:

  1. Mutual exclusion — resources can’t be shared
  2. Hold and wait — thread holds A and waits for B
  3. No preemption — you can’t forcibly take a resource
  4. Circular wait — Thread 1 holds A wants B, Thread 2 holds B wants A

Break any one condition and deadlock is impossible. Most practical fix: always acquire locks in the same order across all threads.

# Deadlock-prone
def transfer_ab():
    with lock_a:
        with lock_b: ...

def transfer_ba():
    with lock_b:       # opposite order!
        with lock_a: ...

# Safe: always alphabetical order
def transfer_any(from_lock, to_lock):
    first, second = sorted([from_lock, to_lock], key=id)
    with first:
        with second: ...

Thread-local storage

threading.local() gives each thread its own copy of a variable. Django uses this for per-request database connections.

local = threading.local()

def worker():
    local.connection = create_db_connection()  # each thread has its own
    # ... use local.connection ...

The modern way: ThreadPoolExecutor

from concurrent.futures import ThreadPoolExecutor, as_completed

urls = ["https://...", "https://...", ...]

with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {executor.submit(fetch, url): url for url in urls}
    for future in as_completed(futures):
        url = futures[future]
        result = future.result()  # raises if the task raised

Prefer this over managing threads manually. It handles thread lifecycle, error propagation, and result collection.

Common mistake: Setting max_workers too high. For HTTP requests, 10-50 is usually sweet spot. More workers = more overhead + server rate limiting. For database queries, match your connection pool size.


Multiprocessing

When to reach for it

CPU-bound tasks that need real parallelism. Image processing, data transformation, ML inference, compression, anything where you’d see 100% CPU on one core.

Each process has its own Python interpreter, its own GIL, its own memory space. True parallelism.

ELI5: Threading is one chef switching rapidly between 5 dishes. Multiprocessing is 5 chefs each with their own kitchen. The 5-chef solution actually cooks 5x faster, but you need 5x the kitchen equipment and can’t easily share ingredients between kitchens.

Core API

from multiprocessing import Pool

def cpu_heavy(n):
    return sum(i * i for i in range(n))

with Pool(processes=4) as pool:  # 4 worker processes
    results = pool.map(cpu_heavy, [10**6, 10**6, 10**6, 10**6])
PrimitivePurpose
PoolFixed worker pool with map/apply
ProcessSingle explicit process
QueueThread/process-safe FIFO
PipeTwo-way channel between processes
Value/ArrayShared typed memory
shared_memory (3.8+)Raw shared memory block

The pickle bottleneck

Inter-process communication serializes everything with pickle. This is the performance ceiling. If you’re sending large numpy arrays back and forth, the serialization cost can exceed the computation benefit.

What can’t be pickled: lambda functions, local functions, database connections, file handles, many C extension objects.

# This fails silently or errors
pool.map(lambda x: x*2, data)  # lambda can't be pickled

# Fix: use a module-level function
def double(x): return x * 2
pool.map(double, data)

Start methods: fork vs spawn vs forkserver

MethodHowSpeedSafetyDefault on
forkCopy entire parent processFastUnsafe with threadsLinux
spawnFresh interpreter, import everythingSlow startupSafemacOS (3.8+), Windows
forkserverDedicated server forks on demandMediumSafe

macOS gotcha: macOS changed default from fork to spawn in Python 3.8. Code that worked fine on Linux may break on macOS. fork + threads = undefined behavior because forking copies locks that were held by threads that no longer exist.

import multiprocessing
multiprocessing.set_start_method('spawn')  # explicit, portable

Shared memory (3.8+)

For large numpy arrays, skip pickle entirely:

from multiprocessing import shared_memory
import numpy as np

shm = shared_memory.SharedMemory(create=True, size=1024*1024*100)
arr = np.ndarray((1000, 1000), dtype=np.float64, buffer=shm.buf)
# Pass shm.name to worker processes — they attach without copying

The modern way: ProcessPoolExecutor

from concurrent.futures import ProcessPoolExecutor

with ProcessPoolExecutor(max_workers=4) as executor:
    results = list(executor.map(cpu_heavy, data))

Same API as ThreadPoolExecutor. Easy to swap between the two.

Common mistake: Using multiprocessing for I/O-bound tasks. Each process has startup overhead (~50-100ms) and higher memory usage. For I/O-bound work with many short tasks, threading or asyncio is almost always better.


asyncio In Depth

The mental model

asyncio is single-threaded cooperative multitasking. One thread, one event loop, multiple coroutines that voluntarily yield control when they’re waiting.

The event loop is a dispatcher: it holds a queue of coroutines ready to run, picks one, runs it until it hits an await, suspends it, picks the next ready coroutine, runs it, and so on.

Event Loop
    │
    ├── coroutine A runs → hits await → suspended
    ├── coroutine B runs → hits await → suspended  
    ├── coroutine C runs → completes → removed
    └── ... I/O event fires → coroutine A marked ready → runs again

ELI5: asyncio is one chef running a restaurant kitchen alone. When they put something in the oven (await), they don’t stand there watching it — they go work on something else. When the oven timer goes off (I/O event), they come back to finish that dish. No waiting around. One person, maximum efficiency through smart scheduling.

Coroutines: the mechanics

async def fetch_data(url):   # coroutine function
    ...

coro = fetch_data("https://...")  # coroutine object — nothing runs yet
result = await coro              # run it, suspending this coroutine until done

async def does NOT run anything. It creates a coroutine function. Calling it returns a coroutine object. Nothing executes until you await it or schedule it on the event loop.

Common mistake: Forgetting await. result = fetch_data(url) gives you a coroutine object, not the result. Python 3.10+ warns about this but older versions silently do the wrong thing.

Task management

import asyncio

# gather: run concurrently, wait for ALL, return results in order
results = await asyncio.gather(fetch(url1), fetch(url2), fetch(url3))

# create_task: schedule immediately, get a handle
task = asyncio.create_task(fetch(url))
# ... do other stuff ...
result = await task

# TaskGroup (3.11+): structured concurrency
async with asyncio.TaskGroup() as tg:
    t1 = tg.create_task(fetch(url1))
    t2 = tg.create_task(fetch(url2))
# All tasks done here. If ANY raises, all others are cancelled.
APIBehavior on exception
gather()By default propagates first exception; others keep running
gather(return_exceptions=True)Returns exceptions as values
TaskGroupCancels all siblings, re-raises after all cleaned up

Why TaskGroup is better than bare gather

asyncio.gather() has subtle cancellation behavior. If one task fails and you have cleanup to do in others, it’s easy to leave coroutines dangling.

TaskGroup implements structured concurrency: tasks can’t outlive the scope they’re created in. When the async with block exits, all tasks are done (or cancelled). This prevents coroutine leaks.

ELI5: gather() is like sending 5 employees on errands and telling them to meet back whenever — if one gets in a car crash, the others don’t know and keep shopping. TaskGroup is like a field trip with a chaperone: if one student goes missing, the chaperone halts everything and accounts for everyone before leaving.

Cancellation

task = asyncio.create_task(long_operation())
await asyncio.sleep(5)
task.cancel()  # sends CancelledError into the coroutine

try:
    await task
except asyncio.CancelledError:
    pass  # expected

Inside the coroutine, CancelledError is raised at the next await. You can catch it to clean up, but you must re-raise it:

async def long_operation():
    try:
        await some_io()
    except asyncio.CancelledError:
        await cleanup()
        raise  # MUST re-raise — swallowing CancelledError breaks everything

Running blocking code in async context

# WRONG: blocks the entire event loop
async def bad():
    data = open("large_file.txt").read()  # blocks! all other coroutines freeze

# RIGHT: run in a thread pool
async def good():
    data = await asyncio.to_thread(open("large_file.txt").read)
    
# Or explicitly
loop = asyncio.get_event_loop()
data = await loop.run_in_executor(None, blocking_function, arg1, arg2)

Common mistake: Calling time.sleep() inside an async function. This is the most common asyncio bug. Use await asyncio.sleep() instead. The blocking version freezes the entire event loop — all your “concurrent” requests stop while one sleeps.


concurrent.futures — The Unified API

Why start here

Before reaching for raw threading or multiprocessing, use concurrent.futures. Same API for both, easy to swap, handles lifecycle and error propagation.

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed

# I/O-bound: threads
with ThreadPoolExecutor(max_workers=10) as executor:
    future = executor.submit(fetch, url)
    result = future.result()  # blocks until done, re-raises exceptions

# CPU-bound: processes (same API!)
with ProcessPoolExecutor(max_workers=4) as executor:
    future = executor.submit(compress, data)
    result = future.result()

A Future from executor.submit() has .done(), .result(timeout=N) (blocks, re-raises), .exception(), .cancel(), and .add_done_callback(). That’s the full API — lean but complete.

map vs as_completed

# map: simple, preserves order, blocks
results = list(executor.map(fn, items))  # blocks until ALL done

# as_completed: process results as they arrive (better for long-tail latency)
futures = {executor.submit(fn, item): item for item in items}
for future in as_completed(futures):
    item = futures[future]
    try:
        result = future.result()
    except Exception as e:
        print(f"{item} failed: {e}")

as_completed is better when tasks have variable duration. First result in 100ms doesn’t have to wait for the slow one that takes 5s.


Synchronization Primitives

When you need them

Whenever two tasks share mutable state. If your concurrency model involves shared data, you need synchronization.

ELI5: Shared mutable state without synchronization is two people editing the same Google Doc but offline — when they sync, one person’s changes get overwritten. Locks are like Google’s “one editor at a time” cursor.

PrimitiveThreadingasyncioUse case
Lockthreading.Lockasyncio.LockExclusive access to resource
Semaphorethreading.Semaphoreasyncio.SemaphoreLimit concurrent access (connection pool)
Eventthreading.Eventasyncio.EventSignal: “this thing happened”
Conditionthreading.Conditionasyncio.ConditionWait for state change
Queuequeue.Queueasyncio.QueueSafe producer-consumer

Semaphore for rate limiting

# Limit to 5 concurrent HTTP requests
sem = asyncio.Semaphore(5)

async def fetch(url):
    async with sem:
        return await http.get(url)

await asyncio.gather(*[fetch(url) for url in urls])  # only 5 at a time

Common mistake: Using a plain list as a queue between threads. list.append() and list.pop() are not atomic in all Python implementations. Use queue.Queue(maxsize=N) — it blocks put() when full and get() when empty, and is safe across threads by design.


Decision Framework

The big decision table

Task typeVolumeRecommendedWhy
I/O-boundModerate (10-100 tasks)ThreadPoolExecutorSimple, low overhead
I/O-boundHigh (1000+ tasks)asyncioScales to thousands; threads are expensive
CPU-boundAnyProcessPoolExecutorBypasses GIL, real parallelism
Mixed I/O + CPUSeparate phasesasyncio for I/O + ProcessPool for CPUEach tool for its job
Fire-and-forget background tasksLowthreading.Thread(daemon=True)Simplest possible
Batch data processingLarge datasetsmultiprocessing.PoolFull process control

Ask in order: (1) CPU-bound? → ProcessPoolExecutor. (2) I/O-bound, how many tasks? Dozens → ThreadPoolExecutor. Hundreds+ → asyncio. (3) Can you go async all the way? If not, threading avoids the pain of mixing sync/async code where one blocking call freezes the event loop.

Summary: the tl;dr

ScenarioUse
HTTP client making many requestsasyncio + aiohttp
Scraping 50 URLsThreadPoolExecutor(10)
Processing 1M CSV rowsProcessPoolExecutor(os.cpu_count())
Background DB writes from web serverThreadPoolExecutor
Parallel ML batch inferencemultiprocessing.Pool
Real-time websocket serverasyncio
Simple parallel mapconcurrent.futures.ProcessPoolExecutor.map()