Memory Management & Performance
Memory Management & Performance
CPython manages memory so you don’t have to — until you’re debugging a production memory leak at 2am, or trying to hit sub-millisecond latency, or wondering why your process is eating 8GB for no apparent reason. That’s when you need to actually understand the machinery.
1. Reference Counting
Every Python object has an ob_refcnt field — a simple integer tracking how many references point to it. When that counter hits zero, CPython immediately deallocates the object. No GC pass needed.
import sys
x = []
print(sys.getrefcount(x)) # 2, not 1 — why?
y = x
print(sys.getrefcount(x)) # 3
del y
print(sys.getrefcount(x)) # 2 again
The +1 mystery: getrefcount itself holds a temporary reference to the object while executing — so you always see count + 1. This is expected and correct.
When you do x = []:
- A list object is created,
ob_refcnt = 1 xbinds to it — but the call togetrefcount(x)passes the object, adding a second ref temporarily
ELI5: Think of
ob_refcntas a sticky note counter on a library book. When you borrow it, the number goes up. When you return it, it goes down. The moment it hits zero, the library throws the book away. The +1 fromgetrefcountis like the librarian picking it up to check the count — they’re also “holding” it for a second.
When Reference Counting Fails: Circular References
a = []
b = [a]
a.append(b) # a → b → a: cycle
del a
del b
# Both objects still alive! ob_refcnt never reaches 0.
Reference counting cannot handle this. a holds a ref to b, b holds a ref to a. Even after you delete both names, each object’s ob_refcnt is still 1. This is exactly why Python has a garbage collector on top of reference counting.
2. Garbage Collection
Python’s GC exists solely to handle circular references. It’s generational, and it tracks container objects (lists, dicts, sets, instances) — not scalars like ints and strings.
Generational GC
| Generation | Threshold (default) | Survives |
|---|---|---|
| 0 (young) | 700 objects | Most objects die here |
| 1 (middle) | Every 10 gen-0 runs | Objects that survived gen-0 |
| 2 (old) | Every 10 gen-1 runs | Long-lived objects |
import gc
print(gc.get_threshold()) # (700, 10, 10)
print(gc.get_count()) # current counts per generation
# Tune for long-lived servers where gen-0 cycles are rare
gc.set_threshold(1000, 15, 15)
The GC runs automatically when the number of tracked objects in gen-0 exceeds the threshold. It does a mark-and-sweep specifically looking for cycles.
When to Disable GC
import gc
gc.disable()
# ... your latency-critical code ...
gc.enable()
Instagram famously disabled GC in their Python workers and saw a ~10% memory reduction because COW (copy-on-write) pages stopped getting dirtied by GC bookkeeping. This only makes sense if you control object lifetimes carefully (no cycles, or you call gc.collect() at known safe points).
ELI5: Generational GC is like a sorting machine at the post office. New packages (gen-0) get checked first and most often — they’re likely to be sorted quickly. The ones that stick around go to a slower secondary sorting (gen-1, gen-2). Old long-term packages rarely need re-sorting.
gc.collect() — When to Force It
gc.collect(generation=0) # only gen-0
gc.collect() # full collection, all generations
Useful when: you’ve just finished a large batch job, deleted a massive data structure you know had cycles, or before a latency-sensitive window. Don’t call it in tight loops — it’s expensive.
__del__ Finalizer — Almost Never What You Want
class Leaky:
def __del__(self):
print("cleaning up")
a = Leaky()
b = Leaky()
a.other = b
b.other = a
del a, b
# __del__ may not get called at all if cyclic GC handles these
# In CPython ≤ 3.3, objects with __del__ were never collected if cyclic
In Python 3.4+ (PEP 442), objects with __del__ in cycles can be collected, but the order of __del__ calls is undefined. Use context managers (__enter__/__exit__) or weakref finalizers instead.
Common mistake: Using
__del__to close file handles or DB connections. If the object gets stuck in a cycle, cleanup never runs. Usewithstatements.
Weak References — Breaking Cycles Without GC Help
import weakref
class Node:
def __init__(self, name):
self.name = name
self.parent = None
parent = Node("parent")
child = Node("child")
child.parent = weakref.ref(parent) # weak ref — doesn't increment ob_refcnt
# Access the referent
p = child.parent() # call the weakref to get the object
if p is not None:
print(p.name) # "parent"
Weak references don’t bump ob_refcnt. When parent has no more strong references, it gets deallocated, and child.parent() returns None. This cleanly breaks parent-child cycles in trees, caches, observer patterns.
3. Object Interning & Small Object Optimization
CPython pre-allocates frequently used objects as singletons. This is an optimization, not a language guarantee — don’t write code that depends on it.
Integer Cache: -5 to 256
a = 100; b = 100
print(a is b) # True — same object
a = 1000; b = 1000
print(a is b) # False in most contexts (may be True in same code block)
Integers from -5 to 256 are pre-allocated at interpreter startup. Any int in that range is the exact same object. Outside that range, each assignment typically creates a new object.
Common mistake: Using
isto compare integers.x is 1may work for small ints but fails for large ones. Always use==.
String Interning
import sys
a = "hello"
b = "hello"
print(a is b) # True — CPython interns simple identifier-like strings
a = "hello world" # space → not always interned
b = "hello world"
print(a is b) # implementation-dependent
# Force interning when you need it
a = sys.intern("my_key_that_appears_millions_of_times")
b = sys.intern("my_key_that_appears_millions_of_times")
print(a is b) # True, guaranteed
CPython automatically interns strings that look like identifiers (alphanumeric + underscore). Useful for sys.intern() when you have a huge dict with repeated string keys — makes dict lookup faster (pointer equality vs string comparison).
The pymalloc Allocator
For objects ≤ 512 bytes, CPython uses its own allocator — pymalloc — instead of calling malloc for each object:
- Arena: 256KB chunk requested from the OS
- Pool: 4KB page within an arena, holds same-sized objects
- Block: single object slot within a pool
This dramatically reduces malloc overhead for small, frequently allocated objects (which is most Python objects). Objects > 512 bytes fall through to the system allocator.
Why sys.getsizeof() Lies
import sys
d = {"key": "value"}
print(sys.getsizeof(d)) # ~232 bytes — the dict object itself
# But the actual memory held is much more:
# the dict object + key string + value string + hash table internals
sys.getsizeof returns the size of the object’s own storage, not the transitive closure of everything it references. For a real size measurement use pympler or objgraph.
from pympler import asizeof
print(asizeof.asizeof(d)) # includes referenced objects
ELI5:
sys.getsizeofis like asking “how big is the box?” — it only measures the cardboard, not the stuff inside.asizeofweighs the whole package including contents.
4. Memory Profiling Tools
tracemalloc — Built-in, Line-Level Tracking
import tracemalloc
tracemalloc.start()
# ... your code ...
result = [dict(i=i) for i in range(10000)]
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
for stat in top_stats[:5]:
print(stat)
# myfile.py:6: size=3.1 MiB, count=30001, average=108 B
Minimal overhead, ships with Python 3.4+. Best first tool to reach for.
memory_profiler — Line-by-Line Memory
# Decorate the function you want to profile
from memory_profiler import profile
@profile
def load_data():
data = []
for i in range(100000):
data.append({"id": i, "value": i * 2})
return data
Run with python -m memory_profiler script.py. Shows incremental memory change per line — very useful for finding the exact line where allocation spikes.
objgraph — Visualizing Reference Graphs
import objgraph
objgraph.show_most_common_types(limit=10)
# dict 12345
# list 8901
# MyClass 203
# Find what's holding references to an object
objgraph.show_backrefs(some_object, max_depth=3)
Excellent for diagnosing memory leaks — shows what object types are accumulating and what’s holding references to them.
Practical Memory Leak Workflow
- Run
objgraph.show_most_common_types()before and after suspect operation - Look for types whose count keeps growing across requests/calls
- Use
objgraph.show_backrefs()on a sample of the leaking objects - Use
tracemallocsnapshots withsnapshot.compare_to()to see the diff - Fix the root cause (usually: global collection, callback holding reference, circular ref without weak ref)
| Tool | Overhead | Granularity | When to Use |
|---|---|---|---|
tracemalloc | Low | File + line | First investigation |
memory_profiler | Medium | Line-by-line | Pinpoint allocation spike |
objgraph | Low | Object types | Finding what’s leaking |
pympler | Medium | Object + referents | Accurate size measurement |
5. CPU Profiling & Optimization
The Profiler Ladder
Start cheap, escalate only if needed:
# 1. cProfile — function-level, built-in
import cProfile
cProfile.run("my_function()", sort="cumulative")
# Or from CLI:
# python -m cProfile -s cumulative script.py | head -30
# 2. line_profiler — line-level (install: pip install line-profiler)
from line_profiler import LineProfiler
lp = LineProfiler()
lp_wrapper = lp(my_function)
lp_wrapper()
lp.print_stats()
# 3. py-spy — attach to running process, zero modification to code
py-spy top --pid 12345
py-spy record -o profile.svg --pid 12345
py-spy is the production tool — it samples the call stack of a running process without stopping it, no code changes required.
The Optimization Hierarchy
Do these in order. Most bugs live at the top.
| Priority | Level | Example |
|---|---|---|
| 1 | Algorithm | O(n²) → O(n log n) |
| 2 | Data structure | list scan → set lookup |
| 3 | Python tricks | List comp → generator, local var caching |
| 4 | C extension | numpy, pandas, ujson |
| 5 | Other runtime | Cython, PyPy, native extension |
ELI5: Optimizing Python at the “tricks” level before fixing an O(n²) algorithm is like polishing your car before changing flat tires. Fix the big stuff first.
Common Python Performance Antipatterns
# BAD: string concatenation in a loop — O(n²) copies
result = ""
for item in items:
result += str(item) # each += creates a new string object
# GOOD: join
result = "".join(str(item) for item in items)
# BAD: repeated dict/attribute lookup in hot loop
for i in range(1000000):
math.sqrt(i) # LOAD_GLOBAL math, then LOAD_ATTR sqrt — every iteration
# GOOD: cache the reference
sqrt = math.sqrt
for i in range(1000000):
sqrt(i) # LOAD_FAST — single bytecode op
# BAD: building full list when you only need iteration
total = sum([x * 2 for x in range(1000000)]) # materializes list first
# GOOD: generator — lazy, O(1) memory
total = sum(x * 2 for x in range(1000000))
Common mistake: Premature micro-optimization. Profile first. The bottleneck is almost never where you think it is.
6. Data Structure Performance
Time Complexity Reference
| Operation | list | dict | set | deque |
|---|---|---|---|---|
| Access by index | O(1) | — | — | O(n) |
Search / in | O(n) | O(1) avg | O(1) avg | O(n) |
| Append (end) | O(1) amortized | — | — | O(1) |
| Insert (front) | O(n) | — | — | O(1) |
| Delete by value | O(n) | O(1) avg | O(1) avg | O(n) |
| Union/intersect | — | — | O(min(m,n)) | — |
Use deque (from collections) when you need frequent head insertions/deletions. Use set for membership testing, not list.
Choosing the Right Array Type
| Type | Memory | Speed | Use When |
|---|---|---|---|
list | High (pointers + objects) | Fast for mixed types | General purpose |
array.array | Low (raw C types) | Fast for homogeneous | Typed numeric data, no numpy |
numpy.ndarray | Low + SIMD | Extremely fast | Numerical computation |
bytes/bytearray | Lowest | Fast | Binary data |
import array, sys
lst = list(range(10000))
arr = array.array('i', range(10000)) # 'i' = signed int
print(sys.getsizeof(lst)) # ~87624 bytes
print(sys.getsizeof(arr)) # ~40064 bytes — ~2x smaller
__slots__ Memory Savings
class WithoutSlots:
def __init__(self, x, y):
self.x = x
self.y = y
class WithSlots:
__slots__ = ('x', 'y')
def __init__(self, x, y):
self.x = x
self.y = y
import sys
a = WithoutSlots(1, 2)
b = WithSlots(1, 2)
print(sys.getsizeof(a)) # 48 bytes + 232 bytes for __dict__
print(sys.getsizeof(b)) # 56 bytes, no __dict__ at all
For classes that will be instantiated millions of times, __slots__ typically saves 40-60% memory per instance. Trade-off: no dynamic attributes, harder to pickle in some cases, breaks multiple inheritance in tricky ways.
Comparison: Named Tuples vs Dataclasses vs Dicts
| Type | Memory | Mutable | Access | Best For |
|---|---|---|---|---|
tuple | Smallest | No | Index only | Immutable records |
namedtuple | Small | No | Name or index | Read-only value objects |
dataclass | Medium (has __dict__) | Yes | Name | Mutable domain objects |
dataclass(slots=True) | Small | Yes | Name | High-volume mutable objects |
dict | Large | Yes | Key | Dynamic structure |
from dataclasses import dataclass
@dataclass(slots=True) # Python 3.10+
class Point:
x: float
y: float
ELI5:
__slots__tells Python “this class will only ever have these specific attributes” — so Python replaces the flexible dictionary with a fixed array of slots. Less flexible, much leaner.
7. Advanced Optimization Techniques
functools.lru_cache and functools.cache
from functools import lru_cache, cache
@cache # Python 3.9+ — unbounded cache, slightly faster than lru_cache
def fib(n):
if n < 2: return n
return fib(n-1) + fib(n-2)
@lru_cache(maxsize=1024) # bounded — won't grow forever
def expensive_lookup(key):
return database.query(key)
cache is lru_cache(maxsize=None) — use it when you know the key space is bounded. Use lru_cache with a maxsize for anything that could grow without bound.
Common mistake: Decorating with
@cacheon a method of a long-lived object. The cache holds a reference toself, which can prevent GC. Use@lru_cacheon module-level functions, or usefunctools.cached_propertyfor instance-level caching.
Generator vs List Comprehension Trade-off
# List comp: O(n) memory, fast random access, iterable multiple times
squares_list = [x**2 for x in range(10**6)] # ~8MB allocated upfront
# Generator: O(1) memory, single-pass only, lazy evaluation
squares_gen = (x**2 for x in range(10**6)) # near-zero memory upfront
Rule: if you only iterate once and don’t need indexing, use a generator. If you need to iterate multiple times or random access, materialize to a list.
LOAD_FAST vs LOAD_GLOBAL
import dis
def with_global():
return len([1, 2, 3])
def with_local():
_len = len # cache in local scope
return _len([1, 2, 3])
dis.dis(with_global)
# LOAD_GLOBAL 'len' — looks up in module globals dict first
dis.dis(with_local)
# LOAD_FAST '_len' — reads from local frame array by index
LOAD_FAST is 2-3x faster than LOAD_GLOBAL because locals are a C array (index lookup), globals are a dict (hash lookup). In tight loops calling the same global 10M times, caching it locally makes a measurable difference.
When to Escape Python
| Tool | When to Use | Effort |
|---|---|---|
numpy/pandas | Numerical/tabular operations | Low — just use it |
| Cython | Hot loop you own, type annotations available | Medium |
ctypes/cffi | Calling existing C library | Medium |
| Native C extension | Maximum control, Python API needed | High |
| PyPy | Existing codebase, can’t touch C extensions | Low (just run PyPy) |
Most Python performance problems are solved at the algorithm or data structure level long before reaching Cython.
Summary: Decision Table
| You see… | Likely cause | Tool to diagnose | Fix |
|---|---|---|---|
| Memory grows per request | Accumulating references, no cleanup | objgraph, tracemalloc | Fix leaking container, use weak refs |
| Circular refs not collected | Missing weak refs | gc.collect(); objgraph | Add weakref.ref for back-pointers |
| High GC pause latency | Frequent gen-0 sweeps | gc.callbacks, profiler | Tune threshold, gc.disable() at critical path |
| Slow dict/list lookups | Wrong data structure | cProfile | Use set for membership, deque for queue |
| High object count per class | No __slots__ | pympler.asizeof | Add __slots__ to hot classes |
| CPU hot function | Pure Python algorithm | cProfile → line_profiler | Algorithmic fix or numpy |
| Unexplained slowness in loop | Global lookups, string concat | line_profiler, dis | Cache locals, use join() |
| Need size of object graph | getsizeof lies | pympler.asizeof | Use asizeof.asizeof() |
The read order for these notes: understand the reference counting + GC machinery first (sections 1-2), then object interning (3) — those explain why things cost what they cost. Profiling tools (4-5) tell you where the costs are. Data structures and optimization techniques (6-7) are where you actually fix them.