Memory Management & Performance

12 min read 2552 words

Table of Contents

Memory Management & Performance

Memory Management & Performance

CPython manages memory so you don’t have to — until you’re debugging a production memory leak at 2am, or trying to hit sub-millisecond latency, or wondering why your process is eating 8GB for no apparent reason. That’s when you need to actually understand the machinery.

1. Reference Counting

Every Python object has an ob_refcnt field — a simple integer tracking how many references point to it. When that counter hits zero, CPython immediately deallocates the object. No GC pass needed.

import sys

x = []
print(sys.getrefcount(x))  # 2, not 1 — why?

y = x
print(sys.getrefcount(x))  # 3

del y
print(sys.getrefcount(x))  # 2 again

The +1 mystery: getrefcount itself holds a temporary reference to the object while executing — so you always see count + 1. This is expected and correct.

When you do x = []:

A list object is created, ob_refcnt = 1
x binds to it — but the call to getrefcount(x) passes the object, adding a second ref temporarily

ELI5: Think of ob_refcnt as a sticky note counter on a library book. When you borrow it, the number goes up. When you return it, it goes down. The moment it hits zero, the library throws the book away. The +1 from getrefcount is like the librarian picking it up to check the count — they’re also “holding” it for a second.

When Reference Counting Fails: Circular References

a = []
b = [a]
a.append(b)   # a → b → a: cycle

del a
del b
# Both objects still alive! ob_refcnt never reaches 0.

Reference counting cannot handle this. a holds a ref to b, b holds a ref to a. Even after you delete both names, each object’s ob_refcnt is still 1. This is exactly why Python has a garbage collector on top of reference counting.

2. Garbage Collection

Python’s GC exists solely to handle circular references. It’s generational, and it tracks container objects (lists, dicts, sets, instances) — not scalars like ints and strings.

Generational GC

Generation	Threshold (default)	Survives
0 (young)	700 objects	Most objects die here
1 (middle)	Every 10 gen-0 runs	Objects that survived gen-0
2 (old)	Every 10 gen-1 runs	Long-lived objects

import gc

print(gc.get_threshold())  # (700, 10, 10)
print(gc.get_count())      # current counts per generation

# Tune for long-lived servers where gen-0 cycles are rare
gc.set_threshold(1000, 15, 15)

The GC runs automatically when the number of tracked objects in gen-0 exceeds the threshold. It does a mark-and-sweep specifically looking for cycles.

When to Disable GC

import gc

gc.disable()
# ... your latency-critical code ...
gc.enable()

Instagram famously disabled GC in their Python workers and saw a ~10% memory reduction because COW (copy-on-write) pages stopped getting dirtied by GC bookkeeping. This only makes sense if you control object lifetimes carefully (no cycles, or you call gc.collect() at known safe points).

ELI5: Generational GC is like a sorting machine at the post office. New packages (gen-0) get checked first and most often — they’re likely to be sorted quickly. The ones that stick around go to a slower secondary sorting (gen-1, gen-2). Old long-term packages rarely need re-sorting.

`gc.collect()` — When to Force It

gc.collect(generation=0)  # only gen-0
gc.collect()               # full collection, all generations

Useful when: you’ve just finished a large batch job, deleted a massive data structure you know had cycles, or before a latency-sensitive window. Don’t call it in tight loops — it’s expensive.

`del` Finalizer — Almost Never What You Want

class Leaky:
    def __del__(self):
        print("cleaning up")

a = Leaky()
b = Leaky()
a.other = b
b.other = a
del a, b
# __del__ may not get called at all if cyclic GC handles these
# In CPython ≤ 3.3, objects with __del__ were never collected if cyclic

In Python 3.4+ (PEP 442), objects with __del__ in cycles can be collected, but the order of __del__ calls is undefined. Use context managers (__enter__/__exit__) or weakref finalizers instead.

Common mistake: Using __del__ to close file handles or DB connections. If the object gets stuck in a cycle, cleanup never runs. Use with statements.

Weak References — Breaking Cycles Without GC Help

import weakref

class Node:
    def __init__(self, name):
        self.name = name
        self.parent = None

parent = Node("parent")
child = Node("child")
child.parent = weakref.ref(parent)  # weak ref — doesn't increment ob_refcnt

# Access the referent
p = child.parent()  # call the weakref to get the object
if p is not None:
    print(p.name)   # "parent"

Weak references don’t bump ob_refcnt. When parent has no more strong references, it gets deallocated, and child.parent() returns None. This cleanly breaks parent-child cycles in trees, caches, observer patterns.

3. Object Interning & Small Object Optimization

CPython pre-allocates frequently used objects as singletons. This is an optimization, not a language guarantee — don’t write code that depends on it.

Integer Cache: -5 to 256

a = 100; b = 100
print(a is b)   # True — same object

a = 1000; b = 1000
print(a is b)   # False in most contexts (may be True in same code block)

Integers from -5 to 256 are pre-allocated at interpreter startup. Any int in that range is the exact same object. Outside that range, each assignment typically creates a new object.

Common mistake: Using is to compare integers. x is 1 may work for small ints but fails for large ones. Always use ==.

String Interning

import sys

a = "hello"
b = "hello"
print(a is b)   # True — CPython interns simple identifier-like strings

a = "hello world"  # space → not always interned
b = "hello world"
print(a is b)   # implementation-dependent

# Force interning when you need it
a = sys.intern("my_key_that_appears_millions_of_times")
b = sys.intern("my_key_that_appears_millions_of_times")
print(a is b)   # True, guaranteed

CPython automatically interns strings that look like identifiers (alphanumeric + underscore). Useful for sys.intern() when you have a huge dict with repeated string keys — makes dict lookup faster (pointer equality vs string comparison).

The pymalloc Allocator

For objects ≤ 512 bytes, CPython uses its own allocator — pymalloc — instead of calling malloc for each object:

Arena: 256KB chunk requested from the OS
Pool: 4KB page within an arena, holds same-sized objects
Block: single object slot within a pool

This dramatically reduces malloc overhead for small, frequently allocated objects (which is most Python objects). Objects > 512 bytes fall through to the system allocator.

Why `sys.getsizeof()` Lies

import sys

d = {"key": "value"}
print(sys.getsizeof(d))   # ~232 bytes — the dict object itself

# But the actual memory held is much more:
# the dict object + key string + value string + hash table internals

sys.getsizeof returns the size of the object’s own storage, not the transitive closure of everything it references. For a real size measurement use pympler or objgraph.

from pympler import asizeof
print(asizeof.asizeof(d))  # includes referenced objects

ELI5: sys.getsizeof is like asking “how big is the box?” — it only measures the cardboard, not the stuff inside. asizeof weighs the whole package including contents.

4. Memory Profiling Tools

`tracemalloc` — Built-in, Line-Level Tracking

import tracemalloc

tracemalloc.start()

# ... your code ...
result = [dict(i=i) for i in range(10000)]

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")

for stat in top_stats[:5]:
    print(stat)
# myfile.py:6: size=3.1 MiB, count=30001, average=108 B

Minimal overhead, ships with Python 3.4+. Best first tool to reach for.

`memory_profiler` — Line-by-Line Memory

# Decorate the function you want to profile
from memory_profiler import profile

@profile
def load_data():
    data = []
    for i in range(100000):
        data.append({"id": i, "value": i * 2})
    return data

Run with python -m memory_profiler script.py. Shows incremental memory change per line — very useful for finding the exact line where allocation spikes.

`objgraph` — Visualizing Reference Graphs

import objgraph

objgraph.show_most_common_types(limit=10)
# dict          12345
# list           8901
# MyClass         203

# Find what's holding references to an object
objgraph.show_backrefs(some_object, max_depth=3)

Excellent for diagnosing memory leaks — shows what object types are accumulating and what’s holding references to them.

Practical Memory Leak Workflow

Run objgraph.show_most_common_types() before and after suspect operation
Look for types whose count keeps growing across requests/calls
Use objgraph.show_backrefs() on a sample of the leaking objects
Use tracemalloc snapshots with snapshot.compare_to() to see the diff
Fix the root cause (usually: global collection, callback holding reference, circular ref without weak ref)

Tool	Overhead	Granularity	When to Use
`tracemalloc`	Low	File + line	First investigation
`memory_profiler`	Medium	Line-by-line	Pinpoint allocation spike
`objgraph`	Low	Object types	Finding what’s leaking
`pympler`	Medium	Object + referents	Accurate size measurement

5. CPU Profiling & Optimization

The Profiler Ladder

Start cheap, escalate only if needed:

# 1. cProfile — function-level, built-in
import cProfile
cProfile.run("my_function()", sort="cumulative")

# Or from CLI:
# python -m cProfile -s cumulative script.py | head -30

# 2. line_profiler — line-level (install: pip install line-profiler)
from line_profiler import LineProfiler

lp = LineProfiler()
lp_wrapper = lp(my_function)
lp_wrapper()
lp.print_stats()

# 3. py-spy — attach to running process, zero modification to code
py-spy top --pid 12345
py-spy record -o profile.svg --pid 12345

py-spy is the production tool — it samples the call stack of a running process without stopping it, no code changes required.

The Optimization Hierarchy

Do these in order. Most bugs live at the top.

Priority	Level	Example
1	Algorithm	O(n²) → O(n log n)
2	Data structure	`list` scan → `set` lookup
3	Python tricks	List comp → generator, local var caching
4	C extension	`numpy`, `pandas`, `ujson`
5	Other runtime	Cython, PyPy, native extension

ELI5: Optimizing Python at the “tricks” level before fixing an O(n²) algorithm is like polishing your car before changing flat tires. Fix the big stuff first.

Common Python Performance Antipatterns

# BAD: string concatenation in a loop — O(n²) copies
result = ""
for item in items:
    result += str(item)   # each += creates a new string object

# GOOD: join
result = "".join(str(item) for item in items)

# BAD: repeated dict/attribute lookup in hot loop
for i in range(1000000):
    math.sqrt(i)   # LOAD_GLOBAL math, then LOAD_ATTR sqrt — every iteration

# GOOD: cache the reference
sqrt = math.sqrt
for i in range(1000000):
    sqrt(i)        # LOAD_FAST — single bytecode op

# BAD: building full list when you only need iteration
total = sum([x * 2 for x in range(1000000)])   # materializes list first

# GOOD: generator — lazy, O(1) memory
total = sum(x * 2 for x in range(1000000))

Common mistake: Premature micro-optimization. Profile first. The bottleneck is almost never where you think it is.

6. Data Structure Performance

Time Complexity Reference

Operation	list	dict	set	deque
Access by index	O(1)	—	—	O(n)
Search / `in`	O(n)	O(1) avg	O(1) avg	O(n)
Append (end)	O(1) amortized	—	—	O(1)
Insert (front)	O(n)	—	—	O(1)
Delete by value	O(n)	O(1) avg	O(1) avg	O(n)
Union/intersect	—	—	O(min(m,n))	—

Use deque (from collections) when you need frequent head insertions/deletions. Use set for membership testing, not list.

Choosing the Right Array Type

Type	Memory	Speed	Use When
`list`	High (pointers + objects)	Fast for mixed types	General purpose
`array.array`	Low (raw C types)	Fast for homogeneous	Typed numeric data, no numpy
`numpy.ndarray`	Low + SIMD	Extremely fast	Numerical computation
`bytes`/`bytearray`	Lowest	Fast	Binary data

import array, sys

lst = list(range(10000))
arr = array.array('i', range(10000))   # 'i' = signed int

print(sys.getsizeof(lst))  # ~87624 bytes
print(sys.getsizeof(arr))  # ~40064 bytes  — ~2x smaller

`slots` Memory Savings

class WithoutSlots:
    def __init__(self, x, y):
        self.x = x
        self.y = y

class WithSlots:
    __slots__ = ('x', 'y')
    def __init__(self, x, y):
        self.x = x
        self.y = y

import sys
a = WithoutSlots(1, 2)
b = WithSlots(1, 2)
print(sys.getsizeof(a))  # 48 bytes + 232 bytes for __dict__
print(sys.getsizeof(b))  # 56 bytes, no __dict__ at all

For classes that will be instantiated millions of times, __slots__ typically saves 40-60% memory per instance. Trade-off: no dynamic attributes, harder to pickle in some cases, breaks multiple inheritance in tricky ways.

Comparison: Named Tuples vs Dataclasses vs Dicts

Type	Memory	Mutable	Access	Best For
`tuple`	Smallest	No	Index only	Immutable records
`namedtuple`	Small	No	Name or index	Read-only value objects
`dataclass`	Medium (has `__dict__`)	Yes	Name	Mutable domain objects
`dataclass(slots=True)`	Small	Yes	Name	High-volume mutable objects
`dict`	Large	Yes	Key	Dynamic structure

from dataclasses import dataclass

@dataclass(slots=True)   # Python 3.10+
class Point:
    x: float
    y: float

ELI5: __slots__ tells Python “this class will only ever have these specific attributes” — so Python replaces the flexible dictionary with a fixed array of slots. Less flexible, much leaner.

7. Advanced Optimization Techniques

`functools.lru_cache` and `functools.cache`

from functools import lru_cache, cache

@cache   # Python 3.9+ — unbounded cache, slightly faster than lru_cache
def fib(n):
    if n < 2: return n
    return fib(n-1) + fib(n-2)

@lru_cache(maxsize=1024)  # bounded — won't grow forever
def expensive_lookup(key):
    return database.query(key)

cache is lru_cache(maxsize=None) — use it when you know the key space is bounded. Use lru_cache with a maxsize for anything that could grow without bound.

Common mistake: Decorating with @cache on a method of a long-lived object. The cache holds a reference to self, which can prevent GC. Use @lru_cache on module-level functions, or use functools.cached_property for instance-level caching.

Generator vs List Comprehension Trade-off

# List comp: O(n) memory, fast random access, iterable multiple times
squares_list = [x**2 for x in range(10**6)]    # ~8MB allocated upfront

# Generator: O(1) memory, single-pass only, lazy evaluation
squares_gen = (x**2 for x in range(10**6))     # near-zero memory upfront

Rule: if you only iterate once and don’t need indexing, use a generator. If you need to iterate multiple times or random access, materialize to a list.

LOAD_FAST vs LOAD_GLOBAL

import dis

def with_global():
    return len([1, 2, 3])

def with_local():
    _len = len   # cache in local scope
    return _len([1, 2, 3])

dis.dis(with_global)
# LOAD_GLOBAL 'len'    — looks up in module globals dict first
dis.dis(with_local)
# LOAD_FAST '_len'     — reads from local frame array by index

LOAD_FAST is 2-3x faster than LOAD_GLOBAL because locals are a C array (index lookup), globals are a dict (hash lookup). In tight loops calling the same global 10M times, caching it locally makes a measurable difference.

When to Escape Python

Tool	When to Use	Effort
`numpy`/`pandas`	Numerical/tabular operations	Low — just use it
Cython	Hot loop you own, type annotations available	Medium
`ctypes`/`cffi`	Calling existing C library	Medium
Native C extension	Maximum control, Python API needed	High
PyPy	Existing codebase, can’t touch C extensions	Low (just run PyPy)

Most Python performance problems are solved at the algorithm or data structure level long before reaching Cython.

Summary: Decision Table

You see…	Likely cause	Tool to diagnose	Fix
Memory grows per request	Accumulating references, no cleanup	`objgraph`, `tracemalloc`	Fix leaking container, use weak refs
Circular refs not collected	Missing weak refs	`gc.collect(); objgraph`	Add `weakref.ref` for back-pointers
High GC pause latency	Frequent gen-0 sweeps	`gc.callbacks`, profiler	Tune threshold, `gc.disable()` at critical path
Slow dict/list lookups	Wrong data structure	`cProfile`	Use `set` for membership, `deque` for queue
High object count per class	No `__slots__`	`pympler.asizeof`	Add `__slots__` to hot classes
CPU hot function	Pure Python algorithm	`cProfile` → `line_profiler`	Algorithmic fix or numpy
Unexplained slowness in loop	Global lookups, string concat	`line_profiler`, `dis`	Cache locals, use `join()`
Need size of object graph	`getsizeof` lies	`pympler.asizeof`	Use `asizeof.asizeof()`

The read order for these notes: understand the reference counting + GC machinery first (sections 1-2), then object interning (3) — those explain why things cost what they cost. Profiling tools (4-5) tell you where the costs are. Data structures and optimization techniques (6-7) are where you actually fix them.

Memory Management & Performance#

1. Reference Counting#

When Reference Counting Fails: Circular References#

2. Garbage Collection#

Generational GC#

When to Disable GC#

gc.collect() — When to Force It#

__del__ Finalizer — Almost Never What You Want#

Weak References — Breaking Cycles Without GC Help#

3. Object Interning & Small Object Optimization#

Integer Cache: -5 to 256#

String Interning#

The pymalloc Allocator#

Why sys.getsizeof() Lies#

4. Memory Profiling Tools#

tracemalloc — Built-in, Line-Level Tracking#

memory_profiler — Line-by-Line Memory#

objgraph — Visualizing Reference Graphs#

Practical Memory Leak Workflow#

5. CPU Profiling & Optimization#

The Profiler Ladder#

The Optimization Hierarchy#

Common Python Performance Antipatterns#

6. Data Structure Performance#

Time Complexity Reference#

Choosing the Right Array Type#

__slots__ Memory Savings#

Comparison: Named Tuples vs Dataclasses vs Dicts#

7. Advanced Optimization Techniques#

functools.lru_cache and functools.cache#

Generator vs List Comprehension Trade-off#

LOAD_FAST vs LOAD_GLOBAL#

When to Escape Python#

Summary: Decision Table#

Memory Management & Performance

1. Reference Counting

When Reference Counting Fails: Circular References

2. Garbage Collection

Generational GC

When to Disable GC

`gc.collect()` — When to Force It

`del` Finalizer — Almost Never What You Want

Weak References — Breaking Cycles Without GC Help

3. Object Interning & Small Object Optimization

Integer Cache: -5 to 256

String Interning

The pymalloc Allocator

Why `sys.getsizeof()` Lies

4. Memory Profiling Tools

`tracemalloc` — Built-in, Line-Level Tracking

`memory_profiler` — Line-by-Line Memory

`objgraph` — Visualizing Reference Graphs

Practical Memory Leak Workflow

5. CPU Profiling & Optimization

The Profiler Ladder

The Optimization Hierarchy

Common Python Performance Antipatterns

6. Data Structure Performance

Time Complexity Reference

Choosing the Right Array Type

`slots` Memory Savings

Comparison: Named Tuples vs Dataclasses vs Dicts

7. Advanced Optimization Techniques

`functools.lru_cache` and `functools.cache`

Generator vs List Comprehension Trade-off

LOAD_FAST vs LOAD_GLOBAL

When to Escape Python

Summary: Decision Table