← Python Mastery — Senior to Principal

Memory Management & Performance

Memory Management & Performance

CPython manages memory so you don’t have to — until you’re debugging a production memory leak at 2am, or trying to hit sub-millisecond latency, or wondering why your process is eating 8GB for no apparent reason. That’s when you need to actually understand the machinery.


1. Reference Counting

Every Python object has an ob_refcnt field — a simple integer tracking how many references point to it. When that counter hits zero, CPython immediately deallocates the object. No GC pass needed.

import sys

x = []
print(sys.getrefcount(x))  # 2, not 1 — why?

y = x
print(sys.getrefcount(x))  # 3

del y
print(sys.getrefcount(x))  # 2 again

The +1 mystery: getrefcount itself holds a temporary reference to the object while executing — so you always see count + 1. This is expected and correct.

When you do x = []:

  1. A list object is created, ob_refcnt = 1
  2. x binds to it — but the call to getrefcount(x) passes the object, adding a second ref temporarily

ELI5: Think of ob_refcnt as a sticky note counter on a library book. When you borrow it, the number goes up. When you return it, it goes down. The moment it hits zero, the library throws the book away. The +1 from getrefcount is like the librarian picking it up to check the count — they’re also “holding” it for a second.

When Reference Counting Fails: Circular References

a = []
b = [a]
a.append(b)   # a → b → a: cycle

del a
del b
# Both objects still alive! ob_refcnt never reaches 0.

Reference counting cannot handle this. a holds a ref to b, b holds a ref to a. Even after you delete both names, each object’s ob_refcnt is still 1. This is exactly why Python has a garbage collector on top of reference counting.


2. Garbage Collection

Python’s GC exists solely to handle circular references. It’s generational, and it tracks container objects (lists, dicts, sets, instances) — not scalars like ints and strings.

Generational GC

GenerationThreshold (default)Survives
0 (young)700 objectsMost objects die here
1 (middle)Every 10 gen-0 runsObjects that survived gen-0
2 (old)Every 10 gen-1 runsLong-lived objects
import gc

print(gc.get_threshold())  # (700, 10, 10)
print(gc.get_count())      # current counts per generation

# Tune for long-lived servers where gen-0 cycles are rare
gc.set_threshold(1000, 15, 15)

The GC runs automatically when the number of tracked objects in gen-0 exceeds the threshold. It does a mark-and-sweep specifically looking for cycles.

When to Disable GC

import gc

gc.disable()
# ... your latency-critical code ...
gc.enable()

Instagram famously disabled GC in their Python workers and saw a ~10% memory reduction because COW (copy-on-write) pages stopped getting dirtied by GC bookkeeping. This only makes sense if you control object lifetimes carefully (no cycles, or you call gc.collect() at known safe points).

ELI5: Generational GC is like a sorting machine at the post office. New packages (gen-0) get checked first and most often — they’re likely to be sorted quickly. The ones that stick around go to a slower secondary sorting (gen-1, gen-2). Old long-term packages rarely need re-sorting.

gc.collect() — When to Force It

gc.collect(generation=0)  # only gen-0
gc.collect()               # full collection, all generations

Useful when: you’ve just finished a large batch job, deleted a massive data structure you know had cycles, or before a latency-sensitive window. Don’t call it in tight loops — it’s expensive.

__del__ Finalizer — Almost Never What You Want

class Leaky:
    def __del__(self):
        print("cleaning up")

a = Leaky()
b = Leaky()
a.other = b
b.other = a
del a, b
# __del__ may not get called at all if cyclic GC handles these
# In CPython ≤ 3.3, objects with __del__ were never collected if cyclic

In Python 3.4+ (PEP 442), objects with __del__ in cycles can be collected, but the order of __del__ calls is undefined. Use context managers (__enter__/__exit__) or weakref finalizers instead.

Common mistake: Using __del__ to close file handles or DB connections. If the object gets stuck in a cycle, cleanup never runs. Use with statements.

Weak References — Breaking Cycles Without GC Help

import weakref

class Node:
    def __init__(self, name):
        self.name = name
        self.parent = None

parent = Node("parent")
child = Node("child")
child.parent = weakref.ref(parent)  # weak ref — doesn't increment ob_refcnt

# Access the referent
p = child.parent()  # call the weakref to get the object
if p is not None:
    print(p.name)   # "parent"

Weak references don’t bump ob_refcnt. When parent has no more strong references, it gets deallocated, and child.parent() returns None. This cleanly breaks parent-child cycles in trees, caches, observer patterns.


3. Object Interning & Small Object Optimization

CPython pre-allocates frequently used objects as singletons. This is an optimization, not a language guarantee — don’t write code that depends on it.

Integer Cache: -5 to 256

a = 100; b = 100
print(a is b)   # True — same object

a = 1000; b = 1000
print(a is b)   # False in most contexts (may be True in same code block)

Integers from -5 to 256 are pre-allocated at interpreter startup. Any int in that range is the exact same object. Outside that range, each assignment typically creates a new object.

Common mistake: Using is to compare integers. x is 1 may work for small ints but fails for large ones. Always use ==.

String Interning

import sys

a = "hello"
b = "hello"
print(a is b)   # True — CPython interns simple identifier-like strings

a = "hello world"  # space → not always interned
b = "hello world"
print(a is b)   # implementation-dependent

# Force interning when you need it
a = sys.intern("my_key_that_appears_millions_of_times")
b = sys.intern("my_key_that_appears_millions_of_times")
print(a is b)   # True, guaranteed

CPython automatically interns strings that look like identifiers (alphanumeric + underscore). Useful for sys.intern() when you have a huge dict with repeated string keys — makes dict lookup faster (pointer equality vs string comparison).

The pymalloc Allocator

For objects ≤ 512 bytes, CPython uses its own allocator — pymalloc — instead of calling malloc for each object:

  • Arena: 256KB chunk requested from the OS
  • Pool: 4KB page within an arena, holds same-sized objects
  • Block: single object slot within a pool

This dramatically reduces malloc overhead for small, frequently allocated objects (which is most Python objects). Objects > 512 bytes fall through to the system allocator.

Why sys.getsizeof() Lies

import sys

d = {"key": "value"}
print(sys.getsizeof(d))   # ~232 bytes — the dict object itself

# But the actual memory held is much more:
# the dict object + key string + value string + hash table internals

sys.getsizeof returns the size of the object’s own storage, not the transitive closure of everything it references. For a real size measurement use pympler or objgraph.

from pympler import asizeof
print(asizeof.asizeof(d))  # includes referenced objects

ELI5: sys.getsizeof is like asking “how big is the box?” — it only measures the cardboard, not the stuff inside. asizeof weighs the whole package including contents.


4. Memory Profiling Tools

tracemalloc — Built-in, Line-Level Tracking

import tracemalloc

tracemalloc.start()

# ... your code ...
result = [dict(i=i) for i in range(10000)]

snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")

for stat in top_stats[:5]:
    print(stat)
# myfile.py:6: size=3.1 MiB, count=30001, average=108 B

Minimal overhead, ships with Python 3.4+. Best first tool to reach for.

memory_profiler — Line-by-Line Memory

# Decorate the function you want to profile
from memory_profiler import profile

@profile
def load_data():
    data = []
    for i in range(100000):
        data.append({"id": i, "value": i * 2})
    return data

Run with python -m memory_profiler script.py. Shows incremental memory change per line — very useful for finding the exact line where allocation spikes.

objgraph — Visualizing Reference Graphs

import objgraph

objgraph.show_most_common_types(limit=10)
# dict          12345
# list           8901
# MyClass         203

# Find what's holding references to an object
objgraph.show_backrefs(some_object, max_depth=3)

Excellent for diagnosing memory leaks — shows what object types are accumulating and what’s holding references to them.

Practical Memory Leak Workflow

  1. Run objgraph.show_most_common_types() before and after suspect operation
  2. Look for types whose count keeps growing across requests/calls
  3. Use objgraph.show_backrefs() on a sample of the leaking objects
  4. Use tracemalloc snapshots with snapshot.compare_to() to see the diff
  5. Fix the root cause (usually: global collection, callback holding reference, circular ref without weak ref)
ToolOverheadGranularityWhen to Use
tracemallocLowFile + lineFirst investigation
memory_profilerMediumLine-by-linePinpoint allocation spike
objgraphLowObject typesFinding what’s leaking
pymplerMediumObject + referentsAccurate size measurement

5. CPU Profiling & Optimization

The Profiler Ladder

Start cheap, escalate only if needed:

# 1. cProfile — function-level, built-in
import cProfile
cProfile.run("my_function()", sort="cumulative")

# Or from CLI:
# python -m cProfile -s cumulative script.py | head -30
# 2. line_profiler — line-level (install: pip install line-profiler)
from line_profiler import LineProfiler

lp = LineProfiler()
lp_wrapper = lp(my_function)
lp_wrapper()
lp.print_stats()
# 3. py-spy — attach to running process, zero modification to code
py-spy top --pid 12345
py-spy record -o profile.svg --pid 12345

py-spy is the production tool — it samples the call stack of a running process without stopping it, no code changes required.

The Optimization Hierarchy

Do these in order. Most bugs live at the top.

PriorityLevelExample
1AlgorithmO(n²) → O(n log n)
2Data structurelist scan → set lookup
3Python tricksList comp → generator, local var caching
4C extensionnumpy, pandas, ujson
5Other runtimeCython, PyPy, native extension

ELI5: Optimizing Python at the “tricks” level before fixing an O(n²) algorithm is like polishing your car before changing flat tires. Fix the big stuff first.

Common Python Performance Antipatterns

# BAD: string concatenation in a loop — O(n²) copies
result = ""
for item in items:
    result += str(item)   # each += creates a new string object

# GOOD: join
result = "".join(str(item) for item in items)
# BAD: repeated dict/attribute lookup in hot loop
for i in range(1000000):
    math.sqrt(i)   # LOAD_GLOBAL math, then LOAD_ATTR sqrt — every iteration

# GOOD: cache the reference
sqrt = math.sqrt
for i in range(1000000):
    sqrt(i)        # LOAD_FAST — single bytecode op
# BAD: building full list when you only need iteration
total = sum([x * 2 for x in range(1000000)])   # materializes list first

# GOOD: generator — lazy, O(1) memory
total = sum(x * 2 for x in range(1000000))

Common mistake: Premature micro-optimization. Profile first. The bottleneck is almost never where you think it is.


6. Data Structure Performance

Time Complexity Reference

Operationlistdictsetdeque
Access by indexO(1)O(n)
Search / inO(n)O(1) avgO(1) avgO(n)
Append (end)O(1) amortizedO(1)
Insert (front)O(n)O(1)
Delete by valueO(n)O(1) avgO(1) avgO(n)
Union/intersectO(min(m,n))

Use deque (from collections) when you need frequent head insertions/deletions. Use set for membership testing, not list.

Choosing the Right Array Type

TypeMemorySpeedUse When
listHigh (pointers + objects)Fast for mixed typesGeneral purpose
array.arrayLow (raw C types)Fast for homogeneousTyped numeric data, no numpy
numpy.ndarrayLow + SIMDExtremely fastNumerical computation
bytes/bytearrayLowestFastBinary data
import array, sys

lst = list(range(10000))
arr = array.array('i', range(10000))   # 'i' = signed int

print(sys.getsizeof(lst))  # ~87624 bytes
print(sys.getsizeof(arr))  # ~40064 bytes  — ~2x smaller

__slots__ Memory Savings

class WithoutSlots:
    def __init__(self, x, y):
        self.x = x
        self.y = y

class WithSlots:
    __slots__ = ('x', 'y')
    def __init__(self, x, y):
        self.x = x
        self.y = y

import sys
a = WithoutSlots(1, 2)
b = WithSlots(1, 2)
print(sys.getsizeof(a))  # 48 bytes + 232 bytes for __dict__
print(sys.getsizeof(b))  # 56 bytes, no __dict__ at all

For classes that will be instantiated millions of times, __slots__ typically saves 40-60% memory per instance. Trade-off: no dynamic attributes, harder to pickle in some cases, breaks multiple inheritance in tricky ways.

Comparison: Named Tuples vs Dataclasses vs Dicts

TypeMemoryMutableAccessBest For
tupleSmallestNoIndex onlyImmutable records
namedtupleSmallNoName or indexRead-only value objects
dataclassMedium (has __dict__)YesNameMutable domain objects
dataclass(slots=True)SmallYesNameHigh-volume mutable objects
dictLargeYesKeyDynamic structure
from dataclasses import dataclass

@dataclass(slots=True)   # Python 3.10+
class Point:
    x: float
    y: float

ELI5: __slots__ tells Python “this class will only ever have these specific attributes” — so Python replaces the flexible dictionary with a fixed array of slots. Less flexible, much leaner.


7. Advanced Optimization Techniques

functools.lru_cache and functools.cache

from functools import lru_cache, cache

@cache   # Python 3.9+ — unbounded cache, slightly faster than lru_cache
def fib(n):
    if n < 2: return n
    return fib(n-1) + fib(n-2)

@lru_cache(maxsize=1024)  # bounded — won't grow forever
def expensive_lookup(key):
    return database.query(key)

cache is lru_cache(maxsize=None) — use it when you know the key space is bounded. Use lru_cache with a maxsize for anything that could grow without bound.

Common mistake: Decorating with @cache on a method of a long-lived object. The cache holds a reference to self, which can prevent GC. Use @lru_cache on module-level functions, or use functools.cached_property for instance-level caching.

Generator vs List Comprehension Trade-off

# List comp: O(n) memory, fast random access, iterable multiple times
squares_list = [x**2 for x in range(10**6)]    # ~8MB allocated upfront

# Generator: O(1) memory, single-pass only, lazy evaluation
squares_gen = (x**2 for x in range(10**6))     # near-zero memory upfront

Rule: if you only iterate once and don’t need indexing, use a generator. If you need to iterate multiple times or random access, materialize to a list.

LOAD_FAST vs LOAD_GLOBAL

import dis

def with_global():
    return len([1, 2, 3])

def with_local():
    _len = len   # cache in local scope
    return _len([1, 2, 3])

dis.dis(with_global)
# LOAD_GLOBAL 'len'    — looks up in module globals dict first
dis.dis(with_local)
# LOAD_FAST '_len'     — reads from local frame array by index

LOAD_FAST is 2-3x faster than LOAD_GLOBAL because locals are a C array (index lookup), globals are a dict (hash lookup). In tight loops calling the same global 10M times, caching it locally makes a measurable difference.

When to Escape Python

ToolWhen to UseEffort
numpy/pandasNumerical/tabular operationsLow — just use it
CythonHot loop you own, type annotations availableMedium
ctypes/cffiCalling existing C libraryMedium
Native C extensionMaximum control, Python API neededHigh
PyPyExisting codebase, can’t touch C extensionsLow (just run PyPy)

Most Python performance problems are solved at the algorithm or data structure level long before reaching Cython.


Summary: Decision Table

You see…Likely causeTool to diagnoseFix
Memory grows per requestAccumulating references, no cleanupobjgraph, tracemallocFix leaking container, use weak refs
Circular refs not collectedMissing weak refsgc.collect(); objgraphAdd weakref.ref for back-pointers
High GC pause latencyFrequent gen-0 sweepsgc.callbacks, profilerTune threshold, gc.disable() at critical path
Slow dict/list lookupsWrong data structurecProfileUse set for membership, deque for queue
High object count per classNo __slots__pympler.asizeofAdd __slots__ to hot classes
CPU hot functionPure Python algorithmcProfileline_profilerAlgorithmic fix or numpy
Unexplained slowness in loopGlobal lookups, string concatline_profiler, disCache locals, use join()
Need size of object graphgetsizeof liespympler.asizeofUse asizeof.asizeof()

The read order for these notes: understand the reference counting + GC machinery first (sections 1-2), then object interning (3) — those explain why things cost what they cost. Profiling tools (4-5) tell you where the costs are. Data structures and optimization techniques (6-7) are where you actually fix them.