CPython Internals
Most Python engineers treat the interpreter as a black box — code goes in, results come out. That works until it doesn’t: a performance regression you can’t explain, a threading bug that makes no sense, a memory spike with no obvious cause. Understanding CPython’s internals turns those mysteries into five-minute diagnoses.
1. How Python Executes Code
The pipeline is: source text → tokens → AST → bytecode → execution. Every time you import a module or run a script, this full pipeline runs (unless a .pyc cache exists).
source.py
│
▼ tokenize (lexer)
[TOKENS] — keywords, names, literals, operators
│
▼ parse (parser)
[AST] — Abstract Syntax Tree (tree of nodes)
│
▼ compile (compiler + peephole optimizer)
[BYTECODE] — sequence of opcodes for the VM
│
▼ execute (ceval.c evaluation loop)
[RESULT]
“Compiled AND interpreted” is not a contradiction — Python compiles to bytecode (compilation step), then a software interpreter executes that bytecode (interpretation step). It’s compiled to an intermediate form, not to machine code like C.
The compilation happens at import time, not at some separate build step. When you import foo, Python compiles foo.py right then, caches the bytecode to __pycache__/foo.cpython-312.pyc, and runs it. On next import, if the .pyc is fresh (mtime + size match), compilation is skipped.
ELI5: It’s like a chef who writes a recipe in French, then a translator converts it to a standard kitchen notation card. The kitchen staff read and execute the notation card. Next time, they reuse the same card unless the French recipe changed. Python’s “notation card” format is bytecode; the
.pycfile is the saved card.
What’s in a .pyc file
| Field | Size | Purpose |
|---|---|---|
| Magic number | 4 bytes | Python version — mismatched magic = recompile |
| Bit field | 4 bytes | Flags (hash-based invalidation, etc.) |
| Timestamp / hash | 8 bytes | Source freshness check |
| Source size | 4 bytes | Secondary freshness check |
| Marshalled code object | variable | The actual bytecode payload |
2. Bytecode Deep Dive
Bytecode is the language the CPython VM speaks. Not machine code — a higher-level, portable, stack-based instruction set.
The dis module
import dis
def greet(name):
return "Hello, " + name
dis.dis(greet)
Output (Python 3.12):
2 RESUME 0
3 LOAD_CONST 1 ('Hello, ')
LOAD_FAST 0 (name)
BINARY_OP 0 (+)
RETURN_VALUE
Read each line as: <lineno> <opcode> <arg> (<human-readable arg>).
Key opcodes
| Opcode | What it does | Speed note |
|---|---|---|
LOAD_FAST | Push local variable onto stack | Array index — O(1), very fast |
LOAD_GLOBAL | Push global/builtin onto stack | Dict lookup — slower |
LOAD_CONST | Push a literal/constant | Just a pointer copy — instant |
STORE_FAST | Pop stack → local variable slot | Array write |
BINARY_OP | Pop two operands, push result | Dispatches to type’s __add__ etc. |
CALL | Call a callable | Involves frame creation |
RETURN_VALUE | Return top of stack | Tears down current frame |
BUILD_LIST | Build list from N stack items | Used for list literals |
Why LOAD_FAST beats LOAD_GLOBAL
Local variables are stored in a fixed-size C array on the frame — indexed by position (known at compile time). Access is array[index]. Global variables live in f_globals, a Python dict — so every access is a hash lookup. This is the entire reason “cache globals as locals” is a real micro-optimization in hot loops.
# In a tight inner loop, this matters:
import math
def slow():
for _ in range(1_000_000):
math.sqrt(2) # LOAD_GLOBAL each iteration
def fast():
sqrt = math.sqrt # hoist to local
for _ in range(1_000_000):
sqrt(2) # LOAD_FAST — 1 dict lookup saved per iteration
ELI5: Local variables are like items on your desk — you just reach for them by position. Global variables are like items in a filing cabinet you have to look up alphabetically every single time. For one lookup it doesn’t matter; for a million iterations it adds up.
Code objects
Every function, class body, and module has a code object (types.CodeType) attached:
def add(a, b):
return a + b
c = add.__code__
print(c.co_varnames) # ('a', 'b') — local variable names
print(c.co_consts) # (None,) — constants used
print(c.co_names) # () — global names referenced
print(c.co_argcount) # 2
The code object is immutable and shared — all calls to add reuse the same code object. The frame (below) is what’s per-call.
Common mistake:
co_varnameslists locals IN ORDER, starting with arguments. Don’t assume it’s only arguments — it includes all locals defined in the function.
3. The Evaluation Loop (ceval.c)
All bytecode execution happens in a single C function: _PyEval_EvalFrameDefault in Python/ceval.c. It’s roughly 3000 lines of C — a giant switch statement over opcode values. This is CPython’s hot path.
// Simplified pseudocode
while (1) {
opcode = *next_instr++;
switch (opcode) {
case LOAD_FAST: PUSH(frame->localsplus[oparg]); break;
case BINARY_ADD: ... ; break;
case RETURN_VALUE: return POP(); break;
// ... ~150 cases
}
}
The GIL check
Every $N$ instructions (historically 100, now driven by sys.getswitchinterval(), default 5ms), the eval loop checks eval_breaker — a flag set by signal handlers and other threads. If set, the current thread may release the GIL and let another thread run. This is the mechanism behind Python’s cooperative threading — it’s not truly preemptive at the bytecode level.
Exception handling
Try/except is not free. CPython maintains a block stack on the frame. Each try block pushes an entry; matching except clauses use it to find the handler. The cost only hits on the exception path — the happy path (no exception) is nearly zero overhead.
PEP 659: Adaptive Specializing Interpreter (Python 3.11+)
The generic BINARY_OP opcode dispatches through Python’s type system on every call. In 3.11+, CPython tracks how an opcode is actually used at runtime. After a few executions, it specializes the opcode in-place:
BINARY_OP → BINARY_OP_ADD_INT (if both operands are ints)
LOAD_ATTR → LOAD_ATTR_INSTANCE_VALUE (if it's always an instance attr)
Specialized opcodes bypass the type dispatch entirely — they check “are both ints?” and if yes, call C integer addition directly. This is why Python 3.11 is ~25% faster than 3.10 on average without any code changes.
ELI5: The generic opcode is like a restaurant waiter who, every time you order coffee, asks “what kind of coffee, what size, what milk?” The specialized opcode is one who remembers your order — after a few visits he just brings your usual without asking. The speedup is real because in real code,
a + bis almost always the same types every time.
4. Frame Objects
Every function call creates a frame object — the execution context: local variables, instruction pointer, reference to code object, globals dict, etc.
import sys
def inner():
frame = sys._getframe() # current frame
print(frame.f_code.co_name) # 'inner'
print(frame.f_back.f_code.co_name) # caller's name
print(frame.f_locals) # local variable dict (snapshot)
print(frame.f_lineno) # current line number
inner()
Frame fields you’ll actually use
| Field | Type | Use case |
|---|---|---|
f_code | code object | What function am I in? |
f_back | frame or None | Walk up the call stack |
f_locals | dict (snapshot) | Inspect locals — read-only useful, writes unreliable |
f_globals | dict (live) | The module’s global namespace |
f_lineno | int | Current executing line |
f_builtins | dict | Builtin namespace for this frame |
ELI5: A frame is like a whiteboard for one function call. It has the function’s local variables, a pointer to where in the code you are, and a sticky note saying “when I’m done, return to THIS other whiteboard (f_back).” When the function returns, the whiteboard is erased.
Why Python recursion is expensive
Each recursive call allocates a new frame object on the heap (a C struct + Python dict + array of locals). Deep recursion means thousands of heap allocations. Compare to a JIT-compiled language where tail recursion can reuse the same stack frame. CPython has no tail-call optimization and never will — it’s an explicit design decision (preserves tracebacks).
import sys
sys.getrecursionlimit() # 1000 by default
sys.setrecursionlimit(5000) # raise with caution — stack overflow is a real risk
Python 3.11 improvement: frames are now allocated more efficiently via a slab allocator and can be “zero-copy” reused in some cases, reducing per-call overhead by ~20%.
Common mistake: Inspecting
frame.f_localsinside a decorator and expecting writes to affect the function.f_localsis a copy — writing to it does nothing to the real locals array. Usectypesblack magic (not recommended in production) if you really need this.
5. Object Layout in C
Every Python object in memory starts with the same C header — this is what makes Python’s type system uniform.
PyObject — the universal header
typedef struct _object {
Py_ssize_t ob_refcnt; // reference count
PyTypeObject *ob_type; // pointer to type (what type(x) returns)
} PyObject;
Every single Python object — int, str, list, your custom class — starts with these two fields. That’s why type(x) and sys.getrefcount(x) work on literally anything.
PyVarObject extends this with ob_size (number of elements) for variable-length objects like lists and tuples.
How int works
Python ints are arbitrary precision. Small ints (-5 to 256) are cached singletons — a = 5; b = 5; a is b is True. Beyond that, each int is a heap-allocated PyLongObject with a ob_digit array of 30-bit “digits” in base $2^{30}$.
>>> (256) is (256)
True
>>> (257) is (257) # may be True in same expression due to constant folding
True
>>> a = 257; b = 257
>>> a is b # False — two separate objects
False
Common mistake: Using
isto compare integers.==checks value;ischecks identity (same memory address). Outside the -5 to 256 cache, twointobjects with the same value are different objects.
How str works (PEP 393, Python 3.3+)
Before 3.3, every string was stored as UCS-2 or UCS-4 (fixed width per character). Wasteful for ASCII-heavy text. PEP 393 introduced flexible string representation:
| Kind | Width | When used |
|---|---|---|
| Compact ASCII | 1 byte/char | All chars ≤ 127 |
| Latin-1 (UCS-1) | 1 byte/char | All chars ≤ 255 |
| UCS-2 | 2 bytes/char | Chars fit in BMP (≤ U+FFFF) |
| UCS-4 | 4 bytes/char | Any Unicode codepoint |
A pure ASCII string uses 1 byte per character. Mixing in one emoji forces the whole string to UCS-4. This means len() is O(1) but memory usage can 4x from adding a single non-ASCII character.
How dict works (Python 3.6+ compact dict)
Pre-3.6 dicts used a sparse hash table — lots of empty slots, unordered. Since 3.6, dicts use a compact representation with a separate indices array:
indices: [empty, 2, empty, 0, empty, 1, ...] # sparse, small entries
entries: [(hash, key, value), ...] # dense, insertion-ordered
Iteration walks the dense entries array — hence guaranteed insertion order (and faster iteration). The indices array handles collision resolution via open addressing.
ELI5: The old dict was a sparse parking lot where cars park in a slot determined by their license plate hash — lots of empty spaces. The new dict is a numbered ticket system: cars park in a compact row in arrival order, and a separate small board maps license plates to ticket numbers. Iteration just reads the compact row left to right.
How list works
A list is a dynamic array — a C array of PyObject* pointers with over-allocation to amortize append cost. The growth factor is roughly 1.125× (plus some additive constant), giving amortized O(1) append.
import sys
l = []
for i in range(10):
l.append(i)
print(f"len={len(l)}, allocated≈{sys.getsizeof(l)}")
Insert at index 0 is O(n) — it shifts all existing pointers. If you need fast prepend/pop-from-front, use collections.deque.
6. The Peephole Optimizer
Between AST compilation and final bytecode, the compiler runs a peephole optimizer — a simple pass that replaces common patterns with more efficient ones.
Constant folding
def f():
x = 3 * 4 # optimized to LOAD_CONST 12
y = "hello " + "world" # optimized to LOAD_CONST 'hello world'
z = (1, 2, 3) # tuple literal stays a constant
Check it:
import dis
def f():
return 3 * 4
dis.dis(f)
# LOAD_CONST 12 ← not 3, not 4, not BINARY_OP
The compiler pre-computes constant expressions — 3 * 4 never runs at runtime.
What the peephole optimizer does NOT do
- No loop unrolling
- No inlining of function calls
- No dead code elimination beyond trivial cases (
if False:) - No branch prediction hints
For real optimization beyond this, look at PyPy (JIT), Cython, or Numba.
Specializing Adaptive Interpreter (3.11+)
At runtime, after an opcode executes a few times with the same types, CPython replaces it with a type-specialized version in the bytecode array itself — this is called quickening. The specialized opcodes skip the general dispatch:
| Generic opcode | Specialized version | Condition |
|---|---|---|
BINARY_OP | BINARY_OP_ADD_INT | Both operands are int |
BINARY_OP | BINARY_OP_ADD_UNICODE | Both operands are str |
LOAD_ATTR | LOAD_ATTR_INSTANCE_VALUE | Attribute is a plain instance var |
CALL | CALL_PY_EXACT_ARGS | Pure Python function, exact arg count |
If the type assumption breaks (you pass a float to the ADD_INT slot), the opcode de-optimizes back to generic and re-specializes over time.
7. Import System Internals
sys.modules — the module cache
The first rule of the import system: every successful import is cached in sys.modules. Re-importing the same module returns the cached object — no re-execution.
import sys
import json
print(sys.modules['json']) # <module 'json' from '...'>
print(sys.modules['json'] is json) # True — same object
This is why mutating an imported module’s globals affects all importers — they all hold the same module object.
Finders and loaders
import foo triggers Python to walk sys.meta_path — a list of finder objects. Each finder either returns a spec (found it, here’s how to load it) or None (not mine, try next).
sys.meta_path = [
BuiltinImporter, # handles 'sys', 'builtins', etc.
FrozenImporter, # frozen modules (embedded in interpreter)
PathFinder, # scans sys.path for file-based modules
]
PathFinder then uses sys.path_hooks to find a loader for the specific file type (.py, .so, .pyc, zip files, etc.).
ELI5: Importing is like going to a library. The meta_path finders are the librarians — you ask each one “do you have this book?” in order. The first one that says yes goes and gets it.
sys.modulesis the “books you’ve already checked out” pile — if it’s there, no need to ask the librarians again.
Circular imports — why they hurt
# a.py
from b import B_thing
# b.py
from a import A_thing
When Python starts importing a.py, it adds a partially initialized a module to sys.modules immediately — before a.py finishes executing. If b.py imports from a during that window, it gets the partial module. The fix: use import a (module-level) instead of from a import X, or restructure to break the cycle.
Common mistake: Circular imports work fine until they don’t — they usually blow up only when the import order changes (e.g., you add a third module). The root fix is restructuring, not import tricks.
Namespace packages (PEP 420)
A directory without __init__.py is still importable as a namespace package (Python 3.3+). Useful for splitting a large package across multiple directories or install locations. Performance cost: finding a namespace package requires scanning all sys.path entries — regular packages (with __init__.py) short-circuit on first match.
8. Alternative Implementations
CPython is the reference implementation but not the only game in town.
Comparison table
| Implementation | Language | Execution | Best use case |
|---|---|---|---|
| CPython | C | Bytecode interpreter | Everything — default choice |
| PyPy | RPython/C | JIT compiler | Long-running CPU-bound code, 2-10× faster |
| Cython | C + Python | Compiles to C | Tight loops, C interop, numpy extensions |
| GraalPy | Java | Truffle/GraalVM JIT | Polyglot (run Python + Java + JS together) |
| RustPython | Rust | Bytecode interpreter | Embedding Python in Rust, learning project |
| MicroPython | C | Bytecode interpreter | Microcontrollers, embedded systems |
When to reach for each
- PyPy: you have a CPU-bound Python loop and can’t use numpy/numba. PyPy’s JIT compiles hot loops to machine code. Caveat: C extensions (numpy, pandas) often don’t work or run slower on PyPy.
- Cython: you need to call C code, or you have a hot function that’s the bottleneck and you want C speeds with Python syntax.
- CPython C API: writing a Python extension in C. You get full control and maximum compatibility, at the cost of manual reference counting.
- ctypes: calling an existing compiled library (a
.so/.dll) from Python without writing any C extension. Zero compilation required, but no type safety. - cffi: like ctypes but more Pythonic, better for complex C APIs, preferred in PyPy-compatible code.
ELI5: PyPy is like a race car driver who learns your track after a few laps and starts taking optimal lines. CPython is a cautious driver following the rules every lap. For a straight highway (predictable code), PyPy wins by a lot. For a chaotic city with random pedestrians (unpredictable types, C extensions), CPython’s reliability wins.
The C API reality
Writing CPython C extensions means managing reference counts manually. Get it wrong and you have either a memory leak (forgot Py_DECREF) or a crash (called Py_DECREF too early — object freed while still reachable).
PyObject *result = PyLong_FromLong(42); // ob_refcnt = 1
Py_INCREF(result); // ob_refcnt = 2
// ... give result to caller ...
Py_DECREF(result); // ob_refcnt = 1, not freed
For new C extensions, prefer pybind11 (C++11 RAII handles refcounts automatically) or maturin + PyO3 (Rust, also automatic).
Summary
| Topic | The one thing to remember |
|---|---|
| Execution pipeline | Source → AST → bytecode → ceval loop; .pyc caches bytecode |
LOAD_FAST vs LOAD_GLOBAL | Array index vs dict lookup — cache globals as locals in hot loops |
| Code objects | Immutable, shared across calls; frames are per-call |
| Evaluation loop | Giant C switch statement; GIL released every ~5ms between threads |
| Frames | Per-call heap allocation; f_back chains form the call stack |
| Object layout | Every object starts with ob_refcnt + ob_type |
dict (3.6+) | Compact + ordered; still hash-based, still O(1) average |
int cache | -5 to 256 are singletons; use == not is for value comparison |
| Peephole optimizer | Constant folding at compile time; specializing interpreter at runtime (3.11+) |
| Import system | sys.modules cache → sys.meta_path finders → loaders |
| Circular imports | Python caches partial modules — restructure to fix, don’t work around |
| Alternatives | PyPy for CPU-bound loops; Cython for C interop; ctypes for existing libs |