CPython Internals

14 min read 2950 words

Table of Contents

1. How Python Executes Code
- What’s in a .pyc file
2. Bytecode Deep Dive
3. The Evaluation Loop (ceval.c)
4. Frame Objects
- Frame fields you’ll actually use
- Why Python recursion is expensive
5. Object Layout in C
6. The Peephole Optimizer
7. Import System Internals
8. Alternative Implementations
Summary

Most Python engineers treat the interpreter as a black box — code goes in, results come out. That works until it doesn’t: a performance regression you can’t explain, a threading bug that makes no sense, a memory spike with no obvious cause. Understanding CPython’s internals turns those mysteries into five-minute diagnoses.

1. How Python Executes Code

The pipeline is: source text → tokens → AST → bytecode → execution. Every time you import a module or run a script, this full pipeline runs (unless a .pyc cache exists).

source.py
   │
   ▼  tokenize (lexer)
[TOKENS]  — keywords, names, literals, operators
   │
   ▼  parse (parser)
[AST]     — Abstract Syntax Tree (tree of nodes)
   │
   ▼  compile (compiler + peephole optimizer)
[BYTECODE] — sequence of opcodes for the VM
   │
   ▼  execute (ceval.c evaluation loop)
[RESULT]

“Compiled AND interpreted” is not a contradiction — Python compiles to bytecode (compilation step), then a software interpreter executes that bytecode (interpretation step). It’s compiled to an intermediate form, not to machine code like C.

The compilation happens at import time, not at some separate build step. When you import foo, Python compiles foo.py right then, caches the bytecode to __pycache__/foo.cpython-312.pyc, and runs it. On next import, if the .pyc is fresh (mtime + size match), compilation is skipped.

ELI5: It’s like a chef who writes a recipe in French, then a translator converts it to a standard kitchen notation card. The kitchen staff read and execute the notation card. Next time, they reuse the same card unless the French recipe changed. Python’s “notation card” format is bytecode; the .pyc file is the saved card.

What’s in a `.pyc` file

Field	Size	Purpose
Magic number	4 bytes	Python version — mismatched magic = recompile
Bit field	4 bytes	Flags (hash-based invalidation, etc.)
Timestamp / hash	8 bytes	Source freshness check
Source size	4 bytes	Secondary freshness check
Marshalled code object	variable	The actual bytecode payload

2. Bytecode Deep Dive

Bytecode is the language the CPython VM speaks. Not machine code — a higher-level, portable, stack-based instruction set.

The `dis` module

import dis

def greet(name):
    return "Hello, " + name

dis.dis(greet)

Output (Python 3.12):

  2           RESUME                   0

  3           LOAD_CONST               1 ('Hello, ')
              LOAD_FAST                0 (name)
              BINARY_OP                0 (+)
              RETURN_VALUE

Read each line as: <lineno> <opcode> <arg> (<human-readable arg>).

Key opcodes

Opcode	What it does	Speed note
`LOAD_FAST`	Push local variable onto stack	Array index — O(1), very fast
`LOAD_GLOBAL`	Push global/builtin onto stack	Dict lookup — slower
`LOAD_CONST`	Push a literal/constant	Just a pointer copy — instant
`STORE_FAST`	Pop stack → local variable slot	Array write
`BINARY_OP`	Pop two operands, push result	Dispatches to type’s `__add__` etc.
`CALL`	Call a callable	Involves frame creation
`RETURN_VALUE`	Return top of stack	Tears down current frame
`BUILD_LIST`	Build list from N stack items	Used for list literals

Why `LOAD_FAST` beats `LOAD_GLOBAL`

Local variables are stored in a fixed-size C array on the frame — indexed by position (known at compile time). Access is array[index]. Global variables live in f_globals, a Python dict — so every access is a hash lookup. This is the entire reason “cache globals as locals” is a real micro-optimization in hot loops.

# In a tight inner loop, this matters:
import math
def slow():
    for _ in range(1_000_000):
        math.sqrt(2)          # LOAD_GLOBAL each iteration

def fast():
    sqrt = math.sqrt          # hoist to local
    for _ in range(1_000_000):
        sqrt(2)               # LOAD_FAST — 1 dict lookup saved per iteration

ELI5: Local variables are like items on your desk — you just reach for them by position. Global variables are like items in a filing cabinet you have to look up alphabetically every single time. For one lookup it doesn’t matter; for a million iterations it adds up.

Code objects

Every function, class body, and module has a code object (types.CodeType) attached:

def add(a, b):
    return a + b

c = add.__code__
print(c.co_varnames)   # ('a', 'b') — local variable names
print(c.co_consts)     # (None,)    — constants used
print(c.co_names)      # ()         — global names referenced
print(c.co_argcount)   # 2

The code object is immutable and shared — all calls to add reuse the same code object. The frame (below) is what’s per-call.

Common mistake: co_varnames lists locals IN ORDER, starting with arguments. Don’t assume it’s only arguments — it includes all locals defined in the function.

3. The Evaluation Loop (ceval.c)

All bytecode execution happens in a single C function: _PyEval_EvalFrameDefault in Python/ceval.c. It’s roughly 3000 lines of C — a giant switch statement over opcode values. This is CPython’s hot path.

// Simplified pseudocode
while (1) {
    opcode = *next_instr++;
    switch (opcode) {
        case LOAD_FAST:   PUSH(frame->localsplus[oparg]); break;
        case BINARY_ADD:  ... ; break;
        case RETURN_VALUE: return POP(); break;
        // ... ~150 cases
    }
}

The GIL check

Every $N$ instructions (historically 100, now driven by sys.getswitchinterval(), default 5ms), the eval loop checks eval_breaker — a flag set by signal handlers and other threads. If set, the current thread may release the GIL and let another thread run. This is the mechanism behind Python’s cooperative threading — it’s not truly preemptive at the bytecode level.

Exception handling

Try/except is not free. CPython maintains a block stack on the frame. Each try block pushes an entry; matching except clauses use it to find the handler. The cost only hits on the exception path — the happy path (no exception) is nearly zero overhead.

PEP 659: Adaptive Specializing Interpreter (Python 3.11+)

The generic BINARY_OP opcode dispatches through Python’s type system on every call. In 3.11+, CPython tracks how an opcode is actually used at runtime. After a few executions, it specializes the opcode in-place:

BINARY_OP  →  BINARY_OP_ADD_INT    (if both operands are ints)
LOAD_ATTR  →  LOAD_ATTR_INSTANCE_VALUE  (if it's always an instance attr)

Specialized opcodes bypass the type dispatch entirely — they check “are both ints?” and if yes, call C integer addition directly. This is why Python 3.11 is ~25% faster than 3.10 on average without any code changes.

ELI5: The generic opcode is like a restaurant waiter who, every time you order coffee, asks “what kind of coffee, what size, what milk?” The specialized opcode is one who remembers your order — after a few visits he just brings your usual without asking. The speedup is real because in real code, a + b is almost always the same types every time.

4. Frame Objects

Every function call creates a frame object — the execution context: local variables, instruction pointer, reference to code object, globals dict, etc.

import sys

def inner():
    frame = sys._getframe()         # current frame
    print(frame.f_code.co_name)     # 'inner'
    print(frame.f_back.f_code.co_name)  # caller's name
    print(frame.f_locals)           # local variable dict (snapshot)
    print(frame.f_lineno)           # current line number

inner()

Frame fields you’ll actually use

Field	Type	Use case
`f_code`	code object	What function am I in?
`f_back`	frame or None	Walk up the call stack
`f_locals`	dict (snapshot)	Inspect locals — read-only useful, writes unreliable
`f_globals`	dict (live)	The module’s global namespace
`f_lineno`	int	Current executing line
`f_builtins`	dict	Builtin namespace for this frame

ELI5: A frame is like a whiteboard for one function call. It has the function’s local variables, a pointer to where in the code you are, and a sticky note saying “when I’m done, return to THIS other whiteboard (f_back).” When the function returns, the whiteboard is erased.

Why Python recursion is expensive

Each recursive call allocates a new frame object on the heap (a C struct + Python dict + array of locals). Deep recursion means thousands of heap allocations. Compare to a JIT-compiled language where tail recursion can reuse the same stack frame. CPython has no tail-call optimization and never will — it’s an explicit design decision (preserves tracebacks).

import sys
sys.getrecursionlimit()    # 1000 by default
sys.setrecursionlimit(5000)  # raise with caution — stack overflow is a real risk

Python 3.11 improvement: frames are now allocated more efficiently via a slab allocator and can be “zero-copy” reused in some cases, reducing per-call overhead by ~20%.

Common mistake: Inspecting frame.f_locals inside a decorator and expecting writes to affect the function. f_locals is a copy — writing to it does nothing to the real locals array. Use ctypes black magic (not recommended in production) if you really need this.

5. Object Layout in C

Every Python object in memory starts with the same C header — this is what makes Python’s type system uniform.

PyObject — the universal header

typedef struct _object {
    Py_ssize_t ob_refcnt;    // reference count
    PyTypeObject *ob_type;   // pointer to type (what type(x) returns)
} PyObject;

Every single Python object — int, str, list, your custom class — starts with these two fields. That’s why type(x) and sys.getrefcount(x) work on literally anything.

PyVarObject extends this with ob_size (number of elements) for variable-length objects like lists and tuples.

How `int` works

Python ints are arbitrary precision. Small ints (-5 to 256) are cached singletons — a = 5; b = 5; a is b is True. Beyond that, each int is a heap-allocated PyLongObject with a ob_digit array of 30-bit “digits” in base $2^{30}$.

>>> (256) is (256)
True
>>> (257) is (257)   # may be True in same expression due to constant folding
True
>>> a = 257; b = 257
>>> a is b           # False — two separate objects
False

Common mistake: Using is to compare integers. == checks value; is checks identity (same memory address). Outside the -5 to 256 cache, two int objects with the same value are different objects.

How `str` works (PEP 393, Python 3.3+)

Before 3.3, every string was stored as UCS-2 or UCS-4 (fixed width per character). Wasteful for ASCII-heavy text. PEP 393 introduced flexible string representation:

Kind	Width	When used
Compact ASCII	1 byte/char	All chars ≤ 127
Latin-1 (UCS-1)	1 byte/char	All chars ≤ 255
UCS-2	2 bytes/char	Chars fit in BMP (≤ U+FFFF)
UCS-4	4 bytes/char	Any Unicode codepoint

A pure ASCII string uses 1 byte per character. Mixing in one emoji forces the whole string to UCS-4. This means len() is O(1) but memory usage can 4x from adding a single non-ASCII character.

How `dict` works (Python 3.6+ compact dict)

Pre-3.6 dicts used a sparse hash table — lots of empty slots, unordered. Since 3.6, dicts use a compact representation with a separate indices array:

indices:  [empty, 2, empty, 0, empty, 1, ...]   # sparse, small entries
entries:  [(hash, key, value), ...]             # dense, insertion-ordered

Iteration walks the dense entries array — hence guaranteed insertion order (and faster iteration). The indices array handles collision resolution via open addressing.

ELI5: The old dict was a sparse parking lot where cars park in a slot determined by their license plate hash — lots of empty spaces. The new dict is a numbered ticket system: cars park in a compact row in arrival order, and a separate small board maps license plates to ticket numbers. Iteration just reads the compact row left to right.

How `list` works

A list is a dynamic array — a C array of PyObject* pointers with over-allocation to amortize append cost. The growth factor is roughly 1.125× (plus some additive constant), giving amortized O(1) append.

import sys
l = []
for i in range(10):
    l.append(i)
    print(f"len={len(l)}, allocated≈{sys.getsizeof(l)}")

Insert at index 0 is O(n) — it shifts all existing pointers. If you need fast prepend/pop-from-front, use collections.deque.

6. The Peephole Optimizer

Between AST compilation and final bytecode, the compiler runs a peephole optimizer — a simple pass that replaces common patterns with more efficient ones.

Constant folding

def f():
    x = 3 * 4       # optimized to LOAD_CONST 12
    y = "hello " + "world"  # optimized to LOAD_CONST 'hello world'
    z = (1, 2, 3)   # tuple literal stays a constant

Check it:

import dis
def f():
    return 3 * 4
dis.dis(f)
# LOAD_CONST 12   ← not 3, not 4, not BINARY_OP

The compiler pre-computes constant expressions — 3 * 4 never runs at runtime.

What the peephole optimizer does NOT do

No loop unrolling
No inlining of function calls
No dead code elimination beyond trivial cases (if False:)
No branch prediction hints

For real optimization beyond this, look at PyPy (JIT), Cython, or Numba.

Specializing Adaptive Interpreter (3.11+)

At runtime, after an opcode executes a few times with the same types, CPython replaces it with a type-specialized version in the bytecode array itself — this is called quickening. The specialized opcodes skip the general dispatch:

Generic opcode	Specialized version	Condition
`BINARY_OP`	`BINARY_OP_ADD_INT`	Both operands are `int`
`BINARY_OP`	`BINARY_OP_ADD_UNICODE`	Both operands are `str`
`LOAD_ATTR`	`LOAD_ATTR_INSTANCE_VALUE`	Attribute is a plain instance var
`CALL`	`CALL_PY_EXACT_ARGS`	Pure Python function, exact arg count

If the type assumption breaks (you pass a float to the ADD_INT slot), the opcode de-optimizes back to generic and re-specializes over time.

7. Import System Internals

`sys.modules` — the module cache

The first rule of the import system: every successful import is cached in sys.modules. Re-importing the same module returns the cached object — no re-execution.

import sys
import json
print(sys.modules['json'])    # <module 'json' from '...'>
print(sys.modules['json'] is json)   # True — same object

This is why mutating an imported module’s globals affects all importers — they all hold the same module object.

Finders and loaders

import foo triggers Python to walk sys.meta_path — a list of finder objects. Each finder either returns a spec (found it, here’s how to load it) or None (not mine, try next).

sys.meta_path = [
    BuiltinImporter,       # handles 'sys', 'builtins', etc.
    FrozenImporter,        # frozen modules (embedded in interpreter)
    PathFinder,            # scans sys.path for file-based modules
]

PathFinder then uses sys.path_hooks to find a loader for the specific file type (.py, .so, .pyc, zip files, etc.).

ELI5: Importing is like going to a library. The meta_path finders are the librarians — you ask each one “do you have this book?” in order. The first one that says yes goes and gets it. sys.modules is the “books you’ve already checked out” pile — if it’s there, no need to ask the librarians again.

Circular imports — why they hurt

# a.py
from b import B_thing

# b.py
from a import A_thing

When Python starts importing a.py, it adds a partially initialized a module to sys.modules immediately — before a.py finishes executing. If b.py imports from a during that window, it gets the partial module. The fix: use import a (module-level) instead of from a import X, or restructure to break the cycle.

Common mistake: Circular imports work fine until they don’t — they usually blow up only when the import order changes (e.g., you add a third module). The root fix is restructuring, not import tricks.

Namespace packages (PEP 420)

A directory without __init__.py is still importable as a namespace package (Python 3.3+). Useful for splitting a large package across multiple directories or install locations. Performance cost: finding a namespace package requires scanning all sys.path entries — regular packages (with __init__.py) short-circuit on first match.

8. Alternative Implementations

CPython is the reference implementation but not the only game in town.

Comparison table

Implementation	Language	Execution	Best use case
CPython	C	Bytecode interpreter	Everything — default choice
PyPy	RPython/C	JIT compiler	Long-running CPU-bound code, 2-10× faster
Cython	C + Python	Compiles to C	Tight loops, C interop, numpy extensions
GraalPy	Java	Truffle/GraalVM JIT	Polyglot (run Python + Java + JS together)
RustPython	Rust	Bytecode interpreter	Embedding Python in Rust, learning project
MicroPython	C	Bytecode interpreter	Microcontrollers, embedded systems

When to reach for each

PyPy: you have a CPU-bound Python loop and can’t use numpy/numba. PyPy’s JIT compiles hot loops to machine code. Caveat: C extensions (numpy, pandas) often don’t work or run slower on PyPy.
Cython: you need to call C code, or you have a hot function that’s the bottleneck and you want C speeds with Python syntax.
CPython C API: writing a Python extension in C. You get full control and maximum compatibility, at the cost of manual reference counting.
ctypes: calling an existing compiled library (a .so / .dll) from Python without writing any C extension. Zero compilation required, but no type safety.
cffi: like ctypes but more Pythonic, better for complex C APIs, preferred in PyPy-compatible code.

ELI5: PyPy is like a race car driver who learns your track after a few laps and starts taking optimal lines. CPython is a cautious driver following the rules every lap. For a straight highway (predictable code), PyPy wins by a lot. For a chaotic city with random pedestrians (unpredictable types, C extensions), CPython’s reliability wins.

The C API reality

Writing CPython C extensions means managing reference counts manually. Get it wrong and you have either a memory leak (forgot Py_DECREF) or a crash (called Py_DECREF too early — object freed while still reachable).

PyObject *result = PyLong_FromLong(42);  // ob_refcnt = 1
Py_INCREF(result);                        // ob_refcnt = 2
// ... give result to caller ...
Py_DECREF(result);                        // ob_refcnt = 1, not freed

For new C extensions, prefer pybind11 (C++11 RAII handles refcounts automatically) or maturin + PyO3 (Rust, also automatic).

Summary

Topic	The one thing to remember
Execution pipeline	Source → AST → bytecode → ceval loop; `.pyc` caches bytecode
`LOAD_FAST` vs `LOAD_GLOBAL`	Array index vs dict lookup — cache globals as locals in hot loops
Code objects	Immutable, shared across calls; frames are per-call
Evaluation loop	Giant C switch statement; GIL released every ~5ms between threads
Frames	Per-call heap allocation; `f_back` chains form the call stack
Object layout	Every object starts with `ob_refcnt` + `ob_type`
`dict` (3.6+)	Compact + ordered; still hash-based, still O(1) average
`int` cache	-5 to 256 are singletons; use `==` not `is` for value comparison
Peephole optimizer	Constant folding at compile time; specializing interpreter at runtime (3.11+)
Import system	`sys.modules` cache → `sys.meta_path` finders → loaders
Circular imports	Python caches partial modules — restructure to fix, don’t work around
Alternatives	PyPy for CPU-bound loops; Cython for C interop; ctypes for existing libs

1. How Python Executes Code#

What’s in a .pyc file#

2. Bytecode Deep Dive#

The dis module#

Key opcodes#

Why LOAD_FAST beats LOAD_GLOBAL#

Code objects#

3. The Evaluation Loop (ceval.c)#

The GIL check#

Exception handling#

PEP 659: Adaptive Specializing Interpreter (Python 3.11+)#

4. Frame Objects#

Frame fields you’ll actually use#

Why Python recursion is expensive#

5. Object Layout in C#

PyObject — the universal header#

How int works#

How str works (PEP 393, Python 3.3+)#

How dict works (Python 3.6+ compact dict)#

How list works#

6. The Peephole Optimizer#

Constant folding#

What the peephole optimizer does NOT do#

Specializing Adaptive Interpreter (3.11+)#

7. Import System Internals#

sys.modules — the module cache#

Finders and loaders#

Circular imports — why they hurt#

Namespace packages (PEP 420)#

8. Alternative Implementations#

Comparison table#

When to reach for each#

The C API reality#

Summary#