← Python Mastery — Senior to Principal

CPython Internals

Most Python engineers treat the interpreter as a black box — code goes in, results come out. That works until it doesn’t: a performance regression you can’t explain, a threading bug that makes no sense, a memory spike with no obvious cause. Understanding CPython’s internals turns those mysteries into five-minute diagnoses.


1. How Python Executes Code

The pipeline is: source text → tokens → AST → bytecode → execution. Every time you import a module or run a script, this full pipeline runs (unless a .pyc cache exists).

source.py
   │
   ▼  tokenize (lexer)
[TOKENS]  — keywords, names, literals, operators
   │
   ▼  parse (parser)
[AST]     — Abstract Syntax Tree (tree of nodes)
   │
   ▼  compile (compiler + peephole optimizer)
[BYTECODE] — sequence of opcodes for the VM
   │
   ▼  execute (ceval.c evaluation loop)
[RESULT]

“Compiled AND interpreted” is not a contradiction — Python compiles to bytecode (compilation step), then a software interpreter executes that bytecode (interpretation step). It’s compiled to an intermediate form, not to machine code like C.

The compilation happens at import time, not at some separate build step. When you import foo, Python compiles foo.py right then, caches the bytecode to __pycache__/foo.cpython-312.pyc, and runs it. On next import, if the .pyc is fresh (mtime + size match), compilation is skipped.

ELI5: It’s like a chef who writes a recipe in French, then a translator converts it to a standard kitchen notation card. The kitchen staff read and execute the notation card. Next time, they reuse the same card unless the French recipe changed. Python’s “notation card” format is bytecode; the .pyc file is the saved card.

What’s in a .pyc file

FieldSizePurpose
Magic number4 bytesPython version — mismatched magic = recompile
Bit field4 bytesFlags (hash-based invalidation, etc.)
Timestamp / hash8 bytesSource freshness check
Source size4 bytesSecondary freshness check
Marshalled code objectvariableThe actual bytecode payload

2. Bytecode Deep Dive

Bytecode is the language the CPython VM speaks. Not machine code — a higher-level, portable, stack-based instruction set.

The dis module

import dis

def greet(name):
    return "Hello, " + name

dis.dis(greet)

Output (Python 3.12):

  2           RESUME                   0

  3           LOAD_CONST               1 ('Hello, ')
              LOAD_FAST                0 (name)
              BINARY_OP                0 (+)
              RETURN_VALUE

Read each line as: <lineno> <opcode> <arg> (<human-readable arg>).

Key opcodes

OpcodeWhat it doesSpeed note
LOAD_FASTPush local variable onto stackArray index — O(1), very fast
LOAD_GLOBALPush global/builtin onto stackDict lookup — slower
LOAD_CONSTPush a literal/constantJust a pointer copy — instant
STORE_FASTPop stack → local variable slotArray write
BINARY_OPPop two operands, push resultDispatches to type’s __add__ etc.
CALLCall a callableInvolves frame creation
RETURN_VALUEReturn top of stackTears down current frame
BUILD_LISTBuild list from N stack itemsUsed for list literals

Why LOAD_FAST beats LOAD_GLOBAL

Local variables are stored in a fixed-size C array on the frame — indexed by position (known at compile time). Access is array[index]. Global variables live in f_globals, a Python dict — so every access is a hash lookup. This is the entire reason “cache globals as locals” is a real micro-optimization in hot loops.

# In a tight inner loop, this matters:
import math
def slow():
    for _ in range(1_000_000):
        math.sqrt(2)          # LOAD_GLOBAL each iteration

def fast():
    sqrt = math.sqrt          # hoist to local
    for _ in range(1_000_000):
        sqrt(2)               # LOAD_FAST — 1 dict lookup saved per iteration

ELI5: Local variables are like items on your desk — you just reach for them by position. Global variables are like items in a filing cabinet you have to look up alphabetically every single time. For one lookup it doesn’t matter; for a million iterations it adds up.

Code objects

Every function, class body, and module has a code object (types.CodeType) attached:

def add(a, b):
    return a + b

c = add.__code__
print(c.co_varnames)   # ('a', 'b') — local variable names
print(c.co_consts)     # (None,)    — constants used
print(c.co_names)      # ()         — global names referenced
print(c.co_argcount)   # 2

The code object is immutable and shared — all calls to add reuse the same code object. The frame (below) is what’s per-call.

Common mistake: co_varnames lists locals IN ORDER, starting with arguments. Don’t assume it’s only arguments — it includes all locals defined in the function.


3. The Evaluation Loop (ceval.c)

All bytecode execution happens in a single C function: _PyEval_EvalFrameDefault in Python/ceval.c. It’s roughly 3000 lines of C — a giant switch statement over opcode values. This is CPython’s hot path.

// Simplified pseudocode
while (1) {
    opcode = *next_instr++;
    switch (opcode) {
        case LOAD_FAST:   PUSH(frame->localsplus[oparg]); break;
        case BINARY_ADD:  ... ; break;
        case RETURN_VALUE: return POP(); break;
        // ... ~150 cases
    }
}

The GIL check

Every $N$ instructions (historically 100, now driven by sys.getswitchinterval(), default 5ms), the eval loop checks eval_breaker — a flag set by signal handlers and other threads. If set, the current thread may release the GIL and let another thread run. This is the mechanism behind Python’s cooperative threading — it’s not truly preemptive at the bytecode level.

Exception handling

Try/except is not free. CPython maintains a block stack on the frame. Each try block pushes an entry; matching except clauses use it to find the handler. The cost only hits on the exception path — the happy path (no exception) is nearly zero overhead.

PEP 659: Adaptive Specializing Interpreter (Python 3.11+)

The generic BINARY_OP opcode dispatches through Python’s type system on every call. In 3.11+, CPython tracks how an opcode is actually used at runtime. After a few executions, it specializes the opcode in-place:

BINARY_OP  →  BINARY_OP_ADD_INT    (if both operands are ints)
LOAD_ATTR  →  LOAD_ATTR_INSTANCE_VALUE  (if it's always an instance attr)

Specialized opcodes bypass the type dispatch entirely — they check “are both ints?” and if yes, call C integer addition directly. This is why Python 3.11 is ~25% faster than 3.10 on average without any code changes.

ELI5: The generic opcode is like a restaurant waiter who, every time you order coffee, asks “what kind of coffee, what size, what milk?” The specialized opcode is one who remembers your order — after a few visits he just brings your usual without asking. The speedup is real because in real code, a + b is almost always the same types every time.


4. Frame Objects

Every function call creates a frame object — the execution context: local variables, instruction pointer, reference to code object, globals dict, etc.

import sys

def inner():
    frame = sys._getframe()         # current frame
    print(frame.f_code.co_name)     # 'inner'
    print(frame.f_back.f_code.co_name)  # caller's name
    print(frame.f_locals)           # local variable dict (snapshot)
    print(frame.f_lineno)           # current line number

inner()

Frame fields you’ll actually use

FieldTypeUse case
f_codecode objectWhat function am I in?
f_backframe or NoneWalk up the call stack
f_localsdict (snapshot)Inspect locals — read-only useful, writes unreliable
f_globalsdict (live)The module’s global namespace
f_linenointCurrent executing line
f_builtinsdictBuiltin namespace for this frame

ELI5: A frame is like a whiteboard for one function call. It has the function’s local variables, a pointer to where in the code you are, and a sticky note saying “when I’m done, return to THIS other whiteboard (f_back).” When the function returns, the whiteboard is erased.

Why Python recursion is expensive

Each recursive call allocates a new frame object on the heap (a C struct + Python dict + array of locals). Deep recursion means thousands of heap allocations. Compare to a JIT-compiled language where tail recursion can reuse the same stack frame. CPython has no tail-call optimization and never will — it’s an explicit design decision (preserves tracebacks).

import sys
sys.getrecursionlimit()    # 1000 by default
sys.setrecursionlimit(5000)  # raise with caution — stack overflow is a real risk

Python 3.11 improvement: frames are now allocated more efficiently via a slab allocator and can be “zero-copy” reused in some cases, reducing per-call overhead by ~20%.

Common mistake: Inspecting frame.f_locals inside a decorator and expecting writes to affect the function. f_locals is a copy — writing to it does nothing to the real locals array. Use ctypes black magic (not recommended in production) if you really need this.


5. Object Layout in C

Every Python object in memory starts with the same C header — this is what makes Python’s type system uniform.

PyObject — the universal header

typedef struct _object {
    Py_ssize_t ob_refcnt;    // reference count
    PyTypeObject *ob_type;   // pointer to type (what type(x) returns)
} PyObject;

Every single Python object — int, str, list, your custom class — starts with these two fields. That’s why type(x) and sys.getrefcount(x) work on literally anything.

PyVarObject extends this with ob_size (number of elements) for variable-length objects like lists and tuples.

How int works

Python ints are arbitrary precision. Small ints (-5 to 256) are cached singletonsa = 5; b = 5; a is b is True. Beyond that, each int is a heap-allocated PyLongObject with a ob_digit array of 30-bit “digits” in base $2^{30}$.

>>> (256) is (256)
True
>>> (257) is (257)   # may be True in same expression due to constant folding
True
>>> a = 257; b = 257
>>> a is b           # False — two separate objects
False

Common mistake: Using is to compare integers. == checks value; is checks identity (same memory address). Outside the -5 to 256 cache, two int objects with the same value are different objects.

How str works (PEP 393, Python 3.3+)

Before 3.3, every string was stored as UCS-2 or UCS-4 (fixed width per character). Wasteful for ASCII-heavy text. PEP 393 introduced flexible string representation:

KindWidthWhen used
Compact ASCII1 byte/charAll chars ≤ 127
Latin-1 (UCS-1)1 byte/charAll chars ≤ 255
UCS-22 bytes/charChars fit in BMP (≤ U+FFFF)
UCS-44 bytes/charAny Unicode codepoint

A pure ASCII string uses 1 byte per character. Mixing in one emoji forces the whole string to UCS-4. This means len() is O(1) but memory usage can 4x from adding a single non-ASCII character.

How dict works (Python 3.6+ compact dict)

Pre-3.6 dicts used a sparse hash table — lots of empty slots, unordered. Since 3.6, dicts use a compact representation with a separate indices array:

indices:  [empty, 2, empty, 0, empty, 1, ...]   # sparse, small entries
entries:  [(hash, key, value), ...]             # dense, insertion-ordered

Iteration walks the dense entries array — hence guaranteed insertion order (and faster iteration). The indices array handles collision resolution via open addressing.

ELI5: The old dict was a sparse parking lot where cars park in a slot determined by their license plate hash — lots of empty spaces. The new dict is a numbered ticket system: cars park in a compact row in arrival order, and a separate small board maps license plates to ticket numbers. Iteration just reads the compact row left to right.

How list works

A list is a dynamic array — a C array of PyObject* pointers with over-allocation to amortize append cost. The growth factor is roughly 1.125× (plus some additive constant), giving amortized O(1) append.

import sys
l = []
for i in range(10):
    l.append(i)
    print(f"len={len(l)}, allocated≈{sys.getsizeof(l)}")

Insert at index 0 is O(n) — it shifts all existing pointers. If you need fast prepend/pop-from-front, use collections.deque.


6. The Peephole Optimizer

Between AST compilation and final bytecode, the compiler runs a peephole optimizer — a simple pass that replaces common patterns with more efficient ones.

Constant folding

def f():
    x = 3 * 4       # optimized to LOAD_CONST 12
    y = "hello " + "world"  # optimized to LOAD_CONST 'hello world'
    z = (1, 2, 3)   # tuple literal stays a constant

Check it:

import dis
def f():
    return 3 * 4
dis.dis(f)
# LOAD_CONST 12   ← not 3, not 4, not BINARY_OP

The compiler pre-computes constant expressions — 3 * 4 never runs at runtime.

What the peephole optimizer does NOT do

  • No loop unrolling
  • No inlining of function calls
  • No dead code elimination beyond trivial cases (if False:)
  • No branch prediction hints

For real optimization beyond this, look at PyPy (JIT), Cython, or Numba.

Specializing Adaptive Interpreter (3.11+)

At runtime, after an opcode executes a few times with the same types, CPython replaces it with a type-specialized version in the bytecode array itself — this is called quickening. The specialized opcodes skip the general dispatch:

Generic opcodeSpecialized versionCondition
BINARY_OPBINARY_OP_ADD_INTBoth operands are int
BINARY_OPBINARY_OP_ADD_UNICODEBoth operands are str
LOAD_ATTRLOAD_ATTR_INSTANCE_VALUEAttribute is a plain instance var
CALLCALL_PY_EXACT_ARGSPure Python function, exact arg count

If the type assumption breaks (you pass a float to the ADD_INT slot), the opcode de-optimizes back to generic and re-specializes over time.


7. Import System Internals

sys.modules — the module cache

The first rule of the import system: every successful import is cached in sys.modules. Re-importing the same module returns the cached object — no re-execution.

import sys
import json
print(sys.modules['json'])    # <module 'json' from '...'>
print(sys.modules['json'] is json)   # True — same object

This is why mutating an imported module’s globals affects all importers — they all hold the same module object.

Finders and loaders

import foo triggers Python to walk sys.meta_path — a list of finder objects. Each finder either returns a spec (found it, here’s how to load it) or None (not mine, try next).

sys.meta_path = [
    BuiltinImporter,       # handles 'sys', 'builtins', etc.
    FrozenImporter,        # frozen modules (embedded in interpreter)
    PathFinder,            # scans sys.path for file-based modules
]

PathFinder then uses sys.path_hooks to find a loader for the specific file type (.py, .so, .pyc, zip files, etc.).

ELI5: Importing is like going to a library. The meta_path finders are the librarians — you ask each one “do you have this book?” in order. The first one that says yes goes and gets it. sys.modules is the “books you’ve already checked out” pile — if it’s there, no need to ask the librarians again.

Circular imports — why they hurt

# a.py
from b import B_thing

# b.py
from a import A_thing

When Python starts importing a.py, it adds a partially initialized a module to sys.modules immediately — before a.py finishes executing. If b.py imports from a during that window, it gets the partial module. The fix: use import a (module-level) instead of from a import X, or restructure to break the cycle.

Common mistake: Circular imports work fine until they don’t — they usually blow up only when the import order changes (e.g., you add a third module). The root fix is restructuring, not import tricks.

Namespace packages (PEP 420)

A directory without __init__.py is still importable as a namespace package (Python 3.3+). Useful for splitting a large package across multiple directories or install locations. Performance cost: finding a namespace package requires scanning all sys.path entries — regular packages (with __init__.py) short-circuit on first match.


8. Alternative Implementations

CPython is the reference implementation but not the only game in town.

Comparison table

ImplementationLanguageExecutionBest use case
CPythonCBytecode interpreterEverything — default choice
PyPyRPython/CJIT compilerLong-running CPU-bound code, 2-10× faster
CythonC + PythonCompiles to CTight loops, C interop, numpy extensions
GraalPyJavaTruffle/GraalVM JITPolyglot (run Python + Java + JS together)
RustPythonRustBytecode interpreterEmbedding Python in Rust, learning project
MicroPythonCBytecode interpreterMicrocontrollers, embedded systems

When to reach for each

  • PyPy: you have a CPU-bound Python loop and can’t use numpy/numba. PyPy’s JIT compiles hot loops to machine code. Caveat: C extensions (numpy, pandas) often don’t work or run slower on PyPy.
  • Cython: you need to call C code, or you have a hot function that’s the bottleneck and you want C speeds with Python syntax.
  • CPython C API: writing a Python extension in C. You get full control and maximum compatibility, at the cost of manual reference counting.
  • ctypes: calling an existing compiled library (a .so / .dll) from Python without writing any C extension. Zero compilation required, but no type safety.
  • cffi: like ctypes but more Pythonic, better for complex C APIs, preferred in PyPy-compatible code.

ELI5: PyPy is like a race car driver who learns your track after a few laps and starts taking optimal lines. CPython is a cautious driver following the rules every lap. For a straight highway (predictable code), PyPy wins by a lot. For a chaotic city with random pedestrians (unpredictable types, C extensions), CPython’s reliability wins.

The C API reality

Writing CPython C extensions means managing reference counts manually. Get it wrong and you have either a memory leak (forgot Py_DECREF) or a crash (called Py_DECREF too early — object freed while still reachable).

PyObject *result = PyLong_FromLong(42);  // ob_refcnt = 1
Py_INCREF(result);                        // ob_refcnt = 2
// ... give result to caller ...
Py_DECREF(result);                        // ob_refcnt = 1, not freed

For new C extensions, prefer pybind11 (C++11 RAII handles refcounts automatically) or maturin + PyO3 (Rust, also automatic).


Summary

TopicThe one thing to remember
Execution pipelineSource → AST → bytecode → ceval loop; .pyc caches bytecode
LOAD_FAST vs LOAD_GLOBALArray index vs dict lookup — cache globals as locals in hot loops
Code objectsImmutable, shared across calls; frames are per-call
Evaluation loopGiant C switch statement; GIL released every ~5ms between threads
FramesPer-call heap allocation; f_back chains form the call stack
Object layoutEvery object starts with ob_refcnt + ob_type
dict (3.6+)Compact + ordered; still hash-based, still O(1) average
int cache-5 to 256 are singletons; use == not is for value comparison
Peephole optimizerConstant folding at compile time; specializing interpreter at runtime (3.11+)
Import systemsys.modules cache → sys.meta_path finders → loaders
Circular importsPython caches partial modules — restructure to fix, don’t work around
AlternativesPyPy for CPU-bound loops; Cython for C interop; ctypes for existing libs