Lifecycle of Python Code: From Source to Execution¶
Overview¶
When you run python hello.py, a surprising amount happens before your first
print fires. CPython — the reference implementation — transforms plain text
through six distinct phases before a single machine instruction executes.
flowchart LR
A["Source text\n(.py file)"] --> B["Tokenizer"]
B --> C["Parser → AST"]
C --> D["Symbol Table\n& Semantic Analysis"]
D --> E["Compiler → Bytecode\n(.pyc / code objects)"]
E --> F["PEVMirtual Machine\n(ceval.c)"]
F --> G["Output / side effects"]
This guide walks every phase from the simplest angle a beginner needs through the low-level details a CPython contributor cares about.
Part 1 — Beginner: Running Your First Script¶
What actually happens when you type python hello.py?¶
Invisible to you, CPython:
- Reads the file as bytes and decodes them to Unicode.
- Splits the stream into tokens (keywords, names, operators, literals).
- Parses tokens into a tree of Python constructs.
- Compiles that tree to a sequence of simple instructions.
- Executes those instructions one by one.
That's it at 30,000 feet. Each step is explored below.
Phase 0 — Source Encoding¶
Before any parsing begins CPython looks for an encoding declaration in the first two lines.
# -*- coding: utf-8 -*- # explicit — tells CPython the encoding
# or implicitly: UTF-8 is the default since Python 3
If the file declares no encoding, CPython assumes UTF-8. A mismatch raises
SyntaxError: encoding problem before a single token is produced.
Phase 1 — Tokenization¶
The tokenizer turns a flat sequence of characters into a stream of labelled chunks called tokens.
import tokenize, io
source = 'x = 1 + 2\n'
tokens = tokenize.generate_tokens(io.StringIO(source).readline)
for tok in tokens:
print(tok)
TokenInfo(type=1 (NAME), string='x', start=(1, 0), end=(1, 1), ...)
TokenInfo(type=54 (OP), string='=', start=(1, 2), end=(1, 3), ...)
TokenInfo(type=2 (NUMBER), string='1', start=(1, 4), end=(1, 5), ...)
TokenInfo(type=54 (OP), string='+', start=(1, 6), end=(1, 7), ...)
TokenInfo(type=2 (NUMBER), string='2', start=(1, 8), end=(1, 9), ...)
TokenInfo(type=4 (NEWLINE), string='\n', ...)
TokenInfo(type=0 (ENDMARKER), ...)
Key token types:
| Token type | Examples |
|---|---|
NAME |
if, def, my_var |
NUMBER |
42, 3.14, 0xFF |
STRING |
"hello", b"bytes" |
OP |
+, =, (, : |
NEWLINE / INDENT / DEDENT |
block structure markers |
Indentation-based blocks are handled entirely at the tokenizer level using a stack of indent widths. The parser never sees raw whitespace.
Phase 2 — Parsing → Abstract Syntax Tree (AST)¶
Tokens feed into a PEG parser that recognises grammatical structure and builds an Abstract Syntax Tree (AST) — a tree of Python constructs.
import ast
source = """
def add(a, b):
return a + b
result = add(1, 2)
"""
tree = ast.parse(source)
print(ast.dump(tree, indent=2))
Module(
body=[
FunctionDef(
name='add',
args=arguments(args=[arg(arg='a'), arg(arg='b')]),
body=[
Return(
value=BinOp(
left=Name(id='a'),
op=Add(),
right=Name(id='b')))]),
Assign(
targets=[Name(id='result')],
value=Call(
func=Name(id='add'),
args=[Constant(value=1), Constant(value=2)]))],
...)
The AST is the last representation that is still human-readable and easily manipulated — it is the target of linters, formatters, and type checkers.
Part 2 — Intermediate: Symbol Tables, Bytecode & .pyc Files¶
Phase 3 — Symbol Table & Semantic Analysis¶
After the AST is built, CPython walks it to build symbol tables — one per scope (module, function, class, comprehension). Each table records:
- which names are local, global, free (closure), or cell.
- whether a name is defined before use (for
UnboundLocalErrordetection).
import symtable
source = """
x = 10
def outer():
y = 20
def inner():
return x + y # x is global, y is free (closure)
return inner
"""
table = symtable.symtable(source, "<string>", "exec")
def dump_table(t, indent=0):
prefix = " " * indent
print(f"{prefix}Scope: {t.get_name()!r} type={t.get_type()}")
for sym in t.get_symbols():
flags = []
if sym.is_global(): flags.append("global")
if sym.is_local(): flags.append("local")
if sym.is_free(): flags.append("free")
if sym.is_assigned(): flags.append("assigned")
print(f"{prefix} {sym.get_name()}: {flags}")
for child in t.get_children():
dump_table(child, indent + 1)
dump_table(table)
Scope: 'top' type=module
x: ['assigned', 'local']
outer: ['assigned', 'local']
Scope: 'outer' type=function
y: ['assigned', 'local']
inner: ['assigned', 'local']
Scope: 'inner' type=function
x: ['global']
y: ['free']
This is why the following raises UnboundLocalError — the symbol table marks
x as local (because of the assignment), so the read before assignment fails:
x = 10
def broken():
print(x) # UnboundLocalError: local variable 'x' referenced before assignment
x = 99 # this assignment makes x LOCAL for the whole function
broken()
Phase 4 — Compilation to Bytecode¶
The AST + symbol table feeds the compiler, which emits a code object
(types.CodeType) for every scope.
Each line: source_line offset opcode argument (name).
A code object carries everything needed to run a scope:
code = add.__code__
print("co_varnames :", code.co_varnames) # ('a', 'b')
print("co_consts :", code.co_consts) # (None,)
print("co_stacksize :", code.co_stacksize) # max stack depth needed
print("co_filename :", code.co_filename)
print("co_firstlineno:", code.co_firstlineno)
.pyc files — the bytecode cache¶
CPython saves compiled code objects to __pycache__/<module>.cpython-3XX.pyc
so subsequent imports skip re-parsing.
import marshal, struct, time, dis
pyc_path = "__pycache__/hello.cpython-312.pyc"
with open(pyc_path, "rb") as f:
magic = f.read(4) # CPython version magic
flags = f.read(4) # bit flags
mtime = struct.unpack("<I", f.read(4))[0]
size = struct.unpack("<I", f.read(4))[0]
code = marshal.load(f)
print("Magic :", magic.hex())
print("mtime :", time.ctime(mtime))
print("Source size:", size, "bytes")
dis.dis(code)
The magic number changes every CPython release. A stale .pyc from a
different version is silently ignored and the source is re-compiled.
Understanding the Evaluation Stack¶
CPython's VM is a stack machine. Every expression pushes and pops values from a runtime value stack.
LOAD_CONST 1 # stack: [1]
LOAD_CONST 2 # stack: [1, 2]
BINARY_OP + # stack: [3] (pop 2, push 1+2)
LOAD_CONST 3 # stack: [3, 3]
BINARY_OP * # stack: [9]
STORE_NAME result # stack: []
co_stacksize is the maximum depth this stack ever reaches — CPython
pre-allocates this space when a frame is created.
Part 3 — Advanced: Frames, the Eval Loop & Import System¶
Phase 5 — Execution in the Eval Loop¶
Each call to a Python function creates a frame object (PyFrameObject in
C). A frame binds a code object to a set of local variables and tracks the
current instruction pointer.
import sys
def show_frame():
frame = sys._getframe()
print("Function :", frame.f_code.co_name)
print("File :", frame.f_code.co_filename)
print("Line :", frame.f_lineno)
print("Locals :", list(frame.f_locals.keys()))
print("Back frame:", frame.f_back.f_code.co_name)
def caller():
x = 42
show_frame()
caller()
Frames form a call stack. sys._getframe(n) walks n levels up. This is
exactly what traceback.print_stack() traverses when an exception is raised.
The CPython Eval Loop (ceval.c)¶
At the C level the main interpreter loop is a giant switch in
Python/ceval.c:
// Simplified sketch of CPython/ceval.c (not actual code)
for (;;) {
opcode = next_opcode(frame);
switch (opcode) {
case LOAD_FAST:
PUSH(frame->localsplus[oparg]);
break;
case BINARY_OP:
right = POP();
left = TOP();
SET_TOP(PyNumber_Add(left, right));
break;
case RETURN_VALUE:
retval = POP();
goto return_or_yield;
// ... ~150 more opcodes
}
}
In CPython 3.11+ this loop was rewritten to use computed gotos
(goto *dispatch_table[opcode]) and specialising adaptive interpreter
(inline caches that speed up repeated patterns like attribute lookups).
Tracing Execution in Pure Python¶
Python exposes a trace hook that fires on every line, call, and return:
import sys
def tracer(frame, event, arg):
if event in ("call", "return", "line"):
print(f"[{event:6s}] {frame.f_code.co_name}:{frame.f_lineno} arg={arg!r}")
return tracer # must return itself to keep tracing
def greet(name):
msg = f"Hello, {name}"
return msg
sys.settrace(tracer)
result = greet("Alice")
sys.settrace(None)
print(result)
[call ] greet:6 arg=None
[line ] greet:7 arg=None
[line ] greet:8 arg=None
[return] greet:8 arg='Hello, Alice'
Hello, Alice
This is the mechanism used by debuggers (pdb), coverage tools
(coverage.py), and profilers (cProfile).
The Import System¶
import foo triggers its own mini-lifecycle:
1. sys.modules cache check → already imported? return cached module
2. Finder loop (sys.meta_path) → locate the module
3. Loader → read source / bytecode, create module object
4. Execution → run the module's top-level code in its namespace
5. sys.modules registration → cache for future imports
BuiltinImporter # handles math, sys, etc.
FrozenImporter # handles _frozen_importlib
PathFinder # handles everything on sys.path
Writing a custom import hook¶
import sys
from importlib.abc import MetaPathFinder, Loader
from importlib.util import spec_from_loader
import types
class DebugLoader(Loader):
def __init__(self, name):
self.name = name
def create_module(self, spec):
return None # use default semantics
def exec_module(self, module):
print(f"[hook] executing module: {self.name!r}")
module.answer = 42
class DebugFinder(MetaPathFinder):
def find_spec(self, fullname, path, target=None):
if fullname.startswith("magic_"):
loader = DebugLoader(fullname)
return spec_from_loader(fullname, loader)
return None # let other finders handle it
sys.meta_path.insert(0, DebugFinder())
import magic_module # triggers our hook
print(magic_module.answer) # 42
AST Manipulation at Runtime¶
Because the AST is exposed through the ast module you can intercept and
rewrite code before it is compiled — used by frameworks like pytest
(assertion rewriting) and numba (JIT compilation).
import ast, types, sys
# Rewrite every integer literal to be doubled
class DoubleLiterals(ast.NodeTransformer):
def visit_Constant(self, node):
if isinstance(node.value, int):
node.value *= 2
return node
source = "x = 1 + 2; print(x)"
tree = ast.parse(source)
tree = DoubleLiterals().visit(tree)
ast.fix_missing_locations(tree)
code = compile(tree, "<doubled>", "exec")
exec(code) # prints 6 (doubled 1→2, 2→4)
ast.fix_missing_locations propagates lineno and col_offset to any nodes
that the transformer added or modified — required before compile().
Bytecode Manipulation¶
You can build or patch bytecode directly using the bytecode library (third
party) or the lower-level opcode / dis modules.
# pip install bytecode
from bytecode import Bytecode, Instr
def original():
return 6 * 7
# Read, patch, write back
bc = Bytecode.from_code(original.__code__)
for i, instr in enumerate(bc):
if isinstance(instr, Instr) and instr.name == "LOAD_CONST":
if instr.arg == 6:
bc[i] = Instr("LOAD_CONST", 100) # patch 6 → 100
original.__code__ = bc.to_code()
print(original()) # 700
Full Lifecycle Cheat-Sheet¶
.py source
│
▼ Phase 0 — encoding detection (BOM / coding comment)
│
▼ Phase 1 — tokenizer (tokenize module)
│ → stream of TOKEN objects (NAME, NUMBER, OP, INDENT…)
│
▼ Phase 2 — PEG parser (Parser/parser.c, generated from Grammar/python.gram)
│ → ast.Module / ast.FunctionDef / ast.Expr …
│
▼ Phase 3 — symbol table (symtable module)
│ → scope resolution (local / global / free / cell)
│
▼ Phase 4 — compiler (Python/compile.c)
│ → code objects (co_code, co_consts, co_varnames …)
│ → optionally cached as .pyc in __pycache__/
│
▼ Phase 5 — eval loop (Python/ceval.c)
→ frames created per call, stack machine executes opcodes
→ objects allocated on the heap, reference counted + GC
Key Modules Reference¶
| Module | What it exposes |
|---|---|
tokenize |
token stream from source text |
ast |
parse, inspect, and transform AST nodes |
symtable |
scope/symbol information per function |
dis |
disassemble code objects to human-readable bytecode |
marshal |
low-level serialisation of code objects (used by .pyc) |
importlib |
full import machinery: finders, loaders, specs |
sys |
sys.meta_path, sys.modules, sys.settrace, sys._getframe |
types |
CodeType, FunctionType, ModuleType — the runtime objects |
opcode |
opcode names and numbers for the current CPython version |