Lifecycle of Python Code: From Source to Execution¶

Overview¶

When you run python hello.py, a surprising amount happens before your first print fires. CPython — the reference implementation — transforms plain text through six distinct phases before a single machine instruction executes.

flowchart LR
    A["Source text\n(.py file)"] --> B["Tokenizer"]
    B --> C["Parser → AST"]
    C --> D["Symbol Table\n& Semantic Analysis"]
    D --> E["Compiler → Bytecode\n(.pyc / code objects)"]
    E --> F["PEVMirtual Machine\n(ceval.c)"]
    F --> G["Output / side effects"]

This guide walks every phase from the simplest angle a beginner needs through the low-level details a CPython contributor cares about.

Part 1 — Beginner: Running Your First Script¶

What actually happens when you type `python hello.py`?¶

# hello.py
name = "world"
print(f"Hello, {name}!")

$ python hello.py
Hello, world!

Invisible to you, CPython:

Reads the file as bytes and decodes them to Unicode.
Splits the stream into tokens (keywords, names, operators, literals).
Parses tokens into a tree of Python constructs.
Compiles that tree to a sequence of simple instructions.
Executes those instructions one by one.

That's it at 30,000 feet. Each step is explored below.

Phase 0 — Source Encoding¶

Before any parsing begins CPython looks for an encoding declaration in the first two lines.

# -*- coding: utf-8 -*-      # explicit — tells CPython the encoding
# or implicitly: UTF-8 is the default since Python 3

# encoding_demo.py
# -*- coding: utf-8 -*-

greeting = "こんにちは"
print(greeting)

If the file declares no encoding, CPython assumes UTF-8. A mismatch raises SyntaxError: encoding problem before a single token is produced.

Phase 1 — Tokenization¶

The tokenizer turns a flat sequence of characters into a stream of labelled chunks called tokens.

import tokenize, io

source = 'x = 1 + 2\n'
tokens = tokenize.generate_tokens(io.StringIO(source).readline)
for tok in tokens:
    print(tok)

TokenInfo(type=1 (NAME),    string='x',  start=(1, 0), end=(1, 1), ...)
TokenInfo(type=54 (OP),     string='=',  start=(1, 2), end=(1, 3), ...)
TokenInfo(type=2 (NUMBER),  string='1',  start=(1, 4), end=(1, 5), ...)
TokenInfo(type=54 (OP),     string='+',  start=(1, 6), end=(1, 7), ...)
TokenInfo(type=2 (NUMBER),  string='2',  start=(1, 8), end=(1, 9), ...)
TokenInfo(type=4 (NEWLINE), string='\n', ...)
TokenInfo(type=0 (ENDMARKER), ...)

Key token types:

Token type	Examples
`NAME`	`if`, `def`, `my_var`
`NUMBER`	`42`, `3.14`, `0xFF`
`STRING`	`"hello"`, `b"bytes"`
`OP`	`+`, `=`, `(`, `:`
`NEWLINE` / `INDENT` / `DEDENT`	block structure markers

Indentation-based blocks are handled entirely at the tokenizer level using a stack of indent widths. The parser never sees raw whitespace.

Phase 2 — Parsing → Abstract Syntax Tree (AST)¶

Tokens feed into a PEG parser that recognises grammatical structure and builds an Abstract Syntax Tree (AST) — a tree of Python constructs.

import ast

source = """
def add(a, b):
    return a + b

result = add(1, 2)
"""

tree = ast.parse(source)
print(ast.dump(tree, indent=2))

Module(
  body=[
    FunctionDef(
      name='add',
      args=arguments(args=[arg(arg='a'), arg(arg='b')]),
      body=[
        Return(
          value=BinOp(
            left=Name(id='a'),
            op=Add(),
            right=Name(id='b')))]),
    Assign(
      targets=[Name(id='result')],
      value=Call(
        func=Name(id='add'),
        args=[Constant(value=1), Constant(value=2)]))],
  ...)

The AST is the last representation that is still human-readable and easily manipulated — it is the target of linters, formatters, and type checkers.

Part 2 — Intermediate: Symbol Tables, Bytecode & .pyc Files¶

Phase 3 — Symbol Table & Semantic Analysis¶

After the AST is built, CPython walks it to build symbol tables — one per scope (module, function, class, comprehension). Each table records:

which names are local, global, free (closure), or cell.
whether a name is defined before use (for UnboundLocalError detection).

import symtable

source = """
x = 10

def outer():
    y = 20
    def inner():
        return x + y   # x is global, y is free (closure)
    return inner
"""

table = symtable.symtable(source, "<string>", "exec")

def dump_table(t, indent=0):
    prefix = "  " * indent
    print(f"{prefix}Scope: {t.get_name()!r}  type={t.get_type()}")
    for sym in t.get_symbols():
        flags = []
        if sym.is_global():   flags.append("global")
        if sym.is_local():    flags.append("local")
        if sym.is_free():     flags.append("free")
        if sym.is_assigned(): flags.append("assigned")
        print(f"{prefix}  {sym.get_name()}: {flags}")
    for child in t.get_children():
        dump_table(child, indent + 1)

dump_table(table)

Scope: 'top'  type=module
  x: ['assigned', 'local']
  outer: ['assigned', 'local']
  Scope: 'outer'  type=function
    y: ['assigned', 'local']
    inner: ['assigned', 'local']
    Scope: 'inner'  type=function
      x: ['global']
      y: ['free']

This is why the following raises UnboundLocalError — the symbol table marks x as local (because of the assignment), so the read before assignment fails:

x = 10

def broken():
    print(x)   # UnboundLocalError: local variable 'x' referenced before assignment
    x = 99     # this assignment makes x LOCAL for the whole function

broken()

Phase 4 — Compilation to Bytecode¶

The AST + symbol table feeds the compiler, which emits a code object (types.CodeType) for every scope.

import dis

def add(a, b):
    return a + b

dis.dis(add)

  2           0 RESUME          0

  3           2 LOAD_FAST       0 (a)
              4 LOAD_FAST       1 (b)
              6 BINARY_OP      0 (+)
             10 RETURN_VALUE

Each line: source_line offset opcode argument (name).

A code object carries everything needed to run a scope:

code = add.__code__
print("co_varnames  :", code.co_varnames)   # ('a', 'b')
print("co_consts    :", code.co_consts)     # (None,)
print("co_stacksize :", code.co_stacksize)  # max stack depth needed
print("co_filename  :", code.co_filename)
print("co_firstlineno:", code.co_firstlineno)

.pyc files — the bytecode cache¶

CPython saves compiled code objects to __pycache__/<module>.cpython-3XX.pyc so subsequent imports skip re-parsing.

import marshal, struct, time, dis

pyc_path = "__pycache__/hello.cpython-312.pyc"

with open(pyc_path, "rb") as f:
    magic   = f.read(4)    # CPython version magic
    flags   = f.read(4)    # bit flags
    mtime   = struct.unpack("<I", f.read(4))[0]
    size    = struct.unpack("<I", f.read(4))[0]
    code    = marshal.load(f)

print("Magic  :", magic.hex())
print("mtime  :", time.ctime(mtime))
print("Source size:", size, "bytes")
dis.dis(code)

The magic number changes every CPython release. A stale .pyc from a different version is silently ignored and the source is re-compiled.

Understanding the Evaluation Stack¶

CPython's VM is a stack machine. Every expression pushes and pops values from a runtime value stack.

# Source
result = (1 + 2) * 3

LOAD_CONST   1          # stack: [1]
LOAD_CONST   2          # stack: [1, 2]
BINARY_OP    +          # stack: [3]      (pop 2, push 1+2)
LOAD_CONST   3          # stack: [3, 3]
BINARY_OP    *          # stack: [9]
STORE_NAME   result     # stack: []

co_stacksize is the maximum depth this stack ever reaches — CPython pre-allocates this space when a frame is created.

Part 3 — Advanced: Frames, the Eval Loop & Import System¶

Phase 5 — Execution in the Eval Loop¶

Each call to a Python function creates a frame object (PyFrameObject in C). A frame binds a code object to a set of local variables and tracks the current instruction pointer.

import sys

def show_frame():
    frame = sys._getframe()
    print("Function  :", frame.f_code.co_name)
    print("File      :", frame.f_code.co_filename)
    print("Line      :", frame.f_lineno)
    print("Locals    :", list(frame.f_locals.keys()))
    print("Back frame:", frame.f_back.f_code.co_name)

def caller():
    x = 42
    show_frame()

caller()

Function  : show_frame
File      : <stdin>
Line      : 3
Locals    : ['frame']
Back frame: caller

Frames form a call stack. sys._getframe(n) walks n levels up. This is exactly what traceback.print_stack() traverses when an exception is raised.

The CPython Eval Loop (ceval.c)¶

At the C level the main interpreter loop is a giant switch in Python/ceval.c:

// Simplified sketch of CPython/ceval.c  (not actual code)
for (;;) {
    opcode = next_opcode(frame);
    switch (opcode) {
        case LOAD_FAST:
            PUSH(frame->localsplus[oparg]);
            break;
        case BINARY_OP:
            right = POP();
            left  = TOP();
            SET_TOP(PyNumber_Add(left, right));
            break;
        case RETURN_VALUE:
            retval = POP();
            goto return_or_yield;
        // ... ~150 more opcodes
    }
}

In CPython 3.11+ this loop was rewritten to use computed gotos (goto *dispatch_table[opcode]) and specialising adaptive interpreter (inline caches that speed up repeated patterns like attribute lookups).

Tracing Execution in Pure Python¶

Python exposes a trace hook that fires on every line, call, and return:

import sys

def tracer(frame, event, arg):
    if event in ("call", "return", "line"):
        print(f"[{event:6s}] {frame.f_code.co_name}:{frame.f_lineno}  arg={arg!r}")
    return tracer  # must return itself to keep tracing

def greet(name):
    msg = f"Hello, {name}"
    return msg

sys.settrace(tracer)
result = greet("Alice")
sys.settrace(None)
print(result)

[call  ] greet:6  arg=None
[line  ] greet:7  arg=None
[line  ] greet:8  arg=None
[return] greet:8  arg='Hello, Alice'
Hello, Alice

This is the mechanism used by debuggers (pdb), coverage tools (coverage.py), and profilers (cProfile).

The Import System¶

import foo triggers its own mini-lifecycle:

1. sys.modules cache check          → already imported? return cached module
2. Finder loop (sys.meta_path)      → locate the module
3. Loader                           → read source / bytecode, create module object
4. Execution                        → run the module's top-level code in its namespace
5. sys.modules registration         → cache for future imports

import sys

# Peek at the finder chain
for finder in sys.meta_path:
    print(type(finder).__name__)

BuiltinImporter      # handles math, sys, etc.
FrozenImporter       # handles _frozen_importlib
PathFinder           # handles everything on sys.path

Writing a custom import hook¶

import sys
from importlib.abc import MetaPathFinder, Loader
from importlib.util import spec_from_loader
import types

class DebugLoader(Loader):
    def __init__(self, name):
        self.name = name

    def create_module(self, spec):
        return None  # use default semantics

    def exec_module(self, module):
        print(f"[hook] executing module: {self.name!r}")
        module.answer = 42

class DebugFinder(MetaPathFinder):
    def find_spec(self, fullname, path, target=None):
        if fullname.startswith("magic_"):
            loader = DebugLoader(fullname)
            return spec_from_loader(fullname, loader)
        return None  # let other finders handle it

sys.meta_path.insert(0, DebugFinder())

import magic_module        # triggers our hook
print(magic_module.answer) # 42

AST Manipulation at Runtime¶

Because the AST is exposed through the ast module you can intercept and rewrite code before it is compiled — used by frameworks like pytest (assertion rewriting) and numba (JIT compilation).

import ast, types, sys

# Rewrite every integer literal to be doubled
class DoubleLiterals(ast.NodeTransformer):
    def visit_Constant(self, node):
        if isinstance(node.value, int):
            node.value *= 2
        return node

source = "x = 1 + 2; print(x)"

tree = ast.parse(source)
tree = DoubleLiterals().visit(tree)
ast.fix_missing_locations(tree)

code = compile(tree, "<doubled>", "exec")
exec(code)   # prints 6  (doubled 1→2, 2→4)

ast.fix_missing_locations propagates lineno and col_offset to any nodes that the transformer added or modified — required before compile().

Bytecode Manipulation¶

You can build or patch bytecode directly using the bytecode library (third party) or the lower-level opcode / dis modules.

# pip install bytecode
from bytecode import Bytecode, Instr

def original():
    return 6 * 7

# Read, patch, write back
bc = Bytecode.from_code(original.__code__)
for i, instr in enumerate(bc):
    if isinstance(instr, Instr) and instr.name == "LOAD_CONST":
        if instr.arg == 6:
            bc[i] = Instr("LOAD_CONST", 100)  # patch 6 → 100

original.__code__ = bc.to_code()
print(original())  # 700

Full Lifecycle Cheat-Sheet¶

.py source
  │
  ▼  Phase 0 — encoding detection (BOM / coding comment)
  │
  ▼  Phase 1 — tokenizer  (tokenize module)
  │            → stream of TOKEN objects (NAME, NUMBER, OP, INDENT…)
  │
  ▼  Phase 2 — PEG parser  (Parser/parser.c, generated from Grammar/python.gram)
  │            → ast.Module / ast.FunctionDef / ast.Expr …
  │
  ▼  Phase 3 — symbol table  (symtable module)
  │            → scope resolution (local / global / free / cell)
  │
  ▼  Phase 4 — compiler  (Python/compile.c)
  │            → code objects (co_code, co_consts, co_varnames …)
  │            → optionally cached as .pyc in __pycache__/
  │
  ▼  Phase 5 — eval loop  (Python/ceval.c)
               → frames created per call, stack machine executes opcodes
               → objects allocated on the heap, reference counted + GC

Key Modules Reference¶

Module	What it exposes
`tokenize`	token stream from source text
`ast`	parse, inspect, and transform AST nodes
`symtable`	scope/symbol information per function
`dis`	disassemble code objects to human-readable bytecode
`marshal`	low-level serialisation of code objects (used by .pyc)
`importlib`	full import machinery: finders, loaders, specs
`sys`	`sys.meta_path`, `sys.modules`, `sys.settrace`, `sys._getframe`
`types`	`CodeType`, `FunctionType`, `ModuleType` — the runtime objects
`opcode`	opcode names and numbers for the current CPython version