Architecture: Autograd & Tape
Architecture: Autograd & Tape
netcl's autograd API is a classical reverse-mode
automatic-differentiation engine built around a per-thread
Tape. When you wrap a block of code in
with ag.Tape():, every differentiable op inside the block creates a
Node that records (a) its output
Tensor, (b) a grad_fn that knows how to compute local
gradients, and (c) its parents in the computation graph. Calling
tape.backward(loss) then walks the graph in reverse topological
order and accumulates gradients back into the leaf nodes.
The forward graph and the implicit backward DAG are illustrated below
in the canonical y = (x+1)**2 ; z = 3*y ; loss = z.sum() example.
Caption — solid arrows are forward edges (parents), dashed arrows
are the reverse edges traversed during backward(). The seed gradient
at loss is 1s with the same shape as loss.value; the chain rule
combines it with each grad_fn to produce parent gradients.
Node — the per-op record
@dataclass
class Node:
value: Tensor
grad_fn: Optional[GradFn] # Callable[[Tensor], List[Optional[Tensor]]]
parents: List[Node]
grad: Optional[Tensor] # accumulated gradient
requires_grad: bool
op_name: Optional[str] # for debug prints
creation_trace:Optional[List[str]]
A Node is not the same thing as a
Tensor. The tensor is the numeric value; the node is the
graph record. In particular, leaf tensors that do not need gradients
have no node, and intermediate tensors whose requires_grad=False
have a node only when at least one of their parents requires
gradients (and even then the node's grad_fn is None).
The fields:
value— the Tensor produced by the forward op.grad_fn— a callable that, given the upstream gradient, returns a list of local gradients (one per parent). May beNonefor terminal nodes.parents— the nodes whose values were the inputs.grad— the accumulated gradient flowing back into this node. On leaf parameters this is what the Optimizer reads.requires_grad— propagated from the parents. Used to decide whether the node participates inbackward()at all.op_name— a human-readable name (e.g."matmul","relu","conv2d"). Used bydetect_anomalyanddebug_tape.creation_trace— a captured stack trace, populated only insidewith ag.detect_anomaly():because it is expensive.
Tape — the per-thread recorder
class Tape:
def __enter__(self): ... # sets itself as current_tape
def __exit__(self, *): ... # clears current_tape
def backward(self, loss: Node) -> None: ...
The Tape is bound to the current thread
through a threading.local() named _tls. This means:
- Two threads can be in different
with ag.Tape():blocks at the same time without interfering. - The single-thread default of
set_current_tape(tape)/get_current_tape()is the implicit tape consumed byapply_opwhen the user does not pass an explicittape=argument. - Nested
with ag.Tape():blocks stack: the inner block becomes the current tape; the outer block resumes when the inner block exits.
Tape.backward(loss) does four things, in order:
- Seed the loss with
loss.grad = ones_like(loss.value)(or thegradthe caller passed). - Build a reverse-topological order of the graph rooted at
lossusing a Kahn-style DFS (recursivebuild_topoinautograd/engine.py). - For each node, in reverse order, call
grad_fn(node.grad)to get local gradients, then accumulate them onto each parent'sgradfield. Accumulation is done withadd_inplaceso gradients from multiple consumers add correctly. - After all gradients are produced, capture the GPU queue into
Tape._pending_flush_queueso the nextTape.__enter__knows the previous step's backward is still in flight. (Theto_host()of the loss value is the natural sync point; an explicitqueue.finish()is not needed and would cost ~90 ms.)
A small but important detail: when _detect_anomaly is on, the
backward loop also pulls every produced gradient to the host and
checks for NaN/Inf. If it finds one, it raises
RuntimeError("Anomaly detected: …") together with the captured
creation_trace of the offending node.
apply_op — the public entry point
def apply_op(forward, grad_fn, *args, tape=None,
op_name="op", attrs=None) -> Node:
...
apply_op is the single function every
differentiable op in autograd/ops.py calls. Internally:
- If the JIT Compiler is in tracing mode
(
tracing_context.activeis true),apply_opbuilds aTraceNodeinstead of executing the op, so the JIT Compiler can later fold the op into a fused kernel. - Otherwise, if
is_grad_enabled()is false,apply_oprunsforwardand returns the raw Tensor (no node is created). This is the fast path used underwith ag.no_grad():. - Otherwise,
apply_oprunsforwardto produce a Tensor, wraps it in a Node with the suppliedgrad_fn, parents, and op name, and appends the node to the current Tape.
The op name is normally derived from the caller frame
(sys._getframe(1).f_code.co_name); explicit op_name=... overrides
this. When apply_op is called from inside
the JIT Compiler, it also scrapes a
small set of well-known attribute names (min_val, max_val,
alpha, negative_slope, exponent) from the caller frame and
attaches them to the trace.
Kahn-style topological sort for backward
The backward method uses a recursive depth-first traversal to
collect nodes in post-order, then iterates the result in reverse:
def build_topo(n: Node):
if id(n) not in visited:
visited.add(id(n))
for p in n.parents:
build_topo(p)
topo.append(n)
build_topo(loss)
for node in reversed(topo):
if node.grad is None or node.grad_fn is None:
continue
grads = node.grad_fn(node.grad)
# ... accumulate grads into parents ...
The id(n)-keyed visited set is important because the same
Node can appear multiple times in the graph
(e.g. when a tensor is reused). The id() check makes the traversal
robust to DAGs, not just trees.
no_grad / set_grad_enabled / is_grad_enabled
with ag.no_grad(): # context manager
y = model.eval_forward(x) # no nodes are recorded
ag.set_grad_enabled(False) # global switch
y = heavy_op(x) # no nodes are recorded
ag.set_grad_enabled(True)
no_gradsaves memory (no Tensor snapshots are kept) and is the right choice for evaluation, weight initialization, and weight-norm computations that are part of the forward graph but do not need gradients.set_grad_enabledis the imperative form; it returns the previous value implicitly because the global flag_GRAD_MODEis just a boolean.is_grad_enabledis the query; the JIT Compiler uses it to decide whether to build traces or just call the forward.
detect_anomaly — the numerical sanity check
with ag.detect_anomaly(): (or ag.set_detect_anomaly(True)) turns
on two pieces of machinery:
- Per-node stack capture — every
Node.creation_traceis set totraceback.format_stack()at creation time. This is expensive (it walks the Python stack on every op), so it is off by default. - Per-grad NaN/Inf check — during
backward(), every produced gradient is pulled to host withg.to_host()and checked fornp.isnan/np.isinf. The first violation raisesRuntimeError("Anomaly detected: …")with the op name and the capturedcreation_traceof the node that produced the bad gradient.
This is the right tool when you suspect an op has the wrong backward — turn it on, run one step, and the stack trace will point exactly at the line in your model that called the broken op.
debug_tape — pretty-printer
debug_tape(tape) is a context manager (autograd/debug.py) that
yields the tape for inspection. It does not change behaviour; it
exists so that debuggers and pdb post-mortem sessions can reach
into the tape without having to capture it as a local variable:
with ag.Tape() as tape, ag.debug_tape(tape):
loss = model(x)
# after the with-block, `tape` is no longer in scope locally,
# but a breakpoint set inside the with-block can see it.
The function is a thin @contextmanager wrapper around a yield;
it does no I/O on its own. Printing, graphing, and step-by-step
inspection are left to user code (or a debugger).
Operator overloading
Node defines the dunder methods
__add__, __sub__, __mul__, __truediv__, __pow__, __neg__,
__lt__, __le__, __gt__, __ge__ (plus the right-hand variants
__radd__, etc.). Each one forwards to the matching function in
netcl.autograd:
def __add__(self, other):
from netcl import autograd as ag
return ag.add(self, other)
The result is that autograd code reads like ordinary arithmetic:
y = (x + 1.0) ** 2
z = 3.0 * y
loss = z.sum()
…with the Tape recording every +, *,
**, and .sum() automatically.
Differentiable wrappers vs. standalone functions
In the actual code (autograd/ops.py), a handful of frequently-used
combinations are exposed as differentiable wrappers rather than as
standalone functions. These are:
linear_relu(x, w, bias=None)—Linear+ ReLU, fused.conv2d_relu(x, w, bias=None)—Conv2d+ ReLU, fused.batch_norm2d_relu(x, gamma, beta, ...)—BatchNorm2d+ ReLU, fused.dropout(x, p, seed=None)— applies the dropout mask, records it for backward, and scales by1/(1-p).
Each one is implemented as a single apply_op(forward, grad_fn, ...)
call, so they participate in the Tape like any
other op. They are not in the JIT Compiler's
fusion set; they are just the nn.functional counterpart with the
post-activation baked in.
Memory notes
- Every recorded Node holds its output
Tensor. For a 100-layer model this can dominate VRAM.
In eval/calibration phases, wrap the call site in
with ag.no_grad():to drop the snapshot. creation_traceis allocated on every op insidedetect_anomaly(). Use it only when debugging.Tape._pending_flush_queueholds a single queue reference; it does not copy tensors and is safe to leave populated.
See also
- Tutorial: Understanding Autograd — a step-by-step walk-through of a one-hidden-layer net.
- Architecture: JIT Compiler — the layer
that uses
apply_op's tracing mode to fuse elementwise op chains. - autograd API — the full symbol list.
- Tensor API — the value type that Node wraps.
- core API —
Tensor.from_host,to_host, and theones_likehelper thatbackwarduses to seed the loss gradient.