netcl wiki
architecture

Architecture: Autograd & Tape

Architecture: Autograd & Tape

netcl's autograd API is a classical reverse-mode automatic-differentiation engine built around a per-thread Tape. When you wrap a block of code in with ag.Tape():, every differentiable op inside the block creates a Node that records (a) its output Tensor, (b) a grad_fn that knows how to compute local gradients, and (c) its parents in the computation graph. Calling tape.backward(loss) then walks the graph in reverse topological order and accumulates gradients back into the leaf nodes.

The forward graph and the implicit backward DAG are illustrated below in the canonical y = (x+1)**2 ; z = 3*y ; loss = z.sum() example.

Caption — solid arrows are forward edges (parents), dashed arrows are the reverse edges traversed during backward(). The seed gradient at loss is 1s with the same shape as loss.value; the chain rule combines it with each grad_fn to produce parent gradients.

Node — the per-op record

@dataclass
class Node:
    value:         Tensor
    grad_fn:       Optional[GradFn]   # Callable[[Tensor], List[Optional[Tensor]]]
    parents:       List[Node]
    grad:          Optional[Tensor]   # accumulated gradient
    requires_grad: bool
    op_name:       Optional[str]      # for debug prints
    creation_trace:Optional[List[str]]

A Node is not the same thing as a Tensor. The tensor is the numeric value; the node is the graph record. In particular, leaf tensors that do not need gradients have no node, and intermediate tensors whose requires_grad=False have a node only when at least one of their parents requires gradients (and even then the node's grad_fn is None).

The fields:

  • value — the Tensor produced by the forward op.
  • grad_fn — a callable that, given the upstream gradient, returns a list of local gradients (one per parent). May be None for terminal nodes.
  • parents — the nodes whose values were the inputs.
  • grad — the accumulated gradient flowing back into this node. On leaf parameters this is what the Optimizer reads.
  • requires_grad — propagated from the parents. Used to decide whether the node participates in backward() at all.
  • op_name — a human-readable name (e.g. "matmul", "relu", "conv2d"). Used by detect_anomaly and debug_tape.
  • creation_trace — a captured stack trace, populated only inside with ag.detect_anomaly(): because it is expensive.

Tape — the per-thread recorder

class Tape:
    def __enter__(self):  ...   # sets itself as current_tape
    def __exit__(self, *): ...  # clears current_tape
    def backward(self, loss: Node) -> None: ...

The Tape is bound to the current thread through a threading.local() named _tls. This means:

  • Two threads can be in different with ag.Tape(): blocks at the same time without interfering.
  • The single-thread default of set_current_tape(tape) / get_current_tape() is the implicit tape consumed by apply_op when the user does not pass an explicit tape= argument.
  • Nested with ag.Tape(): blocks stack: the inner block becomes the current tape; the outer block resumes when the inner block exits.

Tape.backward(loss) does four things, in order:

  1. Seed the loss with loss.grad = ones_like(loss.value) (or the grad the caller passed).
  2. Build a reverse-topological order of the graph rooted at loss using a Kahn-style DFS (recursive build_topo in autograd/engine.py).
  3. For each node, in reverse order, call grad_fn(node.grad) to get local gradients, then accumulate them onto each parent's grad field. Accumulation is done with add_inplace so gradients from multiple consumers add correctly.
  4. After all gradients are produced, capture the GPU queue into Tape._pending_flush_queue so the next Tape.__enter__ knows the previous step's backward is still in flight. (The to_host() of the loss value is the natural sync point; an explicit queue.finish() is not needed and would cost ~90 ms.)

A small but important detail: when _detect_anomaly is on, the backward loop also pulls every produced gradient to the host and checks for NaN/Inf. If it finds one, it raises RuntimeError("Anomaly detected: …") together with the captured creation_trace of the offending node.

apply_op — the public entry point

def apply_op(forward, grad_fn, *args, tape=None,
             op_name="op", attrs=None) -> Node:
    ...

apply_op is the single function every differentiable op in autograd/ops.py calls. Internally:

  1. If the JIT Compiler is in tracing mode (tracing_context.active is true), apply_op builds a TraceNode instead of executing the op, so the JIT Compiler can later fold the op into a fused kernel.
  2. Otherwise, if is_grad_enabled() is false, apply_op runs forward and returns the raw Tensor (no node is created). This is the fast path used under with ag.no_grad():.
  3. Otherwise, apply_op runs forward to produce a Tensor, wraps it in a Node with the supplied grad_fn, parents, and op name, and appends the node to the current Tape.

The op name is normally derived from the caller frame (sys._getframe(1).f_code.co_name); explicit op_name=... overrides this. When apply_op is called from inside the JIT Compiler, it also scrapes a small set of well-known attribute names (min_val, max_val, alpha, negative_slope, exponent) from the caller frame and attaches them to the trace.

Kahn-style topological sort for backward

The backward method uses a recursive depth-first traversal to collect nodes in post-order, then iterates the result in reverse:

def build_topo(n: Node):
    if id(n) not in visited:
        visited.add(id(n))
        for p in n.parents:
            build_topo(p)
        topo.append(n)

build_topo(loss)
for node in reversed(topo):
    if node.grad is None or node.grad_fn is None:
        continue
    grads = node.grad_fn(node.grad)
    # ... accumulate grads into parents ...

The id(n)-keyed visited set is important because the same Node can appear multiple times in the graph (e.g. when a tensor is reused). The id() check makes the traversal robust to DAGs, not just trees.

no_grad / set_grad_enabled / is_grad_enabled

with ag.no_grad():           # context manager
    y = model.eval_forward(x)  # no nodes are recorded

ag.set_grad_enabled(False)   # global switch
y = heavy_op(x)              # no nodes are recorded
ag.set_grad_enabled(True)
  • no_grad saves memory (no Tensor snapshots are kept) and is the right choice for evaluation, weight initialization, and weight-norm computations that are part of the forward graph but do not need gradients.
  • set_grad_enabled is the imperative form; it returns the previous value implicitly because the global flag _GRAD_MODE is just a boolean.
  • is_grad_enabled is the query; the JIT Compiler uses it to decide whether to build traces or just call the forward.

detect_anomaly — the numerical sanity check

with ag.detect_anomaly(): (or ag.set_detect_anomaly(True)) turns on two pieces of machinery:

  1. Per-node stack capture — every Node.creation_trace is set to traceback.format_stack() at creation time. This is expensive (it walks the Python stack on every op), so it is off by default.
  2. Per-grad NaN/Inf check — during backward(), every produced gradient is pulled to host with g.to_host() and checked for np.isnan / np.isinf. The first violation raises RuntimeError("Anomaly detected: …") with the op name and the captured creation_trace of the node that produced the bad gradient.

This is the right tool when you suspect an op has the wrong backward — turn it on, run one step, and the stack trace will point exactly at the line in your model that called the broken op.

debug_tape — pretty-printer

debug_tape(tape) is a context manager (autograd/debug.py) that yields the tape for inspection. It does not change behaviour; it exists so that debuggers and pdb post-mortem sessions can reach into the tape without having to capture it as a local variable:

with ag.Tape() as tape, ag.debug_tape(tape):
    loss = model(x)
# after the with-block, `tape` is no longer in scope locally,
# but a breakpoint set inside the with-block can see it.

The function is a thin @contextmanager wrapper around a yield; it does no I/O on its own. Printing, graphing, and step-by-step inspection are left to user code (or a debugger).

Operator overloading

Node defines the dunder methods __add__, __sub__, __mul__, __truediv__, __pow__, __neg__, __lt__, __le__, __gt__, __ge__ (plus the right-hand variants __radd__, etc.). Each one forwards to the matching function in netcl.autograd:

def __add__(self, other):
    from netcl import autograd as ag
    return ag.add(self, other)

The result is that autograd code reads like ordinary arithmetic:

y = (x + 1.0) ** 2
z = 3.0 * y
loss = z.sum()

…with the Tape recording every +, *, **, and .sum() automatically.

Differentiable wrappers vs. standalone functions

In the actual code (autograd/ops.py), a handful of frequently-used combinations are exposed as differentiable wrappers rather than as standalone functions. These are:

  • linear_relu(x, w, bias=None)Linear + ReLU, fused.
  • conv2d_relu(x, w, bias=None)Conv2d + ReLU, fused.
  • batch_norm2d_relu(x, gamma, beta, ...)BatchNorm2d + ReLU, fused.
  • dropout(x, p, seed=None) — applies the dropout mask, records it for backward, and scales by 1/(1-p).

Each one is implemented as a single apply_op(forward, grad_fn, ...) call, so they participate in the Tape like any other op. They are not in the JIT Compiler's fusion set; they are just the nn.functional counterpart with the post-activation baked in.

Memory notes

  • Every recorded Node holds its output Tensor. For a 100-layer model this can dominate VRAM. In eval/calibration phases, wrap the call site in with ag.no_grad(): to drop the snapshot.
  • creation_trace is allocated on every op inside detect_anomaly(). Use it only when debugging.
  • Tape._pending_flush_queue holds a single queue reference; it does not copy tensors and is safe to leave populated.

See also