concepts

Tape

Status: Public API in netcl.autograd.engine.Tape

The Tape is netcl's autograd graph. It is a doubly-linked DAG of Node objects, one per intermediate tensor, that records what op produced the node, what the inputs were, and what the backward function is. Walking the tape from the loss back to the inputs is how netcl computes gradients.

Unlike PyTorch's tape, the netcl Tape is eager: nodes are added as ops are executed (unless a no_grad() context is active), and the tape is reset between training steps by the user — netcl does not auto-detach the loss tensor or guess when a step has ended. This explicit reset keeps the API predictable: what is on the tape is exactly the set of nodes the user has executed since the last reset().

Overview

A Node on the tape carries:

value — the Tensor produced by the op.
grad_fn — a callable that, given the upstream gradient, produces the gradient w.r.t. each input.
inputs — the list of Nodes the op consumed.
requires_grad — propagated from the inputs; if any input had it True, the output does too.
name — a debug-friendly identifier.

When you call backward(loss), the engine does a topological reverse walk from loss and, for each Node, calls grad_fn with the upstream gradient. The returned gradient is either stored on the input node's grad field (so the user can read it) or passed further upstream if the input itself is the output of another op.

The tape is thread-local. There is exactly one active Tape per thread, accessed via get_current_tape(). This is how the engine knows where to record a new op without the user passing a tape argument everywhere.

Where It Lives

File path: autograd/engine.py.
Module path: netcl.autograd.
Public re-export: from netcl.autograd import Tape.
Sibling: autograd.graph (the lower-level DAG helpers) and autograd.training_compiler (the pattern-based fusion of detection losses).

Diagram

How It Works

The recording is driven by apply_op. When a user calls ag.add(x, y), apply_op is invoked with the op name, the input nodes, and the closure that produced the output tensor. If any input node has requires_grad=True and grad mode is enabled (it is by default), apply_op:

Allocates the output Tensor (via the standard factory).
Constructs a new Node with the output's grad_fn set to a closure that, given the upstream gradient, returns the per-input gradients by calling the op's backward function.
Inserts the node into the current Tape.
Returns the output tensor.

When the user calls backward(loss_tensor):

The engine finds the Node whose value is loss_tensor (an O(N) scan in the worst case; usually a hash hit if the loss tensor was registered when it was created).
It performs a depth-first reverse topological walk.
For each node, the engine calls grad_fn(upstream_grad) and either stores the result on the input's grad (terminal input, i.e. a leaf) or threads it further upstream.
The walk stops at requires_grad=False nodes — those are treated as constants and their gradients are skipped.

The walk is implemented in autograd.engine._walk_backward. It is not symbolic: it actually launches the backward kernels (or, for a @jit_compiled function, the fused backward kernel that the JIT generated alongside the forward).

Code Example

import netcl as nc
import netcl.autograd as ag

# Implicit tape; new ops are recorded as they execute.
x = nc.Tensor.from_host(numpy_x)
x.requires_grad = True
w = nc.Tensor.from_host(numpy_w)
w.requires_grad = True

y = ag.add(ag.mul(x, w), 1.0)         # recorded on the current Tape
loss = ag.mse_loss(y, target)

ag.backward(loss)                      # walks the tape

print(x.grad)                          # populated
print(w.grad)                          # populated

ag.get_current_tape().reset()          # start a fresh step

Performance & Trade-offs

The tape is an explicit structure: the user must remember to reset() it. Forgetting to do so will cause the next step to walk an ever-growing graph and eventually OOM.
A no_grad() context (or set_grad_enabled(False)) suppresses recording. Wrap evaluation and weight-update code in it.
detect_anomaly(True) inserts NaN / inf checks at every node. It is expensive; turn it on only when debugging.
The walk is single-threaded by default. For models with a large tape, the shape of the graph matters more than its size — a wide MLP is faster to backward than a 200-layer ResNet because the latter has more sequential dependencies.