Tape
Tape
Status: Public API in
netcl.autograd.engine.Tape
The Tape is netcl's autograd graph. It is a doubly-linked DAG of
Node objects, one per intermediate tensor, that records what op
produced the node, what the inputs were, and what the backward function
is. Walking the tape from the loss back to the inputs is how netcl
computes gradients.
Unlike PyTorch's tape, the netcl Tape is eager: nodes are added as
ops are executed (unless a no_grad() context is active), and the
tape is reset between training steps by the user — netcl does not
auto-detach the loss tensor or guess when a step has ended. This
explicit reset keeps the API predictable: what is on the tape is
exactly the set of nodes the user has executed since the last
reset().
Overview
A Node on the tape carries:
value— the Tensor produced by the op.grad_fn— a callable that, given the upstream gradient, produces the gradient w.r.t. each input.inputs— the list ofNodes the op consumed.requires_grad— propagated from the inputs; if any input had itTrue, the output does too.name— a debug-friendly identifier.
When you call backward(loss), the engine does a topological reverse
walk from loss and, for each Node, calls grad_fn with the
upstream gradient. The returned gradient is either stored on the
input node's grad field (so the user can read it) or passed
further upstream if the input itself is the output of another op.
The tape is thread-local. There is exactly one active Tape per
thread, accessed via get_current_tape(). This is how the engine
knows where to record a new op without the user passing a tape
argument everywhere.
Where It Lives
- File path:
autograd/engine.py. - Module path:
netcl.autograd. - Public re-export:
from netcl.autograd import Tape. - Sibling:
autograd.graph(the lower-level DAG helpers) andautograd.training_compiler(the pattern-based fusion of detection losses).
Diagram
How It Works
The recording is driven by apply_op. When a user calls
ag.add(x, y), apply_op is invoked with the op name, the input
nodes, and the closure that produced the output tensor. If any input
node has requires_grad=True and grad mode is enabled (it is by
default), apply_op:
- Allocates the output
Tensor(via the standard factory). - Constructs a new
Nodewith the output'sgrad_fnset to a closure that, given the upstream gradient, returns the per-input gradients by calling the op's backward function. - Inserts the node into the current
Tape. - Returns the output tensor.
When the user calls backward(loss_tensor):
- The engine finds the
Nodewhosevalueisloss_tensor(anO(N)scan in the worst case; usually a hash hit if the loss tensor was registered when it was created). - It performs a depth-first reverse topological walk.
- For each node, the engine calls
grad_fn(upstream_grad)and either stores the result on the input'sgrad(terminal input, i.e. a leaf) or threads it further upstream. - The walk stops at
requires_grad=Falsenodes — those are treated as constants and their gradients are skipped.
The walk is implemented in autograd.engine._walk_backward. It is
not symbolic: it actually launches the backward kernels (or, for a
@jit_compiled function, the fused backward kernel that the JIT
generated alongside the forward).
Code Example
import netcl as nc
import netcl.autograd as ag
# Implicit tape; new ops are recorded as they execute.
x = nc.Tensor.from_host(numpy_x)
x.requires_grad = True
w = nc.Tensor.from_host(numpy_w)
w.requires_grad = True
y = ag.add(ag.mul(x, w), 1.0) # recorded on the current Tape
loss = ag.mse_loss(y, target)
ag.backward(loss) # walks the tape
print(x.grad) # populated
print(w.grad) # populated
ag.get_current_tape().reset() # start a fresh step
Performance & Trade-offs
- The tape is an explicit structure: the user must remember to
reset()it. Forgetting to do so will cause the next step to walk an ever-growing graph and eventually OOM. - A
no_grad()context (orset_grad_enabled(False)) suppresses recording. Wrap evaluation and weight-update code in it. detect_anomaly(True)inserts NaN / inf checks at every node. It is expensive; turn it on only when debugging.- The walk is single-threaded by default. For models with a large tape, the shape of the graph matters more than its size — a wide MLP is faster to backward than a 200-layer ResNet because the latter has more sequential dependencies.
See also
- Autograd & Tape — the architecture page with a larger tape diagram.
- Autograd API —
apply_op,backward,no_grad. - JIT Compiler — how the JIT traces the tape at the first call to produce a fused kernel.
- Tensor — the
grad,grad_fnfields are the per-tensor hook into the tape. - Tape — this article.