netcl.autograd — Tape, Node, JIT Compiler
netcl.autograd — Tape, Node, JIT Compiler
netcl.autograd is the reverse-mode automatic differentiation layer of netcl. It
implements a tape-based tape, a per-Node graph data structure, and a dynamic
JIT Compiler that fuses a chain of elementwise ops into a single
forward and backward OpenCL kernel pair. Together they let
you express models as ordinary Python and get an analytical gradient for free.
Note — Top-level re-exports. The bulk of this page is reachable via
import netcl.autograd as agbecauseautograd/__init__.pybundles the public surface. A few of the heavier pieces (CompiledGraph,TraceNode,TrainingGraphCompiler,debug_tape, …) are imported from their submodule path. Always prefer theag.*shortcut when one exists; the long-form imports are shown next to the relevant section heading.
Overview
| Symbol | Path | Purpose |
|---|---|---|
Tape |
autograd/engine.py |
Records ops on a thread-local context; drives backward |
Node |
autograd/engine.py |
One value in the computational graph, with a grad_fn and parents |
apply_op |
autograd/engine.py |
Public entry point for registering a forward+grad-fn pair onto the tape |
no_grad, set_grad_enabled, is_grad_enabled |
autograd/engine.py |
Globally disable / query gradient recording |
set_detect_anomaly, detect_anomaly |
autograd/engine.py |
Numerical-gradient checks during backward |
set_current_tape, get_current_tape |
autograd/engine.py |
Thread-local tape for tape-free APIs |
CompiledGraph |
autograd/graph.py |
Capture-and-replay wrapper for inference / training micrographs |
TraceNode, TracingContext |
autograd/compiler.py |
The dynamic dataflow graph used by the JIT Compiler |
jit_compile |
autograd/compiler.py |
Decorator that fuses a Python function into one OpenCL kernel pair |
AutogradPrimitive |
autograd/compiler.py |
Describes the forward / backward of a fusible op |
register_primitive |
autograd/compiler.py |
Register a custom primitive for the JIT Compiler |
TrainingGraphCompiler |
autograd/training_compiler.py |
Registry of stable fused-loss training patterns |
TrainingPattern |
autograd/training_compiler.py |
One (matcher, planner) entry in the registry |
fused_weighted_bce_smooth_l1_loss |
autograd/training_compiler.py |
Detection-style fused loss used in object detection heads |
get_training_compiler |
autograd/training_compiler.py |
Process-wide singleton of TrainingGraphCompiler |
debug_tape |
autograd/debug.py |
Trivial context manager that exposes the active Tape |
Elementwise / loss ops (add, relu, cross_entropy, …) |
autograd/ops.py |
The per-op autograd wrappers |
Node
A Node is one value in the dynamic computational graph. It is created by
apply_op on every differentiable call; plain
Tensors are also wrapped in Nodes by ag.tensor(x).
@dataclass
class Node:
value: Tensor # the forward result
grad_fn: Optional[GradFn] = None
parents: List["Node"] = field(default_factory=list)
grad: Optional[Tensor] = None
requires_grad: bool = False
creation_trace: Optional[List[str]] = None
op_name: Optional[str] = None
| Field | Meaning |
|---|---|
value |
The forward Tensor. For ops recorded on a tape, this is the output buffer. During a JIT trace it is a TraceNode placeholder. |
grad_fn |
Callable (grad_out) -> List[Optional[Tensor]] that scatters the gradient back to each parent. None for leaf nodes. |
parents |
The input Nodes. Order matches what grad_fn returns. |
grad |
The accumulated upstream gradient, kept in sync with value.grad for the Optimizer. |
requires_grad |
True if any parent had requires_grad=True (and the Tape was active). |
creation_trace |
A list of formatted traceback strings, captured when detect_anomaly is on. |
op_name |
The name passed to apply_op, or the caller frame's function name when None was given. |
Operator overloading on Tensor makes the Node mostly invisible in user code
— a + b calls ag.add(a, b), which in turn calls
apply_op and returns the new Node. The full overload set is
__add__, __radd__, __sub__, __rsub__, __mul__, __rmul__, __truediv__,
__rtruediv__, __neg__, __pow__, __lt__, __le__, __gt__, __ge__.
Tape
Tape is the recorder of a forward pass. It is a context manager that
installs itself as the current tape on a thread-local slot, calls
apply_op under the hood for every op that goes through
autograd/ops.py, and then walks the recorded graph in reverse
topological order on backward(loss).
import netcl.autograd as ag
with ag.Tape() as tape:
pred = model(x)
loss = ag.cross_entropy(pred, y)
tape.backward(loss)
| Method | Purpose |
|---|---|
__enter__ / __exit__ |
Install / restore the thread-local current tape. Resets the Tape._pending_flush_queue flush flag. |
record(node) |
Append a Node to the tape. No-op when tape.enabled is False (used by no_grad). |
backward(loss, grad=None) |
Topologically sort the graph reachable from loss, then call each grad_fn in reverse order, accumulating into each parent's grad and value.grad. |
Tape also manages two subtle but important pieces of
OpenCL bookkeeping:
- The
__enter__hook does not drain the GPU command queue from the previous step's backward pass. Draining happens naturally when the next forward pass callsloss.to_host()(a synchronization point). This avoids an explicitclFinish()per step and saves on the order of 90 ms/step on a typical training loop. - The
backwardhook remembers the queue it used and theTape.__enter__of the next step drains it. This prevents the well-known "OpenCL command-queue-full stall" that otherwise inflates forward time by ~180 ms.
Thread-Local Current Tape
For convenience, ops can run without an explicit with ag.Tape() block — they
automatically find the active tape via the thread-local store:
ag.set_current_tape(tape) # install
ag.get_current_tape() # -> tape | None
apply_op falls back to get_current_tape() whenever no tape= keyword
is passed. Multi-threaded trainers typically use one Tape per worker
thread; the thread-locality keeps them isolated.
apply_op
apply_op is the single public entry point that bridges a Python
forward function and a Python backward function with the
Tape recorder. Every op in
autograd/ops.py is implemented in terms of it.
def apply_op(
fn: Callable[..., Tensor], # forward implementation
grad_fn: Optional[GradFn], # backward implementation (None = no grad)
*args: Node, # input nodes (or plain values)
tape: Optional[Tape] = None, # override the current tape
op_name: Optional[str] = None, # name in the graph (defaults to caller frame)
attrs: Optional[dict] = None, # extra kwargs for the [JIT Compiler](/concepts/jit-compiler) tracer
) -> Node
The function does four things in order:
- Tracing bypass. If the JIT Compiler has a
TracingContextactive,apply_opreturns aNodewhosevalueis aTraceNode— a symbolic placeholder used to build the fused kernel source. No tape recording happens in this mode. - Grad off. If
is_grad_enabled()isFalse,apply_opjust runsfn(...)and returns the raw Tensor, bypassing bothTapeand the autograd machinery entirely. - Normal path. Run the forward function, build a
Nodewith the rightgrad_fn,parents,op_name, and (when anomaly detection is on)creation_trace, thentape.record(node)if a tape is in scope. - In-place grad accumulation. The
backward()loop onTapecalls eachgrad_fnin topological order and accumulates into the corresponding parentNode.grad(andvalue.gradfor Optimizer compatibility). Multiple uses of a parent are summed in-place via an OpenCLADDkernel.
no_grad, set_grad_enabled, is_grad_enabled
These three form the standard PyTorch-style grad-mode switch.
with ag.no_grad():
y = model(x) # forward still runs, but no Nodes are recorded
| Symbol | Purpose |
|---|---|
set_grad_enabled(mode: bool) |
Process-wide toggle; affects every subsequent apply_op call. |
is_grad_enabled() |
Returns the current process-wide grad mode (True by default). |
no_grad |
Context manager that saves the prior mode, sets it to False, and restores on exit. |
The toggle is the recommended way to run inference or any inner loop that should
not pollute the autograd graph. Note that the same effect is achievable per-tape by
passing tape=None to apply_op, but the global flag is what every
high-level helper (e.g. model.eval() in the nn API) ultimately respects.
set_detect_anomaly, detect_anomaly
A slow but high-signal mode that compares analytical gradients (the kernel chain you
authored) against finite-difference numerical gradients for the same input. Useful when
you are writing a new op or when an existing one starts producing NaNs.
with ag.detect_anomaly():
tape.backward(loss)
Internally, Tape.backward calls grad.to_host() for every parent and
checks np.isnan / np.isinf. On hit, it raises a RuntimeError containing the
creation_trace of the offending Node, so you can jump straight to the
Python frame that registered the bad op.
| Symbol | Purpose |
|---|---|
set_detect_anomaly(enable: bool) |
Module-level switch. The corresponding context manager does the same. |
detect_anomaly(enable=True) |
Context manager; saves and restores the prior setting on __exit__. |
Anomaly detection also installs a traceback.format_stack() capture into
Node.creation_trace, so even successful (non-NaN) backward passes leave breadcrumbs for
later inspection.
debug_tape
debug_tape is a thin context manager from
autograd/debug.py that yields the active
Tape so you can inspect tape.nodes and tape.enabled while stepping
through a forward pass with a debugger.
with ag.debug_tape(tape) as t:
# t is the same object as `tape`
pred = model(x)
print(len(t.nodes)) # how many ops have been recorded so far
loss = ag.cross_entropy(pred, y)
print(len(t.nodes)) # one more
For most debugging scenarios the tape.nodes list combined with node.creation_trace is
the fastest path to the source of a gradient issue. If you find yourself wanting richer
graph visualizations, see JIT Compiler for a tour of
TraceNode inspection.
CompiledGraph
CompiledGraph, in autograd/graph.py, is the
capture-and-replay primitive that lets the JIT Compiler
amortize Python overhead across iterations.
from netcl.autograd.graph import CompiledGraph
g = CompiledGraph(params=model.parameters())
with g:
loss = train_step(model, batch)
loss.backward()
g.compile([x, y]) # one-time setup of static/dynamic arg slots
g.replay([x2, y2]) # zero-overhead replay
| Method | Purpose |
|---|---|
__enter__ / __exit__ |
On the first entry, begin capture of every kernel launch. On later entries, restore cached grads. |
compile(inputs) |
Resolve static vs. dynamic kernel arguments and pre-bind them for fast replay. |
replay(new_inputs) |
Zero the gradient buffers and re-enqueue the captured kernel sequence with new input pointers. |
The companion primitives in runtime/capture.py
(get_capture()) are what CompiledGraph calls into under the hood. The replay path is
what makes a steady-state training loop drop into a tight C-level cl.enqueue_nd_range_kernel
loop with no Python frame per launch.
jit_compile
jit_compile is the decorator at the heart of the
JIT Compiler. It traces a Python function containing a chain
of fusible ops, generates a single OpenCL forward kernel and a single backward kernel,
caches them by (fn, shapes, kwargs), and re-uses the cached kernels on every subsequent
call.
from netcl.autograd.compiler import jit_compile
import netcl.autograd as ag
@jit_compile
def fused(x):
y = ag.relu(x)
z = ag.gelu(y) + 0.5
return ag.sigmoid(z)
On the first call jit_compile does the following:
- Tracing pass. A
TracingContext.activeflag is set, a parallel set of dummyNodes is built whosevalues areTraceNodes (placeholder op nodes), andfusedis run on them. The result is a symbolic DAG of operations. - Topological sort. A DFS over the
TraceNodeDAG produces a linearorderfrom input leaves to the single output root. - Forward kernel generation. For each node in
orderthe registeredAutogradPrimitive.forwardemits a single C statement (float node_N = EXPR(parent0, parent1, …);). The whole body is wrapped in a__kernel void fused_forward(…)function. - Backward kernel generation. Starting from the output, the compiler walks
orderin reverse and, for each op, asks the registeredAutogradPrimitive.backwardfor the gradient w.r.t. each parent. Symbolically accumulated gradients are folded with+so an op with multiple consumers gets a singleg_node = g_consumer1 + g_consumer2line. The forward pass is recomputed inside the backward kernel so no intermediates are stored to global memory. - Compile & cache. Both kernels go through
runtime.cache.cacherto get ancl.Program. The kernel objects are stashed in a per-process_COMPILER_CACHEkeyed on(fn, sig_shapes, kwargs). - Subsequent calls. The decorator skips steps 1–5 and goes straight to
kernel_fw(...)and registers agrad_fnclosure that callskernel_bw(...).
If the function's output is not a TraceNode (e.g. it returns a plain
Python float, or it calls an op that is not registered as fusible), the decorator silently
falls back to the un-fused implementation.
CPU Backend
When all inputs are CPU tensors (see Tensor Backend),
jit_compile takes a parallel path: it generates two plain Python
functions fused_forward and fused_backward by exec()-ing NumPy expressions, then
caches them. There is no OpenCL kernel in this branch — the speedup comes from removing
the per-element Python overhead, not from kernel fusion.
Registering Custom Primitives
from netcl.autograd.compiler import register_primitive
def fwd(args, attrs):
return f"{args[0]} * {args[1]}"
def bwd(args, grad_var, attrs, out_var):
return [f"{grad_var} * {args[1]}", f"{grad_var} * {args[0]}"]
register_primitive("mul", fwd, bwd, arity=2, fusible=True)
The compiler automatically looks up a primitive by the op_name string passed to
apply_op. See
Writing a Custom OpenCL Kernel for a worked example that
combines register_primitive with jit_compile.
AutogradPrimitive and register_primitive
An AutogradPrimitive is a frozen dataclass that describes a single
fusible op:
@dataclass(frozen=True)
class AutogradPrimitive:
name: str
forward: Callable[[List[str], dict], str]
backward: Callable[[List[str], str, dict, str], List[str]]
arity: int | None = None
fusible: bool = True
forward(parents, attrs) -> strreturns a single C expression of the output in terms of the input variable names (e.g."fmax(v0, v1)").backward(parents, grad_var, attrs, out_var) -> List[str]returns one C expression per parent, expressing that parent's gradient.grad_varis the C name of the upstream gradient ("g_node_42"in the generated kernel) andout_varis the local C name of the op's output, in case the backward needs it.arityis used for the scalar-broadcast and ternary variants; the compiler can recognize an arity-2 op with ascalarattribute and emit a scalar-fused kernel.fusible=Falseexcludes the primitive from JIT fusion even if it is registered. Use this for ops whose forward or backward is genuinely unfusible (e.g. it needs an internal reduction).
register_primitive writes the entry into the module-level
_PRIMITIVES dictionary. The default set is registered at import time and covers every
op in autograd/ops.py plus the
weighted_bce and weighted_smooth_l1 detection losses.
TraceNode and TracingContext
TraceNode is the symbolic placeholder that the
JIT Compiler builds the fused kernel source from. A
TraceNode carries the op_name, the list of input
TraceNodes, the output shape, the dtype, and an attrs dict
(typically the attrs= keyword forwarded by apply_op).
TracingContext is a thread-local active flag. When active is True,
apply_op takes the tracing branch and returns a Node whose value is
a TraceNode. When False, apply_op runs the real
forward and records onto the active Tape. The compiler is the only
caller that flips active.
The two together implement a dynamic dataflow graph: the compiler traces, the forward kernels compute, the backward kernels consume, and there is no global mutable state besides the _COMPILER_CACHE.
TrainingGraphCompiler, TrainingPattern, fused_weighted_bce_smooth_l1_loss, get_training_compiler
The elementwise JIT Compiler is great for linear chains of
pointwise ops, but the training loop in object-detection heads routinely combines a
pointwise weighted BCE with a pointwise weighted Smooth-L1, and the reduction over
the entire image (a sum of ~tens of millions of elements) is a non-trivial op on its own.
The dedicated fused kernel for that pattern is what
autograd/training_compiler.py is for.
| Symbol | Purpose |
|---|---|
TrainingGraphCompiler |
A small registry of stable TrainingPatterns. New patterns can be added via register(); the built-in entry is weighted_bce_smooth_l1_sum. |
TrainingPattern |
A (name, matcher, planner) triple. The matcher decides whether a forward call is eligible; the planner builds the cached FusedDetectionLossPlan. |
fused_weighted_bce_smooth_l1_loss(pred, target, heat_weight, reg_weight, heat_scale, reg_scale, beta=0.05) |
Returns a Node whose forward is the partial+reduce kernel and whose backward is a single OpenCL kernel that produces the gradient w.r.t. pred only. target and the two weight tensors are treated as non-trainable inputs. |
get_training_compiler() |
Returns the process-wide singleton of TrainingGraphCompiler. Call this from custom training loops if you want to add your own pattern. |
The full set of autograd-supported fused detection losses is summarized in the table below; refer to Architecture: Autograd & Tape for the dataflow diagram.
Putting It Together: Training One Step
import netcl.autograd as ag
import netcl.amp as amp
from netcl.optim import Adam
opt = Adam(model.parameters(), lr=1e-3)
scaler = amp.GradScaler()
for x, y in loader:
with ag.Tape() as tape:
with amp.autocast(enabled=True):
pred = model(x)
loss = ag.cross_entropy(pred, y)
loss = scaler.scale_loss(loss)
tape.backward(loss)
scaler.step(opt, model.parameters())
scaler.update()
opt.zero_grad()
This is the canonical netcl training step. It composes the Tape for
gradient recording, autocast for half-precision forward, and
GradScaler for the loss scaling. See MNIST with MLP for a
complete worked example.
See also
- Understanding Autograd — the step-by-step tutorial this API is designed to support.
- Architecture: Autograd & Tape — the design rationale, the
topological-sort algorithm, and the in-place
ADDaccumulation. - Architecture: JIT Compiler — how
TraceNodes become a single OpenCL kernel pair. - Writing a Custom OpenCL Kernel — shows how to use
register_primitiveandjit_compilefor a custom op. - amp API — the
autocastandGradScalerused together with the Tape in the example above. - Tensor — the value carried by every
Node.value. - OpenCL — the runtime that executes the compiled kernels.