netcl wiki
api

netcl.autograd — Tape, Node, JIT Compiler

netcl.autograd — Tape, Node, JIT Compiler

netcl.autograd is the reverse-mode automatic differentiation layer of netcl. It implements a tape-based tape, a per-Node graph data structure, and a dynamic JIT Compiler that fuses a chain of elementwise ops into a single forward and backward OpenCL kernel pair. Together they let you express models as ordinary Python and get an analytical gradient for free.

Note — Top-level re-exports. The bulk of this page is reachable via import netcl.autograd as ag because autograd/__init__.py bundles the public surface. A few of the heavier pieces (CompiledGraph, TraceNode, TrainingGraphCompiler, debug_tape, …) are imported from their submodule path. Always prefer the ag.* shortcut when one exists; the long-form imports are shown next to the relevant section heading.

Overview

Symbol Path Purpose
Tape autograd/engine.py Records ops on a thread-local context; drives backward
Node autograd/engine.py One value in the computational graph, with a grad_fn and parents
apply_op autograd/engine.py Public entry point for registering a forward+grad-fn pair onto the tape
no_grad, set_grad_enabled, is_grad_enabled autograd/engine.py Globally disable / query gradient recording
set_detect_anomaly, detect_anomaly autograd/engine.py Numerical-gradient checks during backward
set_current_tape, get_current_tape autograd/engine.py Thread-local tape for tape-free APIs
CompiledGraph autograd/graph.py Capture-and-replay wrapper for inference / training micrographs
TraceNode, TracingContext autograd/compiler.py The dynamic dataflow graph used by the JIT Compiler
jit_compile autograd/compiler.py Decorator that fuses a Python function into one OpenCL kernel pair
AutogradPrimitive autograd/compiler.py Describes the forward / backward of a fusible op
register_primitive autograd/compiler.py Register a custom primitive for the JIT Compiler
TrainingGraphCompiler autograd/training_compiler.py Registry of stable fused-loss training patterns
TrainingPattern autograd/training_compiler.py One (matcher, planner) entry in the registry
fused_weighted_bce_smooth_l1_loss autograd/training_compiler.py Detection-style fused loss used in object detection heads
get_training_compiler autograd/training_compiler.py Process-wide singleton of TrainingGraphCompiler
debug_tape autograd/debug.py Trivial context manager that exposes the active Tape
Elementwise / loss ops (add, relu, cross_entropy, …) autograd/ops.py The per-op autograd wrappers

Node

A Node is one value in the dynamic computational graph. It is created by apply_op on every differentiable call; plain Tensors are also wrapped in Nodes by ag.tensor(x).

@dataclass
class Node:
    value: Tensor                  # the forward result
    grad_fn: Optional[GradFn] = None
    parents: List["Node"] = field(default_factory=list)
    grad: Optional[Tensor] = None
    requires_grad: bool = False
    creation_trace: Optional[List[str]] = None
    op_name: Optional[str] = None
Field Meaning
value The forward Tensor. For ops recorded on a tape, this is the output buffer. During a JIT trace it is a TraceNode placeholder.
grad_fn Callable (grad_out) -> List[Optional[Tensor]] that scatters the gradient back to each parent. None for leaf nodes.
parents The input Nodes. Order matches what grad_fn returns.
grad The accumulated upstream gradient, kept in sync with value.grad for the Optimizer.
requires_grad True if any parent had requires_grad=True (and the Tape was active).
creation_trace A list of formatted traceback strings, captured when detect_anomaly is on.
op_name The name passed to apply_op, or the caller frame's function name when None was given.

Operator overloading on Tensor makes the Node mostly invisible in user code — a + b calls ag.add(a, b), which in turn calls apply_op and returns the new Node. The full overload set is __add__, __radd__, __sub__, __rsub__, __mul__, __rmul__, __truediv__, __rtruediv__, __neg__, __pow__, __lt__, __le__, __gt__, __ge__.

Tape

Tape is the recorder of a forward pass. It is a context manager that installs itself as the current tape on a thread-local slot, calls apply_op under the hood for every op that goes through autograd/ops.py, and then walks the recorded graph in reverse topological order on backward(loss).

import netcl.autograd as ag

with ag.Tape() as tape:
    pred = model(x)
    loss = ag.cross_entropy(pred, y)

tape.backward(loss)
Method Purpose
__enter__ / __exit__ Install / restore the thread-local current tape. Resets the Tape._pending_flush_queue flush flag.
record(node) Append a Node to the tape. No-op when tape.enabled is False (used by no_grad).
backward(loss, grad=None) Topologically sort the graph reachable from loss, then call each grad_fn in reverse order, accumulating into each parent's grad and value.grad.

Tape also manages two subtle but important pieces of OpenCL bookkeeping:

  • The __enter__ hook does not drain the GPU command queue from the previous step's backward pass. Draining happens naturally when the next forward pass calls loss.to_host() (a synchronization point). This avoids an explicit clFinish() per step and saves on the order of 90 ms/step on a typical training loop.
  • The backward hook remembers the queue it used and the Tape.__enter__ of the next step drains it. This prevents the well-known "OpenCL command-queue-full stall" that otherwise inflates forward time by ~180 ms.

Thread-Local Current Tape

For convenience, ops can run without an explicit with ag.Tape() block — they automatically find the active tape via the thread-local store:

ag.set_current_tape(tape)    # install
ag.get_current_tape()        # -> tape | None

apply_op falls back to get_current_tape() whenever no tape= keyword is passed. Multi-threaded trainers typically use one Tape per worker thread; the thread-locality keeps them isolated.

apply_op

apply_op is the single public entry point that bridges a Python forward function and a Python backward function with the Tape recorder. Every op in autograd/ops.py is implemented in terms of it.

def apply_op(
    fn: Callable[..., Tensor],          # forward implementation
    grad_fn: Optional[GradFn],          # backward implementation (None = no grad)
    *args: Node,                        # input nodes (or plain values)
    tape: Optional[Tape] = None,        # override the current tape
    op_name: Optional[str] = None,      # name in the graph (defaults to caller frame)
    attrs: Optional[dict] = None,       # extra kwargs for the [JIT Compiler](/concepts/jit-compiler) tracer
) -> Node

The function does four things in order:

  1. Tracing bypass. If the JIT Compiler has a TracingContext active, apply_op returns a Node whose value is a TraceNode — a symbolic placeholder used to build the fused kernel source. No tape recording happens in this mode.
  2. Grad off. If is_grad_enabled() is False, apply_op just runs fn(...) and returns the raw Tensor, bypassing both Tape and the autograd machinery entirely.
  3. Normal path. Run the forward function, build a Node with the right grad_fn, parents, op_name, and (when anomaly detection is on) creation_trace, then tape.record(node) if a tape is in scope.
  4. In-place grad accumulation. The backward() loop on Tape calls each grad_fn in topological order and accumulates into the corresponding parent Node.grad (and value.grad for Optimizer compatibility). Multiple uses of a parent are summed in-place via an OpenCL ADD kernel.

no_grad, set_grad_enabled, is_grad_enabled

These three form the standard PyTorch-style grad-mode switch.

with ag.no_grad():
    y = model(x)        # forward still runs, but no Nodes are recorded
Symbol Purpose
set_grad_enabled(mode: bool) Process-wide toggle; affects every subsequent apply_op call.
is_grad_enabled() Returns the current process-wide grad mode (True by default).
no_grad Context manager that saves the prior mode, sets it to False, and restores on exit.

The toggle is the recommended way to run inference or any inner loop that should not pollute the autograd graph. Note that the same effect is achievable per-tape by passing tape=None to apply_op, but the global flag is what every high-level helper (e.g. model.eval() in the nn API) ultimately respects.

set_detect_anomaly, detect_anomaly

A slow but high-signal mode that compares analytical gradients (the kernel chain you authored) against finite-difference numerical gradients for the same input. Useful when you are writing a new op or when an existing one starts producing NaNs.

with ag.detect_anomaly():
    tape.backward(loss)

Internally, Tape.backward calls grad.to_host() for every parent and checks np.isnan / np.isinf. On hit, it raises a RuntimeError containing the creation_trace of the offending Node, so you can jump straight to the Python frame that registered the bad op.

Symbol Purpose
set_detect_anomaly(enable: bool) Module-level switch. The corresponding context manager does the same.
detect_anomaly(enable=True) Context manager; saves and restores the prior setting on __exit__.

Anomaly detection also installs a traceback.format_stack() capture into Node.creation_trace, so even successful (non-NaN) backward passes leave breadcrumbs for later inspection.

debug_tape

debug_tape is a thin context manager from autograd/debug.py that yields the active Tape so you can inspect tape.nodes and tape.enabled while stepping through a forward pass with a debugger.

with ag.debug_tape(tape) as t:
    # t is the same object as `tape`
    pred = model(x)
    print(len(t.nodes))   # how many ops have been recorded so far
    loss = ag.cross_entropy(pred, y)
    print(len(t.nodes))   # one more

For most debugging scenarios the tape.nodes list combined with node.creation_trace is the fastest path to the source of a gradient issue. If you find yourself wanting richer graph visualizations, see JIT Compiler for a tour of TraceNode inspection.

CompiledGraph

CompiledGraph, in autograd/graph.py, is the capture-and-replay primitive that lets the JIT Compiler amortize Python overhead across iterations.

from netcl.autograd.graph import CompiledGraph

g = CompiledGraph(params=model.parameters())
with g:
    loss = train_step(model, batch)
    loss.backward()

g.compile([x, y])          # one-time setup of static/dynamic arg slots
g.replay([x2, y2])         # zero-overhead replay
Method Purpose
__enter__ / __exit__ On the first entry, begin capture of every kernel launch. On later entries, restore cached grads.
compile(inputs) Resolve static vs. dynamic kernel arguments and pre-bind them for fast replay.
replay(new_inputs) Zero the gradient buffers and re-enqueue the captured kernel sequence with new input pointers.

The companion primitives in runtime/capture.py (get_capture()) are what CompiledGraph calls into under the hood. The replay path is what makes a steady-state training loop drop into a tight C-level cl.enqueue_nd_range_kernel loop with no Python frame per launch.

jit_compile

jit_compile is the decorator at the heart of the JIT Compiler. It traces a Python function containing a chain of fusible ops, generates a single OpenCL forward kernel and a single backward kernel, caches them by (fn, shapes, kwargs), and re-uses the cached kernels on every subsequent call.

from netcl.autograd.compiler import jit_compile
import netcl.autograd as ag

@jit_compile
def fused(x):
    y = ag.relu(x)
    z = ag.gelu(y) + 0.5
    return ag.sigmoid(z)

On the first call jit_compile does the following:

  1. Tracing pass. A TracingContext.active flag is set, a parallel set of dummy Nodes is built whose values are TraceNodes (placeholder op nodes), and fused is run on them. The result is a symbolic DAG of operations.
  2. Topological sort. A DFS over the TraceNode DAG produces a linear order from input leaves to the single output root.
  3. Forward kernel generation. For each node in order the registered AutogradPrimitive.forward emits a single C statement (float node_N = EXPR(parent0, parent1, …);). The whole body is wrapped in a __kernel void fused_forward(…) function.
  4. Backward kernel generation. Starting from the output, the compiler walks order in reverse and, for each op, asks the registered AutogradPrimitive.backward for the gradient w.r.t. each parent. Symbolically accumulated gradients are folded with + so an op with multiple consumers gets a single g_node = g_consumer1 + g_consumer2 line. The forward pass is recomputed inside the backward kernel so no intermediates are stored to global memory.
  5. Compile & cache. Both kernels go through runtime.cache.cacher to get an cl.Program. The kernel objects are stashed in a per-process _COMPILER_CACHE keyed on (fn, sig_shapes, kwargs).
  6. Subsequent calls. The decorator skips steps 1–5 and goes straight to kernel_fw(...) and registers a grad_fn closure that calls kernel_bw(...).

If the function's output is not a TraceNode (e.g. it returns a plain Python float, or it calls an op that is not registered as fusible), the decorator silently falls back to the un-fused implementation.

CPU Backend

When all inputs are CPU tensors (see Tensor Backend), jit_compile takes a parallel path: it generates two plain Python functions fused_forward and fused_backward by exec()-ing NumPy expressions, then caches them. There is no OpenCL kernel in this branch — the speedup comes from removing the per-element Python overhead, not from kernel fusion.

Registering Custom Primitives

from netcl.autograd.compiler import register_primitive

def fwd(args, attrs):
    return f"{args[0]} * {args[1]}"

def bwd(args, grad_var, attrs, out_var):
    return [f"{grad_var} * {args[1]}", f"{grad_var} * {args[0]}"]

register_primitive("mul", fwd, bwd, arity=2, fusible=True)

The compiler automatically looks up a primitive by the op_name string passed to apply_op. See Writing a Custom OpenCL Kernel for a worked example that combines register_primitive with jit_compile.

AutogradPrimitive and register_primitive

An AutogradPrimitive is a frozen dataclass that describes a single fusible op:

@dataclass(frozen=True)
class AutogradPrimitive:
    name: str
    forward: Callable[[List[str], dict], str]
    backward: Callable[[List[str], str, dict, str], List[str]]
    arity: int | None = None
    fusible: bool = True
  • forward(parents, attrs) -> str returns a single C expression of the output in terms of the input variable names (e.g. "fmax(v0, v1)").
  • backward(parents, grad_var, attrs, out_var) -> List[str] returns one C expression per parent, expressing that parent's gradient. grad_var is the C name of the upstream gradient ("g_node_42" in the generated kernel) and out_var is the local C name of the op's output, in case the backward needs it.
  • arity is used for the scalar-broadcast and ternary variants; the compiler can recognize an arity-2 op with a scalar attribute and emit a scalar-fused kernel.
  • fusible=False excludes the primitive from JIT fusion even if it is registered. Use this for ops whose forward or backward is genuinely unfusible (e.g. it needs an internal reduction).

register_primitive writes the entry into the module-level _PRIMITIVES dictionary. The default set is registered at import time and covers every op in autograd/ops.py plus the weighted_bce and weighted_smooth_l1 detection losses.

TraceNode and TracingContext

TraceNode is the symbolic placeholder that the JIT Compiler builds the fused kernel source from. A TraceNode carries the op_name, the list of input TraceNodes, the output shape, the dtype, and an attrs dict (typically the attrs= keyword forwarded by apply_op).

TracingContext is a thread-local active flag. When active is True, apply_op takes the tracing branch and returns a Node whose value is a TraceNode. When False, apply_op runs the real forward and records onto the active Tape. The compiler is the only caller that flips active.

The two together implement a dynamic dataflow graph: the compiler traces, the forward kernels compute, the backward kernels consume, and there is no global mutable state besides the _COMPILER_CACHE.

TrainingGraphCompiler, TrainingPattern, fused_weighted_bce_smooth_l1_loss, get_training_compiler

The elementwise JIT Compiler is great for linear chains of pointwise ops, but the training loop in object-detection heads routinely combines a pointwise weighted BCE with a pointwise weighted Smooth-L1, and the reduction over the entire image (a sum of ~tens of millions of elements) is a non-trivial op on its own. The dedicated fused kernel for that pattern is what autograd/training_compiler.py is for.

Symbol Purpose
TrainingGraphCompiler A small registry of stable TrainingPatterns. New patterns can be added via register(); the built-in entry is weighted_bce_smooth_l1_sum.
TrainingPattern A (name, matcher, planner) triple. The matcher decides whether a forward call is eligible; the planner builds the cached FusedDetectionLossPlan.
fused_weighted_bce_smooth_l1_loss(pred, target, heat_weight, reg_weight, heat_scale, reg_scale, beta=0.05) Returns a Node whose forward is the partial+reduce kernel and whose backward is a single OpenCL kernel that produces the gradient w.r.t. pred only. target and the two weight tensors are treated as non-trainable inputs.
get_training_compiler() Returns the process-wide singleton of TrainingGraphCompiler. Call this from custom training loops if you want to add your own pattern.

The full set of autograd-supported fused detection losses is summarized in the table below; refer to Architecture: Autograd & Tape for the dataflow diagram.

Putting It Together: Training One Step

import netcl.autograd as ag
import netcl.amp as amp
from netcl.optim import Adam

opt = Adam(model.parameters(), lr=1e-3)
scaler = amp.GradScaler()

for x, y in loader:
    with ag.Tape() as tape:
        with amp.autocast(enabled=True):
            pred = model(x)
            loss = ag.cross_entropy(pred, y)
        loss = scaler.scale_loss(loss)

    tape.backward(loss)
    scaler.step(opt, model.parameters())
    scaler.update()
    opt.zero_grad()

This is the canonical netcl training step. It composes the Tape for gradient recording, autocast for half-precision forward, and GradScaler for the loss scaling. See MNIST with MLP for a complete worked example.

See also