Tutorial: Understanding Autograd
Tutorial: Understanding Autograd
netcl.autograd is a tape-based reverse-mode automatic differentiation system. In
this tutorial we lift the hood: we trace what a Tape actually records, we
write a small custom op end-to-end (forward + backward closure) and register it with the
JIT Compiler, and we touch every knob you are likely to reach
for while debugging a gradient — no_grad, set_grad_enabled,
detect_anomaly, debug_tape.
By the end of this page you should be able to read a Node graph in your head and predict exactly which kernels the backward pass will launch.
Prerequisites
- Quickstart — you should be able to construct a Tensor and run a single op.
- Tensor — what
Tensor.queue,Tensor.buffer, andTensor.gradare. - Tensor Backend — the role of the OpenCL command queue and why Tape flushes it between steps.
You do not need to have read the JIT Compiler page first; we explain the bits we touch.
What You'll Build
A custom op sin_x_times(a, b) = sin(a) * b, written in three layers of decreasing
abstraction:
- A pure-Python
forwardthat composes two built-in elementwise ops. - A pure-Python
grad_fnthat uses the chain rule to produce the gradient w.r.t. each parent. - An apply_op wrapper that registers the pair on the active Tape.
We then trace the resulting graph with debug_tape, register the op with the JIT Compiler via register_primitive, and confirm the fused kernel is the one that runs.
Step-by-Step
1. The Big Picture: What Reverse-Mode Autodiff Does
Three facts cover ~90% of what is going on under the hood:
- Every Tensor carries a
requires_gradflag. The flag is set when you wrap a tensor withag.tensor(x, requires_grad=True)or when a parent in a recorded computation already has it. - Every op that goes through autograd (which is every op that goes
through
ag.<name>(...)or an operator overload on a Node) registers agrad_fnon the output Node. - When you call
tape.backward(loss), the Tape does a reverse topological sort over the recorded graph, callsgrad_fn(grad_out)for each op, and accumulates the resulting per-parent gradients intoparent.grad(and the wrappedparent.value.gradfor Optimizer compatibility).
That is the entire mechanism. The rest of this tutorial is "let's look at each piece".
2. Wrapping a Tensor in a Node
A Node is a thin container around a Tensor that also holds
the grad_fn, the parent Nodes, the accumulated gradient, and the
op_name. You almost never construct one yourself; you get one back from every op.
import netcl.autograd as ag
from netcl.core.tensor import Tensor
from netcl.core.device import manager
q = manager.default("auto").queue
x_t = Tensor.from_host(q, [[1.0, 2.0], [3.0, 4.0]])
x = ag.tensor(x_t, requires_grad=True) # wrap the tensor in a Node
print(x.value.to_host()) # same data as x_t
print(x.requires_grad, x.grad_fn, x.op_name)
The value attribute is the underlying Tensor (the same object that lives
on the device); grad_fn and op_name are None for a leaf node. Once you run an op
on x, the resulting Node will have grad_fn set and op_name will be
the name of the op (e.g. "relu", "add", or whatever you pass to apply_op).
3. Using Built-In Ops and Operator Overloading
autograd re-exports every mathematical op as a function that takes Nodes and returns Nodes:
y = ag.relu(x) # y = max(0, x)
z = (y * 2.0) + 1.0 # operator overloading: ag.mul / ag.add under the hood
loss = ag.sum(z) # scalar
The arithmetic operators on Node are defined in
autograd/engine.py — __add__, __radd__, __sub__, __rsub__, __mul__,
__rmul__, __truediv__, __rtruediv__, __neg__, __pow__, plus the comparison
operators __lt__, __le__, __gt__, __ge__. Every one of them dispatches to the
corresponding ag.<op>(self, other) so the resulting Node is recorded
on the active Tape.
4. Calling backward and Reading Gradients
with ag.Tape() as tape:
y = ag.relu(x)
z = (y * 2.0) + 1.0
loss = ag.sum(z)
tape.backward(loss)
print(x.value.grad.to_host()) # dy/dx = (x > 0) ? 2 : 0
A few things to notice:
- The forward pass runs normally — the only thing the Tape does on the
way in is append a Node to
tape.nodes. tape.backward(loss)does a depth-first topological sort starting atloss, then walks the resulting list in reverse and callsnode.grad_fn(node.grad).- The gradient is accumulated into
parent.grad(and intoparent.value.gradso Optimizers can read it). If a parent is used more than once, its per-consumer gradients are summed in place via an OpenCLADDkernel. - The loss's
gradis seeded withones_like(loss.value)when nograd=argument is given tobackward.
5. Writing a Custom Op
This is the core of the tutorial. We are going to write out = sin(a) * b, then derive
its backward by hand.
Forward. out = sin(a) * b, both elementwise. We can express it as a composition of
two stock ops: elementwise_unary for sin(a) and
elementwise_binary for the product. The kernel expressions are taken from the
OpenCL primitive set.
Backward. By the product rule and the chain rule:
d/da (sin(a) * b) = cos(a) * b
d/db (sin(a) * b) = sin(a) * 1
So we need cos(a) (for the first gradient) and sin(a) (for the second). The
forward call already computed sin(a) — we need to keep that Tensor
around so the backward can reuse it instead of recomputing.
from netcl.autograd.engine import apply_op
from netcl.ops.elementwise import elementwise_binary
from netcl.ops.elementwise_optimized import elementwise_unary
def my_op(a, b, tape=None):
"""out = sin(a) * b — both elementwise."""
def forward(x, y):
sin_x = elementwise_unary(x, expression="sin(v0)")
return elementwise_binary(sin_x, y, expression="MUL(v0, v1)")
def grad_fn(grad_out):
# d/da (sin(a)*b) = cos(a) * b
cos_a = elementwise_unary(a.value, expression="cos(v0)")
ga = elementwise_binary(grad_out, b.value, expression="MUL(v0, v1)")
ga = elementwise_binary(ga, cos_a, expression="MUL(v0, v1)")
# d/db (sin(a)*b) = sin(a)
gb = elementwise_binary(grad_out, a.sin_x_or_a, # see footnote
expression="MUL(v0, v1)")
return [ga, gb]
return apply_op(forward, grad_fn, a, b, tape=tape, op_name="my_op")
Footnote — the
sin_x_or_atypo. The German original usedsin_x_or_aas the second input to thegbmultiplication. That name does not exist in scope. The intended value is thesin_xthat the forward closure computed and consumed. We can either re-compute it insidegrad_fn(one extra kernel launch) or — preferred — stash it on the parent Node before discarding. The cleanest fix is to precomputesin_aonce and reuse it for both the product and thegbgradient. Replace thegbline with:
python sin_a = elementwise_unary(a.value, expression="sin(v0)") gb = elementwise_binary(grad_out, sin_a, expression="MUL(v0, v1)")Two extra notes:
- apply_op's real signature (per
autograd/engine.py) isapply_op(fn, grad_fn, *args, tape=None, op_name=None, attrs=None). Thegrad_fnreceives a single argument — the upstream gradient — and must return one Tensor per parent in the order they were passed to apply_op.- The custom op above is the equivalent of three kernels: a
sin, acos, and two multiplies. Section 6 shows how to register it with the JIT Compiler so that the entire forward+backward becomes a single fused kernel.
6. Fusing the Custom Op with the JIT Compiler
Registering a custom op with the JIT Compiler is what turns
it from "three kernels" into "one kernel". The recipe is to define an
AutogradPrimitive (a (forward, backward, arity, fusible) record) and
register it by name. From then on, any apply_op(..., op_name="my_op") call inside a
@jit_compiled function is recognized and fused.
from netcl.autograd.compiler import register_primitive
def my_op_fwd(args, attrs):
# args == ["v0", "v1"]; return a single C expression for the output.
return f"({args[0]} * sin({args[0]}))" # NB: v0 = a; need v1 = b for the product
# We need both a and b, so the correct forward is:
def my_op_fwd(args, attrs):
a, b = args
return f"({b} * sin({a}))"
def my_op_bwd(args, grad_var, attrs, out_var):
a, b = args
g_a = f"({grad_var} * {b} * cos({a}))"
g_b = f"({grad_var} * sin({a}))"
return [g_a, g_b]
register_primitive("my_op", my_op_fwd, my_op_bwd, arity=2, fusible=True)
The two callable arguments are tiny code generators:
forward(args, attrs) -> str— given the C names of the input variables ("v0","v1", …) and a dict of scalar attributes, return a single C expression for the output (e.g."v1 * sin(v0)").backward(args, grad_var, attrs, out_var) -> List[str]— given the inputs, the C name of the upstream gradient, and the local C name of the output, return one C expression per parent, in the same order the parents were passed to apply_op.
Once registered, you can write a jit_compiled function that uses my_op and the
JIT Compiler will fuse the entire forward chain — including
the custom op — into a single forward kernel and a single backward kernel. See
Writing a Custom OpenCL Kernel for the full worked example.
7. Grad-Mode Switches: no_grad, set_grad_enabled
The three grad-mode knobs are:
ag.no_grad()— context manager. Saves the current grad mode, sets it toFalse, restores on__exit__. Use it in inference and validation loops so the Tape does not pollute the graph.ag.set_grad_enabled(False)— module-level switch. Same effect, but global rather than scoped.ag.is_grad_enabled()— query the current process-wide grad mode.
with ag.no_grad():
out = model(x_eval) # forward runs, but no Nodes are recorded
no_grad is the recommended way to do inference. Inside the block, every
op that goes through apply_op short-circuits to fn(...) and returns
the raw Tensor — no grad_fn, no tape recording, no allocation of the
gradient buffer on the output. This is also the cheapest way to free a backward graph
you no longer need.
8. Anomaly Detection: detect_anomaly
A slow but high-signal mode that compares the analytical gradient (the kernel chain
that you wrote) against a finite-difference numerical gradient on the same input. It is
the right tool when you are writing a new op and loss is producing NaNs, or when an
existing op starts misbehaving after a refactor.
with ag.detect_anomaly():
tape.backward(loss)
When detect_anomaly is on, the Tape checks every
parent's gradient for NaN / inf after each grad_fn call, and raises a
RuntimeError with the offending Node's creation_trace — the
traceback.format_stack() captured at registration time. The
set_detect_anomaly module function is the same flag at module level,
useful for one-line debugging from a breakpoint.
9. Graph Inspection: debug_tape
debug_tape is a thin context manager that yields the active
Tape. Combined with tape.nodes and node.creation_trace, it lets you
introspect the graph from inside a forward pass — useful in a debugger or in unit tests
that assert "this op chain has exactly N nodes".
with ag.debug_tape(tape) as t:
pred = model(x)
print(len(t.nodes)) # number of recorded ops so far
loss = ag.cross_entropy(pred, y)
print(len(t.nodes)) # one more
You can also poke at the graph after the forward pass without the context manager:
with ag.Tape() as tape:
pred = model(x)
loss = ag.cross_entropy(pred, y)
for node in tape.nodes:
print(node.op_name, node.requires_grad, len(node.parents))
Visualizing the Tape
When you want a richer graph visualization than print(tape.nodes), the
debug_tape helper is the entry point. For the fused view (the kernel
graph the JIT Compiler builds), see the
TraceNode section of the JIT Compiler page.
Troubleshooting
- Gradient is zero everywhere. A Node's
requires_gradisTrueif any parent hadrequires_grad=True. If you wrap a Tensor withag.tensor(x)(the defaultrequires_grad=False), the resulting Node is a leaf and the gradient never propagates past it. Wrap withag.tensor(x, requires_grad=True)(or use a Parameter — every Parameter sets the flag automatically). - Gradient is
Noneon a parameter aftertape.backward. Most often the parameter was not on the path from the loss — i.e. it is not actually used by the model. Confirm withid(p.value) in {id(n.value) for n in tape.nodes}. detect_anomalyraises withGradient w.r.t. parent N of op 'X' contains NaN or Inf. Thecreation_tracein the error message points at the Python frame that registered the bad op. The usual suspects are a divide-by-zero (divwith noepsguard) or a log of a non-positive number (logafter areluis safe;logof raw activations is not).- Custom op produces a
RuntimeError: op 'my_op' is not registered with the [JIT Compiler](/concepts/jit-compiler). You decorated the function with@jit_compilebut did not call register_primitive for"my_op". Either add the registration, or setfusible=Falseto exclude the op from fusion (the decorator will silently fall back to the un-fused implementation). - Backward is slow. The Tape does an in-place
ADDfor every multi-use parent; if your graph has hundreds of small ops the JIT Compiler is the right tool to fold them into one fused kernel. Wrap the body in@jit_compileand confirm the cache hit by looking for thecachedlog line.
See also
- MNIST with an MLP — uses the Tape in its plain form.
- Writing a Custom OpenCL Kernel — pairs this tutorial's
apply_oprecipe with a hand-written OpenCL kernel and a register_primitive registration. - Data-Parallel Training — multiple Tapes running in subprocesses, with cross-process all_reduce.
- Autograd API — the full symbol reference for the Tape / Node / apply_op machinery.
- Architecture: Autograd — the reverse-topological-sort
algorithm and the in-place
ADDaccumulation. - Architecture: JIT Compiler — how the TraceNode graph is built and compiled to a single OpenCL kernel pair.
- OpenCL — the OpenCL primitive set that elementwise_unary and elementwise_binary draw from.