netcl wiki
tutorials

Tutorial: Understanding Autograd

Tutorial: Understanding Autograd

netcl.autograd is a tape-based reverse-mode automatic differentiation system. In this tutorial we lift the hood: we trace what a Tape actually records, we write a small custom op end-to-end (forward + backward closure) and register it with the JIT Compiler, and we touch every knob you are likely to reach for while debugging a gradient — no_grad, set_grad_enabled, detect_anomaly, debug_tape.

By the end of this page you should be able to read a Node graph in your head and predict exactly which kernels the backward pass will launch.

Prerequisites

  • Quickstart — you should be able to construct a Tensor and run a single op.
  • Tensor — what Tensor.queue, Tensor.buffer, and Tensor.grad are.
  • Tensor Backend — the role of the OpenCL command queue and why Tape flushes it between steps.

You do not need to have read the JIT Compiler page first; we explain the bits we touch.

What You'll Build

A custom op sin_x_times(a, b) = sin(a) * b, written in three layers of decreasing abstraction:

  1. A pure-Python forward that composes two built-in elementwise ops.
  2. A pure-Python grad_fn that uses the chain rule to produce the gradient w.r.t. each parent.
  3. An apply_op wrapper that registers the pair on the active Tape.

We then trace the resulting graph with debug_tape, register the op with the JIT Compiler via register_primitive, and confirm the fused kernel is the one that runs.

Step-by-Step

1. The Big Picture: What Reverse-Mode Autodiff Does

Three facts cover ~90% of what is going on under the hood:

  • Every Tensor carries a requires_grad flag. The flag is set when you wrap a tensor with ag.tensor(x, requires_grad=True) or when a parent in a recorded computation already has it.
  • Every op that goes through autograd (which is every op that goes through ag.<name>(...) or an operator overload on a Node) registers a grad_fn on the output Node.
  • When you call tape.backward(loss), the Tape does a reverse topological sort over the recorded graph, calls grad_fn(grad_out) for each op, and accumulates the resulting per-parent gradients into parent.grad (and the wrapped parent.value.grad for Optimizer compatibility).

That is the entire mechanism. The rest of this tutorial is "let's look at each piece".

2. Wrapping a Tensor in a Node

A Node is a thin container around a Tensor that also holds the grad_fn, the parent Nodes, the accumulated gradient, and the op_name. You almost never construct one yourself; you get one back from every op.

import netcl.autograd as ag
from netcl.core.tensor import Tensor
from netcl.core.device import manager

q = manager.default("auto").queue
x_t = Tensor.from_host(q, [[1.0, 2.0], [3.0, 4.0]])
x = ag.tensor(x_t, requires_grad=True)   # wrap the tensor in a Node
print(x.value.to_host())                 # same data as x_t
print(x.requires_grad, x.grad_fn, x.op_name)

The value attribute is the underlying Tensor (the same object that lives on the device); grad_fn and op_name are None for a leaf node. Once you run an op on x, the resulting Node will have grad_fn set and op_name will be the name of the op (e.g. "relu", "add", or whatever you pass to apply_op).

3. Using Built-In Ops and Operator Overloading

autograd re-exports every mathematical op as a function that takes Nodes and returns Nodes:

y = ag.relu(x)            # y = max(0, x)
z = (y * 2.0) + 1.0       # operator overloading: ag.mul / ag.add under the hood
loss = ag.sum(z)          # scalar

The arithmetic operators on Node are defined in autograd/engine.py__add__, __radd__, __sub__, __rsub__, __mul__, __rmul__, __truediv__, __rtruediv__, __neg__, __pow__, plus the comparison operators __lt__, __le__, __gt__, __ge__. Every one of them dispatches to the corresponding ag.<op>(self, other) so the resulting Node is recorded on the active Tape.

4. Calling backward and Reading Gradients

with ag.Tape() as tape:
    y = ag.relu(x)
    z = (y * 2.0) + 1.0
    loss = ag.sum(z)

tape.backward(loss)
print(x.value.grad.to_host())     # dy/dx = (x > 0) ? 2 : 0

A few things to notice:

  • The forward pass runs normally — the only thing the Tape does on the way in is append a Node to tape.nodes.
  • tape.backward(loss) does a depth-first topological sort starting at loss, then walks the resulting list in reverse and calls node.grad_fn(node.grad).
  • The gradient is accumulated into parent.grad (and into parent.value.grad so Optimizers can read it). If a parent is used more than once, its per-consumer gradients are summed in place via an OpenCL ADD kernel.
  • The loss's grad is seeded with ones_like(loss.value) when no grad= argument is given to backward.

5. Writing a Custom Op

This is the core of the tutorial. We are going to write out = sin(a) * b, then derive its backward by hand.

Forward. out = sin(a) * b, both elementwise. We can express it as a composition of two stock ops: elementwise_unary for sin(a) and elementwise_binary for the product. The kernel expressions are taken from the OpenCL primitive set.

Backward. By the product rule and the chain rule:

d/da (sin(a) * b) = cos(a) * b
d/db (sin(a) * b) = sin(a) * 1

So we need cos(a) (for the first gradient) and sin(a) (for the second). The forward call already computed sin(a) — we need to keep that Tensor around so the backward can reuse it instead of recomputing.

from netcl.autograd.engine import apply_op
from netcl.ops.elementwise import elementwise_binary
from netcl.ops.elementwise_optimized import elementwise_unary

def my_op(a, b, tape=None):
    """out = sin(a) * b — both elementwise."""
    def forward(x, y):
        sin_x = elementwise_unary(x, expression="sin(v0)")
        return elementwise_binary(sin_x, y, expression="MUL(v0, v1)")

    def grad_fn(grad_out):
        # d/da (sin(a)*b) = cos(a) * b
        cos_a = elementwise_unary(a.value, expression="cos(v0)")
        ga = elementwise_binary(grad_out, b.value, expression="MUL(v0, v1)")
        ga = elementwise_binary(ga,    cos_a,   expression="MUL(v0, v1)")
        # d/db (sin(a)*b) = sin(a)
        gb = elementwise_binary(grad_out, a.sin_x_or_a,        # see footnote
                                expression="MUL(v0, v1)")
        return [ga, gb]

    return apply_op(forward, grad_fn, a, b, tape=tape, op_name="my_op")

Footnote — the sin_x_or_a typo. The German original used sin_x_or_a as the second input to the gb multiplication. That name does not exist in scope. The intended value is the sin_x that the forward closure computed and consumed. We can either re-compute it inside grad_fn (one extra kernel launch) or — preferred — stash it on the parent Node before discarding. The cleanest fix is to precompute sin_a once and reuse it for both the product and the gb gradient. Replace the gb line with:

python sin_a = elementwise_unary(a.value, expression="sin(v0)") gb = elementwise_binary(grad_out, sin_a, expression="MUL(v0, v1)")

Two extra notes:

  1. apply_op's real signature (per autograd/engine.py) is apply_op(fn, grad_fn, *args, tape=None, op_name=None, attrs=None). The grad_fn receives a single argument — the upstream gradient — and must return one Tensor per parent in the order they were passed to apply_op.
  2. The custom op above is the equivalent of three kernels: a sin, a cos, and two multiplies. Section 6 shows how to register it with the JIT Compiler so that the entire forward+backward becomes a single fused kernel.

6. Fusing the Custom Op with the JIT Compiler

Registering a custom op with the JIT Compiler is what turns it from "three kernels" into "one kernel". The recipe is to define an AutogradPrimitive (a (forward, backward, arity, fusible) record) and register it by name. From then on, any apply_op(..., op_name="my_op") call inside a @jit_compiled function is recognized and fused.

from netcl.autograd.compiler import register_primitive

def my_op_fwd(args, attrs):
    # args == ["v0", "v1"]; return a single C expression for the output.
    return f"({args[0]} * sin({args[0]}))"   # NB: v0 = a; need v1 = b for the product

# We need both a and b, so the correct forward is:
def my_op_fwd(args, attrs):
    a, b = args
    return f"({b} * sin({a}))"

def my_op_bwd(args, grad_var, attrs, out_var):
    a, b = args
    g_a = f"({grad_var} * {b} * cos({a}))"
    g_b = f"({grad_var} * sin({a}))"
    return [g_a, g_b]

register_primitive("my_op", my_op_fwd, my_op_bwd, arity=2, fusible=True)

The two callable arguments are tiny code generators:

  • forward(args, attrs) -> str — given the C names of the input variables ("v0", "v1", …) and a dict of scalar attributes, return a single C expression for the output (e.g. "v1 * sin(v0)").
  • backward(args, grad_var, attrs, out_var) -> List[str] — given the inputs, the C name of the upstream gradient, and the local C name of the output, return one C expression per parent, in the same order the parents were passed to apply_op.

Once registered, you can write a jit_compiled function that uses my_op and the JIT Compiler will fuse the entire forward chain — including the custom op — into a single forward kernel and a single backward kernel. See Writing a Custom OpenCL Kernel for the full worked example.

7. Grad-Mode Switches: no_grad, set_grad_enabled

The three grad-mode knobs are:

  • ag.no_grad() — context manager. Saves the current grad mode, sets it to False, restores on __exit__. Use it in inference and validation loops so the Tape does not pollute the graph.
  • ag.set_grad_enabled(False) — module-level switch. Same effect, but global rather than scoped.
  • ag.is_grad_enabled() — query the current process-wide grad mode.
with ag.no_grad():
    out = model(x_eval)    # forward runs, but no Nodes are recorded

no_grad is the recommended way to do inference. Inside the block, every op that goes through apply_op short-circuits to fn(...) and returns the raw Tensor — no grad_fn, no tape recording, no allocation of the gradient buffer on the output. This is also the cheapest way to free a backward graph you no longer need.

8. Anomaly Detection: detect_anomaly

A slow but high-signal mode that compares the analytical gradient (the kernel chain that you wrote) against a finite-difference numerical gradient on the same input. It is the right tool when you are writing a new op and loss is producing NaNs, or when an existing op starts misbehaving after a refactor.

with ag.detect_anomaly():
    tape.backward(loss)

When detect_anomaly is on, the Tape checks every parent's gradient for NaN / inf after each grad_fn call, and raises a RuntimeError with the offending Node's creation_trace — the traceback.format_stack() captured at registration time. The set_detect_anomaly module function is the same flag at module level, useful for one-line debugging from a breakpoint.

9. Graph Inspection: debug_tape

debug_tape is a thin context manager that yields the active Tape. Combined with tape.nodes and node.creation_trace, it lets you introspect the graph from inside a forward pass — useful in a debugger or in unit tests that assert "this op chain has exactly N nodes".

with ag.debug_tape(tape) as t:
    pred = model(x)
    print(len(t.nodes))           # number of recorded ops so far
    loss = ag.cross_entropy(pred, y)
    print(len(t.nodes))           # one more

You can also poke at the graph after the forward pass without the context manager:

with ag.Tape() as tape:
    pred = model(x)
    loss = ag.cross_entropy(pred, y)

for node in tape.nodes:
    print(node.op_name, node.requires_grad, len(node.parents))

Visualizing the Tape

When you want a richer graph visualization than print(tape.nodes), the debug_tape helper is the entry point. For the fused view (the kernel graph the JIT Compiler builds), see the TraceNode section of the JIT Compiler page.

Troubleshooting

  • Gradient is zero everywhere. A Node's requires_grad is True if any parent had requires_grad=True. If you wrap a Tensor with ag.tensor(x) (the default requires_grad=False), the resulting Node is a leaf and the gradient never propagates past it. Wrap with ag.tensor(x, requires_grad=True) (or use a Parameter — every Parameter sets the flag automatically).
  • Gradient is None on a parameter after tape.backward. Most often the parameter was not on the path from the loss — i.e. it is not actually used by the model. Confirm with id(p.value) in {id(n.value) for n in tape.nodes}.
  • detect_anomaly raises with Gradient w.r.t. parent N of op 'X' contains NaN or Inf. The creation_trace in the error message points at the Python frame that registered the bad op. The usual suspects are a divide-by-zero (div with no eps guard) or a log of a non-positive number (log after a relu is safe; log of raw activations is not).
  • Custom op produces a RuntimeError: op 'my_op' is not registered with the [JIT Compiler](/concepts/jit-compiler). You decorated the function with @jit_compile but did not call register_primitive for "my_op". Either add the registration, or set fusible=False to exclude the op from fusion (the decorator will silently fall back to the un-fused implementation).
  • Backward is slow. The Tape does an in-place ADD for every multi-use parent; if your graph has hundreds of small ops the JIT Compiler is the right tool to fold them into one fused kernel. Wrap the body in @jit_compile and confirm the cache hit by looking for the cached log line.

See also