concepts

Adam

Status: Public API in netcl.optim.adam.Adam (re-exported from netcl.optim)

Adam (Kingma and Ba, 2014) is the workhorse first-order optimizer of deep learning and is bundled with netcl under the same name. It maintains per-parameter exponential moving averages of the gradient (m_t, the first moment) and of the squared gradient (v_t, the second moment), and uses these to compute a bias-corrected update.

The Adam class in netcl follows the original paper closely:

m_t     = beta1 * m_{t-1} + (1 - beta1) * g_t
v_t     = beta2 * v_{t-1} + (1 - beta2) * g_t ** 2
m_hat   = m_t / (1 - beta1 ** t)
v_hat   = v_t / (1 - beta2 ** t)
theta_t = theta_{t-1} - lr * m_hat / (sqrt(v_hat) + eps)

AdamW is a closely related variant (see AdamW) that decouples weight decay from the gradient step. The two optimizers share the same moment-estimation code; only the final parameter update differs.

Overview

Adam is a stateful optimizer: each Parameter it sees gets a pair of state tensors (m and v) allocated lazily on the first step. The state is held in a dict keyed on id(parameter), so re-using a parameter across multiple optimizers (e.g. for a discriminator / generator pair) requires care.

The optimizer works with any netcl Tensor whose requires_grad is True. Tensors that are not leaf tensors (i.e. produced by an op) are ignored. The standard call site is the Trainer loop or a hand-written training step.

Where It Lives

File path: optim/adam.py.
Module path: netcl.optim.adam.
Public re-export: from netcl.optim import Adam.
Sibling optimizers: optim.sgd.SGD, optim.momentum.Momentum, optim.rmsprop.RMSProp, optim.adamw.AdamW.

Diagram

How It Works

On step():

For each parameter with a non-None grad: * Read the gradient from param.grad. * Update the first moment m = beta1 * m + (1 - beta1) * g. * Update the second moment v = beta2 * v + (1 - beta2) * g * g. * Compute m_hat = m / (1 - beta1 ** t) and v_hat = v / (1 - beta2 ** t). * Apply the update param -= lr * m_hat / (sqrt(v_hat) + eps).
Increment the internal step counter.
Optionally apply weight decay coupled (L2 penalty on the gradient before the moment update); see weight_decay parameter.

The math is implemented in OpenCL kernels (one fused kernel per parameter) and is dispatched through the same op system as the forward / backward passes, so it benefits from the same BufferPool and async-queue path.

Code Example

import netcl.autograd as ag
from netcl.nn import Linear, ReLU, Sequential, cross_entropy
from netcl.optim import Adam, CosineAnnealingLR
from netcl.core.device import manager

q = manager.default("auto").queue
model = Sequential(Linear(q, 784, 256), ReLU(), Linear(q, 256, 10))
optimizer = Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999),
                 eps=1e-8, weight_decay=0.0)
scheduler = CosineAnnealingLR(optimizer, T_max=epochs)

for epoch in range(epochs):
    for x, y in dataloader:
        with ag.Tape() as tape:
            logits = model(x)
            loss = cross_entropy(logits, y)
        tape.backward(loss)
        optimizer.step()
        optimizer.zero_grad()
    scheduler.step()

Performance & Trade-offs

Adam's per-step compute is about 3x the cost of SGD with momentum (one m update, one v update, one bias-corrected step). On small models, this is dominated by the kernel-launch overhead; the JIT Compiler does not fuse optimizer steps.
The stateful moments double the memory cost of the model. For a 100 M-parameter model, Adam needs about 800 MB of additional state (fp32) on top of the model itself.
Use AdamW when you want a regularizer that actually shrinks weights (and not just a coupled L2 penalty on the gradient).
Under AMP, keep the optimizer state in fp32 even when the parameters are autocast to fp16 — Adam will silently lose precision otherwise.