Adam
Adam
Status: Public API in
netcl.optim.adam.Adam(re-exported fromnetcl.optim)
Adam (Kingma and Ba, 2014) is the workhorse first-order optimizer of
deep learning and is bundled with netcl under the same name. It
maintains per-parameter exponential moving averages of the gradient
(m_t, the first moment) and of the squared gradient (v_t, the
second moment), and uses these to compute a bias-corrected update.
The Adam class in netcl follows the original paper closely:
m_t = beta1 * m_{t-1} + (1 - beta1) * g_t
v_t = beta2 * v_{t-1} + (1 - beta2) * g_t ** 2
m_hat = m_t / (1 - beta1 ** t)
v_hat = v_t / (1 - beta2 ** t)
theta_t = theta_{t-1} - lr * m_hat / (sqrt(v_hat) + eps)
AdamW is a closely related variant (see
AdamW) that decouples weight decay from the gradient
step. The two optimizers share the same moment-estimation code; only
the final parameter update differs.
Overview
Adam is a stateful optimizer: each Parameter it sees gets a
pair of state tensors (m and v) allocated lazily on the first
step. The state is held in a dict keyed on id(parameter), so
re-using a parameter across multiple optimizers (e.g. for a
discriminator / generator pair) requires care.
The optimizer works with any netcl Tensor whose requires_grad is
True. Tensors that are not leaf tensors (i.e. produced by an op)
are ignored. The standard call site is the Trainer
loop or a hand-written training step.
Where It Lives
- File path:
optim/adam.py. - Module path:
netcl.optim.adam. - Public re-export:
from netcl.optim import Adam. - Sibling optimizers:
optim.sgd.SGD,optim.momentum.Momentum,optim.rmsprop.RMSProp,optim.adamw.AdamW.
Diagram
How It Works
On step():
- For each parameter with a non-
Nonegrad: * Read the gradient fromparam.grad. * Update the first momentm = beta1 * m + (1 - beta1) * g. * Update the second momentv = beta2 * v + (1 - beta2) * g * g. * Computem_hat = m / (1 - beta1 ** t)andv_hat = v / (1 - beta2 ** t). * Apply the updateparam -= lr * m_hat / (sqrt(v_hat) + eps). - Increment the internal step counter.
- Optionally apply weight decay coupled (L2 penalty on the
gradient before the moment update); see
weight_decayparameter.
The math is implemented in OpenCL kernels (one fused kernel per parameter) and is dispatched through the same op system as the forward / backward passes, so it benefits from the same BufferPool and async-queue path.
Code Example
import netcl.autograd as ag
from netcl.nn import Linear, ReLU, Sequential, cross_entropy
from netcl.optim import Adam, CosineAnnealingLR
from netcl.core.device import manager
q = manager.default("auto").queue
model = Sequential(Linear(q, 784, 256), ReLU(), Linear(q, 256, 10))
optimizer = Adam(model.parameters(), lr=1e-3, betas=(0.9, 0.999),
eps=1e-8, weight_decay=0.0)
scheduler = CosineAnnealingLR(optimizer, T_max=epochs)
for epoch in range(epochs):
for x, y in dataloader:
with ag.Tape() as tape:
logits = model(x)
loss = cross_entropy(logits, y)
tape.backward(loss)
optimizer.step()
optimizer.zero_grad()
scheduler.step()
Performance & Trade-offs
Adam's per-step compute is about 3x the cost ofSGDwith momentum (onemupdate, onevupdate, one bias-corrected step). On small models, this is dominated by the kernel-launch overhead; the JIT Compiler does not fuse optimizer steps.- The stateful moments double the memory cost of the model. For a
100 M-parameter model,
Adamneeds about 800 MB of additional state (fp32) on top of the model itself. - Use
AdamWwhen you want a regularizer that actually shrinks weights (and not just a coupled L2 penalty on the gradient). - Under AMP, keep the optimizer state in fp32 even
when the parameters are autocast to fp16 —
Adamwill silently lose precision otherwise.