netcl.optim — Optimizers & Schedules
netcl.optim — Optimizers & Schedules
netcl.optim provides the parameter-update machinery that turns accumulated gradients
into actual Tensor mutations. It mirrors torch.optim: a step() call
applies the update rule, a zero_grad() clears the gradient buffers, and a separate
scheduler family adjusts the learning rate over epochs.
All public symbols are re-exported from the package root:
from netcl.optim import (
SGD, Adam, AdamW, Momentum, RMSProp,
CosineAnnealingLR, ReduceLROnPlateau, WarmupCosine,
clip_grad_norm, clip_grad_norm_device,
AMPGradScaler,
)
Optimizer Index
| Class / function | Purpose |
|---|---|
SGD |
Stochastic gradient descent (+ optional momentum/Nesterov). |
Momentum |
SGD with classical (Polyak) momentum. |
Adam |
Adaptive moments (Kingma & Ba, 2014). |
AdamW |
Adam with decoupled weight decay. |
RMSProp |
Per-parameter adaptive learning rate. |
clip_grad_norm |
Norm-based gradient clipping (host-side). |
clip_grad_norm_device |
Same, runs as an OpenCL kernel. |
CosineAnnealingLR |
Cosine annealing to eta_min. |
WarmupCosine |
Linear warmup followed by cosine decay. |
ReduceLROnPlateau |
Plateau-triggered decay. |
AMPGradScaler |
Loss scaling for AMP training. |
SGD
from netcl.optim import SGD
opt = SGD(
params,
lr=1e-2,
momentum=0.0,
dampening=0.0,
weight_decay=0.0,
nesterov=False,
)
Update rule (per parameter θ, with gradient g):
- If
weight_decay > 0:g ← g + weight_decay * θ - If
momentum > 0:v ← momentum * v + (1 - dampening) * g - if
nesterov:g ← g + momentum * v - else:
g ← v θ ← θ - lr * g
Set nesterov=True for the Nesterov variant, which uses the lookahead gradient
g + momentum * v instead of v in the step. The defaults (momentum=0.0,
nesterov=False) reduce to vanilla gradient descent.
Momentum
Classic Polyak momentum. The implementation is identical to SGD with
momentum > 0 and dampening=0.0, but the constructor only exposes the momentum-style
knobs:
from netcl.optim import Momentum
opt = Momentum(params, lr=1e-2, momentum=0.9, weight_decay=1e-4, nesterov=False)
Adam / AdamW
Adam keeps two exponential moving averages per parameter — the first moment m and the
uncentered second moment v — and computes a bias-corrected adaptive step.
from netcl.optim import Adam, AdamW
opt = Adam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.0)
optw = AdamW(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01)
Per parameter θ with gradient g and step t:
m ← β1 * m + (1 - β1) * gv ← β2 * v + (1 - β2) * g²m̂ ← m / (1 - β1^t)v̂ ← v / (1 - β2^t)θ ← θ - lr * m̂ / (sqrt(v̂) + eps)
In Adam, weight_decay is applied as an L2 penalty on the gradient (the original
formulation); in AdamW the decay is decoupled from the gradient update — it is
applied directly to the weights as θ ← θ - lr * (m̂ / (sqrt(v̂) + eps) + weight_decay * θ).
Decoupled weight decay is the modern default for transformer-style training and is the
recommended variant unless you have a specific reason to use the L2-in-gradient form.
RMSProp
Per-parameter adaptive learning rate using a moving average of squared gradients:
from netcl.optim import RMSProp
opt = RMSProp(
params,
lr=1e-3,
alpha=0.99,
eps=1e-8,
weight_decay=0.0,
momentum=0.0,
centered=False,
)
Per parameter θ with gradient g:
v ← alpha * v + (1 - alpha) * g²- If
centered: additionally maintain a moving averageg_avgofg, and usesqrt(v - g_avg²) + epsas the denominator (a more stable variance estimate). θ ← θ - lr * g / (sqrt(v) + eps)
Set momentum > 0 to add a momentum term on top of the RMSProp update.
Gradient Clipping
Two helpers, both in netcl.optim.clip, that operate in place on the
Tensor.grad fields of the parameters you pass in. Call them after tape.backward(loss)
and before opt.step().
from netcl.optim import clip_grad_norm, clip_grad_norm_device
# Host-side: pull grads to NumPy, scale, copy back.
clip_grad_norm(parameters, max_norm=1.0)
# Device-side: runs as an OpenCL kernel — avoids the H2D/D2H round-trip.
clip_grad_norm_device(parameters, max_norm=1.0)
clip_grad_norm computes the total L2-norm of the concatenated gradient vector and
scales every gradient by min(1, max_norm / total_norm). clip_grad_norm_device does
the same computation entirely on the device and is faster for large parameter counts.
LR Schedules
Schedulers adjust optimizer.lr on every call to scheduler.step(). Call sched.step()
once per epoch for epoch-based schedulers, or sched.step(metric) for metric-based ones.
CosineAnnealingLR
Smoothly anneal from the initial learning rate to eta_min over T_max epochs.
from netcl.optim import CosineAnnealingLR
sched = CosineAnnealingLR(opt, T_max=50, eta_min=0.0)
for epoch in range(50):
train(...)
sched.step()
The curve is lr(e) = eta_min + (lr_0 - eta_min) * (1 + cos(π * e / T_max)) / 2.
WarmupCosine
Linear warmup for warmup_steps steps followed by cosine decay. Useful for transformer
training where a cold start risks divergence.
from netcl.optim import WarmupCosine
sched = WarmupCosine(opt, warmup_steps=500, total_steps=10000, eta_min=0.0)
# call sched.step() once per optimizer step (not per epoch)
ReduceLROnPlateau
The only metric-based scheduler. It watches a scalar (typically validation loss) and
reduces the LR when the metric has stopped improving for patience measurements.
from netcl.optim import ReduceLROnPlateau
plateau = ReduceLROnPlateau(
optimizer,
mode="min", # "min" or "max"
factor=0.5, # new_lr = old_lr * factor
patience=5, # epochs with no improvement before decay
threshold=1e-4, # significant change threshold
threshold_mode="rel", # "rel" or "abs"
cooldown=0,
min_lr=0.0,
eps=1e-8,
)
for epoch in range(epochs):
val_loss = validate(...)
plateau.step(val_loss)
Set mode="max" for accuracy-like metrics. threshold_mode="rel" interprets
threshold as a relative change (default 1e-4); use "abs" for an absolute tolerance.
AMPGradScaler & AMP
Mixed-precision training scales the loss to keep fp16 gradients representable.
AMPGradScaler is re-exported from netcl.optim; the underlying class lives in
netcl.amp.
from netcl.optim import AMPGradScaler
from netcl.amp import autocast
import netcl.autograd as ag
scaler = AMPGradScaler()
for x, y in loader:
with ag.Tape() as tape:
with autocast():
logits = model(x)
loss = loss_fn(logits, y)
tape.backward(scaler.scale(loss))
scaler.step(opt)
scaler.update()
opt.zero_grad()
The full AMP contract, including the autocast context manager and the device support matrix, is documented on its own page.
Full Training Loop
import netcl.autograd as ag
from netcl.optim import AdamW, CosineAnnealingLR, clip_grad_norm
opt = AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
sched = CosineAnnealingLR(opt, T_max=20)
for epoch in range(20):
for x, y in loader:
with ag.Tape() as tape:
logits = model(x)
loss = ag.cross_entropy(logits, y)
tape.backward(loss)
clip_grad_norm(model.parameters(), max_norm=1.0)
opt.step()
opt.zero_grad()
sched.step()
Distributed Notes
Each worker in a Data Parallel setup holds its own copy of every
optimizer. The distributed API takes care of gradient all-reduce
before the optimizer's step() is called, so a parameter update is identical to
the single-process case. Parameter sharding (ZeRO-style) is not part of the
optimizer contract; if you need it, wrap the parameters in sharded views before handing
them to the optimizer constructor.
For an end-to-end multi-device example, see Data Parallel.
See also
- MNIST with MLP — end-to-end training loop using AdamW, gradient clipping, and a cosine schedule.
- Data Parallel — using these optimizers in a multi-process setup.
- autograd API — the Tape that produces the gradients
passed into
opt.step(). - amp API — the GradScaler and autocast context that wrap these optimizers for mixed precision.
- distributed API — gradient all-reduce in front of
opt.step(). - Tensor — the actual Tensor objects that the optimizers mutate in place.