api

netcl.optim — Optimizers & Schedules

`netcl.optim` — Optimizers & Schedules

netcl.optim provides the parameter-update machinery that turns accumulated gradients into actual Tensor mutations. It mirrors torch.optim: a step() call applies the update rule, a zero_grad() clears the gradient buffers, and a separate scheduler family adjusts the learning rate over epochs.

All public symbols are re-exported from the package root:

from netcl.optim import (
    SGD, Adam, AdamW, Momentum, RMSProp,
    CosineAnnealingLR, ReduceLROnPlateau, WarmupCosine,
    clip_grad_norm, clip_grad_norm_device,
    AMPGradScaler,
)

Optimizer Index

Class / function	Purpose
`SGD`	Stochastic gradient descent (+ optional momentum/Nesterov).
`Momentum`	SGD with classical (Polyak) momentum.
`Adam`	Adaptive moments (Kingma & Ba, 2014).
`AdamW`	Adam with decoupled weight decay.
`RMSProp`	Per-parameter adaptive learning rate.
`clip_grad_norm`	Norm-based gradient clipping (host-side).
`clip_grad_norm_device`	Same, runs as an OpenCL kernel.
`CosineAnnealingLR`	Cosine annealing to `eta_min`.
`WarmupCosine`	Linear warmup followed by cosine decay.
`ReduceLROnPlateau`	Plateau-triggered decay.
`AMPGradScaler`	Loss scaling for AMP training.

SGD

from netcl.optim import SGD

opt = SGD(
    params,
    lr=1e-2,
    momentum=0.0,
    dampening=0.0,
    weight_decay=0.0,
    nesterov=False,
)

Update rule (per parameter θ, with gradient g):

If weight_decay > 0: g ← g + weight_decay * θ
If momentum > 0: v ← momentum * v + (1 - dampening) * g
if nesterov: g ← g + momentum * v
else: g ← v
θ ← θ - lr * g

Set nesterov=True for the Nesterov variant, which uses the lookahead gradient g + momentum * v instead of v in the step. The defaults (momentum=0.0, nesterov=False) reduce to vanilla gradient descent.

Momentum

Classic Polyak momentum. The implementation is identical to SGD with momentum > 0 and dampening=0.0, but the constructor only exposes the momentum-style knobs:

from netcl.optim import Momentum

opt = Momentum(params, lr=1e-2, momentum=0.9, weight_decay=1e-4, nesterov=False)

Adam / AdamW

Adam keeps two exponential moving averages per parameter — the first moment m and the uncentered second moment v — and computes a bias-corrected adaptive step.

from netcl.optim import Adam, AdamW

opt  = Adam(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.0)
optw = AdamW(params, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01)

Per parameter θ with gradient g and step t:

m ← β1 * m + (1 - β1) * g
v ← β2 * v + (1 - β2) * g²
m̂ ← m / (1 - β1^t)
v̂ ← v / (1 - β2^t)
θ ← θ - lr * m̂ / (sqrt(v̂) + eps)

In Adam, weight_decay is applied as an L2 penalty on the gradient (the original formulation); in AdamW the decay is decoupled from the gradient update — it is applied directly to the weights as θ ← θ - lr * (m̂ / (sqrt(v̂) + eps) + weight_decay * θ). Decoupled weight decay is the modern default for transformer-style training and is the recommended variant unless you have a specific reason to use the L2-in-gradient form.

RMSProp

Per-parameter adaptive learning rate using a moving average of squared gradients:

from netcl.optim import RMSProp

opt = RMSProp(
    params,
    lr=1e-3,
    alpha=0.99,
    eps=1e-8,
    weight_decay=0.0,
    momentum=0.0,
    centered=False,
)

Per parameter θ with gradient g:

v ← alpha * v + (1 - alpha) * g²
If centered: additionally maintain a moving average g_avg of g, and use sqrt(v - g_avg²) + eps as the denominator (a more stable variance estimate).
θ ← θ - lr * g / (sqrt(v) + eps)

Set momentum > 0 to add a momentum term on top of the RMSProp update.

Gradient Clipping

Two helpers, both in netcl.optim.clip, that operate in place on the Tensor.grad fields of the parameters you pass in. Call them after tape.backward(loss) and before opt.step().

from netcl.optim import clip_grad_norm, clip_grad_norm_device

# Host-side: pull grads to NumPy, scale, copy back.
clip_grad_norm(parameters, max_norm=1.0)

# Device-side: runs as an OpenCL kernel — avoids the H2D/D2H round-trip.
clip_grad_norm_device(parameters, max_norm=1.0)

clip_grad_norm computes the total L2-norm of the concatenated gradient vector and scales every gradient by min(1, max_norm / total_norm). clip_grad_norm_device does the same computation entirely on the device and is faster for large parameter counts.

LR Schedules

Schedulers adjust optimizer.lr on every call to scheduler.step(). Call sched.step() once per epoch for epoch-based schedulers, or sched.step(metric) for metric-based ones.

`CosineAnnealingLR`

Smoothly anneal from the initial learning rate to eta_min over T_max epochs.

from netcl.optim import CosineAnnealingLR

sched = CosineAnnealingLR(opt, T_max=50, eta_min=0.0)
for epoch in range(50):
    train(...)
    sched.step()

The curve is lr(e) = eta_min + (lr_0 - eta_min) * (1 + cos(π * e / T_max)) / 2.

`WarmupCosine`

Linear warmup for warmup_steps steps followed by cosine decay. Useful for transformer training where a cold start risks divergence.

from netcl.optim import WarmupCosine

sched = WarmupCosine(opt, warmup_steps=500, total_steps=10000, eta_min=0.0)
# call sched.step() once per optimizer step (not per epoch)

`ReduceLROnPlateau`

The only metric-based scheduler. It watches a scalar (typically validation loss) and reduces the LR when the metric has stopped improving for patience measurements.

from netcl.optim import ReduceLROnPlateau

plateau = ReduceLROnPlateau(
    optimizer,
    mode="min",      # "min" or "max"
    factor=0.5,      # new_lr = old_lr * factor
    patience=5,      # epochs with no improvement before decay
    threshold=1e-4,  # significant change threshold
    threshold_mode="rel",  # "rel" or "abs"
    cooldown=0,
    min_lr=0.0,
    eps=1e-8,
)

for epoch in range(epochs):
    val_loss = validate(...)
    plateau.step(val_loss)

Set mode="max" for accuracy-like metrics. threshold_mode="rel" interprets threshold as a relative change (default 1e-4); use "abs" for an absolute tolerance.

`AMPGradScaler` & AMP

Mixed-precision training scales the loss to keep fp16 gradients representable. AMPGradScaler is re-exported from netcl.optim; the underlying class lives in netcl.amp.

from netcl.optim import AMPGradScaler
from netcl.amp import autocast
import netcl.autograd as ag

scaler = AMPGradScaler()

for x, y in loader:
    with ag.Tape() as tape:
        with autocast():
            logits = model(x)
            loss = loss_fn(logits, y)
    tape.backward(scaler.scale(loss))
    scaler.step(opt)
    scaler.update()
    opt.zero_grad()

The full AMP contract, including the autocast context manager and the device support matrix, is documented on its own page.

Full Training Loop

import netcl.autograd as ag
from netcl.optim import AdamW, CosineAnnealingLR, clip_grad_norm

opt   = AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
sched = CosineAnnealingLR(opt, T_max=20)

for epoch in range(20):
    for x, y in loader:
        with ag.Tape() as tape:
            logits = model(x)
            loss = ag.cross_entropy(logits, y)
        tape.backward(loss)
        clip_grad_norm(model.parameters(), max_norm=1.0)
        opt.step()
        opt.zero_grad()
    sched.step()

Distributed Notes

Each worker in a Data Parallel setup holds its own copy of every optimizer. The distributed API takes care of gradient all-reduce before the optimizer's step() is called, so a parameter update is identical to the single-process case. Parameter sharding (ZeRO-style) is not part of the optimizer contract; if you need it, wrap the parameters in sharded views before handing them to the optimizer constructor.

For an end-to-end multi-device example, see Data Parallel.