SGD
SGD
Status: Public API in
netcl.optim.sgd.SGD(re-exported fromnetcl.optim)
Stochastic Gradient Descent with optional momentum and Nesterov
acceleration. SGD is the simplest optimizer in netcl: it applies
theta = theta - lr * (g + weight_decay * theta) per parameter,
with an optional momentum buffer.
The SGD class in netcl supports three modes:
- Vanilla SGD — no momentum. The update is
theta -= lr * g. - SGD with momentum —
v = momentum * v + g; theta -= lr * v. - SGD with Nesterov momentum — the same update as momentum, but
the gradient is evaluated at the "look-ahead" point
theta - momentum * v. This typically gives a small but consistent speed-up over plain momentum.
Overview
SGD is the canonical baseline optimizer. It has the smallest
memory footprint of any optimizer in netcl (no per-parameter state
in the vanilla case, one momentum buffer per parameter otherwise)
and the lowest per-step compute.
Where It Lives
- File path:
optim/sgd.py. - Module path:
netcl.optim.sgd. - Public re-export:
from netcl.optim import SGD.
How It Works
For each parameter, the kernel does:
if (momentum != 0) {
v[i] = momentum * v[i] + g[i];
if (nesterov) g[i] += momentum * v[i];
else g[i] = v[i];
}
g[i] += weight_decay * param[i];
param[i] -= lr * g[i];
This is a single fused kernel per parameter. The momentum buffer
v is allocated lazily on the first step, exactly like
Adam's moments.
Code Example
import netcl.optim as opt
optimizer = opt.SGD(
model.parameters(),
lr=0.1, # typical for ResNet on ImageNet
momentum=0.9,
weight_decay=5e-4,
nesterov=True,
)
A cosine learning-rate schedule pairs naturally:
scheduler = opt.lr_scheduler.CosineAnnealingLR(optimizer, T_max=90)
Performance & Trade-offs
- The cheapest optimizer in netcl. The fused kernel does one multiply-add per parameter per step.
- Vanilla SGD is brittle on noisy gradients and rarely used in practice; momentum 0.9 is a near-universal default.
- For ResNet-style vision training, SGD with Nesterov + a cosine schedule is still the most common recipe and is hard to beat with adaptive optimizers at the same parameter count.
See also
- SGD — the API page.
- Adam — the adaptive alternative.
- Momentum — momentum-only variant.
- CosineAnnealingLR — typical schedule.
- SGD — this article.