Momentum
Momentum
Status: Public API in
netcl.optim.momentum.Momentum(re-exported fromnetcl.optim)
Heavy-ball (Polyak) momentum is one of the oldest tricks in stochastic optimisation. The idea is to give the optimizer a short-term memory: instead of updating the parameters in the direction of the current gradient, the optimizer updates in the direction of an exponentially weighted moving average of past gradients. This dampens the noise of minibatch SGD and often gives a meaningful speed-up, especially on ill-conditioned loss surfaces.
Momentum is the netcl class that exposes heavy-ball momentum
as a stand-alone optimizer. It keeps a per-parameter velocity
buffer and updates it as v = momentum * v + g and
theta -= lr * v. It is essentially SGD with
the momentum > 0 flag, exposed as its own class for users
who want the explicit name.
Momentum is rarely used in practice — SGD(momentum=0.9) is
strictly more general — but the class is kept for symmetry
with the rest of the optim module and to provide a clean
home for the nesterov and weight_decay options that
pure-momentum users typically want.
Overview
The math is identical to SGD with momentum > 0 and
nesterov=False:
v_{t+1} = momentum * v_t + g_t
theta_{t+1} = theta_t - lr * v_{t+1}
Momentum and SGD share the same fused OpenCL kernel; the
two classes differ only in their default arguments. The
Momentum constructor takes lr, momentum, weight_decay,
and nesterov; the SGD constructor takes the same arguments
plus an extra dampening parameter that pure-momentum users
typically do not want.
Where It Lives
- File path:
optim/momentum.py. - Module path:
netcl.optim.momentum. - Public re-export:
from netcl.optim import Momentum.
How It Works
For each parameter, the kernel does:
v[i] = momentum * v[i] + g[i];
param[i] -= lr * v[i];
Optionally a coupled weight-decay is added before the update.
The velocity buffer v is allocated lazily on the first
step, exactly like Adam's moments. The
nesterov=True flag modifies the look-ahead update to
param -= lr * (momentum * v + g), which is the Nesterov
accelerated gradient (NAG).
Code Example
import netcl.optim as opt
optimizer = opt.Momentum(model.parameters(),
lr=0.01, momentum=0.9,
weight_decay=0.0, nesterov=False)
With Nesterov acceleration (the recommended default for vision training):
optimizer = opt.Momentum(model.parameters(),
lr=0.1, momentum=0.9,
weight_decay=5e-4, nesterov=True)
A cosine learning-rate schedule pairs naturally:
scheduler = opt.lr_scheduler.CosineAnnealingLR(optimizer,
T_max=90)
for epoch in range(90):
train_one_epoch(...)
scheduler.step()
Performance & Trade-offs
- Identical to
SGD(momentum=0.9). The class is provided for readability; pick whichever name reads better at the call site. - The velocity buffer is fp32 even when the parameters are fp16 under AMP. The fused kernel is aware of the autocast context and casts appropriately.
- Nesterov momentum (
nesterov=True) typically gives a small but consistent speed-up over plain momentum on convex-style loss surfaces (vision classifiers, language models). It is the default for most vision recipes. Momentumis not adaptive — it does not rescale the per-parameter learning rate the way Adam or RMSprop do. For ill-conditioned problems, switch to one of those.
When to use heavy-ball momentum
Heavy-ball momentum is at its best on smooth loss surfaces with low-to-moderate stochastic noise. The two regimes where it shines are:
-
Vision classifiers on ImageNet-style data. A combination of
Momentum(lr=0.1, momentum=0.9, weight_decay=1e-4, nesterov=True)with a cosine schedule is still the workhorse recipe for ResNet, EfficientNet, and ConvNeXt. The loss surface is well-behaved and the minibatch size is usually large enough (256 to 1024) that the per-step gradient noise is small. -
Reinforcement learning policy gradients. Policy-gradient updates are notoriously noisy; a momentum buffer smooths the per-step updates and avoids the oscillations that plague raw SGD. The
Momentumclass is the default optimizer in many RL codebases (PPO, A2C).
The two regimes where heavy-ball momentum is not the right choice:
-
Transformer-style attention models. Adaptive optimizers (
AdamW,Lion) consistently outperform plain momentum on these losses. The per-parameter gradient magnitudes vary by orders of magnitude across the embedding, attention, and MLP blocks, and the momentum buffer is the wrong shape to capture that variation. -
Small-batch fine-tuning. With minibatch size 1 to 16, the per-step gradient is too noisy for the EMA in the velocity buffer to be useful. Switch to AdamW and a low learning rate.
Debugging tips
If training with Momentum diverges (the loss spikes after a
warm-up period), the usual suspects are:
- Learning rate too high. Try dividing the learning rate by 10 and re-running.
- Missing weight decay. Without
weight_decay, the parameters can drift to large magnitudes; setweight_decay=1e-4and re-run. - Missing Nesterov on a convex loss. Switch
nesterov=Falsetonesterov=True; the speed-up is small but the stability is real.
If the loss plateaus early, the usual suspects are:
- Learning rate too low. Try multiplying by 2 and re-running.
- Learning rate schedule too aggressive. A cosine schedule
decays to zero over
T_maxsteps; ifT_maxis too short, the model spends half the training run at a near-zero learning rate. SetT_maxto the full epoch count.
See also
See also
- Momentum — the API page.
- SGD — the more general class.
- Adam — the adaptive alternative.
- CosineAnnealingLR — typical schedule.
- Momentum — this article.