netcl wiki
concepts

Momentum

Momentum

Status: Public API in netcl.optim.momentum.Momentum (re-exported from netcl.optim)

Heavy-ball (Polyak) momentum is one of the oldest tricks in stochastic optimisation. The idea is to give the optimizer a short-term memory: instead of updating the parameters in the direction of the current gradient, the optimizer updates in the direction of an exponentially weighted moving average of past gradients. This dampens the noise of minibatch SGD and often gives a meaningful speed-up, especially on ill-conditioned loss surfaces.

Momentum is the netcl class that exposes heavy-ball momentum as a stand-alone optimizer. It keeps a per-parameter velocity buffer and updates it as v = momentum * v + g and theta -= lr * v. It is essentially SGD with the momentum > 0 flag, exposed as its own class for users who want the explicit name.

Momentum is rarely used in practice — SGD(momentum=0.9) is strictly more general — but the class is kept for symmetry with the rest of the optim module and to provide a clean home for the nesterov and weight_decay options that pure-momentum users typically want.

Overview

The math is identical to SGD with momentum > 0 and nesterov=False:

v_{t+1}   = momentum * v_t + g_t
theta_{t+1} = theta_t - lr * v_{t+1}

Momentum and SGD share the same fused OpenCL kernel; the two classes differ only in their default arguments. The Momentum constructor takes lr, momentum, weight_decay, and nesterov; the SGD constructor takes the same arguments plus an extra dampening parameter that pure-momentum users typically do not want.

Where It Lives

  • File path: optim/momentum.py.
  • Module path: netcl.optim.momentum.
  • Public re-export: from netcl.optim import Momentum.

How It Works

For each parameter, the kernel does:

v[i] = momentum * v[i] + g[i];
param[i] -= lr * v[i];

Optionally a coupled weight-decay is added before the update. The velocity buffer v is allocated lazily on the first step, exactly like Adam's moments. The nesterov=True flag modifies the look-ahead update to param -= lr * (momentum * v + g), which is the Nesterov accelerated gradient (NAG).

Code Example

import netcl.optim as opt

optimizer = opt.Momentum(model.parameters(),
                         lr=0.01, momentum=0.9,
                         weight_decay=0.0, nesterov=False)

With Nesterov acceleration (the recommended default for vision training):

optimizer = opt.Momentum(model.parameters(),
                         lr=0.1, momentum=0.9,
                         weight_decay=5e-4, nesterov=True)

A cosine learning-rate schedule pairs naturally:

scheduler = opt.lr_scheduler.CosineAnnealingLR(optimizer,
                                               T_max=90)
for epoch in range(90):
    train_one_epoch(...)
    scheduler.step()

Performance & Trade-offs

  • Identical to SGD(momentum=0.9). The class is provided for readability; pick whichever name reads better at the call site.
  • The velocity buffer is fp32 even when the parameters are fp16 under AMP. The fused kernel is aware of the autocast context and casts appropriately.
  • Nesterov momentum (nesterov=True) typically gives a small but consistent speed-up over plain momentum on convex-style loss surfaces (vision classifiers, language models). It is the default for most vision recipes.
  • Momentum is not adaptive — it does not rescale the per-parameter learning rate the way Adam or RMSprop do. For ill-conditioned problems, switch to one of those.

When to use heavy-ball momentum

Heavy-ball momentum is at its best on smooth loss surfaces with low-to-moderate stochastic noise. The two regimes where it shines are:

  • Vision classifiers on ImageNet-style data. A combination of Momentum(lr=0.1, momentum=0.9, weight_decay=1e-4, nesterov=True) with a cosine schedule is still the workhorse recipe for ResNet, EfficientNet, and ConvNeXt. The loss surface is well-behaved and the minibatch size is usually large enough (256 to 1024) that the per-step gradient noise is small.

  • Reinforcement learning policy gradients. Policy-gradient updates are notoriously noisy; a momentum buffer smooths the per-step updates and avoids the oscillations that plague raw SGD. The Momentum class is the default optimizer in many RL codebases (PPO, A2C).

The two regimes where heavy-ball momentum is not the right choice:

  • Transformer-style attention models. Adaptive optimizers (AdamW, Lion) consistently outperform plain momentum on these losses. The per-parameter gradient magnitudes vary by orders of magnitude across the embedding, attention, and MLP blocks, and the momentum buffer is the wrong shape to capture that variation.

  • Small-batch fine-tuning. With minibatch size 1 to 16, the per-step gradient is too noisy for the EMA in the velocity buffer to be useful. Switch to AdamW and a low learning rate.

Debugging tips

If training with Momentum diverges (the loss spikes after a warm-up period), the usual suspects are:

  • Learning rate too high. Try dividing the learning rate by 10 and re-running.
  • Missing weight decay. Without weight_decay, the parameters can drift to large magnitudes; set weight_decay=1e-4 and re-run.
  • Missing Nesterov on a convex loss. Switch nesterov=False to nesterov=True; the speed-up is small but the stability is real.

If the loss plateaus early, the usual suspects are:

  • Learning rate too low. Try multiplying by 2 and re-running.
  • Learning rate schedule too aggressive. A cosine schedule decays to zero over T_max steps; if T_max is too short, the model spends half the training run at a near-zero learning rate. Set T_max to the full epoch count.

See also

See also