netcl wiki
concepts

SGD

SGD

Status: Public API in netcl.optim.sgd.SGD (re-exported from netcl.optim)

Stochastic Gradient Descent with optional momentum and Nesterov acceleration. SGD is the simplest optimizer in netcl: it applies theta = theta - lr * (g + weight_decay * theta) per parameter, with an optional momentum buffer.

The SGD class in netcl supports three modes:

  • Vanilla SGD — no momentum. The update is theta -= lr * g.
  • SGD with momentumv = momentum * v + g; theta -= lr * v.
  • SGD with Nesterov momentum — the same update as momentum, but the gradient is evaluated at the "look-ahead" point theta - momentum * v. This typically gives a small but consistent speed-up over plain momentum.

Overview

SGD is the canonical baseline optimizer. It has the smallest memory footprint of any optimizer in netcl (no per-parameter state in the vanilla case, one momentum buffer per parameter otherwise) and the lowest per-step compute.

Where It Lives

  • File path: optim/sgd.py.
  • Module path: netcl.optim.sgd.
  • Public re-export: from netcl.optim import SGD.

How It Works

For each parameter, the kernel does:

if (momentum != 0) {
    v[i] = momentum * v[i] + g[i];
    if (nesterov) g[i] += momentum * v[i];
    else g[i] = v[i];
}
g[i] += weight_decay * param[i];
param[i] -= lr * g[i];

This is a single fused kernel per parameter. The momentum buffer v is allocated lazily on the first step, exactly like Adam's moments.

Code Example

import netcl.optim as opt

optimizer = opt.SGD(
    model.parameters(),
    lr=0.1,            # typical for ResNet on ImageNet
    momentum=0.9,
    weight_decay=5e-4,
    nesterov=True,
)

A cosine learning-rate schedule pairs naturally:

scheduler = opt.lr_scheduler.CosineAnnealingLR(optimizer, T_max=90)

Performance & Trade-offs

  • The cheapest optimizer in netcl. The fused kernel does one multiply-add per parameter per step.
  • Vanilla SGD is brittle on noisy gradients and rarely used in practice; momentum 0.9 is a near-universal default.
  • For ResNet-style vision training, SGD with Nesterov + a cosine schedule is still the most common recipe and is hard to beat with adaptive optimizers at the same parameter count.

See also