concepts

SGD

Status: Public API in netcl.optim.sgd.SGD (re-exported from netcl.optim)

Stochastic Gradient Descent with optional momentum and Nesterov acceleration. SGD is the simplest optimizer in netcl: it applies theta = theta - lr * (g + weight_decay * theta) per parameter, with an optional momentum buffer.

The SGD class in netcl supports three modes:

Vanilla SGD — no momentum. The update is theta -= lr * g.
SGD with momentum — v = momentum * v + g; theta -= lr * v.
SGD with Nesterov momentum — the same update as momentum, but the gradient is evaluated at the "look-ahead" point theta - momentum * v. This typically gives a small but consistent speed-up over plain momentum.

Overview

SGD is the canonical baseline optimizer. It has the smallest memory footprint of any optimizer in netcl (no per-parameter state in the vanilla case, one momentum buffer per parameter otherwise) and the lowest per-step compute.

Where It Lives

File path: optim/sgd.py.
Module path: netcl.optim.sgd.
Public re-export: from netcl.optim import SGD.

How It Works

For each parameter, the kernel does:

if (momentum != 0) {
    v[i] = momentum * v[i] + g[i];
    if (nesterov) g[i] += momentum * v[i];
    else g[i] = v[i];
}
g[i] += weight_decay * param[i];
param[i] -= lr * g[i];

This is a single fused kernel per parameter. The momentum buffer v is allocated lazily on the first step, exactly like Adam's moments.

Code Example

import netcl.optim as opt

optimizer = opt.SGD(
    model.parameters(),
    lr=0.1,            # typical for ResNet on ImageNet
    momentum=0.9,
    weight_decay=5e-4,
    nesterov=True,
)

A cosine learning-rate schedule pairs naturally:

scheduler = opt.lr_scheduler.CosineAnnealingLR(optimizer, T_max=90)

Performance & Trade-offs

The cheapest optimizer in netcl. The fused kernel does one multiply-add per parameter per step.
Vanilla SGD is brittle on noisy gradients and rarely used in practice; momentum 0.9 is a near-universal default.
For ResNet-style vision training, SGD with Nesterov + a cosine schedule is still the most common recipe and is hard to beat with adaptive optimizers at the same parameter count.

SGD

Overview

Where It Lives

How It Works

Code Example

Performance & Trade-offs

See also