concepts

RMSProp

Status: Public API in netcl.optim.rmsprop.RMSProp (re-exported from netcl.optim)

RMSProp (Hinton, unpublished lecture notes) is an adaptive learning rate optimizer that maintains a per-parameter moving average of the squared gradient and divides the gradient by its square root. The intuition: parameters that receive large gradients get their effective learning rate reduced, and parameters that receive small gradients get it increased.

v_t     = alpha * v_{t-1} + (1 - alpha) * g_t ** 2
theta_t = theta_{t-1} - lr * g_t / (sqrt(v_t) + eps)

The original formulation uses alpha = 0.9. RMSProp is closely related to Adam without the first-moment term; it is a good default for recurrent networks and for fine-tuning where you want gentle adaptation without momentum.

Overview

Like Adam, RMSProp is stateful: a per-parameter v buffer is allocated lazily on the first step. The state is held in a dict keyed on id(parameter). The optimizer is reset by the usual zero_grad() call on the user side; the optimizer's own state survives across steps (this is correct: the moving average is meant to be persistent).

Where It Lives

File path: optim/rmsprop.py.
Module path: netcl.optim.rmsprop.
Public re-export: from netcl.optim import RMSProp (note capital P).

How It Works

For each parameter, the kernel does:

v[i] = alpha * v[i] + (1 - alpha) * g[i] * g[i];
param[i] -= lr * g[i] / (sqrt(v[i]) + eps);

Optionally a coupled weight-decay is added before the update. The momentum parameter (default 0) adds a Polyak-style momentum buffer on top of the per-parameter update; this is the RMSProp with momentum variant sometimes used for CIFAR.

Code Example

from netcl.optim import RMSProp

optimizer = RMSProp(
    model.parameters(),
    lr=1e-3,
    alpha=0.99,
    eps=1e-8,
    weight_decay=0.0,
    momentum=0.0,
)

Performance & Trade-offs

One state buffer per parameter; half the memory of Adam.
Empirically less stable than Adam on transformer-style models but a good default for RNNs and for fine-tuning a pre-trained classifier.
The eps term is critical on small gradients: without it, the 1 / sqrt(v) term can blow up.