RMSProp
RMSProp
Status: Public API in
netcl.optim.rmsprop.RMSProp(re-exported fromnetcl.optim)
RMSProp (Hinton, unpublished lecture notes) is an adaptive learning
rate optimizer that maintains a per-parameter moving average of the
squared gradient and divides the gradient by its square root. The
intuition: parameters that receive large gradients get their effective
learning rate reduced, and parameters that receive small gradients
get it increased.
v_t = alpha * v_{t-1} + (1 - alpha) * g_t ** 2
theta_t = theta_{t-1} - lr * g_t / (sqrt(v_t) + eps)
The original formulation uses alpha = 0.9. RMSProp is closely
related to Adam without the first-moment term; it is a good
default for recurrent networks and for fine-tuning where you want
gentle adaptation without momentum.
Overview
Like Adam, RMSProp is stateful: a per-parameter
v buffer is allocated lazily on the first step. The state is held
in a dict keyed on id(parameter). The optimizer is reset by the
usual zero_grad() call on the user side; the optimizer's own state
survives across steps (this is correct: the moving average is meant
to be persistent).
Where It Lives
- File path:
optim/rmsprop.py. - Module path:
netcl.optim.rmsprop. - Public re-export:
from netcl.optim import RMSProp(note capital P).
How It Works
For each parameter, the kernel does:
v[i] = alpha * v[i] + (1 - alpha) * g[i] * g[i];
param[i] -= lr * g[i] / (sqrt(v[i]) + eps);
Optionally a coupled weight-decay is added before the update. The
momentum parameter (default 0) adds a Polyak-style momentum buffer
on top of the per-parameter update; this is the RMSProp with
momentum variant sometimes used for CIFAR.
Code Example
from netcl.optim import RMSProp
optimizer = RMSProp(
model.parameters(),
lr=1e-3,
alpha=0.99,
eps=1e-8,
weight_decay=0.0,
momentum=0.0,
)
Performance & Trade-offs
- One state buffer per parameter; half the memory of Adam.
- Empirically less stable than
Adamon transformer-style models but a good default for RNNs and for fine-tuning a pre-trained classifier. - The
epsterm is critical on small gradients: without it, the1 / sqrt(v)term can blow up.