netcl.amp — Mixed-Precision (autocast + GradScaler)
netcl.amp — Mixed-Precision (autocast + GradScaler)
The amp API is the mixed-precision training layer. It exposes the two
pieces every half-precision recipe needs: a GradScaler that scales the
loss to avoid fp16 underflow and a thread-local autocast context manager
that flips inputs to fp16 inside the forward pass. Both pieces are device-aware — they
read the cl_khr_fp16 extension from the active
OpenCL device and silently degrade to fp32 when the
device does not support it.
Note — Top-level re-exports.
netcl/amp.pylives at the package root (not inside a sub-package), so every public symbol is reachable asnetcl.amp.<name>and also re-exported throughnetcl/__init__.pyfor ergonomicfrom netcl import amp.
Symbol Table
| Symbol | Purpose |
|---|---|
GradScaler |
Dynamic loss scaler with inf/nan-aware step() |
supports_fp16(queue) |
Capability probe: True if the device advertises cl_khr_fp16 |
autocast_enabled(profile_supports_fp16: bool) |
Heuristic that returns True iff the device profile supports fp16 |
autocast |
Context manager that flips a thread-local autocast flag |
is_autocast_enabled() |
Query helper for the thread-local autocast flag |
maybe_cast_tensor(t) |
Idempotent dtype promotion — cast to fp16 if autocast is on and the device can take it |
master_param(param) |
FP32 master copy helper for Optimizer updates |
GradScaler
GradScaler is the centerpiece of mixed-precision training. It maintains a
single scalar scale (initial value 2**16) that the user multiplies the loss by
before backward, and then divides the gradients by after backward but before
opt.step(). The trick keeps the gradient magnitudes in the safe range of fp16 even
when the un-scaled gradients would underflow to zero.
from netcl.amp import GradScaler
scaler = GradScaler(
init_scale=2.0**16,
growth_factor=2.0,
backoff_factor=0.5,
growth_interval=2000,
enabled=True,
)
| Field | Default | Purpose |
|---|---|---|
init_scale |
2**16 |
Initial value of the loss scale. |
growth_factor |
2.0 |
Multiplier applied to the scale after growth_interval clean steps. |
backoff_factor |
0.5 |
Multiplier applied to the scale on the first inf/nan step after a clean run. |
growth_interval |
2000 |
Number of consecutive clean steps before the scale is grown. |
enabled |
True |
When False, every method is a no-op; useful for fp32-only runs. |
scale_loss(loss)
scaled = scaler.scale_loss(loss) # scaled.value == loss.value * scale
Returns a new Tensor scaled = loss * scale computed on the device via the
elementwise op. The original loss is untouched, so the same loss can be inspected for
logging without the scale being baked in. When scaler.enabled is False this returns
loss unchanged.
unscale_grads(params) → bool
found_inf = scaler.unscale_grads(model.parameters())
Pulls every param.grad to the host and checks for inf / nan with
np.any(~np.isfinite(...)). If clean, multiplies every grad in place by
1.0 / scale on the device. Returns True if any inf/nan was found, False otherwise.
The probe uses the same supports_fp16 capability check that the
autocast manager uses, so devices without cl_khr_fp16 support
are treated identically by both pieces.
step(optimizer, params)
scaler.step(opt, model.parameters())
The recommended one-call form. It runs unscale_grads(params), then:
- If clean: calls
optimizer.step()and increments an internal growth counter. When the counter reachesgrowth_interval, the scale is multiplied bygrowth_factorand the counter resets. - If inf/nan: skips
optimizer.step(), multiplies the scale bybackoff_factor, and resets the growth counter to zero. The optimizer'szero_gradis not called here — that is the caller's job (orupdate()below).
When scaler.enabled is False, this is just optimizer.step().
update()
scaler.update()
A no-op kept for API compatibility with PyTorch's GradScaler. The real "update" logic
(growth / backoff) is inline in step().
supports_fp16
from netcl.amp import supports_fp16
capable = supports_fp16(queue)
Returns True if the device bound to queue advertises cl_khr_fp16 in its
extensions string. Returns False on any error (missing device, missing extension
string, etc.) — never raises.
This is the same probe as core.capabilities.device_profile(...).has_fp16;
use whichever import path feels more natural.
autocast_enabled
from netcl.amp import autocast_enabled
should_autocast = autocast_enabled(profile_supports_fp16=True)
A trivial heuristic: returns the boolean it was given. The intent is that callers write
autocast_enabled(device_profile.supports_fp16) at the top of their forward pass, so the
decision is a single named function call rather than a bare boolean. The
autocast context manager is the actual mechanism that flips the global
flag; this helper exists for code that prefers the function form.
autocast
from netcl.amp import autocast
with autocast(enabled=True):
y = model(x) # forward runs in fp16 where safe
autocast is a context manager that sets the thread-local
_AUTOCAST_ENABLED flag. The autograd ops in autograd/ops.py
read this flag via maybe_cast_tensor and cast their inputs to fp16 when
both the flag is on and the device can take fp16.
| Argument | Default | Purpose |
|---|---|---|
enabled |
True |
When False, the context manager is a no-op (still flips the flag, just to False). |
device_queue |
None |
If given, the fp16 capability is re-probed via supports_fp16(device_queue) on __enter__; if the device cannot take fp16, the flag stays False even when enabled=True. |
The flag is thread-local and the context manager restores the prior value on
__exit__, so nested autocast regions compose cleanly.
is_autocast_enabled
is_autocast_enabled is the query helper for the same thread-local flag. It
is useful inside an op implementation that wants to be autocast-aware but is not invoked
through autocast directly.
from netcl.amp import is_autocast_enabled
with autocast(enabled=True):
assert is_autocast_enabled() is True
assert is_autocast_enabled() is False
Query helper for the same thread-local flag. Useful inside an op implementation that
wants to be autocast-aware but is not invoked through autocast directly.
maybe_cast_tensor
maybe_cast_tensor is the idempotent dtype promotion the autograd ops
call. It is safe to apply unconditionally — when autocast is off, it returns the input
unchanged; when autocast is on but the device does not support fp16, it also returns
the input unchanged. When autocast is on and the device supports fp16 and the input
is currently float32, it returns a new Tensor with dtype="float16" and
the same logical shape. Other dtypes are left alone.
from netcl.amp import maybe_cast_tensor
y = maybe_cast_tensor(x) # fp16 if autocast is on and the device can take it
master_param
master_param is the helper that an Optimizer calls to keep a
fp32 master copy of any fp16 parameter. When the parameter is already fp32 it is
returned unchanged (with a back-reference master._model_param = master); when it is
fp16 it returns a new fp32 Tensor carrying _model_param pointing back at
the original fp16 parameter, so the optimizer can update the master in fp32 and the
next forward can copy the master back into the fp16 model parameter.
The requires_grad and _frozen flags are preserved on the master copy.
from netcl.amp import master_param
master = master_param(model.fc1.weight) # if weight is fp16, master is fp32
Recommended Pattern
The canonical netcl training step ties autocast, GradScaler,
and the Tape together:
import netcl.autograd as ag
import netcl.amp as amp
from netcl.optim import Adam
opt = Adam(model.parameters(), lr=1e-3)
scaler = amp.GradScaler()
for x, y in loader:
with ag.Tape() as tape:
with amp.autocast(enabled=True):
pred = model(x)
loss = ag.cross_entropy(pred, y)
scaled = scaler.scale_loss(loss) # multiply by `scale` on-device
tape.backward(scaled) # backward of (loss * scale)
# Optimizer step (with inf/nan guard + scale update)
scaler.step(opt, model.parameters())
opt.zero_grad()
The exact same loop without AMP is the same code with with autocast(enabled=False): and
no scaler calls. The minimal-diff property is intentional — adding AMP to an existing
netcl training loop should require two extra lines (the autocast block and the
scaler.scale_loss call) and zero refactoring of the model or the Tape
code.
See also
- MNIST with MLP — the worked example that uses
GradScalerandautocastend-to-end. - Tensor — the value type that
maybe_cast_tensorreturns. - Optimizer — the parameter-update step that
GradScaler.stepwraps. - optim API — the optimizer family that consumes
master_paramwhen the weights are fp16. - OpenCLBackend — the OpenCL transport that
supports_fp16probes. - JIT Compiler — the kernel-fusion path that
autocast's dtype promotion flows into. - Understanding Autograd — the Tape / grad-flow
story that the scaler's
unscale_gradsandstepare designed against.