netcl wiki
api

netcl.amp — Mixed-Precision (autocast + GradScaler)

netcl.amp — Mixed-Precision (autocast + GradScaler)

The amp API is the mixed-precision training layer. It exposes the two pieces every half-precision recipe needs: a GradScaler that scales the loss to avoid fp16 underflow and a thread-local autocast context manager that flips inputs to fp16 inside the forward pass. Both pieces are device-aware — they read the cl_khr_fp16 extension from the active OpenCL device and silently degrade to fp32 when the device does not support it.

Note — Top-level re-exports. netcl/amp.py lives at the package root (not inside a sub-package), so every public symbol is reachable as netcl.amp.<name> and also re-exported through netcl/__init__.py for ergonomic from netcl import amp.

Symbol Table

Symbol Purpose
GradScaler Dynamic loss scaler with inf/nan-aware step()
supports_fp16(queue) Capability probe: True if the device advertises cl_khr_fp16
autocast_enabled(profile_supports_fp16: bool) Heuristic that returns True iff the device profile supports fp16
autocast Context manager that flips a thread-local autocast flag
is_autocast_enabled() Query helper for the thread-local autocast flag
maybe_cast_tensor(t) Idempotent dtype promotion — cast to fp16 if autocast is on and the device can take it
master_param(param) FP32 master copy helper for Optimizer updates

GradScaler

GradScaler is the centerpiece of mixed-precision training. It maintains a single scalar scale (initial value 2**16) that the user multiplies the loss by before backward, and then divides the gradients by after backward but before opt.step(). The trick keeps the gradient magnitudes in the safe range of fp16 even when the un-scaled gradients would underflow to zero.

from netcl.amp import GradScaler

scaler = GradScaler(
    init_scale=2.0**16,
    growth_factor=2.0,
    backoff_factor=0.5,
    growth_interval=2000,
    enabled=True,
)
Field Default Purpose
init_scale 2**16 Initial value of the loss scale.
growth_factor 2.0 Multiplier applied to the scale after growth_interval clean steps.
backoff_factor 0.5 Multiplier applied to the scale on the first inf/nan step after a clean run.
growth_interval 2000 Number of consecutive clean steps before the scale is grown.
enabled True When False, every method is a no-op; useful for fp32-only runs.

scale_loss(loss)

scaled = scaler.scale_loss(loss)        # scaled.value == loss.value * scale

Returns a new Tensor scaled = loss * scale computed on the device via the elementwise op. The original loss is untouched, so the same loss can be inspected for logging without the scale being baked in. When scaler.enabled is False this returns loss unchanged.

unscale_grads(params)bool

found_inf = scaler.unscale_grads(model.parameters())

Pulls every param.grad to the host and checks for inf / nan with np.any(~np.isfinite(...)). If clean, multiplies every grad in place by 1.0 / scale on the device. Returns True if any inf/nan was found, False otherwise. The probe uses the same supports_fp16 capability check that the autocast manager uses, so devices without cl_khr_fp16 support are treated identically by both pieces.

step(optimizer, params)

scaler.step(opt, model.parameters())

The recommended one-call form. It runs unscale_grads(params), then:

  • If clean: calls optimizer.step() and increments an internal growth counter. When the counter reaches growth_interval, the scale is multiplied by growth_factor and the counter resets.
  • If inf/nan: skips optimizer.step(), multiplies the scale by backoff_factor, and resets the growth counter to zero. The optimizer's zero_grad is not called here — that is the caller's job (or update() below).

When scaler.enabled is False, this is just optimizer.step().

update()

scaler.update()

A no-op kept for API compatibility with PyTorch's GradScaler. The real "update" logic (growth / backoff) is inline in step().

supports_fp16

from netcl.amp import supports_fp16
capable = supports_fp16(queue)

Returns True if the device bound to queue advertises cl_khr_fp16 in its extensions string. Returns False on any error (missing device, missing extension string, etc.) — never raises.

This is the same probe as core.capabilities.device_profile(...).has_fp16; use whichever import path feels more natural.

autocast_enabled

from netcl.amp import autocast_enabled
should_autocast = autocast_enabled(profile_supports_fp16=True)

A trivial heuristic: returns the boolean it was given. The intent is that callers write autocast_enabled(device_profile.supports_fp16) at the top of their forward pass, so the decision is a single named function call rather than a bare boolean. The autocast context manager is the actual mechanism that flips the global flag; this helper exists for code that prefers the function form.

autocast

from netcl.amp import autocast

with autocast(enabled=True):
    y = model(x)         # forward runs in fp16 where safe

autocast is a context manager that sets the thread-local _AUTOCAST_ENABLED flag. The autograd ops in autograd/ops.py read this flag via maybe_cast_tensor and cast their inputs to fp16 when both the flag is on and the device can take fp16.

Argument Default Purpose
enabled True When False, the context manager is a no-op (still flips the flag, just to False).
device_queue None If given, the fp16 capability is re-probed via supports_fp16(device_queue) on __enter__; if the device cannot take fp16, the flag stays False even when enabled=True.

The flag is thread-local and the context manager restores the prior value on __exit__, so nested autocast regions compose cleanly.

is_autocast_enabled

is_autocast_enabled is the query helper for the same thread-local flag. It is useful inside an op implementation that wants to be autocast-aware but is not invoked through autocast directly.

from netcl.amp import is_autocast_enabled

with autocast(enabled=True):
    assert is_autocast_enabled() is True
assert is_autocast_enabled() is False

Query helper for the same thread-local flag. Useful inside an op implementation that wants to be autocast-aware but is not invoked through autocast directly.

maybe_cast_tensor

maybe_cast_tensor is the idempotent dtype promotion the autograd ops call. It is safe to apply unconditionally — when autocast is off, it returns the input unchanged; when autocast is on but the device does not support fp16, it also returns the input unchanged. When autocast is on and the device supports fp16 and the input is currently float32, it returns a new Tensor with dtype="float16" and the same logical shape. Other dtypes are left alone.

from netcl.amp import maybe_cast_tensor
y = maybe_cast_tensor(x)            # fp16 if autocast is on and the device can take it

master_param

master_param is the helper that an Optimizer calls to keep a fp32 master copy of any fp16 parameter. When the parameter is already fp32 it is returned unchanged (with a back-reference master._model_param = master); when it is fp16 it returns a new fp32 Tensor carrying _model_param pointing back at the original fp16 parameter, so the optimizer can update the master in fp32 and the next forward can copy the master back into the fp16 model parameter.

The requires_grad and _frozen flags are preserved on the master copy.

from netcl.amp import master_param
master = master_param(model.fc1.weight)   # if weight is fp16, master is fp32

The canonical netcl training step ties autocast, GradScaler, and the Tape together:

import netcl.autograd as ag
import netcl.amp as amp
from netcl.optim import Adam

opt = Adam(model.parameters(), lr=1e-3)
scaler = amp.GradScaler()

for x, y in loader:
    with ag.Tape() as tape:
        with amp.autocast(enabled=True):
            pred = model(x)
            loss = ag.cross_entropy(pred, y)
        scaled = scaler.scale_loss(loss)         # multiply by `scale` on-device

    tape.backward(scaled)                         # backward of (loss * scale)

    # Optimizer step (with inf/nan guard + scale update)
    scaler.step(opt, model.parameters())

    opt.zero_grad()

The exact same loop without AMP is the same code with with autocast(enabled=False): and no scaler calls. The minimal-diff property is intentional — adding AMP to an existing netcl training loop should require two extra lines (the autocast block and the scaler.scale_loss call) and zero refactoring of the model or the Tape code.

See also