netcl wiki
concepts

mse_loss

mse_loss

Status: Public API in netcl.nn.loss.mse_loss and netcl.autograd.ops.mse_loss

mse_loss is the mean squared error loss, the standard regression loss for problems where the target is a continuous vector. It is the average over the batch of the per-example squared L2 distance between the prediction and the target.

For a batch of size N and per-example prediction / target vectors of shape C:

loss = (1 / N) * sum_n sum_c (pred[n, c] - target[n, c]) ** 2

netcl implements mse_loss as a fused OpenCL kernel: a single launch that computes the per-element difference, the per-element square, and the reduction, in one pass. The fused implementation is in ops/reduction.py; the autograd registration is in autograd/ops.py.

Overview

mse_loss is one of the most frequently called ops in regression training. The fused kernel is the reason a netcl training loop can hit high GPU utilisation even on tiny batches — the launch overhead is paid once, not three times (one for the difference, one for the square, one for the reduction).

The loss has two reduction modes:

  • reduction="mean" (default) — average the per-example losses over the batch (and over the per-example shape).
  • reduction="sum" — sum the per-example losses.
  • reduction="none" — return the per-example loss as a tensor of the same shape as the prediction (no reduction).

Where It Lives

  • File path: nn/loss.py (mse_loss), autograd/ops.py (autograd registration), ops/reduction.py (fused OpenCL kernel).
  • Module path: netcl.nn.functional.mse_loss (functional API), netcl.autograd.ops.mse_loss (autograd-aware op).

How It Works

The fused kernel is a single launch with two reduction steps:

  1. Per-element pass: for each (n, c) index, compute d = pred[n, c] - target[n, c], then sq = d * d. Write the per-element sq to a small intermediate buffer.
  2. Reduction pass: tree-reduce the intermediate buffer to a single scalar; divide by N (or N * C for mean).

The backward pass is similarly fused: it computes 2 * (pred - target) / N (or the analogous sum-form expression) and registers the corresponding grad_fn on the Tape.

Code Example

import netcl as nc
import netcl.nn.functional as F

pred   = nc.Tensor.from_host(numpy_pred)         # shape (N, C)
target = nc.Tensor.from_host(numpy_target)       # shape (N, C)

loss = F.mse_loss(pred, target, reduction="mean")
loss.backward()

A minimal training step with mse_loss:

optimizer = opt.Adam(model.parameters(), lr=1e-3)
for x, y in dataloader:
    optimizer.zero_grad()
    pred = model(x)
    loss = F.mse_loss(pred, y)
    loss.backward()
    optimizer.step()

For a per-example loss vector (used in some metric-learning recipes):

per_example = F.mse_loss(pred, target, reduction="none")
# shape: (N,) — one loss per example

Performance & Trade-offs

  • The fused kernel is about 3x faster than a naive ((pred - target) ** 2).mean() chain, because it pays the launch overhead once and writes the intermediate buffer to on-chip memory.
  • mse_loss is sensitive to outliers: a single bad example with a large error dominates the gradient. For problems where outliers are common, use Huber loss (smooth_l1_loss) instead.
  • Under AMP, the loss runs in fp16 but the accumulator is fp32. The fused kernel is aware of the autocast context and switches to the fp32 kernel if the loss is fp32-typed.
  • For very small per-example shapes (C = 1), the launch overhead dominates the compute. The kernel is still correct, but the speed-up over a manual chain is small.

See also

  • mse_loss — the API page.
  • CrossEntropyLoss — the classification analogue.
  • Tape — the autograd graph mse_loss registers its grad_fn on.
  • AMP — the autocast context.
  • JIT Compilermse_loss is fusible when wrapped in @jit_compile.
  • mse_loss — this article.