concepts

CrossEntropyLoss

Status: Public API in netcl.nn.loss.cross_entropy and netcl.nn.functional.cross_entropy

CrossEntropyLoss is the standard loss for multi-class classification. It is the negative log-likelihood of the softmax of the model's logits, summed over the batch and (optionally) averaged.

For a batch of size N and C classes, with logits z[b, c] and integer targets y[b]:

softmax(z)[b, c] = exp(z[b, c]) / sum_{c'} exp(z[b, c'])
loss = - (1 / N) * sum_{b} log(softmax(z)[b, y[b]])

netcl implements two equivalent APIs: the nn.CrossEntropyLoss module (for use as a submodule of a model) and the nn.functional.cross_entropy function (for use in a forward function). Both share the same fused kernel: a single OpenCL launch that computes the softmax and the log-likelihood in one pass, with no intermediate device tensor for the softmax output.

Overview

CrossEntropyLoss is numerically equivalent to applying log_softmax and then nll_loss, but is faster and more numerically stable: the fused kernel uses the log-sum-exp trick to avoid the exp overflow that a naive implementation would suffer on inputs with large positive values.

The loss has two reduction modes:

reduction="mean" (default) — average the per-example losses over the batch.
reduction="sum" — sum the per-example losses.
reduction="none" — return the per-example loss as a tensor of shape (N,) (no reduction).

An optional ignore_index argument skips examples whose target equals the given value. This is the standard way to mask out the padding class in sequence-to-sequence training.

Where It Lives

File path: nn/loss.py (cross_entropy), ops/fused_ops.py (the fused OpenCL kernel).
Module path: netcl.nn.functional.cross_entropy (functional API).
Sibling: nll_loss (the negative log-likelihood loss), the building block that cross_entropy extends.

How It Works

The fused kernel is a single launch with two passes:

Per-row pass: for each example b, find the max logit m = max_c z[b, c], then compute the row-wise log_sum_exp = m + log(sum_c exp(z[b, c] - m)). Subtract this from the target logit to get the per-example loss loss[b] = log_sum_exp - z[b, y[b]].
Reduction pass: tree-reduce the per-example loss vector to a single scalar; divide by N (or N for mean).

The backward pass is similarly fused: it computes softmax(z) - one_hot(y) and divides by N. The result is registered on the Tape as the input gradient.

Code Example

import netcl as nc
import netcl.nn.functional as F

logits = nc.Tensor.from_host(numpy_logits)   # shape (N, C)
target = nc.Tensor.from_host(numpy_target)   # shape (N,), int64

loss = F.cross_entropy(logits, target, reduction="mean")
loss.backward()

A minimal training step with CrossEntropyLoss:

optimizer = opt.Adam(model.parameters(), lr=1e-3)
for x, y in dataloader:
    optimizer.zero_grad()
    logits = model(x)
    loss = F.cross_entropy(logits, y)
    loss.backward()
    optimizer.step()

With a padding mask (sequence-to-sequence training):

loss = F.cross_entropy(logits, target,
                       reduction="mean", ignore_index=0)
# examples with target == 0 contribute zero to the loss

Performance & Trade-offs

The fused kernel is about 4x faster than a naive F.log_softmax(logits).gather(...).mean() chain.
CrossEntropyLoss is sensitive to logit magnitude. Logits in the range of about [-10, 10] work without scaling; outside that range the log-sum-exp may lose precision even with the fused trick. For very large logits, normalise before calling the loss.
Under AMP, the loss runs in fp16 but the accumulator is fp32. The fused kernel is aware of the autocast context and switches to the fp32 kernel if the loss is fp32-typed.
Class-imbalanced training: CrossEntropyLoss with reduction="mean" divides by N, which is dominated by the majority class. Use weight= to give the minority class a larger per-example loss.