CrossEntropyLoss
CrossEntropyLoss
Status: Public API in
netcl.nn.loss.cross_entropyandnetcl.nn.functional.cross_entropy
CrossEntropyLoss is the standard loss for multi-class
classification. It is the negative log-likelihood of the softmax
of the model's logits, summed over the batch and (optionally)
averaged.
For a batch of size N and C classes, with logits
z[b, c] and integer targets y[b]:
softmax(z)[b, c] = exp(z[b, c]) / sum_{c'} exp(z[b, c'])
loss = - (1 / N) * sum_{b} log(softmax(z)[b, y[b]])
netcl implements two equivalent APIs: the nn.CrossEntropyLoss
module (for use as a submodule of a model) and the
nn.functional.cross_entropy function (for use in a forward
function). Both share the same fused kernel: a single OpenCL
launch that computes the softmax and the log-likelihood in one
pass, with no intermediate device tensor for the softmax output.
Overview
CrossEntropyLoss is numerically equivalent to applying
log_softmax and then nll_loss, but is faster and more
numerically stable: the fused kernel uses the log-sum-exp
trick to avoid the exp overflow that a naive implementation
would suffer on inputs with large positive values.
The loss has two reduction modes:
reduction="mean"(default) — average the per-example losses over the batch.reduction="sum"— sum the per-example losses.reduction="none"— return the per-example loss as a tensor of shape(N,)(no reduction).
An optional ignore_index argument skips examples whose target
equals the given value. This is the standard way to mask out
the padding class in sequence-to-sequence training.
Where It Lives
- File path:
nn/loss.py(cross_entropy),ops/fused_ops.py(the fused OpenCL kernel). - Module path:
netcl.nn.functional.cross_entropy(functional API). - Sibling:
nll_loss(the negative log-likelihood loss), the building block thatcross_entropyextends.
How It Works
The fused kernel is a single launch with two passes:
- Per-row pass: for each example
b, find the max logitm = max_c z[b, c], then compute the row-wiselog_sum_exp = m + log(sum_c exp(z[b, c] - m)). Subtract this from the target logit to get the per-example lossloss[b] = log_sum_exp - z[b, y[b]]. - Reduction pass: tree-reduce the per-example loss vector
to a single scalar; divide by
N(orNformean).
The backward pass is similarly fused: it computes
softmax(z) - one_hot(y) and divides by N. The result is
registered on the Tape as the input gradient.
Code Example
import netcl as nc
import netcl.nn.functional as F
logits = nc.Tensor.from_host(numpy_logits) # shape (N, C)
target = nc.Tensor.from_host(numpy_target) # shape (N,), int64
loss = F.cross_entropy(logits, target, reduction="mean")
loss.backward()
A minimal training step with CrossEntropyLoss:
optimizer = opt.Adam(model.parameters(), lr=1e-3)
for x, y in dataloader:
optimizer.zero_grad()
logits = model(x)
loss = F.cross_entropy(logits, y)
loss.backward()
optimizer.step()
With a padding mask (sequence-to-sequence training):
loss = F.cross_entropy(logits, target,
reduction="mean", ignore_index=0)
# examples with target == 0 contribute zero to the loss
Performance & Trade-offs
- The fused kernel is about 4x faster than a naive
F.log_softmax(logits).gather(...).mean()chain. CrossEntropyLossis sensitive to logit magnitude. Logits in the range of about[-10, 10]work without scaling; outside that range the log-sum-exp may lose precision even with the fused trick. For very large logits, normalise before calling the loss.- Under AMP, the loss runs in fp16 but the accumulator is fp32. The fused kernel is aware of the autocast context and switches to the fp32 kernel if the loss is fp32-typed.
- Class-imbalanced training:
CrossEntropyLosswithreduction="mean"divides byN, which is dominated by the majority class. Useweight=to give the minority class a larger per-example loss.
See also
- CrossEntropyLoss — the API page.
- mse_loss — the regression analogue.
- Tape — the autograd graph the loss registers on.
- AMP — the autocast context.
- JIT Compiler — the loss is fusible
when wrapped in
@jit_compile. - CrossEntropyLoss — this article.