bf16
bf16
Status: External standard — Google Brain float (bfloat16)
bf16 (brain floating-point) is a 16-bit floating-point format
designed by Google Brain. It has the same exponent range as fp32
(8 exponent bits) but with a shorter mantissa (7 mantissa bits).
The netcl tensor type does not currently use bf16 as a first-class
dtype, but the format is documented here because the architecture
treats it as a possible future addition (it is the format of
choice for some training pipelines on TPUs and on recent NVIDIA
hardware).
The format layout is:
- 1 sign bit
- 8 exponent bits (bias 127, same as fp32)
- 7 mantissa bits
This gives the same ~1.4e-45 to ~3.4e38 range as fp32, but with
only about 2 decimal digits of precision. The narrower mantissa is
acceptable in practice for deep-learning forward and backward
passes, where the wider range avoids the scaling problem of fp16.
Overview
Compared to fp16, bf16 trades precision for
range. For deep learning, this trade is usually favourable:
activations overflow much more often than they need 3 digits of
precision. The downside is that bf16 is not natively supported
by most consumer OpenCL hardware; the only widely-supported
device with native bf16 is the Google TPU and some recent NVIDIA
Tensor Core generations. netcl supports it via a software path
that pairs a bf16 storage with an fp32 compute, similar to
the AMP pattern for fp16.
Where It Lives
- The format is defined by Google's open documentation; the bit-level layout is identical to the top 16 bits of an fp32 value.
- netcl does not currently allocate
bf16tensors; the architecture reserves the dtype namebfloat16for future use.
How It Works
A bf16 value is the top 16 bits of the corresponding fp32
value, rounded. Conversion from fp32 to bf16 is a
single-shift-and-mask operation; conversion back pads the mantissa
with zeros and is exact (no rounding loss). This makes bf16 a
"lossy compression" of fp32 that keeps the dynamic range.
A typical recipe for bf16 training is to keep the master weights in fp32, cast to bf16 for the forward and backward, and update in fp32. The netcl AMP module is structured around this pattern, even when fp16 is the storage format.
Code Example
bf16 is not yet a netcl dtype; the code below is illustrative
of how it would be used if it were:
import netcl as nc
# Hypothetical API.
x = nc.Tensor.zeros((4, 1024), dtype="bfloat16",
context=ctx, queue=q)
y = model(x)
In current netcl, use fp16 under AMP or fp32 for the equivalent precision profile.
Performance & Trade-offs
- Range matches fp32, so no
GradScaleris needed. The "no underflow / overflow" property makes bf16 much easier to use than fp16 for mixed-precision training. - Precision is about 2 decimal digits, which is enough for most vision and language models but is on the edge for tasks with small but meaningful weight differences (some reinforcement-learning setups, some physics-informed models).
- Hardware support is the binding constraint. As of 2024, only NVIDIA's recent GPUs (Ampere and later) and Google's TPUs have native bf16 compute; older devices emulate it in fp32 with no speed-up.
See also
- fp16 — the other 16-bit format.
- AMP — the mixed-precision wrapper.
- cl_khr_fp16 — the OpenCL extension for fp16.
- bf16 — this article.