concepts

bf16

Status: External standard — Google Brain float (bfloat16)

bf16 (brain floating-point) is a 16-bit floating-point format designed by Google Brain. It has the same exponent range as fp32 (8 exponent bits) but with a shorter mantissa (7 mantissa bits). The netcl tensor type does not currently use bf16 as a first-class dtype, but the format is documented here because the architecture treats it as a possible future addition (it is the format of choice for some training pipelines on TPUs and on recent NVIDIA hardware).

The format layout is:

1 sign bit
8 exponent bits (bias 127, same as fp32)
7 mantissa bits

This gives the same ~1.4e-45 to ~3.4e38 range as fp32, but with only about 2 decimal digits of precision. The narrower mantissa is acceptable in practice for deep-learning forward and backward passes, where the wider range avoids the scaling problem of fp16.

Overview

Compared to fp16, bf16 trades precision for range. For deep learning, this trade is usually favourable: activations overflow much more often than they need 3 digits of precision. The downside is that bf16 is not natively supported by most consumer OpenCL hardware; the only widely-supported device with native bf16 is the Google TPU and some recent NVIDIA Tensor Core generations. netcl supports it via a software path that pairs a bf16 storage with an fp32 compute, similar to the AMP pattern for fp16.

Where It Lives

The format is defined by Google's open documentation; the bit-level layout is identical to the top 16 bits of an fp32 value.
netcl does not currently allocate bf16 tensors; the architecture reserves the dtype name bfloat16 for future use.

How It Works

A bf16 value is the top 16 bits of the corresponding fp32 value, rounded. Conversion from fp32 to bf16 is a single-shift-and-mask operation; conversion back pads the mantissa with zeros and is exact (no rounding loss). This makes bf16 a "lossy compression" of fp32 that keeps the dynamic range.

A typical recipe for bf16 training is to keep the master weights in fp32, cast to bf16 for the forward and backward, and update in fp32. The netcl AMP module is structured around this pattern, even when fp16 is the storage format.

Code Example

bf16 is not yet a netcl dtype; the code below is illustrative of how it would be used if it were:

import netcl as nc

# Hypothetical API.
x = nc.Tensor.zeros((4, 1024), dtype="bfloat16",
                    context=ctx, queue=q)
y = model(x)

In current netcl, use fp16 under AMP or fp32 for the equivalent precision profile.

Performance & Trade-offs

Range matches fp32, so no GradScaler is needed. The "no underflow / overflow" property makes bf16 much easier to use than fp16 for mixed-precision training.
Precision is about 2 decimal digits, which is enough for most vision and language models but is on the edge for tasks with small but meaningful weight differences (some reinforcement-learning setups, some physics-informed models).
Hardware support is the binding constraint. As of 2024, only NVIDIA's recent GPUs (Ampere and later) and Google's TPUs have native bf16 compute; older devices emulate it in fp32 with no speed-up.