fp16
fp16
Status: External standard — IEEE 754 binary16
fp16 (also called half, binary16, or IEEE 754 half-precision
floating-point) is a 16-bit floating-point format defined by the
IEEE 754 standard (and the older IEEE 754-2008, where the format
was introduced). The netcl tensor type uses it for half-precision
storage and arithmetic when the underlying device supports
cl_khr_fp16.
The format layout is:
- 1 sign bit
- 5 exponent bits (bias 15)
- 10 mantissa bits (with an implicit leading 1)
This gives a dynamic range of roughly 6.1e-5 to 65504 and a
precision of about 3 decimal digits. The representation is much
smaller than fp32 (half the bits), so buffers, memory bandwidth,
and (on most modern GPUs) compute throughput are roughly halved
as well.
Overview
In netcl, a tensor with dtype="float16" (or the equivalent
dtype="half") is stored as a cl.Buffer of half the byte count
of an equivalent fp32 tensor. The __global half* kernel
argument is generated automatically when the source is emitted.
The fp16 format is the storage format. The compute precision
may be higher; the netcl kernel selector picks a mixed-precision
strategy where the storage is half but the accumulators in
matmul, conv, and batch-norm are fp32. This is the standard
recipe for half-precision training and is the one AMP enables.
Where It Lives
- The format itself is defined by IEEE 754, section 3.6, and by
the OpenCL
cl_khr_fp16extension (OpenCL 1.2 specification, section 5.1.1). - The netcl-side dtype mapping is in
core/tensor.py:_np_dtypeandcore/tensor.py:_dtype_nbytes. - The runtime detection is in
amp.py:supports_fp16andcore/capabilities.py:device_profile(...).has_fp16.
How It Works
The fp16 bit layout, from MSB to LSB, is:
s eeeee mmmmmmmmm
1 5 10
s = 0means positive,s = 1means negative.- The exponent is stored with a bias of 15: the actual exponent
is
e - 15. The rangee = 0ande = 31are reserved for subnormal numbers and special values. - The mantissa has an implicit leading
1for normal numbers. The total precision is therefore 11 bits (1 implicit + 10 explicit).
Special values:
0x0001to0x03FF— subnormal numbers (smaller than6.1e-5).0x7C00— positive infinity;0xFC00— negative infinity.0x7E00— NaN. (Some implementations also use0x7C01or0xFC01; the OpenCL spec allows either.)
The smallest positive normal is 0x0401, which equals
6.103515625e-5. The largest finite value is 0x7BFF, which
equals 65504.
Code Example
A netcl tensor with dtype="float16":
import netcl as nc
x = nc.Tensor.zeros((4, 1024), dtype="float16",
context=ctx, queue=q)
print(x.nbytes) # 8192 (4 * 1024 * 2)
A custom OpenCL kernel that uses half:
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void scale_half(__global half* x, half s) {
int gid = get_global_id(0);
x[gid] = (half)((float)x[gid] * (float)s);
}
The (float) casts are usually needed because the half-precision
operator* is not always optimised on every driver; the
explicit fp32 multiply-then-cast is faster in practice.
Performance & Trade-offs
- Range is the binding constraint. Activations can easily
exceed 65504; under AMP the loss is multiplied
by a
GradScalerto push it into the well-represented range. - Precision is about 3 decimal digits. A 1.0 weight and a 0.999999 weight are the same value in fp16. For models where the relative difference between weights matters, keep a master copy in fp32.
- Arithmetic cost: on most modern GPUs fp16 is 2x fp32 in general-purpose pipelines, and up to 4x to 8x on the matrix-multiply pipelines that have explicit half-rate support.
- Underflow / overflow are silent. A weight update that
produces a value below
6.1e-5becomes zero; one above65504becomes infinity. Both are bugs that are easy to miss; the GradScaler is the standard defence.