concepts

fp16

Status: External standard — IEEE 754 binary16

fp16 (also called half, binary16, or IEEE 754 half-precision floating-point) is a 16-bit floating-point format defined by the IEEE 754 standard (and the older IEEE 754-2008, where the format was introduced). The netcl tensor type uses it for half-precision storage and arithmetic when the underlying device supports cl_khr_fp16.

The format layout is:

1 sign bit
5 exponent bits (bias 15)
10 mantissa bits (with an implicit leading 1)

This gives a dynamic range of roughly 6.1e-5 to 65504 and a precision of about 3 decimal digits. The representation is much smaller than fp32 (half the bits), so buffers, memory bandwidth, and (on most modern GPUs) compute throughput are roughly halved as well.

Overview

In netcl, a tensor with dtype="float16" (or the equivalent dtype="half") is stored as a cl.Buffer of half the byte count of an equivalent fp32 tensor. The __global half* kernel argument is generated automatically when the source is emitted.

The fp16 format is the storage format. The compute precision may be higher; the netcl kernel selector picks a mixed-precision strategy where the storage is half but the accumulators in matmul, conv, and batch-norm are fp32. This is the standard recipe for half-precision training and is the one AMP enables.

Where It Lives

The format itself is defined by IEEE 754, section 3.6, and by the OpenCL cl_khr_fp16 extension (OpenCL 1.2 specification, section 5.1.1).
The netcl-side dtype mapping is in core/tensor.py:_np_dtype and core/tensor.py:_dtype_nbytes.
The runtime detection is in amp.py:supports_fp16 and core/capabilities.py:device_profile(...).has_fp16.

How It Works

The fp16 bit layout, from MSB to LSB, is:

s eeeee mmmmmmmmm
1 5    10

s = 0 means positive, s = 1 means negative.
The exponent is stored with a bias of 15: the actual exponent is e - 15. The range e = 0 and e = 31 are reserved for subnormal numbers and special values.
The mantissa has an implicit leading 1 for normal numbers. The total precision is therefore 11 bits (1 implicit + 10 explicit).

Special values:

0x0001 to 0x03FF — subnormal numbers (smaller than 6.1e-5).
0x7C00 — positive infinity; 0xFC00 — negative infinity.
0x7E00 — NaN. (Some implementations also use 0x7C01 or 0xFC01; the OpenCL spec allows either.)

The smallest positive normal is 0x0401, which equals 6.103515625e-5. The largest finite value is 0x7BFF, which equals 65504.

Code Example

A netcl tensor with dtype="float16":

import netcl as nc

x = nc.Tensor.zeros((4, 1024), dtype="float16",
                    context=ctx, queue=q)
print(x.nbytes)             # 8192 (4 * 1024 * 2)

A custom OpenCL kernel that uses half:

#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void scale_half(__global half* x, half s) {
    int gid = get_global_id(0);
    x[gid] = (half)((float)x[gid] * (float)s);
}

The (float) casts are usually needed because the half-precision operator* is not always optimised on every driver; the explicit fp32 multiply-then-cast is faster in practice.

Performance & Trade-offs

Range is the binding constraint. Activations can easily exceed 65504; under AMP the loss is multiplied by a GradScaler to push it into the well-represented range.
Precision is about 3 decimal digits. A 1.0 weight and a 0.999999 weight are the same value in fp16. For models where the relative difference between weights matters, keep a master copy in fp32.
Arithmetic cost: on most modern GPUs fp16 is 2x fp32 in general-purpose pipelines, and up to 4x to 8x on the matrix-multiply pipelines that have explicit half-rate support.
Underflow / overflow are silent. A weight update that produces a value below 6.1e-5 becomes zero; one above 65504 becomes infinity. Both are bugs that are easy to miss; the GradScaler is the standard defence.