cl_khr_fp16
cl_khr_fp16
Status: External standard — Khronos
cl_khr_fp16extension
cl_khr_fp16 is the Khronos extension that adds 16-bit floating-point
storage and arithmetic to OpenCL. netcl uses it to enable
AMP on devices that expose the extension in their
supported extensions list.
The extension is defined in the OpenCL 1.2 specification, section
5.1.1 (and the corresponding section of later specifications). The
relevant data type is half (also spelled cl_half in the C
binding), a 16-bit IEEE 754 binary floating-point number. The
extension adds a corresponding set of vector types (half2,
half4, half8, half16) and a number of built-in functions
(half_hcos, half_hsin, etc.).
Overview
A netcl tensor's dtype is one of "float", "float32", "half",
"float16", "float64", "double". The "half" and "float16"
spelling both map to the fp16 format. When a tensor
is created with dtype="float16", its underlying buffer is
allocated as a cl.Buffer of half the size of an equivalent
fp32 buffer. The OpenCL kernel that reads or writes the buffer
must be compiled with #pragma OPENCL EXTENSION cl_khr_fp16 :
enable (which netcl prepends automatically when it sees a
half-precision kernel).
cl_khr_fp16 is optional. Devices that do not expose it cannot
allocate half-precision buffers. The netcl
DeviceProfile exposes a has_fp16 flag that the
AMP module uses to decide whether to enable autocast:
import netcl as nc
profile = nc.device.profile(queue)
profile.has_fp16 # True on most modern GPUs; False on many CPUs
Where It Lives
- The OpenCL extension itself is defined by the Khronos Group in the OpenCL specification, not in netcl.
- The netcl-side detection is in
amp.py:supports_fp16(queue)and incore/capabilities.py:device_profile. - The OpenCL C macro for the extension is
cl_khr_fp16, which is set by the OpenCL implementation when the extension is supported.
How It Works
When netcl generates an OpenCL kernel for a half-precision op, the preamble is extended with:
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
The kernel arguments are typed as __global half* (or the
appropriate vector type). The build will fail at clBuildProgram
time if the device does not expose cl_khr_fp16; netcl catches
this and falls back to an fp32 implementation of the same op.
The actual cl_half storage format is the IEEE 754 binary16
layout: 1 sign bit, 5 exponent bits, 10 mantissa bits. The
fp16 article has the bit-level details.
Limitations
- No
cl_khr_fp16on most CPUs. POCL exposes it on some configurations, but the typicallibOpenCL.sofrom Intel's CPU OpenCL runtime does not. In that case AMP silently falls back to fp32. - Half-precision atomics are missing on most devices even when
cl_khr_fp16is present. netcl's cross-replica all-reduce therefore runs in fp32, with the fp16 inputs cast on entry. - Reduced range and precision.
halfhas a max value of about 65504 and only about 3 decimal digits of precision. NaNs and infinities are supported, soisnan/isinfchecks are still valid.
Code Example
Detecting the extension from a netcl program:
import netcl as nc
from netcl.amp import supports_fp16
ctx, queue = nc.device.manager.default()
if supports_fp16(queue):
with nc.autocast():
# half-precision forward
y = model(x)
else:
# fp32 fallback
y = model(x)
Compiling a custom OpenCL kernel that uses fp16:
#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void scale_half(__global half* x, half s) {
int gid = get_global_id(0);
x[gid] *= s;
}
Performance & Trade-offs
- On most modern GPUs, fp16 is roughly 2x the throughput of fp32 for memory-bound ops, and up to 4x for the matrix-multiply pipelines on tensor cores (where the OpenCL driver maps to vendor intrinsics).
- For BatchNorm and reductions, the accumulator should be fp32 even when the input is fp16, to avoid catastrophic cancellation. netcl's fused BN kernel always accumulates in fp32.
- Use AMP's
GradScalerto keep the loss in a range where the fp16 representation does not underflow. The defaultinit_scale=2**16is a good starting point; raise it if you seefound_inf=Trueon every step.
See also
- cl_khr_fp16 — the architecture page on precision selection.
- OpenCL — the underlying standard.
- PyOpenCL — the Python binding.
- fp16 — the binary16 format.
- AMP — the autocast / scale wrapper.
- DeviceProfile — the device capabilities record.
- cl_khr_fp16 — this article.