netcl wiki
concepts

cl_khr_fp16

cl_khr_fp16

Status: External standard — Khronos cl_khr_fp16 extension

cl_khr_fp16 is the Khronos extension that adds 16-bit floating-point storage and arithmetic to OpenCL. netcl uses it to enable AMP on devices that expose the extension in their supported extensions list.

The extension is defined in the OpenCL 1.2 specification, section 5.1.1 (and the corresponding section of later specifications). The relevant data type is half (also spelled cl_half in the C binding), a 16-bit IEEE 754 binary floating-point number. The extension adds a corresponding set of vector types (half2, half4, half8, half16) and a number of built-in functions (half_hcos, half_hsin, etc.).

Overview

A netcl tensor's dtype is one of "float", "float32", "half", "float16", "float64", "double". The "half" and "float16" spelling both map to the fp16 format. When a tensor is created with dtype="float16", its underlying buffer is allocated as a cl.Buffer of half the size of an equivalent fp32 buffer. The OpenCL kernel that reads or writes the buffer must be compiled with #pragma OPENCL EXTENSION cl_khr_fp16 : enable (which netcl prepends automatically when it sees a half-precision kernel).

cl_khr_fp16 is optional. Devices that do not expose it cannot allocate half-precision buffers. The netcl DeviceProfile exposes a has_fp16 flag that the AMP module uses to decide whether to enable autocast:

import netcl as nc
profile = nc.device.profile(queue)
profile.has_fp16     # True on most modern GPUs; False on many CPUs

Where It Lives

  • The OpenCL extension itself is defined by the Khronos Group in the OpenCL specification, not in netcl.
  • The netcl-side detection is in amp.py:supports_fp16(queue) and in core/capabilities.py:device_profile.
  • The OpenCL C macro for the extension is cl_khr_fp16, which is set by the OpenCL implementation when the extension is supported.

How It Works

When netcl generates an OpenCL kernel for a half-precision op, the preamble is extended with:

#pragma OPENCL EXTENSION cl_khr_fp16 : enable

The kernel arguments are typed as __global half* (or the appropriate vector type). The build will fail at clBuildProgram time if the device does not expose cl_khr_fp16; netcl catches this and falls back to an fp32 implementation of the same op.

The actual cl_half storage format is the IEEE 754 binary16 layout: 1 sign bit, 5 exponent bits, 10 mantissa bits. The fp16 article has the bit-level details.

Limitations

  • No cl_khr_fp16 on most CPUs. POCL exposes it on some configurations, but the typical libOpenCL.so from Intel's CPU OpenCL runtime does not. In that case AMP silently falls back to fp32.
  • Half-precision atomics are missing on most devices even when cl_khr_fp16 is present. netcl's cross-replica all-reduce therefore runs in fp32, with the fp16 inputs cast on entry.
  • Reduced range and precision. half has a max value of about 65504 and only about 3 decimal digits of precision. NaNs and infinities are supported, so isnan / isinf checks are still valid.

Code Example

Detecting the extension from a netcl program:

import netcl as nc
from netcl.amp import supports_fp16

ctx, queue = nc.device.manager.default()
if supports_fp16(queue):
    with nc.autocast():
        # half-precision forward
        y = model(x)
else:
    # fp32 fallback
    y = model(x)

Compiling a custom OpenCL kernel that uses fp16:

#pragma OPENCL EXTENSION cl_khr_fp16 : enable
__kernel void scale_half(__global half* x, half s) {
    int gid = get_global_id(0);
    x[gid] *= s;
}

Performance & Trade-offs

  • On most modern GPUs, fp16 is roughly 2x the throughput of fp32 for memory-bound ops, and up to 4x for the matrix-multiply pipelines on tensor cores (where the OpenCL driver maps to vendor intrinsics).
  • For BatchNorm and reductions, the accumulator should be fp32 even when the input is fp16, to avoid catastrophic cancellation. netcl's fused BN kernel always accumulates in fp32.
  • Use AMP's GradScaler to keep the loss in a range where the fp16 representation does not underflow. The default init_scale=2**16 is a good starting point; raise it if you see found_inf=True on every step.

See also

  • cl_khr_fp16 — the architecture page on precision selection.
  • OpenCL — the underlying standard.
  • PyOpenCL — the Python binding.
  • fp16 — the binary16 format.
  • AMP — the autocast / scale wrapper.
  • DeviceProfile — the device capabilities record.
  • cl_khr_fp16 — this article.