OpenCL
OpenCL
Status: External standard — Khronos OpenCL 1.2 (and later)
OpenCL (Open Computing Language) is the cross-vendor parallel programming standard maintained by the Khronos Group. netcl uses OpenCL as its only GPU compute API: the entire tensor backend, the kernel selector, the JIT compiler, the BufferPool, and the autograd engine all dispatch to OpenCL on the GPU queue.
The relevant version for netcl is OpenCL 1.2 (the baseline Khronos
ratified in 2011; required for full netcl support), with the
cl_khr_fp16 extension for fp16 support and
cl_khr_int64_base_atomics for cross-replica reductions. Most
modern drivers expose OpenCL 3.0 in addition, but netcl does not
require it.
Overview
An OpenCL program has four components:
- Platform — the vendor's OpenCL implementation (NVIDIA, Intel, AMD, Apple, POCL). One machine may have several.
- Device — a single GPU, CPU, or accelerator. The DeviceManager enumerates the devices on each platform.
- Context — the boundary at which memory and command-queue
operations are valid. A
cl.Contextis created from one or more devices. - Command queue — the submission point for kernel launches and
memory copies. A
cl.CommandQueueis associated with a single device.
A kernel is a C function annotated with __kernel void that runs
on the device. Kernels are written in OpenCL C (a subset of C99 with
restrictions and a few extensions) and built per-device by the
clBuildProgram call.
Where It Lives (in netcl)
- netcl does not implement OpenCL — it wraps it through PyOpenCL.
- The OpenCL backend is at
core/backend/opencl.py. - The device enumeration is in
core/device.py. - The kernel selector (which kernel runs on which device) is in
core/kernel_selector.py.
Diagram
How It Works
A netcl program looks like this:
import netcl as nc
import numpy as np
# netcl picks an OpenCL device for you.
ctx, queue = nc.device.manager.default()
# Build a small custom op.
prg = nc.KernelSpec(
name="scale_by_two",
src="""
__kernel void scale_by_two(__global float* x) {
int gid = get_global_id(0);
x[gid] *= 2.0f;
}
""",
).build(queue)
x = nc.Tensor.from_host(np.arange(1024, dtype=np.float32))
prg.run(queue, (x.size,), None, x)
In netcl, the typical pattern is "create a device, allocate tensors, run ops". The OpenCL specifics are hidden:
import netcl as nc
# netcl picks an OpenCL device for you.
ctx, queue = nc.device.manager.default()
# Build a small custom op.
prg = nc.KernelSpec(
name="scale_by_two",
src="""
__kernel void scale_by_two(__global float* x) {
int gid = get_global_id(0);
x[gid] *= 2.0f;
}
""",
).build(queue)
x = nc.Tensor.from_host(numpy_array)
prg.run(queue, (x.size,), None, x)
netcl hides the OpenCL boilerplate (the clBuildProgram call, the
buffer allocation, the queue submission) and lets the user think
in terms of tensors and ops.
The execution model
- Host code runs on the CPU. The user's Python program is host code.
- Device code runs on the OpenCL device. The kernels are device code.
- Buffers (
cl.Buffer) are the device-side memory. The host cannot read or write a buffer directly; it must usecl.enqueue_copyto move data. - Command queues serialize the work submitted to a single device. Out-of-order execution is supported but is opt-in.
- Events (
cl.Event) are the synchronization primitives. A kernel launch can wait on a set of events; it produces one event that subsequent work can wait on.
Code Example
The most common user-facing pattern in netcl is "create a device, allocate tensors, run ops". The OpenCL specifics are hidden:
import netcl as nc
# netcl picks an OpenCL device for you.
ctx, queue = nc.device.manager.default()
# Build a small custom op.
prg = nc.KernelSpec(
name="scale_by_two",
src="""
__kernel void scale_by_two(__global float* x) {
int gid = get_global_id(0);
x[gid] *= 2.0f;
}
""",
).build(queue)
x = nc.Tensor.from_host(numpy_array)
prg.run(queue, (x.size,), None, x)
Performance & Trade-offs
- First kernel launch is slow.
clBuildProgramtypically takes 10 ms to 500 ms, and the firstclEnqueueNDRangeKernelcall pays an extra driver warm-up. The JIT Compiler amortises this with a cache. - Different vendors, different perf. NVIDIA's OpenCL driver is generally slower than its CUDA driver (sometimes by 2x); Intel's OpenCL is fast on integrated GPUs and competitive on discrete; AMD's OpenCL is fast on discrete GPUs and slow on integrated; Apple's OpenCL is deprecated in favour of Metal. netcl's kernel selector is aware of these differences and picks strategies that are known to work well on each.
- fp16 is optional. A device that does not expose
cl_khr_fp16 cannot run
AMP in fp16. The
DeviceProfile.has_fp16flag tells you whether the device supports it. - Subgroup / wave intrinsics. OpenCL 2.0 added subgroups; the
netcl kernel selector uses them on devices that expose
has_subgroups = True.
See also
- OpenCL — the netcl device-management API.
- PyOpenCL — the Python wrapper netcl uses.
- cl_khr_fp16 — the fp16 extension.
- fp16 — the floating-point format itself.
- BufferPool — how netcl caches
cl.Bufferallocations. - Tensor Backend — the netcl stack on top of OpenCL.
- OpenCL — this article.