concepts

OpenCL

Status: External standard — Khronos OpenCL 1.2 (and later)

OpenCL (Open Computing Language) is the cross-vendor parallel programming standard maintained by the Khronos Group. netcl uses OpenCL as its only GPU compute API: the entire tensor backend, the kernel selector, the JIT compiler, the BufferPool, and the autograd engine all dispatch to OpenCL on the GPU queue.

The relevant version for netcl is OpenCL 1.2 (the baseline Khronos ratified in 2011; required for full netcl support), with the cl_khr_fp16 extension for fp16 support and cl_khr_int64_base_atomics for cross-replica reductions. Most modern drivers expose OpenCL 3.0 in addition, but netcl does not require it.

Overview

An OpenCL program has four components:

Platform — the vendor's OpenCL implementation (NVIDIA, Intel, AMD, Apple, POCL). One machine may have several.
Device — a single GPU, CPU, or accelerator. The DeviceManager enumerates the devices on each platform.
Context — the boundary at which memory and command-queue operations are valid. A cl.Context is created from one or more devices.
Command queue — the submission point for kernel launches and memory copies. A cl.CommandQueue is associated with a single device.

A kernel is a C function annotated with __kernel void that runs on the device. Kernels are written in OpenCL C (a subset of C99 with restrictions and a few extensions) and built per-device by the clBuildProgram call.

Where It Lives (in netcl)

netcl does not implement OpenCL — it wraps it through PyOpenCL.
The OpenCL backend is at core/backend/opencl.py.
The device enumeration is in core/device.py.
The kernel selector (which kernel runs on which device) is in core/kernel_selector.py.

Diagram

How It Works

A netcl program looks like this:

import netcl as nc
import numpy as np

# netcl picks an OpenCL device for you.
ctx, queue = nc.device.manager.default()

# Build a small custom op.
prg = nc.KernelSpec(
    name="scale_by_two",
    src="""
        __kernel void scale_by_two(__global float* x) {
            int gid = get_global_id(0);
            x[gid] *= 2.0f;
        }
    """,
).build(queue)

x = nc.Tensor.from_host(np.arange(1024, dtype=np.float32))
prg.run(queue, (x.size,), None, x)

In netcl, the typical pattern is "create a device, allocate tensors, run ops". The OpenCL specifics are hidden:

import netcl as nc

# netcl picks an OpenCL device for you.
ctx, queue = nc.device.manager.default()

# Build a small custom op.
prg = nc.KernelSpec(
    name="scale_by_two",
    src="""
        __kernel void scale_by_two(__global float* x) {
            int gid = get_global_id(0);
            x[gid] *= 2.0f;
        }
    """,
).build(queue)

x = nc.Tensor.from_host(numpy_array)
prg.run(queue, (x.size,), None, x)

netcl hides the OpenCL boilerplate (the clBuildProgram call, the buffer allocation, the queue submission) and lets the user think in terms of tensors and ops.

The execution model

Host code runs on the CPU. The user's Python program is host code.
Device code runs on the OpenCL device. The kernels are device code.
Buffers (cl.Buffer) are the device-side memory. The host cannot read or write a buffer directly; it must use cl.enqueue_copy to move data.
Command queues serialize the work submitted to a single device. Out-of-order execution is supported but is opt-in.
Events (cl.Event) are the synchronization primitives. A kernel launch can wait on a set of events; it produces one event that subsequent work can wait on.

Code Example

The most common user-facing pattern in netcl is "create a device, allocate tensors, run ops". The OpenCL specifics are hidden:

import netcl as nc

# netcl picks an OpenCL device for you.
ctx, queue = nc.device.manager.default()

# Build a small custom op.
prg = nc.KernelSpec(
    name="scale_by_two",
    src="""
        __kernel void scale_by_two(__global float* x) {
            int gid = get_global_id(0);
            x[gid] *= 2.0f;
        }
    """,
).build(queue)

x = nc.Tensor.from_host(numpy_array)
prg.run(queue, (x.size,), None, x)

Performance & Trade-offs

First kernel launch is slow. clBuildProgram typically takes 10 ms to 500 ms, and the first clEnqueueNDRangeKernel call pays an extra driver warm-up. The JIT Compiler amortises this with a cache.
Different vendors, different perf. NVIDIA's OpenCL driver is generally slower than its CUDA driver (sometimes by 2x); Intel's OpenCL is fast on integrated GPUs and competitive on discrete; AMD's OpenCL is fast on discrete GPUs and slow on integrated; Apple's OpenCL is deprecated in favour of Metal. netcl's kernel selector is aware of these differences and picks strategies that are known to work well on each.
fp16 is optional. A device that does not expose cl_khr_fp16 cannot run AMP in fp16. The DeviceProfile.has_fp16 flag tells you whether the device supports it.
Subgroup / wave intrinsics. OpenCL 2.0 added subgroups; the netcl kernel selector uses them on devices that expose has_subgroups = True.