concepts

PyOpenCL

Status: External library — Andreas Kloeckner's PyOpenCL

PyOpenCL is the Python binding to OpenCL maintained by Andreas Kloeckner. netcl uses PyOpenCL for all of its GPU interaction: buffer allocation, kernel build, kernel launch, event handling, context and queue management.

PyOpenCL is a thin wrapper around the OpenCL C API. It exposes cl.Context, cl.CommandQueue, cl.Buffer, cl.Program, cl.Kernel, cl.Event as Python objects, and a numpy-friendly cl.array.Array class for high-level buffer arithmetic. The Tensor type in netcl is essentially a hand-written wrapper around a cl.Buffer plus a numpy view.

Overview

netcl depends on PyOpenCL at runtime. If PyOpenCL is not installed, the import netcl succeeds, but every call into a GPU op raises ImportError: pyopencl required. The [cpu] and [opencl] extras in the netcl pyproject.toml install PyOpenCL.

PyOpenCL is licensed under the MIT license; the version netcl targets is pyopencl >= 2024.1 (which exposes OpenCL 3.0 features on drivers that support them). The bundled cl.array module is used as a reference for netcl's own high-level arithmetic but is not used at runtime — netcl's ops are dispatched through netcl's own op system.

Where It Lives

netcl code that uses PyOpenCL is in core/backend/opencl.py and core/tensor.py.
The PyOpenCL import is guarded with a try / except ImportError in every file that uses it, so the CPU-only build is importable.

How It Works

A typical PyOpenCL pattern in netcl:

import pyopencl as cl

# Context + queue — netcl's DeviceManager wraps this.
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

# Build a kernel.
src = """
    __kernel void relu(__global const float* in, __global float* out) {
        int gid = get_global_id(0);
        out[gid] = in[gid] > 0.0f ? in[gid] : 0.0f;
    }
"""
prg = cl.Program(ctx, src).build()
relu = prg.relu

# Allocate device buffers.
n = 1024
in_buf  = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, n * 4)
out_buf = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, n * 4)

# Copy input.
import numpy as np
host_in = np.random.randn(n).astype(np.float32)
cl.enqueue_copy(queue, in_buf, host_in)

# Launch.
relu(queue, (n,), None, in_buf, out_buf)

# Copy back.
host_out = np.empty(n, dtype=np.float32)
cl.enqueue_copy(queue, host_out, out_buf).wait()

netcl's equivalent:

import netcl as nc
x = nc.Tensor.from_host(host_in)
y = nc.relu(x)
host_out = y.to_host()

The netcl version is identical at the call site; the implementation is built on top of the PyOpenCL version.

Code Example

A common use of PyOpenCL directly in netcl is the KernelSpec helper, which lets you write a one-off kernel and run it from netcl:

import netcl as nc

spec = nc.KernelSpec(
    name="scale_by_two",
    src="__kernel void scale_by_two(__global float* x) {"
        "  int gid = get_global_id(0);"
        "  x[gid] *= 2.0f;"
        "}",
)
prg = spec.build(queue)
prg.run(queue, (n,), None, x)

The Tutorial: Custom OpenCL Kernel walks through a more complete example.

Performance & Trade-offs

PyOpenCL is fast. The Python overhead per call is in the low microseconds; the GPU work is the bottleneck, not the binding.
The cl.array module is convenient for prototyping but is not used at runtime in netcl. netcl's own op system has tighter integration with the BufferPool and the JIT Compiler.
If you need to drop down to PyOpenCL for a one-off kernel, prefer nc.KernelSpec over cl.Program — netcl's spec records the work-group size that the autotuner picked, so subsequent runs use the right size automatically.
PyOpenCL installs a SIGINT handler on first import to flush pending DMA. If you install your own handler, call cl.tools.install_sigint_handler() first.