netcl wiki
concepts

PyOpenCL

PyOpenCL

Status: External library — Andreas Kloeckner's PyOpenCL

PyOpenCL is the Python binding to OpenCL maintained by Andreas Kloeckner. netcl uses PyOpenCL for all of its GPU interaction: buffer allocation, kernel build, kernel launch, event handling, context and queue management.

PyOpenCL is a thin wrapper around the OpenCL C API. It exposes cl.Context, cl.CommandQueue, cl.Buffer, cl.Program, cl.Kernel, cl.Event as Python objects, and a numpy-friendly cl.array.Array class for high-level buffer arithmetic. The Tensor type in netcl is essentially a hand-written wrapper around a cl.Buffer plus a numpy view.

Overview

netcl depends on PyOpenCL at runtime. If PyOpenCL is not installed, the import netcl succeeds, but every call into a GPU op raises ImportError: pyopencl required. The [cpu] and [opencl] extras in the netcl pyproject.toml install PyOpenCL.

PyOpenCL is licensed under the MIT license; the version netcl targets is pyopencl >= 2024.1 (which exposes OpenCL 3.0 features on drivers that support them). The bundled cl.array module is used as a reference for netcl's own high-level arithmetic but is not used at runtime — netcl's ops are dispatched through netcl's own op system.

Where It Lives

  • netcl code that uses PyOpenCL is in core/backend/opencl.py and core/tensor.py.
  • The PyOpenCL import is guarded with a try / except ImportError in every file that uses it, so the CPU-only build is importable.

How It Works

A typical PyOpenCL pattern in netcl:

import pyopencl as cl

# Context + queue — netcl's DeviceManager wraps this.
ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

# Build a kernel.
src = """
    __kernel void relu(__global const float* in, __global float* out) {
        int gid = get_global_id(0);
        out[gid] = in[gid] > 0.0f ? in[gid] : 0.0f;
    }
"""
prg = cl.Program(ctx, src).build()
relu = prg.relu

# Allocate device buffers.
n = 1024
in_buf  = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, n * 4)
out_buf = cl.Buffer(ctx, cl.mem_flags.READ_WRITE, n * 4)

# Copy input.
import numpy as np
host_in = np.random.randn(n).astype(np.float32)
cl.enqueue_copy(queue, in_buf, host_in)

# Launch.
relu(queue, (n,), None, in_buf, out_buf)

# Copy back.
host_out = np.empty(n, dtype=np.float32)
cl.enqueue_copy(queue, host_out, out_buf).wait()

netcl's equivalent:

import netcl as nc
x = nc.Tensor.from_host(host_in)
y = nc.relu(x)
host_out = y.to_host()

The netcl version is identical at the call site; the implementation is built on top of the PyOpenCL version.

Code Example

A common use of PyOpenCL directly in netcl is the KernelSpec helper, which lets you write a one-off kernel and run it from netcl:

import netcl as nc

spec = nc.KernelSpec(
    name="scale_by_two",
    src="__kernel void scale_by_two(__global float* x) {"
        "  int gid = get_global_id(0);"
        "  x[gid] *= 2.0f;"
        "}",
)
prg = spec.build(queue)
prg.run(queue, (n,), None, x)

The Tutorial: Custom OpenCL Kernel walks through a more complete example.

Performance & Trade-offs

  • PyOpenCL is fast. The Python overhead per call is in the low microseconds; the GPU work is the bottleneck, not the binding.
  • The cl.array module is convenient for prototyping but is not used at runtime in netcl. netcl's own op system has tighter integration with the BufferPool and the JIT Compiler.
  • If you need to drop down to PyOpenCL for a one-off kernel, prefer nc.KernelSpec over cl.Program — netcl's spec records the work-group size that the autotuner picked, so subsequent runs use the right size automatically.
  • PyOpenCL installs a SIGINT handler on first import to flush pending DMA. If you install your own handler, call cl.tools.install_sigint_handler() first.

See also

  • OpenCL — the underlying standard.
  • PyOpenCL — the netcl-side use of PyOpenCL.
  • Tensor — the netcl wrapper around cl.Buffer.
  • BufferPool — the pool of cl.Buffers.
  • KernelSpec — the netcl helper for one-off kernels.
  • PyOpenCL — this article.