concepts

KernelSpec

Status: Public API in netcl.core.kernel_selector.KernelSpec

KernelSpec is the netcl record that describes a single OpenCL kernel: its name, the OpenCL C source, the shape signature it expects, and (after autotuning) the optimal work-group size. The spec is the boundary between the user-written kernel and the rest of the runtime.

KernelSpec is also the place to register a custom kernel with the netcl op system. Once a spec is built, the user can call spec.run(queue, global_size, local_size, *args) to launch the kernel with the same ergonomic interface as the built-in ops.

Overview

@dataclass
class KernelSpec:
    name:        str
    src:         str
    shape_sig:   tuple = ()
    workgroup_size: int | None = None
    extra_flags: tuple = ()
    tune:        bool = True

The fields are:

name — the registered name. The build produces a cl.Program with a kernel named <name> (matching the __kernel void <name>(...) function in src).
src — the OpenCL C source. May contain a single kernel or multiple kernels; the build processes all of them.
shape_sig — the shape signature the op dispatch uses to look up the kernel. The convention is the tuple of (input_shapes, output_shapes) reduced to a hash. Two calls with the same shape_sig hit the same cache slot.
workgroup_size — the autotuned local size. None means "not autotuned, use the device default of 64".
extra_flags — extra build flags (e.g. ("-cl-fast-relaxed-math",)).
tune — whether to autotune on first build. Default True.

Where It Lives

File path: core/kernel_selector.py.
Module path: netcl.core.kernel_selector.
Public re-export: from netcl.core import KernelSpec.

How It Works

spec.build(queue) compiles the OpenCL source for the device on queue. The build:

Prepends the netcl preamble (defines ADD, MUL, RELU, etc.).
Adds #pragma OPENCL EXTENSION cl_khr_fp16 : enable if any of the kernel arguments is __global half*.
Runs cl.Program.build() with extra_flags.
Caches the resulting cl.Program in the spec.

spec.run(queue, global_size, local_size, *args) enqueues a launch of the named kernel with the given arguments. If local_size is None, the spec's workgroup_size is used (if set); otherwise the device default of 64 is used.

Code Example

A minimal custom kernel:

import netcl as nc

spec = nc.KernelSpec(
    name="scale_by_two",
    src="""__kernel void scale_by_two(__global float* x) {
        int gid = get_global_id(0);
        x[gid] *= 2.0f;
    }""",
)
prg = spec.build(queue)

x = nc.Tensor.from_host(numpy_array)
prg.run(queue, (x.size,), None, x)

A kernel with the autotuner enabled:

spec = nc.KernelSpec(
    name="matmul_tiled",
    src="""__kernel void matmul_tiled(
        __global const float* a, __global const float* b,
        __global float* c, int M, int N, int K) {
        // ... tiled GEMM body ...
    }""",
    workgroup_size=None,    # autotune
)
prg = spec.build(queue)
# First call autotunes; subsequent calls use the cached size.
prg.run(queue, (M, N), None, a, b, c, M, N, K)

A spec that registers a custom op in the autograd system:

from netcl.autograd.compiler import register_primitive
from netcl.core import KernelSpec

spec = KernelSpec(name="my_op", src="...")
prg = spec.build(queue)

# Tell the autograd system how to differentiate through my_op.
register_primitive(
    name="my_op",
    forward=lambda args, attrs: f"MY_OP({args[0]}, {args[1]})",
    backward=lambda args, grad_var, attrs, out_var: [
        f"MUL({grad_var}, {args[1]})",
        f"MUL({grad_var}, {args[0]})",
    ],
    arity=2,
    fusible=True,
)

Performance & Trade-offs

tune=True is the default. The first call to spec.run runs the autotuner; this is typically a few hundred microseconds. Set tune=False for kernels where the optimal local size is known in advance.
The cached cl.Program lives in the spec for the lifetime of the Python process. Long-running training jobs should not re-build specs in a hot loop.
extra_flags=("-cl-fast-relaxed-math",) enables a vendor fast-math mode. It is faster but can break NaN propagation in some edge cases. Use it only when you know the kernel is NaN-free.
The spec is not thread-safe in the sense that two threads building the same spec concurrently will both call clBuildProgram and one of them will lose. Use a lock if you build specs from multiple threads.