KernelSpec
KernelSpec
Status: Public API in
netcl.core.kernel_selector.KernelSpec
KernelSpec is the netcl record that describes a single OpenCL
kernel: its name, the OpenCL C source, the shape signature it
expects, and (after autotuning) the optimal work-group size. The
spec is the boundary between the user-written kernel and the
rest of the runtime.
KernelSpec is also the place to register a custom kernel with
the netcl op system. Once a spec is built, the user can call
spec.run(queue, global_size, local_size, *args) to launch the
kernel with the same ergonomic interface as the built-in ops.
Overview
@dataclass
class KernelSpec:
name: str
src: str
shape_sig: tuple = ()
workgroup_size: int | None = None
extra_flags: tuple = ()
tune: bool = True
The fields are:
name— the registered name. The build produces acl.Programwith a kernel named<name>(matching the__kernel void <name>(...)function insrc).src— the OpenCL C source. May contain a single kernel or multiple kernels; the build processes all of them.shape_sig— the shape signature the op dispatch uses to look up the kernel. The convention is the tuple of(input_shapes, output_shapes)reduced to a hash. Two calls with the sameshape_sighit the same cache slot.workgroup_size— the autotuned local size.Nonemeans "not autotuned, use the device default of 64".extra_flags— extra build flags (e.g.("-cl-fast-relaxed-math",)).tune— whether to autotune on first build. DefaultTrue.
Where It Lives
- File path:
core/kernel_selector.py. - Module path:
netcl.core.kernel_selector. - Public re-export:
from netcl.core import KernelSpec.
How It Works
spec.build(queue) compiles the OpenCL source for the device
on queue. The build:
- Prepends the netcl preamble (defines
ADD,MUL,RELU, etc.). - Adds
#pragma OPENCL EXTENSION cl_khr_fp16 : enableif any of the kernel arguments is__global half*. - Runs
cl.Program.build()withextra_flags. - Caches the resulting
cl.Programin the spec.
spec.run(queue, global_size, local_size, *args) enqueues a
launch of the named kernel with the given arguments. If
local_size is None, the spec's workgroup_size is used
(if set); otherwise the device default of 64 is used.
Code Example
A minimal custom kernel:
import netcl as nc
spec = nc.KernelSpec(
name="scale_by_two",
src="""__kernel void scale_by_two(__global float* x) {
int gid = get_global_id(0);
x[gid] *= 2.0f;
}""",
)
prg = spec.build(queue)
x = nc.Tensor.from_host(numpy_array)
prg.run(queue, (x.size,), None, x)
A kernel with the autotuner enabled:
spec = nc.KernelSpec(
name="matmul_tiled",
src="""__kernel void matmul_tiled(
__global const float* a, __global const float* b,
__global float* c, int M, int N, int K) {
// ... tiled GEMM body ...
}""",
workgroup_size=None, # autotune
)
prg = spec.build(queue)
# First call autotunes; subsequent calls use the cached size.
prg.run(queue, (M, N), None, a, b, c, M, N, K)
A spec that registers a custom op in the autograd system:
from netcl.autograd.compiler import register_primitive
from netcl.core import KernelSpec
spec = KernelSpec(name="my_op", src="...")
prg = spec.build(queue)
# Tell the autograd system how to differentiate through my_op.
register_primitive(
name="my_op",
forward=lambda args, attrs: f"MY_OP({args[0]}, {args[1]})",
backward=lambda args, grad_var, attrs, out_var: [
f"MUL({grad_var}, {args[1]})",
f"MUL({grad_var}, {args[0]})",
],
arity=2,
fusible=True,
)
Performance & Trade-offs
tune=Trueis the default. The first call tospec.runruns the autotuner; this is typically a few hundred microseconds. Settune=Falsefor kernels where the optimal local size is known in advance.- The cached
cl.Programlives in the spec for the lifetime of the Python process. Long-running training jobs should not re-build specs in a hot loop. extra_flags=("-cl-fast-relaxed-math",)enables a vendor fast-math mode. It is faster but can breakNaNpropagation in some edge cases. Use it only when you know the kernel isNaN-free.- The spec is not thread-safe in the sense that two threads
building the same spec concurrently will both call
clBuildProgramand one of them will lose. Use a lock if you build specs from multiple threads.
See also
- KernelSpec — the API page.
- WorkGroupTuner — the per-kernel tuner.
- Autotuner — the higher-level autotuner.
- AutogradPrimitive — the primitive used to differentiate a custom kernel.
- JIT Compiler — the runtime use of the spec.
- KernelSpec — this article.