concepts

WorkGroupTuner

Status: Public API in netcl.profiling.autotuner.WorkGroupTuner

WorkGroupTuner is the autotuner that picks the optimal OpenCL work-group size for a kernel. The work-group size is one of the most important parameters of an OpenCL kernel: too small and the device is under-utilised, too large and the kernel runs out of registers and spills to local memory.

The tuner works by running the kernel with a small set of candidate work-group sizes and measuring the runtime of each. The candidate that produces the shortest runtime is returned to the caller, who records it in the KernelSpec and reuses it for all subsequent calls.

Overview

WorkGroupTuner is a small dataclass. The constructor takes the kernel, the global size, and (optionally) the list of candidate local sizes; tune() runs each candidate and returns the AutotuneResult containing the best local size and the measured runtime.

The tuner uses the device profile to constrain the candidate set. For example, a device with CL_DEVICE_MAX_WORK_GROUP_SIZE = 256 will only be tested with local_size <= 256. The default candidate set is:

Powers of two from 1 to 256.
A few "natural" sizes for common patterns: 64, 128, 256.

The full set is configurable; pass candidates=[...] to override it.

Where It Lives

File path: profiling/autotuner.py.
Module path: netcl.profiling.autotuner.
Public re-export: from netcl.profiling import WorkGroupTuner.
Sibling: Autotuner (the higher-level kernel-by-kernel autotuner), AutotuneResult (the result record).

How It Works

tuner = WorkGroupTuner(prg.scale_by_two,
                       global_size=(N,))
result = tuner.tune(queue, candidates=[32, 64, 128, 256])
print(result.best_local_size, result.runtime_us)
# (128, 47.3)

Under the hood, the tuner:

Issues a clFinish(queue) to make sure no prior work is pending.
For each candidate local size, runs the kernel with an OpenCL event-based timer. The first run is excluded (it pays the JIT cost); the runtime is the average of the next three runs.
Picks the candidate with the shortest mean runtime.
Returns the result as an AutotuneResult dataclass.

The tuning cost is typically a few hundred microseconds per candidate. For a kernel that runs millions of times, this is amortised in a single training step.

Code Example

Tuning a custom kernel:

import netcl as nc
from netcl.profiling import WorkGroupTuner

prg = nc.KernelSpec(
    name="my_kernel",
    src="""__kernel void my_kernel(__global float* x) {
        int gid = get_global_id(0);
        x[gid] = x[gid] * 2.0f;
    }""",
).build(queue)

tuner = WorkGroupTuner(prg.my_kernel, global_size=(1_000_000,))
result = tuner.tune(queue)

print(f"best local size: {result.best_local_size}")
print(f"runtime: {result.runtime_us:.1f} us")

Integrating the tuned size into a KernelSpec:

spec = nc.KernelSpec(
    name="my_kernel",
    src="...",
    workgroup_size=result.best_local_size,   # cached
)

Performance & Trade-offs

The tuner is invoked at most once per (kernel, global_size) pair. The result is cached in the KernelSpec, and subsequent calls use the cached local size without re-tuning.
The candidate set is the main knob. Too small and the tuner misses the best size; too large and the tuning itself takes too long. The default is a reasonable starting point.
The tuner times kernels on an idle device. If the GPU is already busy (e.g. another training step is in flight), the measured runtime will be wrong. Call the tuner before the hot loop, not in the middle of it.
For very small kernels (where the launch overhead dominates the runtime), the tuner is less useful — pick a fixed local size of 64 and move on.