WorkGroupTuner
WorkGroupTuner
Status: Public API in
netcl.profiling.autotuner.WorkGroupTuner
WorkGroupTuner is the autotuner that picks the optimal OpenCL
work-group size for a kernel. The work-group size is one of the
most important parameters of an OpenCL kernel: too small and the
device is under-utilised, too large and the kernel runs out of
registers and spills to local memory.
The tuner works by running the kernel with a small set of
candidate work-group sizes and measuring the runtime of each.
The candidate that produces the shortest runtime is returned to
the caller, who records it in the KernelSpec and reuses it for
all subsequent calls.
Overview
WorkGroupTuner is a small dataclass. The constructor takes the
kernel, the global size, and (optionally) the list of candidate
local sizes; tune() runs each candidate and returns the
AutotuneResult containing the best local size and the
measured runtime.
The tuner uses the device profile to constrain the candidate
set. For example, a device with CL_DEVICE_MAX_WORK_GROUP_SIZE =
256 will only be tested with local_size <= 256. The default
candidate set is:
- Powers of two from 1 to 256.
- A few "natural" sizes for common patterns: 64, 128, 256.
The full set is configurable; pass candidates=[...] to
override it.
Where It Lives
- File path:
profiling/autotuner.py. - Module path:
netcl.profiling.autotuner. - Public re-export:
from netcl.profiling import WorkGroupTuner. - Sibling:
Autotuner(the higher-level kernel-by-kernel autotuner),AutotuneResult(the result record).
How It Works
tuner = WorkGroupTuner(prg.scale_by_two,
global_size=(N,))
result = tuner.tune(queue, candidates=[32, 64, 128, 256])
print(result.best_local_size, result.runtime_us)
# (128, 47.3)
Under the hood, the tuner:
- Issues a
clFinish(queue)to make sure no prior work is pending. - For each candidate local size, runs the kernel with an OpenCL event-based timer. The first run is excluded (it pays the JIT cost); the runtime is the average of the next three runs.
- Picks the candidate with the shortest mean runtime.
- Returns the result as an
AutotuneResultdataclass.
The tuning cost is typically a few hundred microseconds per candidate. For a kernel that runs millions of times, this is amortised in a single training step.
Code Example
Tuning a custom kernel:
import netcl as nc
from netcl.profiling import WorkGroupTuner
prg = nc.KernelSpec(
name="my_kernel",
src="""__kernel void my_kernel(__global float* x) {
int gid = get_global_id(0);
x[gid] = x[gid] * 2.0f;
}""",
).build(queue)
tuner = WorkGroupTuner(prg.my_kernel, global_size=(1_000_000,))
result = tuner.tune(queue)
print(f"best local size: {result.best_local_size}")
print(f"runtime: {result.runtime_us:.1f} us")
Integrating the tuned size into a KernelSpec:
spec = nc.KernelSpec(
name="my_kernel",
src="...",
workgroup_size=result.best_local_size, # cached
)
Performance & Trade-offs
- The tuner is invoked at most once per
(kernel, global_size)pair. The result is cached in theKernelSpec, and subsequent calls use the cached local size without re-tuning. - The candidate set is the main knob. Too small and the tuner misses the best size; too large and the tuning itself takes too long. The default is a reasonable starting point.
- The tuner times kernels on an idle device. If the GPU is already busy (e.g. another training step is in flight), the measured runtime will be wrong. Call the tuner before the hot loop, not in the middle of it.
- For very small kernels (where the launch overhead dominates the runtime), the tuner is less useful — pick a fixed local size of 64 and move on.
See also
- WorkGroupTuner — the API page.
- Autotuner — the higher-level kernel autotuner.
- KernelSpec — the kernel record that stores the tuned local size.
- Profiling — the perf-timer API used by the tuner.
- WorkGroupTuner — this article.