Autotuner
Autotuner
Status: Public API in
netcl.profiling.autotuner.Autotuner
Autotuner is the higher-level kernel-by-kernel autotuner. It
takes a registry of named kernels and runs each one through the
WorkGroupTuner, recording the best
local size for each. The recorded sizes are then used
transparently by the rest of netcl: any op that needs a kernel
will look up the autotuned size and use it.
The autotuner is invoked once per (device, kernel-name, global-size-shape)
triple. After that, the result is cached and the tuner never runs
again unless the user explicitly calls clear().
Overview
The autotuner is a thin wrapper around WorkGroupTuner that:
- Iterates over a user-supplied list of
(name, kernel, global_size)tuples. - Runs
WorkGroupTuner.tune()for each. - Stores the best local size in a dict keyed on
(name, global_size). - Exposes the dict as
autotuner.results.
The op dispatch in netcl looks up the dict by (kernel_name,
global_size_shape); if a hit is found, the cached local size is
used. If a miss, the op falls back to a sensible default (usually
local_size = 64).
Where It Lives
- File path:
profiling/autotuner.py. - Module path:
netcl.profiling.autotuner. - Public re-export:
from netcl.profiling import Autotuner.
How It Works
from netcl.profiling import Autotuner
autotuner = Autotuner(device=queue.device)
autotuner.add("matmul_small", prg.matmul_small, (256, 256))
autotuner.add("matmul_medium", prg.matmul_medium, (1024, 1024))
autotuner.add("matmul_large", prg.matmul_large, (4096, 4096))
autotuner.tune_all(queue)
After tune_all, the autotuner has the best local size for
each kernel. The op dispatch code uses these results:
local_size = autotuner.lookup("matmul_small", (256, 256))
# local_size = 64
prg.matmul_small(queue, (256, 256), (local_size,),
in_a, in_b, out_c)
Code Example
A full autotuning session for a small model:
import netcl as nc
from netcl.profiling import Autotuner
ctx, queue = nc.device.manager.default()
autotuner = Autotuner(device=queue.device)
# Register all the kernels the model uses.
for name, prg, shape in model.kernels():
autotuner.add(name, prg, shape)
# Run the tuner once.
autotuner.tune_all(queue)
# Save the results to a file for later re-use.
autotuner.save("autotune_results.json")
Restoring cached results:
autotuner = Autotuner(device=queue.device)
autotuner.load("autotune_results.json")
# Now the dispatch uses the cached sizes without re-tuning.
Performance & Trade-offs
- The autotuner is off by default. The op dispatch uses
conservative default local sizes until the user explicitly
calls
tune_all. This is the right default: most users should not pay the tuning cost on first run. - Tuning cost is proportional to the number of registered kernels times the number of candidate local sizes (a few hundred microseconds each). For a model with 20 kernels, the total tune time is around 100 ms.
- The autotuner's results are device-specific. A new device
needs a new tuning session. The
save/loadmechanism exists to amortise the cost across runs on the same device. - The autotuner is not aware of dtype. A
matmulkernel for fp16 has a different optimal local size than the same kernel for fp32; the autotuner currently treats them as the same kernel. This is a known limitation.
See also
- Autotuner — the API page.
- WorkGroupTuner — the per-kernel tuner.
- AutotuneResult — the result dataclass.
- Profiling — the perf-timer API.
- Autotuner — this article.