concepts

Autotuner

Status: Public API in netcl.profiling.autotuner.Autotuner

Autotuner is the higher-level kernel-by-kernel autotuner. It takes a registry of named kernels and runs each one through the WorkGroupTuner, recording the best local size for each. The recorded sizes are then used transparently by the rest of netcl: any op that needs a kernel will look up the autotuned size and use it.

The autotuner is invoked once per (device, kernel-name, global-size-shape) triple. After that, the result is cached and the tuner never runs again unless the user explicitly calls clear().

Overview

The autotuner is a thin wrapper around WorkGroupTuner that:

Iterates over a user-supplied list of (name, kernel, global_size) tuples.
Runs WorkGroupTuner.tune() for each.
Stores the best local size in a dict keyed on (name, global_size).
Exposes the dict as autotuner.results.

The op dispatch in netcl looks up the dict by (kernel_name, global_size_shape); if a hit is found, the cached local size is used. If a miss, the op falls back to a sensible default (usually local_size = 64).

Where It Lives

File path: profiling/autotuner.py.
Module path: netcl.profiling.autotuner.
Public re-export: from netcl.profiling import Autotuner.

How It Works

from netcl.profiling import Autotuner

autotuner = Autotuner(device=queue.device)
autotuner.add("matmul_small", prg.matmul_small, (256, 256))
autotuner.add("matmul_medium", prg.matmul_medium, (1024, 1024))
autotuner.add("matmul_large", prg.matmul_large, (4096, 4096))
autotuner.tune_all(queue)

After tune_all, the autotuner has the best local size for each kernel. The op dispatch code uses these results:

local_size = autotuner.lookup("matmul_small", (256, 256))
# local_size = 64
prg.matmul_small(queue, (256, 256), (local_size,),
                 in_a, in_b, out_c)

Code Example

A full autotuning session for a small model:

import netcl as nc
from netcl.profiling import Autotuner

ctx, queue = nc.device.manager.default()
autotuner = Autotuner(device=queue.device)

# Register all the kernels the model uses.
for name, prg, shape in model.kernels():
    autotuner.add(name, prg, shape)

# Run the tuner once.
autotuner.tune_all(queue)

# Save the results to a file for later re-use.
autotuner.save("autotune_results.json")

Restoring cached results:

autotuner = Autotuner(device=queue.device)
autotuner.load("autotune_results.json")
# Now the dispatch uses the cached sizes without re-tuning.

Performance & Trade-offs

The autotuner is off by default. The op dispatch uses conservative default local sizes until the user explicitly calls tune_all. This is the right default: most users should not pay the tuning cost on first run.
Tuning cost is proportional to the number of registered kernels times the number of candidate local sizes (a few hundred microseconds each). For a model with 20 kernels, the total tune time is around 100 ms.
The autotuner's results are device-specific. A new device needs a new tuning session. The save / load mechanism exists to amortise the cost across runs on the same device.
The autotuner is not aware of dtype. A matmul kernel for fp16 has a different optimal local size than the same kernel for fp32; the autotuner currently treats them as the same kernel. This is a known limitation.