netcl wiki
concepts

Autotuner

Autotuner

Status: Public API in netcl.profiling.autotuner.Autotuner

Autotuner is the higher-level kernel-by-kernel autotuner. It takes a registry of named kernels and runs each one through the WorkGroupTuner, recording the best local size for each. The recorded sizes are then used transparently by the rest of netcl: any op that needs a kernel will look up the autotuned size and use it.

The autotuner is invoked once per (device, kernel-name, global-size-shape) triple. After that, the result is cached and the tuner never runs again unless the user explicitly calls clear().

Overview

The autotuner is a thin wrapper around WorkGroupTuner that:

  1. Iterates over a user-supplied list of (name, kernel, global_size) tuples.
  2. Runs WorkGroupTuner.tune() for each.
  3. Stores the best local size in a dict keyed on (name, global_size).
  4. Exposes the dict as autotuner.results.

The op dispatch in netcl looks up the dict by (kernel_name, global_size_shape); if a hit is found, the cached local size is used. If a miss, the op falls back to a sensible default (usually local_size = 64).

Where It Lives

  • File path: profiling/autotuner.py.
  • Module path: netcl.profiling.autotuner.
  • Public re-export: from netcl.profiling import Autotuner.

How It Works

from netcl.profiling import Autotuner

autotuner = Autotuner(device=queue.device)
autotuner.add("matmul_small", prg.matmul_small, (256, 256))
autotuner.add("matmul_medium", prg.matmul_medium, (1024, 1024))
autotuner.add("matmul_large", prg.matmul_large, (4096, 4096))
autotuner.tune_all(queue)

After tune_all, the autotuner has the best local size for each kernel. The op dispatch code uses these results:

local_size = autotuner.lookup("matmul_small", (256, 256))
# local_size = 64
prg.matmul_small(queue, (256, 256), (local_size,),
                 in_a, in_b, out_c)

Code Example

A full autotuning session for a small model:

import netcl as nc
from netcl.profiling import Autotuner

ctx, queue = nc.device.manager.default()
autotuner = Autotuner(device=queue.device)

# Register all the kernels the model uses.
for name, prg, shape in model.kernels():
    autotuner.add(name, prg, shape)

# Run the tuner once.
autotuner.tune_all(queue)

# Save the results to a file for later re-use.
autotuner.save("autotune_results.json")

Restoring cached results:

autotuner = Autotuner(device=queue.device)
autotuner.load("autotune_results.json")
# Now the dispatch uses the cached sizes without re-tuning.

Performance & Trade-offs

  • The autotuner is off by default. The op dispatch uses conservative default local sizes until the user explicitly calls tune_all. This is the right default: most users should not pay the tuning cost on first run.
  • Tuning cost is proportional to the number of registered kernels times the number of candidate local sizes (a few hundred microseconds each). For a model with 20 kernels, the total tune time is around 100 ms.
  • The autotuner's results are device-specific. A new device needs a new tuning session. The save / load mechanism exists to amortise the cost across runs on the same device.
  • The autotuner is not aware of dtype. A matmul kernel for fp16 has a different optimal local size than the same kernel for fp32; the autotuner currently treats them as the same kernel. This is a known limitation.

See also