netcl wiki
api

netcl.profiling — Timing, Regions, Autotuner

netcl.profiling — Timing, Regions, Autotuner

The profiling API is the performance-measurement layer of netcl. It sits on top of the runtime API and the Tensor Backend and exposes a small, opinionated toolkit for measuring GPU time, attributing it to named regions of a training loop, and discovering good kernel-launch parameters for the active OpenCL device. If you have ever wanted a one-liner to time a function, a context manager to wrap a training step, or a tuner that sweeps workgroup sizes for you, this is the package that provides it.

Note — Long-form imports. netcl/profiling/__init__.py is empty in the current code; every public symbol lives in a submodule and must be imported by its full submodule path. There is no from netcl import profiling shortcut. The recommended imports are:

python from netcl.profiling.timing import timing, region, timing_region, EventTimer from netcl.profiling.autotuner import autotune, AutotuneResult, Autotuner

Both modules are also touched by the JIT Compiler and the Trainer internally; in user code the long-form import is the supported way to reach them.

Overview

The profiling API has three responsibilities, each backed by a different symbol family:

  • Quick latency measurementtiming wraps a single function call and returns mean, standard deviation, and per-call timings.
  • Region-based profilingregion, timing_region, and EventTimer let you attribute GPU time to named blocks of a training loop.
  • Autotuningautotune and Autotuner sweep kernel-launch parameters and feed the winning configuration back into the runtime API program cache.

All three layers emit OpenCL profiling events when the active queue was created with cl.command_queue_properties.PROFILING_ENABLE. The profiling API enables that flag automatically on every queue it is given, so the numbers you see are real GPU time, not wall-clock.

Symbol Table

Symbol Path Purpose
timing profiling/timing.py Synchronous timing helper; returns mean, std, and per-call timings
region profiling/timing.py Alias for timing_region
timing_region profiling/timing.py Context manager that records GPU time for a named region
EventTimer profiling/timing.py Lower-level cl.Event-based timer for fine-grained kernel attribution
autotune profiling/autotuner.py Functional wrapper around Autotuner for one-shot sweeps
AutotuneResult profiling/autotuner.py Frozen result holding the winning best_config and the full measurements table
Autotuner profiling/autotuner.py Stateful tuner that can be reused across multiple kernel signatures

timing

timing is the synchronous, host-side timing helper. It is the right tool when you want a single latency number for a callable — for example, the cost of one matmul, one conv2d, or one full step of a tiny MLP.

from netcl.profiling.timing import timing

mean_s, std_s, per_call = timing(
    matmul, a, b,
    n_warmup=3,
    n=100,
)
print(f"matmul: {mean_s*1e6:.2f} us ± {std_s*1e6:.2f} us")
Parameter Default Purpose
fn The callable to time.
*args Positional arguments forwarded to fn on every call.
n_warmup 3 Number of warm-up calls discarded before measurement starts.
n 100 Number of timed calls. The return values of these calls are averaged.
**kwargs {} Keyword arguments forwarded to fn on every call.

Returns (mean_s, std_s, per_call_s_list) — a three-tuple of a mean in seconds, a sample standard deviation in seconds, and the raw per-call timings as a list of floats. The helper creates its own EventTimer under the hood, so the reported numbers are GPU time, not wall-clock. Use timing for quick latency measurement when you do not need region aggregation.

region / timing_region

timing_region is the context manager for named timing regions. region is exported as an alias for callers that prefer the shorter name. It is the right tool for end-to-end training-loop profiling — the loop wraps every interesting block in a with timing_region("..."): and the runtime API aggregates the per-region GPU time into a tree.

from netcl.profiling.timing import timing_region

with timing_region("forward"):
    logits = model(x)

with timing_region("backward"):
    loss.backward()

with timing_region("optimizer"):
    opt.step()

timing_region emits cl.Event-based measurements on the active queue. Nested timing_region blocks are supported: an inner block is attributed to its parent and reported as a sub-region. The aggregation is per-queue and is read out with PerfCounter.report(); see the runtime API page for the full reporting surface. Use this when you want to break a training step into named phases and see where the time goes.

EventTimer

EventTimer is the lower-level wrapper around cl.Event for fine-grained GPU timing. It is the right tool when you need to attribute time to a specific kernel rather than to a named region of Python code.

from netcl.profiling.timing import EventTimer

timer = EventTimer(queue)
timer.start()
matmul(a, b)
elapsed_ms = timer.stop()       # cl_event-based, real GPU time

EventTimer is also the building block that timing and timing_region use internally, so its numbers are consistent with the higher-level helpers. It owns a pair of cl.Event objects, lazily creates them, and reuses them across calls to avoid per-measurement allocation overhead. Use EventTimer when you need the raw event API — for example, when you are writing a custom kernel wrapper and want to record the per-launch latency.

autotune / Autotuner

autotune and Autotuner sweep kernel-launch parameters for a single kernel and pick the fastest configuration. They search over the intra-kernel tile parameters that the JIT Compiler exposes through WorkGroupTuner — workgroup sizes, tile sizes, and unroll factors.

from netcl.profiling.autotuner import autotune

result = autotune(
    kernel_fn=my_matmul,
    args=(a, b),
    workgroup_sizes=[(8, 8), (16, 16), (32, 8)],
    tile_sizes=[(4, 4), (8, 8)],
    unroll_factors=[1, 2, 4],
    n_warmup=3,
    n=20,
)

print("best config:", result.best_config)
print("trials:", len(result.measurements))
Parameter Default Purpose
kernel_fn The kernel callable to time. Should accept the same *args on every call.
args The positional argument tuple used for every trial. Inputs must be representative.
workgroup_sizes None Optional list of (gx, gy) tuples. When None, the tuner queries WorkGroupTuner.
tile_sizes None Optional list of (tx, ty) tuples. The tuner also falls back to WorkGroupTuner.
unroll_factors None Optional list of integer unroll factors.
n_warmup 3 Warm-up trials per configuration, discarded.
n 20 Timed trials per configuration; the mean is used to pick the winner.

Returns an AutotuneResult with two fields: best_config (the winning parameter dictionary) and measurements (the full per-trial table, useful for inspecting variance or for plotting a sweep). Use the functional autotune form for one-off sweeps; use the Autotuner class directly when you want to amortize setup across multiple kernel signatures.

Integration with the Runtime Cache

The autotuner writes its results into the runtime API program cache, keyed on the kernel's (source_hash, build_flags) tuple. Once a sweep has finished, the JIT Compiler automatically picks up the winning configuration the next time it encounters the same kernel — there is no second import or re-registration step. This means a single autotune call at the top of a training script is enough to teach the rest of netcl to use the right tile sizes for the active device.

Limitations

The autotuner does not search over algorithm choice — for example, it will not pick between a direct convolution and a Winograd-style conv, or between a standard matmul and an im2col-based one. Those choices are made earlier, in KernelSelector and the JIT Compiler. What autotune does is sweep the intra-kernel parameters (workgroup shape, tile shape, unroll factor) of a single already-chosen kernel implementation. If you need a different algorithm, change the kernel itself or pick a different KernelVariant; the autotuner cannot do that for you.

See also