netcl.profiling — Timing, Regions, Autotuner
netcl.profiling — Timing, Regions, Autotuner
The profiling API is the performance-measurement layer of netcl. It sits on top of the runtime API and the Tensor Backend and exposes a small, opinionated toolkit for measuring GPU time, attributing it to named regions of a training loop, and discovering good kernel-launch parameters for the active OpenCL device. If you have ever wanted a one-liner to time a function, a context manager to wrap a training step, or a tuner that sweeps workgroup sizes for you, this is the package that provides it.
Note — Long-form imports.
netcl/profiling/__init__.pyis empty in the current code; every public symbol lives in a submodule and must be imported by its full submodule path. There is nofrom netcl import profilingshortcut. The recommended imports are:
python from netcl.profiling.timing import timing, region, timing_region, EventTimer from netcl.profiling.autotuner import autotune, AutotuneResult, AutotunerBoth modules are also touched by the JIT Compiler and the
Trainerinternally; in user code the long-form import is the supported way to reach them.
Overview
The profiling API has three responsibilities, each backed by a different symbol family:
- Quick latency measurement —
timingwraps a single function call and returns mean, standard deviation, and per-call timings. - Region-based profiling —
region,timing_region, andEventTimerlet you attribute GPU time to named blocks of a training loop. - Autotuning —
autotuneandAutotunersweep kernel-launch parameters and feed the winning configuration back into the runtime API program cache.
All three layers emit OpenCL profiling events when the
active queue was created with cl.command_queue_properties.PROFILING_ENABLE. The
profiling API enables that flag automatically on every queue it is
given, so the numbers you see are real GPU time, not wall-clock.
Symbol Table
| Symbol | Path | Purpose |
|---|---|---|
timing |
profiling/timing.py |
Synchronous timing helper; returns mean, std, and per-call timings |
region |
profiling/timing.py |
Alias for timing_region |
timing_region |
profiling/timing.py |
Context manager that records GPU time for a named region |
EventTimer |
profiling/timing.py |
Lower-level cl.Event-based timer for fine-grained kernel attribution |
autotune |
profiling/autotuner.py |
Functional wrapper around Autotuner for one-shot sweeps |
AutotuneResult |
profiling/autotuner.py |
Frozen result holding the winning best_config and the full measurements table |
Autotuner |
profiling/autotuner.py |
Stateful tuner that can be reused across multiple kernel signatures |
timing
timing is the synchronous, host-side timing helper. It is the right
tool when you want a single latency number for a callable — for example, the cost of one
matmul, one conv2d, or one full step of a tiny
MLP.
from netcl.profiling.timing import timing
mean_s, std_s, per_call = timing(
matmul, a, b,
n_warmup=3,
n=100,
)
print(f"matmul: {mean_s*1e6:.2f} us ± {std_s*1e6:.2f} us")
| Parameter | Default | Purpose |
|---|---|---|
fn |
— | The callable to time. |
*args |
— | Positional arguments forwarded to fn on every call. |
n_warmup |
3 |
Number of warm-up calls discarded before measurement starts. |
n |
100 |
Number of timed calls. The return values of these calls are averaged. |
**kwargs |
{} |
Keyword arguments forwarded to fn on every call. |
Returns (mean_s, std_s, per_call_s_list) — a three-tuple of a mean in seconds, a
sample standard deviation in seconds, and the raw per-call timings as a list of floats.
The helper creates its own EventTimer under the hood, so the reported
numbers are GPU time, not wall-clock. Use timing for quick latency
measurement when you do not need region aggregation.
region / timing_region
timing_region is the context manager for named timing regions.
region is exported as an alias for callers that prefer the shorter
name. It is the right tool for end-to-end training-loop profiling — the loop wraps
every interesting block in a with timing_region("..."): and the runtime
API aggregates the per-region GPU time into a tree.
from netcl.profiling.timing import timing_region
with timing_region("forward"):
logits = model(x)
with timing_region("backward"):
loss.backward()
with timing_region("optimizer"):
opt.step()
timing_region emits cl.Event-based measurements on the active queue. Nested
timing_region blocks are supported: an inner block is attributed to
its parent and reported as a sub-region. The aggregation is per-queue and is read out
with PerfCounter.report(); see the runtime API page for
the full reporting surface. Use this when you want to break a training step into named
phases and see where the time goes.
EventTimer
EventTimer is the lower-level wrapper around cl.Event for
fine-grained GPU timing. It is the right tool when you need to attribute time to a
specific kernel rather than to a named region of Python code.
from netcl.profiling.timing import EventTimer
timer = EventTimer(queue)
timer.start()
matmul(a, b)
elapsed_ms = timer.stop() # cl_event-based, real GPU time
EventTimer is also the building block that timing and
timing_region use internally, so its numbers are consistent
with the higher-level helpers. It owns a pair of cl.Event objects, lazily creates
them, and reuses them across calls to avoid per-measurement allocation overhead. Use
EventTimer when you need the raw event API — for example, when you
are writing a custom kernel wrapper and want to record the per-launch latency.
autotune / Autotuner
autotune and Autotuner sweep kernel-launch
parameters for a single kernel and pick the fastest configuration. They search over
the intra-kernel tile parameters that the JIT Compiler
exposes through WorkGroupTuner — workgroup sizes, tile sizes, and unroll
factors.
from netcl.profiling.autotuner import autotune
result = autotune(
kernel_fn=my_matmul,
args=(a, b),
workgroup_sizes=[(8, 8), (16, 16), (32, 8)],
tile_sizes=[(4, 4), (8, 8)],
unroll_factors=[1, 2, 4],
n_warmup=3,
n=20,
)
print("best config:", result.best_config)
print("trials:", len(result.measurements))
| Parameter | Default | Purpose |
|---|---|---|
kernel_fn |
— | The kernel callable to time. Should accept the same *args on every call. |
args |
— | The positional argument tuple used for every trial. Inputs must be representative. |
workgroup_sizes |
None |
Optional list of (gx, gy) tuples. When None, the tuner queries WorkGroupTuner. |
tile_sizes |
None |
Optional list of (tx, ty) tuples. The tuner also falls back to WorkGroupTuner. |
unroll_factors |
None |
Optional list of integer unroll factors. |
n_warmup |
3 |
Warm-up trials per configuration, discarded. |
n |
20 |
Timed trials per configuration; the mean is used to pick the winner. |
Returns an AutotuneResult with two fields: best_config (the
winning parameter dictionary) and measurements (the full per-trial table, useful for
inspecting variance or for plotting a sweep). Use the functional autotune
form for one-off sweeps; use the Autotuner class directly when you
want to amortize setup across multiple kernel signatures.
Integration with the Runtime Cache
The autotuner writes its results into the runtime API program cache, keyed
on the kernel's (source_hash, build_flags) tuple. Once a sweep has finished, the JIT
Compiler automatically picks up the winning configuration
the next time it encounters the same kernel — there is no second import or
re-registration step. This means a single autotune call at the top
of a training script is enough to teach the rest of netcl to use the right tile sizes
for the active device.
Limitations
The autotuner does not search over algorithm choice — for example, it will not pick
between a direct convolution and a Winograd-style conv, or between a standard matmul
and an im2col-based one. Those choices are made earlier, in
KernelSelector and the JIT Compiler. What
autotune does is sweep the intra-kernel parameters (workgroup
shape, tile shape, unroll factor) of a single already-chosen kernel implementation. If
you need a different algorithm, change the kernel itself or pick a different
KernelVariant; the autotuner cannot do that for you.
See also
- Quickstart — install netcl, run a smoke test, dispatch a fused kernel.
- FAQ — common pitfalls, ICD selection, fp16 fallback behavior.
runtimeAPI — the cache, capture, scheduler, andPerfCounterthat the profiling API builds on.- Writing a Custom OpenCL Kernel — the
EventTimerandautotunehelpers are demonstrated end-to-end on a hand-written kernel. - JIT Compiler — how autotuned configs reach the program cache and how the JIT Compiler consumes them.
- Tensor Backend — the OpenCL
device model, the
OpenCLBackendqueue, and thecl.Eventsemantics. - Memory Pool — the
BufferPooland the per-device pool stats that autotuning reads from to pick representative inputs.