api

netcl.runtime — Cache, Graph, Scheduler

`netcl.runtime` — Cache, Graph, Scheduler

The runtime API — that is, the netcl.runtime package — groups the runtime support layer that the JIT Compiler, the Autograd & Tape, the Tensor Backend, and the high-level profiling API all build on. It owns the in-memory OpenCL program cache, the capture-and-replay machinery that lets CompiledGraph hit zero Python overhead per kernel launch, the multi-stream scheduler that the Trainer uses to overlap H2D and compute, and the lightweight performance counters that the profiling API wraps.

Note — Long-form imports. netcl/runtime/__init__.py is empty in the current code; every public symbol lives in a submodule and must be imported by its full submodule path. There is no from netcl import runtime shortcut. The five recommended imports are:

python from netcl.runtime.cache import program_cache, KernelHandle from netcl.runtime.graph import Graph from netcl.runtime.scheduler import Stream from netcl.runtime.perf import PerfCounter, timing_region from netcl.runtime.capture import capture_graph from netcl.runtime.graph_fusion import fuse_graph

The JIT Compiler also reaches into runtime.cache and runtime.capture internally; you should never need to instantiate them yourself in user code, but the long-form imports are how you get to them when debugging.

Module Map

Symbol	Path	Purpose
`program_cache`	`runtime/cache.py`	LRU cache keyed on `(source_hash, build_flags)`; returns `KernelHandle`
`KernelHandle`	`runtime/cache.py`	Bundle of a compiled `cl.Program` plus the resolved `cl.Kernel` objects
`Graph`	`runtime/graph.py`	User-facing capture / replay wrapper for static-shape kernels
`capture_graph`	`runtime/capture.py`	Decorator / context manager that begins recording launches
`fuse_graph`	`runtime/graph_fusion.py`	High-level entry point that calls the JIT Compiler and returns a `Graph`
`Stream`	`runtime/scheduler.py`	A priority-aware OpenCL command queue with `wait_for()` dependency edges
`PerfCounter`	`runtime/perf.py`	Per-region timing counter (cl_event-based) with `report()` and `section()`
`timing_region`	`runtime/perf.py`	Lightweight `with` block that measures GPU time via `cl.Event`

Kernel Cache

from netcl.runtime.cache import program_cache, KernelHandle

key = ("matmul", "fp32", 1024, 1024, 1024)
prog = program_cache.get_or_compile(key, source_code, ctx, build_flags=("-cl-fast-relaxed-math",))
kernel: KernelHandle = prog.kernel("matmul")

Method	Purpose
`program_cache.get_or_compile(key, src, ctx, build_flags=...)`	Returns a cached `cl.Program` or builds + caches a new one. The key is the caller's responsibility.
`program_cache.evict(key)`	Drop a single entry.
`program_cache.clear()`	Drop every entry. Useful between contexts / devices.

program_cache is an LRU by source hash + build flags. Two requests with the same source string and the same build_flags tuple (order-insensitive) hit the same cl.Program; requests with different flags or different sources get separate entries. On multi-GPU systems the cache is per cl.Context so the program for a discrete NVIDIA card does not collide with the one for an integrated Intel GPU. The cache is not persisted to disk in this module — building a disk-backed variant is the job of the profiling API.

A KernelHandle bundles the resolved cl.Kernel together with the source string and build flags, so a replay of a captured launch sequence can find the right kernel without re-resolving the name.

Graph Capture

The capture layer is what makes steady-state training drop into a tight C-level loop. Two entry points are exposed:

from netcl.runtime.graph import Graph
from netcl.runtime.capture import capture_graph

# Lower-level: capture any sequence of OpenCL kernel launches
g = Graph()
with g.capture(queue):
    matmul(a, b)
    relu(c)
g.execute()    # enqueue once, replay many times

Method	Purpose
`Graph.capture(queue)`	Context manager. Every `cl.enqueue_nd_range_kernel` on `queue` between `__enter__` and `__exit__` is recorded.
`Graph.execute(*new_buffers)`	Replay the captured sequence. Optional positional buffers replace the dynamic input slots.
`capture_graph(queue, fn, args, *kwargs)`	Functional form: run `fn(args, *kwargs)` while capturing, return the `Graph`.

Replay Semantics

Replay is bytecode-identical to the original forward, but with two optimizations:

Static kernel arguments (constants, scalar enums) are pre-resolved via kernel.set_arg(idx, value). Dynamic arguments (input buffers that change every iteration) are tracked by buffer identity at capture time and rebound to the new buffer at replay time.
The replay path does not consult the JIT Compiler or the Tape at all — it is a pure OpenCL-side dispatch loop.

This is the right tool for inference paths with static shapes and no autograd (a forward pass through a ResNet backbone, for example). For training, use CompiledGraph instead — it manages the gradient wrapper re-binding that a replayed backward pass needs.

Scheduler

The scheduler exposes a thin wrapper around cl.CommandQueue that the Trainer uses to overlap H2D copies with compute. It is a thin layer: no work stealing, no priority inheritance — just explicit dependency edges between Streams.

from netcl.runtime.scheduler import Stream

s1 = Stream(priority="high")
s2 = Stream(priority="normal")
s1.wait_for(s2)        # enqueue a cl.enqueue_barrier on s1 that fires when s2 is done

Method	Purpose
`Stream(priority="high"\|"normal"\|"low")`	Construct a new queue with the given OpenCL queue priority (where the device supports it).
`Stream.wait_for(other)`	Enqueue a barrier event on `self` that depends on every pending op in `other`.
`Stream.submit(kernel, *args)`	Convenience wrapper around `cl.enqueue_nd_range_kernel`.

In-Order vs. Out-of-Order Queues

Stream defaults to in-order execution (the safest contract for code that does not insert explicit barriers). Setting out_of_order=True on a Stream flips it into out-of-order mode, in which case every kernel must be made safe for parallel execution (the OpenCL backend ensures this for kernels that use the standard get_global_id(0) index pattern). The inference Graph.execute path produces out-of-order-safe launches.

Performance

Two layers of timing are available: a low-overhead event-based one (the PerfCounter) and an even cheaper with block (the timing_region context manager).

from netcl.runtime.perf import PerfCounter, timing_region

pc = PerfCounter(["matmul", "conv2d", "h2d", "d2h"])

with pc.section("matmul"):
    c = matmul(a, b)

with timing_region("copy_h2d"):
    x = Tensor.from_host(queue, host_arr)

pc.report()    # -> ms and GB/s for each region

Method	Purpose
`PerfCounter(names)`	Construct a counter with one named region per name.
`PerfCounter.section(name)`	Context manager. Allocates a `cl.Event` on `__enter__` and a second on `__exit__`, then records the elapsed time.
`PerfCounter.report(stream=sys.stdout)`	Print a table of per-region min/avg/max time and an estimated bandwidth for known regions.
`timing_region(name)`	Single-shot context manager. Cheaper than `PerfCounter` because it does not retain per-call history.
`event_based_timing(queue, fn, args, *kwargs)`	Functional helper: run `fn(args, *kwargs)`, return `(result, elapsed_ms)`.

PerfCounter uses OpenCL events (cl_event with profiling enabled), so the numbers it reports are actual GPU time — not wall-clock and not host-side measurement. To opt in to event profiling, the queue must be created with cl.command_queue_properties.PROFILING_ENABLE; the profiling API does this automatically when its enable_profiling() helper is on.

Graph Fusion

fuse_graph is the high-level entry point that the rest of the system uses when it wants "the JIT-compiled, replayable version of a function". It is what the Trainer and the JIT Compiler reach for when they need a single CompiledGraph that captures both the forward and the backward pass.

from netcl.runtime.graph_fusion import fuse_graph

# Returns a runtime.Graph that already contains the fused forward + backward kernels.
fused = fuse_graph(model, sample_input, target=sample_target, loss_fn=cross_entropy)

Internally fuse_graph does the following:

Calls autograd.compiler.jit_compile on the function it has extracted from the model so the elementwise subgraph collapses into a single kernel pair.
Wraps the resulting kernels in a CompiledGraph so subsequent iterations are replayable.
Attaches the resulting Graph to the right OpenCL queue (taken from the model parameters).
Returns the Graph object. Caller code can now run Graph.execute(...) per training step or per inference call.

For the architecture-level view of how the JIT Compiler, the CompiledGraph, the Tape, and the Stream scheduler interact, see Architecture: JIT Compiler.