netcl.runtime — Cache, Graph, Scheduler
netcl.runtime — Cache, Graph, Scheduler
The runtime API — that is, the netcl.runtime package — groups the
runtime support layer that the
JIT Compiler, the Autograd & Tape,
the Tensor Backend, and the high-level
profiling API all build on. It owns the in-memory
OpenCL program cache, the capture-and-replay machinery
that lets CompiledGraph hit zero Python overhead per kernel launch, the
multi-stream scheduler that the Trainer uses to overlap H2D and compute, and
the lightweight performance counters that the profiling API wraps.
Note — Long-form imports.
netcl/runtime/__init__.pyis empty in the current code; every public symbol lives in a submodule and must be imported by its full submodule path. There is nofrom netcl import runtimeshortcut. The five recommended imports are:
python from netcl.runtime.cache import program_cache, KernelHandle from netcl.runtime.graph import Graph from netcl.runtime.scheduler import Stream from netcl.runtime.perf import PerfCounter, timing_region from netcl.runtime.capture import capture_graph from netcl.runtime.graph_fusion import fuse_graphThe JIT Compiler also reaches into
runtime.cacheandruntime.captureinternally; you should never need to instantiate them yourself in user code, but the long-form imports are how you get to them when debugging.
Module Map
| Symbol | Path | Purpose |
|---|---|---|
program_cache |
runtime/cache.py |
LRU cache keyed on (source_hash, build_flags); returns KernelHandle |
KernelHandle |
runtime/cache.py |
Bundle of a compiled cl.Program plus the resolved cl.Kernel objects |
Graph |
runtime/graph.py |
User-facing capture / replay wrapper for static-shape kernels |
capture_graph |
runtime/capture.py |
Decorator / context manager that begins recording launches |
fuse_graph |
runtime/graph_fusion.py |
High-level entry point that calls the JIT Compiler and returns a Graph |
Stream |
runtime/scheduler.py |
A priority-aware OpenCL command queue with wait_for() dependency edges |
PerfCounter |
runtime/perf.py |
Per-region timing counter (cl_event-based) with report() and section() |
timing_region |
runtime/perf.py |
Lightweight with block that measures GPU time via cl.Event |
Kernel Cache
from netcl.runtime.cache import program_cache, KernelHandle
key = ("matmul", "fp32", 1024, 1024, 1024)
prog = program_cache.get_or_compile(key, source_code, ctx, build_flags=("-cl-fast-relaxed-math",))
kernel: KernelHandle = prog.kernel("matmul")
| Method | Purpose |
|---|---|
program_cache.get_or_compile(key, src, ctx, build_flags=...) |
Returns a cached cl.Program or builds + caches a new one. The key is the caller's responsibility. |
program_cache.evict(key) |
Drop a single entry. |
program_cache.clear() |
Drop every entry. Useful between contexts / devices. |
program_cache is an LRU by source hash + build flags. Two requests with the same
source string and the same build_flags tuple (order-insensitive) hit the same
cl.Program; requests with different flags or different sources get separate entries.
On multi-GPU systems the cache is per cl.Context so the program for a discrete NVIDIA
card does not collide with the one for an integrated Intel GPU. The cache is not
persisted to disk in this module — building a disk-backed variant is the job of the
profiling API.
A KernelHandle bundles the resolved cl.Kernel together with the source string and
build flags, so a replay of a captured launch sequence can find the right kernel without
re-resolving the name.
Graph Capture
The capture layer is what makes steady-state training drop into a tight C-level loop. Two entry points are exposed:
from netcl.runtime.graph import Graph
from netcl.runtime.capture import capture_graph
# Lower-level: capture any sequence of OpenCL kernel launches
g = Graph()
with g.capture(queue):
matmul(a, b)
relu(c)
g.execute() # enqueue once, replay many times
| Method | Purpose |
|---|---|
Graph.capture(queue) |
Context manager. Every cl.enqueue_nd_range_kernel on queue between __enter__ and __exit__ is recorded. |
Graph.execute(*new_buffers) |
Replay the captured sequence. Optional positional buffers replace the dynamic input slots. |
capture_graph(queue, fn, *args, **kwargs) |
Functional form: run fn(*args, **kwargs) while capturing, return the Graph. |
Replay Semantics
Replay is bytecode-identical to the original forward, but with two optimizations:
- Static kernel arguments (constants, scalar enums) are pre-resolved via
kernel.set_arg(idx, value). Dynamic arguments (input buffers that change every iteration) are tracked by buffer identity at capture time and rebound to the new buffer at replay time. - The replay path does not consult the
JIT Compiler or the
Tapeat all — it is a pure OpenCL-side dispatch loop.
This is the right tool for inference paths with static shapes and no autograd (a
forward pass through a ResNet backbone, for example). For training, use
CompiledGraph instead — it manages the gradient wrapper re-binding that
a replayed backward pass needs.
Scheduler
The scheduler exposes a thin wrapper around cl.CommandQueue that the
Trainer uses to overlap H2D copies with compute. It is a thin layer: no
work stealing, no priority inheritance — just explicit dependency edges between
Streams.
from netcl.runtime.scheduler import Stream
s1 = Stream(priority="high")
s2 = Stream(priority="normal")
s1.wait_for(s2) # enqueue a cl.enqueue_barrier on s1 that fires when s2 is done
| Method | Purpose |
|---|---|
Stream(priority="high"|"normal"|"low") |
Construct a new queue with the given OpenCL queue priority (where the device supports it). |
Stream.wait_for(other) |
Enqueue a barrier event on self that depends on every pending op in other. |
Stream.submit(kernel, *args) |
Convenience wrapper around cl.enqueue_nd_range_kernel. |
In-Order vs. Out-of-Order Queues
Stream defaults to in-order execution (the safest contract for code
that does not insert explicit barriers). Setting out_of_order=True on a
Stream flips it into out-of-order mode, in which case every kernel must
be made safe for parallel execution (the OpenCL backend
ensures this for kernels that use the standard get_global_id(0) index pattern). The
inference Graph.execute path produces out-of-order-safe launches.
Performance
Two layers of timing are available: a low-overhead event-based one (the
PerfCounter) and an even cheaper with block (the
timing_region context manager).
from netcl.runtime.perf import PerfCounter, timing_region
pc = PerfCounter(["matmul", "conv2d", "h2d", "d2h"])
with pc.section("matmul"):
c = matmul(a, b)
with timing_region("copy_h2d"):
x = Tensor.from_host(queue, host_arr)
pc.report() # -> ms and GB/s for each region
| Method | Purpose |
|---|---|
PerfCounter(names) |
Construct a counter with one named region per name. |
PerfCounter.section(name) |
Context manager. Allocates a cl.Event on __enter__ and a second on __exit__, then records the elapsed time. |
PerfCounter.report(stream=sys.stdout) |
Print a table of per-region min/avg/max time and an estimated bandwidth for known regions. |
timing_region(name) |
Single-shot context manager. Cheaper than PerfCounter because it does not retain per-call history. |
event_based_timing(queue, fn, *args, **kwargs) |
Functional helper: run fn(*args, **kwargs), return (result, elapsed_ms). |
PerfCounter uses OpenCL events (cl_event with profiling enabled),
so the numbers it reports are actual GPU time — not wall-clock and not host-side
measurement. To opt in to event profiling, the queue must be created with
cl.command_queue_properties.PROFILING_ENABLE; the
profiling API does this automatically when its enable_profiling()
helper is on.
Graph Fusion
fuse_graph is the high-level entry point that the rest of the system
uses when it wants "the JIT-compiled, replayable version of a function". It is what the
Trainer and the JIT Compiler reach for when they
need a single CompiledGraph that captures both the forward and the
backward pass.
from netcl.runtime.graph_fusion import fuse_graph
# Returns a runtime.Graph that already contains the fused forward + backward kernels.
fused = fuse_graph(model, sample_input, target=sample_target, loss_fn=cross_entropy)
Internally fuse_graph does the following:
- Calls
autograd.compiler.jit_compileon the function it has extracted from the model so the elementwise subgraph collapses into a single kernel pair. - Wraps the resulting kernels in a
CompiledGraphso subsequent iterations are replayable. - Attaches the resulting
Graphto the right OpenCL queue (taken from the model parameters). - Returns the
Graphobject. Caller code can now runGraph.execute(...)per training step or per inference call.
For the architecture-level view of how the
JIT Compiler, the CompiledGraph, the
Tape, and the Stream scheduler interact, see
Architecture: JIT Compiler.
See also
- JIT Compiler — how the elementwise subgraph becomes a
single OpenCL kernel pair, and how
fuse_graphsits on top of it. - Autograd & Tape — the design of the
CompiledGraphreplay path. - Tensor Backend — the OpenCL queues and contexts that
every
Streamwraps. - Tensor — the value type whose buffer identities
Graph.executerebinds at replay time. - profiling API — the higher-level wrapper around
PerfCounterandtiming_regionfor end-to-end microbenchmarks. - Understanding Autograd — shows the
Tape/jit_compileinterplay thatfuse_graphautomates.