api

netcl.core — Devices, Tensor, Memory, Kernels

`netcl.core` — Devices, Tensor, Memory, Kernels

The core package is the foundation of netcl. It owns the device discovery layer, the Tensor lifecycle, the BufferPool memory recycler, the OpenCLBackend and CPUBackend pluggable transports, and the small set of helpers that turn a string body into a real OpenCL kernel via the JIT Compiler.

Note — Top-level re-exports. netcl/core/__init__.py re-exports DeviceManager, DeviceHandle, manager, Tensor, BufferPool, BufferHandle only. Other symbols (CPUQueue, PoolStats, OpenCLBackend, CPUBackend, KernelSpec, PRIMITIVE_PREAMBLE, WorkGroupTuner, KernelSelector, …) are imported from their full submodule paths as shown below. The root netcl/__init__.py does surface matmul, build_matmul_kernel, elementwise_binary, relu, bias_add, reduce_sum, softmax, and conv2d as convenience shortcuts, but for anything else use the explicit submodule import.

Overview

Symbol	Path	Purpose
`DeviceManager`	`netcl.core.device`	Discovers OpenCL platforms and devices, returns a default
`DeviceHandle`	`netcl.core.device`	Immutable handle: `(platform, device, context, queue, backend)`
`manager` (singleton)	`netcl.core.device`	Process-wide `DeviceManager` instance, ready to use
`CPUQueue`	`netcl.core.device`	Stand-in queue object for the CPU/NumPy backend
`Tensor`	`netcl.core.tensor`	N-dim array stored on a device
`BufferPool`	`netcl.core.memory`	Power-of-two bucketed device-buffer pool
`BufferHandle`	`netcl.core.memory`	Single lease out of a `BufferPool` (auto-released)
`PoolStats`	`netcl.core.memory`	Hit/miss/bytes counters with computed `hit_rate`
`PinnedBufferPool`	`netcl.core.memory`	Page-locked host-buffer pool for fast H2D/D2H
`PersistentBufferPool`	`netcl.core.memory`	Bounded bucket pool with size limits and per-bucket caps
`OpenCLBackend`	`netcl.core.backend.opencl`	Async H2D/D2H, pinned memory, fork-safe queue cleanup
`CPUBackend`	`netcl.core.backend.cpu`	NumPy fallback (used when no OpenCL device is present)
`DeviceProfile`	`netcl.core.capabilities`	Frozen summary of device caps (FP16, subgroups, local mem)
`kernel_strategy`	`netcl.core.capabilities`	Returns `"portable"` or `"optimized"` for a profile
`KernelSpec`	`netcl.core.kernels.primitives`	Declarative OpenCL kernel source generator
`PRIMITIVE_PREAMBLE`	`netcl.core.kernels.primitives`	C-macro preamble injected into generated kernels
`WorkGroupTuner`	`netcl.core.kernels.primitives`	Heuristic local/global size chooser
`PrimitiveBuilder`	`netcl.core.kernels.primitives`	Builds elementwise kernels from expressions
`KernelSelector`	`netcl.core.kernel_selector`	Auto-selects matmul/conv/elementwise/reduction variants
`KernelVariant`	`netcl.core.kernel_selector`	Enum of available kernel implementations
`KernelConfig`	`netcl.core.kernel_selector`	Tile sizes, work-per-thread, local memory flags
`is_integrated_gpu`	`netcl.core.memory`	Heuristic: does host and device share memory?
`has_svm_support`	`netcl.core.memory`	Probes `cl_svm_*` capability
`create_pinned_buffer`	`netcl.core.memory`	Allocates a single `ALLOC_HOST_PTR` buffer
`create_zero_copy_buffer`	`netcl.core.memory`	`USE_HOST_PTR` zero-copy mapping for integrated GPUs

`DeviceManager` and the `manager` Singleton

DeviceManager is a small but stateful object: it caches one DeviceHandle per requested selector string so repeated calls inside a hot loop do not pay the discovery cost twice.

from netcl.core.device import manager

# Returns the first GPU/CPU/Accelerator that was discovered.
dev = manager.default("auto")
print(dev.platform_name, dev.device_name, dev.backend, dev.device_type)
print(type(dev.queue), type(dev.context))

Method	Returns	Purpose
`DeviceManager.discover()`	`list[(Platform, Device)]`	Enumerate every `(platform, device)` pair the driver reports.
`DeviceManager.default(s)`	`DeviceHandle` \| `None`	Return a cached handle for `"auto"`, `"gpu"`, `"cpu"`, or a device id. Returns `None` when no device is available.
`DeviceManager._device_type(dev)`	`str`	Maps a `cl.Device.type` bitfield to `"gpu"`, `"cpu"`, `"accel"`, or `"other"`.

The manager singleton is constructed at import time and is what most user code should reach for. It is safe to call manager.default("auto") once at program start and pass the handle around.

Thread-Local Active Device

netcl.core.device also exposes a small device context manager and a thread-local active device. This is what allows helper functions inside Tensor to find the right queue without you having to thread it through every call.

from netcl.core.device import device

with device("gpu") as dev:
    # get_active_device() now returns this handle for the current thread.
    ...
# On exit, the previous active device is restored.

For an end-to-end introduction to the device model and how it maps onto real OpenCL platforms, see Architecture: Tensor Backend.

`DeviceHandle`

@dataclass
class DeviceHandle:
    platform_name: str           # e.g. "NVIDIA CUDA", "Intel(R) OpenCL"
    device_name:   str           # e.g. "Tesla T4", "Iris Xe"
    backend:       str           # "cl" or "cpu"
    device_type:   str           # "gpu" | "cpu" | "accel" | "other"
    context:       cl.Context    # OpenCL context (None for CPU)
    queue:         cl.CommandQueue  # OpenCL queue, or a CPUQueue placeholder

DeviceHandle is the only thing every other layer in the stack needs. Backends take a queue; the Tensor constructor takes a queue (and optionally the handle's context); the BufferPool takes a context. Treat it as immutable.

`Tensor` (See Tensor)

The full Tensor reference lives on its own page. From core's perspective, Tensor is the user-facing handle that ties a buffer, a shape, a dtype, a queue, and an optional BufferPool lease into one object. The two factory entry points are:

from netcl.core.tensor import Tensor
from netcl.core.device import manager

q = manager.default("auto").queue
a = Tensor.from_host(q, host_ndarray)                 # copy host -> device
b = Tensor.from_shape(q, (4, 8), dtype="float32")      # zero-filled on device

Read the Tensor page for the rest of the lifecycle, the dtype mapping, the pool_handle interaction with BufferPool, and the autograd integration.

`BufferPool`, `BufferHandle`, `PoolStats`

The device-side memory recycler. BufferPool keeps a per-bucket free list keyed by power-of-two sizes. BufferHandle is a one-shot lease; calling release() puts the buffer back into the pool instead of letting the OpenCL runtime reap it. PoolStats exposes hit/miss and bytes counters so you can confirm the pool is doing useful work.

from netcl.core.memory import BufferPool
pool = BufferPool(context)
h = pool.allocate(nbytes)          # -> BufferHandle, possibly reused
h.release()                         # back to the pool
print(pool.stats.hits, pool.stats.misses, pool.stats.bytes_cached)

`PinnedBufferPool`

Same shape as BufferPool, but the buffers are allocated with cl.mem_flags.ALLOC_HOST_PTR so the host side is page-locked. The OpenCLBackend uses this pool for high-throughput H2D/D2H transfers on discrete GPUs. Access it indirectly via get_pinned_pool(queue).

`PersistentBufferPool`

A bounded alternative to BufferPool with a fixed bucket table (1 KB … 256 MB) and a hard max_cached_bytes cap (default 4 GB, overridable via the NETCL_MAX_CACHED_GB environment variable). Use this in long-running training loops where the simple BufferPool would unboundedly grow. It is the pool that Tensor.from_shape reaches for by default on the OpenCL path (via get_persistent_pool(queue)).

from netcl.core.memory import get_persistent_pool
pool = get_persistent_pool(queue)
print(pool.get_stats())   # {hits, misses, hit_rate, bytes_allocated, bytes_cached, buckets}

For the rationale and tradeoffs, see Architecture: Memory Pool.

`OpenCLBackend`

OpenCLBackend (in netcl.core.backend.opencl) is the production transport. It owns a cl.CommandQueue and implements from_host, from_shape, to_host, to_host_async, wait, and reshape. The most important pieces of behavior are:

Async H2D. from_host enqueues a non-blocking enqueue_copy when async_copy=True (the default; can also be set globally via NETCL_ASYNC_H2D=1). The returned (buffer, event, pending_release, host_ref) tuple lets the Tensor object wait() on the copy at the right point.
Pinned H2D. When NETCL_PINNED_H2D=1 (the default), the backend routes the H2D copy through a PinnedBufferPool using a synchronous map → unmap → copy pattern. This is faster than pageable memory on most discrete GPUs.
Zero-copy on integrated GPUs. is_integrated_gpu(queue) reports True for devices that share host and device memory (Intel iGPUs, AMD APUs, OpenCL CPU devices). The backend uses USE_HOST_PTR in that case so the device operates directly on the host ndarray.
Fork safety. When the process is forked (for example to spawn a data-loader worker), _before_fork snapshots every live cl.Buffer and cl.CommandQueue and _after_fork_in_child bumps their Python refcount so the child's OpenCL driver state stays untouched. This avoids the well-known "GPU hang after fork" failure mode.
Clean shutdown. The first time an OpenCLBackend is constructed, it registers an atexit hook that calls queue.finish() on every registered queue. It also installs a custom SIGINT handler so a Ctrl+C during a stuck kernel cleanly os._exit(130)s after a 3-second grace window.
Interruptible waits. interruptible_finish(queue, timeout=30.0) runs queue.finish() in a worker thread and polls in 10 ms increments, so the process can still respond to SIGINT while a long kernel is in flight.

See Architecture: Tensor Backend for diagrams of how the backend, the BufferPool, and the JIT Compiler interact.

`CPUBackend`

A tiny NumPy-backed stand-in used when no OpenCL device is available (or when the user explicitly asked for "cpu" via manager.default("cpu")). It implements the same interface as OpenCLBackend but every method is just a numpy call:

from netcl.core.backend.cpu import CPUBackend
be = CPUBackend()
x = be.from_host(host_arr, dtype="float32")
y = be.from_shape((4, 8), dtype="float32")  # zeros
print(be.to_host(x, (4, 8), "float32"))    # same ndarray

CPU tensors carry their data in Tensor.array instead of Tensor.buffer; the Tensor constructor routes reads and writes through this field automatically.

`KernelSpec`, `PRIMITIVE_PREAMBLE`, `WorkGroupTuner`

netcl.core.kernels.primitives is the codegen layer that the JIT Compiler sits on top of.

from netcl.core.kernels.primitives import KernelSpec, PRIMITIVE_PREAMBLE

src = KernelSpec(
    name="vec_add",
    params=[
        "__global const float* a",
        "__global const float* b",
        "__global float* c",
    ],
    preamble=PRIMITIVE_PREAMBLE,
    body="int gid = get_global_id(0); c[gid] = a[gid] + b[gid];",
).to_source()

KernelSpec.to_source() returns a complete __kernel definition with the preamble spliced in. PRIMITIVE_PREAMBLE is a small set of C macros (LOAD, STORE, ADD, SUB, MUL, DIV, RELU) that make generated bodies easier to write and easier to constant-fold.

WorkGroupTuner is the heuristic that picks local_size and rounds global_size up to a multiple of it. It is intentionally simple — the heavy lifting is done by KernelSelector — but it is the default for hand-written kernels.

PrimitiveBuilder is a thin convenience over KernelSpec:

from netcl.core.kernels.primitives import PrimitiveBuilder, WorkGroupTuner

builder = PrimitiveBuilder(dtype="float", tuner=WorkGroupTuner())
spec = builder.elementwise_spec(name="mul_add", arity=2, expression="MUL(ADD(v0, v1), 2.0f)")
# Returns a KernelSpec; pass a cl.Context to .build() to compile it.

build_elementwise_kernel is the one-shot helper that fuses the above:

from netcl.core.kernels.primitives import build_elementwise_kernel

spec, kernel = build_elementwise_kernel(
    context, name="mul_add", arity=2,
    expression="MUL(ADD(v0, v1), 2.0f)",
    dtype="float",
)

When dtype is "half" or "float16", the helper automatically prepends #pragma OPENCL EXTENSION cl_khr_fp16 : enable and adds the relevant -cl-fast-relaxed-math build options. This is how the ops API gets correct fp16 kernels without requiring every call site to remember the pragma.

`KernelSelector`

netcl.core.kernel_selector.KernelSelector (and the get_kernel_selector() factory) is the autotuner that decides which kernel variant to run for a given problem size and device. It was added to make the matmul and Conv2d code paths GPU-aware without hard-coding vendor-specific tuning inside each op.

The selector is a thin policy layer over a richer ExtendedDeviceProfile (from netcl.core.device_profile). It maintains an in-memory autotune cache keyed on (op, shape, dtype, flags) so a given problem shape is decided only once per process.

`KernelVariant` and `KernelConfig`

KernelVariant is an Enum of every implementation the selector knows about:

MatMul: MATMUL_NAIVE, MATMUL_TILED, MATMUL_REGISTER_TILED, MATMUL_VECTORIZED
Conv2D: CONV2D_NAIVE, CONV2D_IM2COL, CONV2D_IMPLICIT_GEMM, CONV2D_TILED_LOCAL, CONV2D_WINOGRAD
Elementwise: ELEMENTWISE_SCALAR, ELEMENTWISE_VECTORIZED
Reduction: REDUCTION_SEQUENTIAL, REDUCTION_PARALLEL, REDUCTION_WORKGROUP
BatchNorm: BATCHNORM_NAIVE, BATCHNORM_FUSED

KernelConfig is a dataclass of the actual tile and launch parameters:

@dataclass
class KernelConfig:
    variant: KernelVariant
    tile_m: int = 32
    tile_n: int = 32
    tile_k: int = 8
    work_per_thread: int = 4
    local_size_1d: int = 256
    local_size_2d: Tuple[int, int] = (16, 16)
    use_local_memory: bool = True
    use_vectorization: bool = True
    vector_width: int = 4
    extra: Dict[str, Any] = field(default_factory=dict)

Selection Methods

Method	Purpose
`select_matmul_kernel(M, N, K, dtype)`	Picks naive for tiny problems, tiled for small/medium, register-tiled for large, with vendor-specific overrides for NVIDIA Ampere/Turing and AMD RDNA2/3.
`select_conv2d_kernel(...)`	Routes 1×1 to implicit-GEMM, 3×3 stride-1 to Winograd (disable with `NETCL_CONV_WINOGRAD=0`), falls back to implicit-GEMM otherwise; CPU is forced to im2col.
`select_elementwise_kernel(n, dtype)`	Uses vectorized loads (width 4 or 8) above 1024 elements.
`select_reduction_kernel(n, kind)`	Sequential below 256 elements, parallel below 64 K, multi-stage workgroup above.
`select_batchnorm_kernel(...)`	Returns the fused BN+ReLU variant when `fuse_relu=True` and `training=False`.
`clear_cache()`	Drops the autotune cache (useful after switching devices).

Two module-level convenience functions — select_matmul_config(M, N, K) and select_conv2d_config(...) — wrap the singleton returned by get_kernel_selector() and are what the higher-level ops API calls.

Environment Variables

Variable	Default	Effect
`NETCL_KERNEL_STRATEGY`	`auto`	Forces `portable` or `optimized` (or auto-detect).
`NETCL_CONV_WINOGRAD`	`1`	Set to `0` to disable Winograd 3×3 selection.
`NETCL_CONV_TILED_LOCAL`	`0`	Set to `1` to allow the tiled-local conv2d variant on small outputs.
`NETCL_MAX_CACHED_GB`	`4`	Default cap for `PersistentBufferPool.max_cached_bytes`.
`NETCL_ASYNC_H2D`	`1`	Default for `Tensor.from_host(async_copy=...)`.
`NETCL_PINNED_H2D`	`1`	Default for `Tensor.from_host(use_pinned=...)`.
`NETCL_PROFILE_EVENTS`	`0`	Set to `1` to request `PROFILING_ENABLE` on the OpenCL queue.

`cl_khr_fp16` Capability Probe

OpenCL exposes optional features through extension strings. netcl.core.capabilities inspects the device's extensions list and exposes two booleans on DeviceProfile:

has_fp16 / supports_fp16 — True if any extension mentions "fp16" or "cl_khr_fp16" is in the list. Half-precision math is only safe to use when this is True.
has_subgroups — True if any extension mentions "subgroup". Subgroup operations can be a substantial win for reductions and scans.
fast_atomics — True if the device advertises cl_khr_int64_base_atomics or the extended atomics extension.

Use device_profile(cl_device) to get a frozen DeviceProfile, then check the relevant flag before launching fp16 code paths:

from netcl.core.capabilities import device_profile
prof = device_profile(queue.device)
if prof.has_fp16:
    # safe to use half / float16 tensors and fp16 kernels
    ...

The kernel_strategy(profile) helper returns "optimized" when the device reports subgroups or fp16, or when the vendor is NVIDIA or AMD; it returns "portable" for CPUs and devices with very small local memory.

netcl.core — Devices, Tensor, Memory, Kernels

Overview

DeviceManager and the manager Singleton

Thread-Local Active Device

DeviceHandle

Tensor (See Tensor)

BufferPool, BufferHandle, PoolStats

PinnedBufferPool

PersistentBufferPool

OpenCLBackend

CPUBackend

KernelSpec, PRIMITIVE_PREAMBLE, WorkGroupTuner

KernelSelector

KernelVariant and KernelConfig