netcl wiki
api

netcl.core — Devices, Tensor, Memory, Kernels

netcl.core — Devices, Tensor, Memory, Kernels

The core package is the foundation of netcl. It owns the device discovery layer, the Tensor lifecycle, the BufferPool memory recycler, the OpenCLBackend and CPUBackend pluggable transports, and the small set of helpers that turn a string body into a real OpenCL kernel via the JIT Compiler.

Note — Top-level re-exports. netcl/core/__init__.py re-exports DeviceManager, DeviceHandle, manager, Tensor, BufferPool, BufferHandle only. Other symbols (CPUQueue, PoolStats, OpenCLBackend, CPUBackend, KernelSpec, PRIMITIVE_PREAMBLE, WorkGroupTuner, KernelSelector, …) are imported from their full submodule paths as shown below. The root netcl/__init__.py does surface matmul, build_matmul_kernel, elementwise_binary, relu, bias_add, reduce_sum, softmax, and conv2d as convenience shortcuts, but for anything else use the explicit submodule import.

Overview

Symbol Path Purpose
DeviceManager netcl.core.device Discovers OpenCL platforms and devices, returns a default
DeviceHandle netcl.core.device Immutable handle: (platform, device, context, queue, backend)
manager (singleton) netcl.core.device Process-wide DeviceManager instance, ready to use
CPUQueue netcl.core.device Stand-in queue object for the CPU/NumPy backend
Tensor netcl.core.tensor N-dim array stored on a device
BufferPool netcl.core.memory Power-of-two bucketed device-buffer pool
BufferHandle netcl.core.memory Single lease out of a BufferPool (auto-released)
PoolStats netcl.core.memory Hit/miss/bytes counters with computed hit_rate
PinnedBufferPool netcl.core.memory Page-locked host-buffer pool for fast H2D/D2H
PersistentBufferPool netcl.core.memory Bounded bucket pool with size limits and per-bucket caps
OpenCLBackend netcl.core.backend.opencl Async H2D/D2H, pinned memory, fork-safe queue cleanup
CPUBackend netcl.core.backend.cpu NumPy fallback (used when no OpenCL device is present)
DeviceProfile netcl.core.capabilities Frozen summary of device caps (FP16, subgroups, local mem)
kernel_strategy netcl.core.capabilities Returns "portable" or "optimized" for a profile
KernelSpec netcl.core.kernels.primitives Declarative OpenCL kernel source generator
PRIMITIVE_PREAMBLE netcl.core.kernels.primitives C-macro preamble injected into generated kernels
WorkGroupTuner netcl.core.kernels.primitives Heuristic local/global size chooser
PrimitiveBuilder netcl.core.kernels.primitives Builds elementwise kernels from expressions
KernelSelector netcl.core.kernel_selector Auto-selects matmul/conv/elementwise/reduction variants
KernelVariant netcl.core.kernel_selector Enum of available kernel implementations
KernelConfig netcl.core.kernel_selector Tile sizes, work-per-thread, local memory flags
is_integrated_gpu netcl.core.memory Heuristic: does host and device share memory?
has_svm_support netcl.core.memory Probes cl_svm_* capability
create_pinned_buffer netcl.core.memory Allocates a single ALLOC_HOST_PTR buffer
create_zero_copy_buffer netcl.core.memory USE_HOST_PTR zero-copy mapping for integrated GPUs

DeviceManager and the manager Singleton

DeviceManager is a small but stateful object: it caches one DeviceHandle per requested selector string so repeated calls inside a hot loop do not pay the discovery cost twice.

from netcl.core.device import manager

# Returns the first GPU/CPU/Accelerator that was discovered.
dev = manager.default("auto")
print(dev.platform_name, dev.device_name, dev.backend, dev.device_type)
print(type(dev.queue), type(dev.context))
Method Returns Purpose
DeviceManager.discover() list[(Platform, Device)] Enumerate every (platform, device) pair the driver reports.
DeviceManager.default(s) DeviceHandle | None Return a cached handle for "auto", "gpu", "cpu", or a device id. Returns None when no device is available.
DeviceManager._device_type(dev) str Maps a cl.Device.type bitfield to "gpu", "cpu", "accel", or "other".

The manager singleton is constructed at import time and is what most user code should reach for. It is safe to call manager.default("auto") once at program start and pass the handle around.

Thread-Local Active Device

netcl.core.device also exposes a small device context manager and a thread-local active device. This is what allows helper functions inside Tensor to find the right queue without you having to thread it through every call.

from netcl.core.device import device

with device("gpu") as dev:
    # get_active_device() now returns this handle for the current thread.
    ...
# On exit, the previous active device is restored.

For an end-to-end introduction to the device model and how it maps onto real OpenCL platforms, see Architecture: Tensor Backend.

DeviceHandle

@dataclass
class DeviceHandle:
    platform_name: str           # e.g. "NVIDIA CUDA", "Intel(R) OpenCL"
    device_name:   str           # e.g. "Tesla T4", "Iris Xe"
    backend:       str           # "cl" or "cpu"
    device_type:   str           # "gpu" | "cpu" | "accel" | "other"
    context:       cl.Context    # OpenCL context (None for CPU)
    queue:         cl.CommandQueue  # OpenCL queue, or a CPUQueue placeholder

DeviceHandle is the only thing every other layer in the stack needs. Backends take a queue; the Tensor constructor takes a queue (and optionally the handle's context); the BufferPool takes a context. Treat it as immutable.

Tensor (See Tensor)

The full Tensor reference lives on its own page. From core's perspective, Tensor is the user-facing handle that ties a buffer, a shape, a dtype, a queue, and an optional BufferPool lease into one object. The two factory entry points are:

from netcl.core.tensor import Tensor
from netcl.core.device import manager

q = manager.default("auto").queue
a = Tensor.from_host(q, host_ndarray)                 # copy host -> device
b = Tensor.from_shape(q, (4, 8), dtype="float32")      # zero-filled on device

Read the Tensor page for the rest of the lifecycle, the dtype mapping, the pool_handle interaction with BufferPool, and the autograd integration.

BufferPool, BufferHandle, PoolStats

The device-side memory recycler. BufferPool keeps a per-bucket free list keyed by power-of-two sizes. BufferHandle is a one-shot lease; calling release() puts the buffer back into the pool instead of letting the OpenCL runtime reap it. PoolStats exposes hit/miss and bytes counters so you can confirm the pool is doing useful work.

from netcl.core.memory import BufferPool
pool = BufferPool(context)
h = pool.allocate(nbytes)          # -> BufferHandle, possibly reused
h.release()                         # back to the pool
print(pool.stats.hits, pool.stats.misses, pool.stats.bytes_cached)

PinnedBufferPool

Same shape as BufferPool, but the buffers are allocated with cl.mem_flags.ALLOC_HOST_PTR so the host side is page-locked. The OpenCLBackend uses this pool for high-throughput H2D/D2H transfers on discrete GPUs. Access it indirectly via get_pinned_pool(queue).

PersistentBufferPool

A bounded alternative to BufferPool with a fixed bucket table (1 KB … 256 MB) and a hard max_cached_bytes cap (default 4 GB, overridable via the NETCL_MAX_CACHED_GB environment variable). Use this in long-running training loops where the simple BufferPool would unboundedly grow. It is the pool that Tensor.from_shape reaches for by default on the OpenCL path (via get_persistent_pool(queue)).

from netcl.core.memory import get_persistent_pool
pool = get_persistent_pool(queue)
print(pool.get_stats())   # {hits, misses, hit_rate, bytes_allocated, bytes_cached, buckets}

For the rationale and tradeoffs, see Architecture: Memory Pool.

OpenCLBackend

OpenCLBackend (in netcl.core.backend.opencl) is the production transport. It owns a cl.CommandQueue and implements from_host, from_shape, to_host, to_host_async, wait, and reshape. The most important pieces of behavior are:

  • Async H2D. from_host enqueues a non-blocking enqueue_copy when async_copy=True (the default; can also be set globally via NETCL_ASYNC_H2D=1). The returned (buffer, event, pending_release, host_ref) tuple lets the Tensor object wait() on the copy at the right point.
  • Pinned H2D. When NETCL_PINNED_H2D=1 (the default), the backend routes the H2D copy through a PinnedBufferPool using a synchronous map → unmap → copy pattern. This is faster than pageable memory on most discrete GPUs.
  • Zero-copy on integrated GPUs. is_integrated_gpu(queue) reports True for devices that share host and device memory (Intel iGPUs, AMD APUs, OpenCL CPU devices). The backend uses USE_HOST_PTR in that case so the device operates directly on the host ndarray.
  • Fork safety. When the process is forked (for example to spawn a data-loader worker), _before_fork snapshots every live cl.Buffer and cl.CommandQueue and _after_fork_in_child bumps their Python refcount so the child's OpenCL driver state stays untouched. This avoids the well-known "GPU hang after fork" failure mode.
  • Clean shutdown. The first time an OpenCLBackend is constructed, it registers an atexit hook that calls queue.finish() on every registered queue. It also installs a custom SIGINT handler so a Ctrl+C during a stuck kernel cleanly os._exit(130)s after a 3-second grace window.
  • Interruptible waits. interruptible_finish(queue, timeout=30.0) runs queue.finish() in a worker thread and polls in 10 ms increments, so the process can still respond to SIGINT while a long kernel is in flight.

See Architecture: Tensor Backend for diagrams of how the backend, the BufferPool, and the JIT Compiler interact.

CPUBackend

A tiny NumPy-backed stand-in used when no OpenCL device is available (or when the user explicitly asked for "cpu" via manager.default("cpu")). It implements the same interface as OpenCLBackend but every method is just a numpy call:

from netcl.core.backend.cpu import CPUBackend
be = CPUBackend()
x = be.from_host(host_arr, dtype="float32")
y = be.from_shape((4, 8), dtype="float32")  # zeros
print(be.to_host(x, (4, 8), "float32"))    # same ndarray

CPU tensors carry their data in Tensor.array instead of Tensor.buffer; the Tensor constructor routes reads and writes through this field automatically.

KernelSpec, PRIMITIVE_PREAMBLE, WorkGroupTuner

netcl.core.kernels.primitives is the codegen layer that the JIT Compiler sits on top of.

from netcl.core.kernels.primitives import KernelSpec, PRIMITIVE_PREAMBLE

src = KernelSpec(
    name="vec_add",
    params=[
        "__global const float* a",
        "__global const float* b",
        "__global float* c",
    ],
    preamble=PRIMITIVE_PREAMBLE,
    body="int gid = get_global_id(0); c[gid] = a[gid] + b[gid];",
).to_source()

KernelSpec.to_source() returns a complete __kernel definition with the preamble spliced in. PRIMITIVE_PREAMBLE is a small set of C macros (LOAD, STORE, ADD, SUB, MUL, DIV, RELU) that make generated bodies easier to write and easier to constant-fold.

WorkGroupTuner is the heuristic that picks local_size and rounds global_size up to a multiple of it. It is intentionally simple — the heavy lifting is done by KernelSelector — but it is the default for hand-written kernels.

PrimitiveBuilder is a thin convenience over KernelSpec:

from netcl.core.kernels.primitives import PrimitiveBuilder, WorkGroupTuner

builder = PrimitiveBuilder(dtype="float", tuner=WorkGroupTuner())
spec = builder.elementwise_spec(name="mul_add", arity=2, expression="MUL(ADD(v0, v1), 2.0f)")
# Returns a KernelSpec; pass a cl.Context to .build() to compile it.

build_elementwise_kernel is the one-shot helper that fuses the above:

from netcl.core.kernels.primitives import build_elementwise_kernel

spec, kernel = build_elementwise_kernel(
    context, name="mul_add", arity=2,
    expression="MUL(ADD(v0, v1), 2.0f)",
    dtype="float",
)

When dtype is "half" or "float16", the helper automatically prepends #pragma OPENCL EXTENSION cl_khr_fp16 : enable and adds the relevant -cl-fast-relaxed-math build options. This is how the ops API gets correct fp16 kernels without requiring every call site to remember the pragma.

KernelSelector

netcl.core.kernel_selector.KernelSelector (and the get_kernel_selector() factory) is the autotuner that decides which kernel variant to run for a given problem size and device. It was added to make the matmul and Conv2d code paths GPU-aware without hard-coding vendor-specific tuning inside each op.

The selector is a thin policy layer over a richer ExtendedDeviceProfile (from netcl.core.device_profile). It maintains an in-memory autotune cache keyed on (op, shape, dtype, flags) so a given problem shape is decided only once per process.

KernelVariant and KernelConfig

KernelVariant is an Enum of every implementation the selector knows about:

  • MatMul: MATMUL_NAIVE, MATMUL_TILED, MATMUL_REGISTER_TILED, MATMUL_VECTORIZED
  • Conv2D: CONV2D_NAIVE, CONV2D_IM2COL, CONV2D_IMPLICIT_GEMM, CONV2D_TILED_LOCAL, CONV2D_WINOGRAD
  • Elementwise: ELEMENTWISE_SCALAR, ELEMENTWISE_VECTORIZED
  • Reduction: REDUCTION_SEQUENTIAL, REDUCTION_PARALLEL, REDUCTION_WORKGROUP
  • BatchNorm: BATCHNORM_NAIVE, BATCHNORM_FUSED

KernelConfig is a dataclass of the actual tile and launch parameters:

@dataclass
class KernelConfig:
    variant: KernelVariant
    tile_m: int = 32
    tile_n: int = 32
    tile_k: int = 8
    work_per_thread: int = 4
    local_size_1d: int = 256
    local_size_2d: Tuple[int, int] = (16, 16)
    use_local_memory: bool = True
    use_vectorization: bool = True
    vector_width: int = 4
    extra: Dict[str, Any] = field(default_factory=dict)

Selection Methods

Method Purpose
select_matmul_kernel(M, N, K, dtype) Picks naive for tiny problems, tiled for small/medium, register-tiled for large, with vendor-specific overrides for NVIDIA Ampere/Turing and AMD RDNA2/3.
select_conv2d_kernel(...) Routes 1×1 to implicit-GEMM, 3×3 stride-1 to Winograd (disable with NETCL_CONV_WINOGRAD=0), falls back to implicit-GEMM otherwise; CPU is forced to im2col.
select_elementwise_kernel(n, dtype) Uses vectorized loads (width 4 or 8) above 1024 elements.
select_reduction_kernel(n, kind) Sequential below 256 elements, parallel below 64 K, multi-stage workgroup above.
select_batchnorm_kernel(...) Returns the fused BN+ReLU variant when fuse_relu=True and training=False.
clear_cache() Drops the autotune cache (useful after switching devices).

Two module-level convenience functions — select_matmul_config(M, N, K) and select_conv2d_config(...) — wrap the singleton returned by get_kernel_selector() and are what the higher-level ops API calls.

Environment Variables

Variable Default Effect
NETCL_KERNEL_STRATEGY auto Forces portable or optimized (or auto-detect).
NETCL_CONV_WINOGRAD 1 Set to 0 to disable Winograd 3×3 selection.
NETCL_CONV_TILED_LOCAL 0 Set to 1 to allow the tiled-local conv2d variant on small outputs.
NETCL_MAX_CACHED_GB 4 Default cap for PersistentBufferPool.max_cached_bytes.
NETCL_ASYNC_H2D 1 Default for Tensor.from_host(async_copy=...).
NETCL_PINNED_H2D 1 Default for Tensor.from_host(use_pinned=...).
NETCL_PROFILE_EVENTS 0 Set to 1 to request PROFILING_ENABLE on the OpenCL queue.

cl_khr_fp16 Capability Probe

OpenCL exposes optional features through extension strings. netcl.core.capabilities inspects the device's extensions list and exposes two booleans on DeviceProfile:

  • has_fp16 / supports_fp16True if any extension mentions "fp16" or "cl_khr_fp16" is in the list. Half-precision math is only safe to use when this is True.
  • has_subgroupsTrue if any extension mentions "subgroup". Subgroup operations can be a substantial win for reductions and scans.
  • fast_atomicsTrue if the device advertises cl_khr_int64_base_atomics or the extended atomics extension.

Use device_profile(cl_device) to get a frozen DeviceProfile, then check the relevant flag before launching fp16 code paths:

from netcl.core.capabilities import device_profile
prof = device_profile(queue.device)
if prof.has_fp16:
    # safe to use half / float16 tensors and fp16 kernels
    ...

The kernel_strategy(profile) helper returns "optimized" when the device reports subgroups or fp16, or when the vendor is NVIDIA or AMD; it returns "portable" for CPUs and devices with very small local memory.

See also

  • Tensor — the Tensor API and lifecycle that uses all of the above.
  • ops API — elementary operations built on these primitives.
  • autograd API — how the Tape wraps a Tensor for automatic differentiation.
  • JIT Compiler — how KernelSpec is composed into fused graphs.
  • Tensor Backend — the bigger picture of backends, queues, and pools.
  • Memory Pool — bucket sizing, hit rates, and the choice between BufferPool and PersistentBufferPool.