netcl.core — Devices, Tensor, Memory, Kernels
netcl.core — Devices, Tensor, Memory, Kernels
The core package is the foundation of netcl. It owns the device discovery layer, the
Tensor lifecycle, the BufferPool memory recycler, the
OpenCLBackend and CPUBackend
pluggable transports, and the small set of helpers that turn a string body into a real
OpenCL kernel via the JIT Compiler.
Note — Top-level re-exports.
netcl/core/__init__.pyre-exportsDeviceManager, DeviceHandle, manager, Tensor, BufferPool, BufferHandleonly. Other symbols (CPUQueue,PoolStats,OpenCLBackend,CPUBackend,KernelSpec,PRIMITIVE_PREAMBLE,WorkGroupTuner,KernelSelector, …) are imported from their full submodule paths as shown below. The rootnetcl/__init__.pydoes surfacematmul,build_matmul_kernel,elementwise_binary,relu,bias_add,reduce_sum,softmax, andconv2das convenience shortcuts, but for anything else use the explicit submodule import.
Overview
| Symbol | Path | Purpose |
|---|---|---|
DeviceManager |
netcl.core.device |
Discovers OpenCL platforms and devices, returns a default |
DeviceHandle |
netcl.core.device |
Immutable handle: (platform, device, context, queue, backend) |
manager (singleton) |
netcl.core.device |
Process-wide DeviceManager instance, ready to use |
CPUQueue |
netcl.core.device |
Stand-in queue object for the CPU/NumPy backend |
Tensor |
netcl.core.tensor |
N-dim array stored on a device |
BufferPool |
netcl.core.memory |
Power-of-two bucketed device-buffer pool |
BufferHandle |
netcl.core.memory |
Single lease out of a BufferPool (auto-released) |
PoolStats |
netcl.core.memory |
Hit/miss/bytes counters with computed hit_rate |
PinnedBufferPool |
netcl.core.memory |
Page-locked host-buffer pool for fast H2D/D2H |
PersistentBufferPool |
netcl.core.memory |
Bounded bucket pool with size limits and per-bucket caps |
OpenCLBackend |
netcl.core.backend.opencl |
Async H2D/D2H, pinned memory, fork-safe queue cleanup |
CPUBackend |
netcl.core.backend.cpu |
NumPy fallback (used when no OpenCL device is present) |
DeviceProfile |
netcl.core.capabilities |
Frozen summary of device caps (FP16, subgroups, local mem) |
kernel_strategy |
netcl.core.capabilities |
Returns "portable" or "optimized" for a profile |
KernelSpec |
netcl.core.kernels.primitives |
Declarative OpenCL kernel source generator |
PRIMITIVE_PREAMBLE |
netcl.core.kernels.primitives |
C-macro preamble injected into generated kernels |
WorkGroupTuner |
netcl.core.kernels.primitives |
Heuristic local/global size chooser |
PrimitiveBuilder |
netcl.core.kernels.primitives |
Builds elementwise kernels from expressions |
KernelSelector |
netcl.core.kernel_selector |
Auto-selects matmul/conv/elementwise/reduction variants |
KernelVariant |
netcl.core.kernel_selector |
Enum of available kernel implementations |
KernelConfig |
netcl.core.kernel_selector |
Tile sizes, work-per-thread, local memory flags |
is_integrated_gpu |
netcl.core.memory |
Heuristic: does host and device share memory? |
has_svm_support |
netcl.core.memory |
Probes cl_svm_* capability |
create_pinned_buffer |
netcl.core.memory |
Allocates a single ALLOC_HOST_PTR buffer |
create_zero_copy_buffer |
netcl.core.memory |
USE_HOST_PTR zero-copy mapping for integrated GPUs |
DeviceManager and the manager Singleton
DeviceManager is a small but stateful object: it caches one DeviceHandle per requested
selector string so repeated calls inside a hot loop do not pay the discovery cost twice.
from netcl.core.device import manager
# Returns the first GPU/CPU/Accelerator that was discovered.
dev = manager.default("auto")
print(dev.platform_name, dev.device_name, dev.backend, dev.device_type)
print(type(dev.queue), type(dev.context))
| Method | Returns | Purpose |
|---|---|---|
DeviceManager.discover() |
list[(Platform, Device)] |
Enumerate every (platform, device) pair the driver reports. |
DeviceManager.default(s) |
DeviceHandle | None |
Return a cached handle for "auto", "gpu", "cpu", or a device id. Returns None when no device is available. |
DeviceManager._device_type(dev) |
str |
Maps a cl.Device.type bitfield to "gpu", "cpu", "accel", or "other". |
The manager singleton is constructed at import time and is what most user code should reach
for. It is safe to call manager.default("auto") once at program start and pass the handle
around.
Thread-Local Active Device
netcl.core.device also exposes a small device context manager and a thread-local active
device. This is what allows helper functions inside Tensor to find the right queue without
you having to thread it through every call.
from netcl.core.device import device
with device("gpu") as dev:
# get_active_device() now returns this handle for the current thread.
...
# On exit, the previous active device is restored.
For an end-to-end introduction to the device model and how it maps onto real OpenCL platforms, see Architecture: Tensor Backend.
DeviceHandle
@dataclass
class DeviceHandle:
platform_name: str # e.g. "NVIDIA CUDA", "Intel(R) OpenCL"
device_name: str # e.g. "Tesla T4", "Iris Xe"
backend: str # "cl" or "cpu"
device_type: str # "gpu" | "cpu" | "accel" | "other"
context: cl.Context # OpenCL context (None for CPU)
queue: cl.CommandQueue # OpenCL queue, or a CPUQueue placeholder
DeviceHandle is the only thing every other layer in the stack needs. Backends take a
queue; the Tensor constructor takes a queue (and optionally the handle's
context); the BufferPool takes a context. Treat it as immutable.
Tensor (See Tensor)
The full Tensor reference lives on its own page. From core's perspective,
Tensor is the user-facing handle that ties a buffer, a shape, a dtype, a queue, and an
optional BufferPool lease into one object. The two factory entry points are:
from netcl.core.tensor import Tensor
from netcl.core.device import manager
q = manager.default("auto").queue
a = Tensor.from_host(q, host_ndarray) # copy host -> device
b = Tensor.from_shape(q, (4, 8), dtype="float32") # zero-filled on device
Read the Tensor page for the rest of the lifecycle, the dtype mapping, the
pool_handle interaction with BufferPool, and the autograd integration.
BufferPool, BufferHandle, PoolStats
The device-side memory recycler. BufferPool keeps a per-bucket free list keyed by
power-of-two sizes. BufferHandle is a one-shot lease; calling release() puts the buffer
back into the pool instead of letting the OpenCL runtime reap it. PoolStats exposes
hit/miss and bytes counters so you can confirm the pool is doing useful work.
from netcl.core.memory import BufferPool
pool = BufferPool(context)
h = pool.allocate(nbytes) # -> BufferHandle, possibly reused
h.release() # back to the pool
print(pool.stats.hits, pool.stats.misses, pool.stats.bytes_cached)
PinnedBufferPool
Same shape as BufferPool, but the buffers are allocated with cl.mem_flags.ALLOC_HOST_PTR
so the host side is page-locked. The OpenCLBackend uses this
pool for high-throughput H2D/D2H transfers on discrete GPUs. Access it indirectly via
get_pinned_pool(queue).
PersistentBufferPool
A bounded alternative to BufferPool with a fixed bucket table (1 KB … 256 MB) and a hard
max_cached_bytes cap (default 4 GB, overridable via the NETCL_MAX_CACHED_GB environment
variable). Use this in long-running training loops where the simple BufferPool would
unboundedly grow. It is the pool that Tensor.from_shape reaches for by default on the
OpenCL path (via get_persistent_pool(queue)).
from netcl.core.memory import get_persistent_pool
pool = get_persistent_pool(queue)
print(pool.get_stats()) # {hits, misses, hit_rate, bytes_allocated, bytes_cached, buckets}
For the rationale and tradeoffs, see Architecture: Memory Pool.
OpenCLBackend
OpenCLBackend (in netcl.core.backend.opencl) is the production transport. It owns a
cl.CommandQueue and implements from_host, from_shape, to_host, to_host_async,
wait, and reshape. The most important pieces of behavior are:
- Async H2D.
from_hostenqueues a non-blockingenqueue_copywhenasync_copy=True(the default; can also be set globally viaNETCL_ASYNC_H2D=1). The returned(buffer, event, pending_release, host_ref)tuple lets the Tensor objectwait()on the copy at the right point. - Pinned H2D. When
NETCL_PINNED_H2D=1(the default), the backend routes the H2D copy through a PinnedBufferPool using a synchronous map → unmap → copy pattern. This is faster than pageable memory on most discrete GPUs. - Zero-copy on integrated GPUs.
is_integrated_gpu(queue)reportsTruefor devices that share host and device memory (Intel iGPUs, AMD APUs, OpenCL CPU devices). The backend usesUSE_HOST_PTRin that case so the device operates directly on the hostndarray. - Fork safety. When the process is forked (for example to spawn a data-loader worker),
_before_forksnapshots every livecl.Bufferandcl.CommandQueueand_after_fork_in_childbumps their Python refcount so the child's OpenCL driver state stays untouched. This avoids the well-known "GPU hang after fork" failure mode. - Clean shutdown. The first time an
OpenCLBackendis constructed, it registers anatexithook that callsqueue.finish()on every registered queue. It also installs a customSIGINThandler so aCtrl+Cduring a stuck kernel cleanlyos._exit(130)s after a 3-second grace window. - Interruptible waits.
interruptible_finish(queue, timeout=30.0)runsqueue.finish()in a worker thread and polls in 10 ms increments, so the process can still respond toSIGINTwhile a long kernel is in flight.
See Architecture: Tensor Backend for diagrams of how the backend, the BufferPool, and the JIT Compiler interact.
CPUBackend
A tiny NumPy-backed stand-in used when no OpenCL device is available (or when the user
explicitly asked for "cpu" via manager.default("cpu")). It implements the same interface
as OpenCLBackend but every method is just a numpy call:
from netcl.core.backend.cpu import CPUBackend
be = CPUBackend()
x = be.from_host(host_arr, dtype="float32")
y = be.from_shape((4, 8), dtype="float32") # zeros
print(be.to_host(x, (4, 8), "float32")) # same ndarray
CPU tensors carry their data in Tensor.array instead of Tensor.buffer; the Tensor
constructor routes reads and writes through this field automatically.
KernelSpec, PRIMITIVE_PREAMBLE, WorkGroupTuner
netcl.core.kernels.primitives is the codegen layer that the JIT Compiler
sits on top of.
from netcl.core.kernels.primitives import KernelSpec, PRIMITIVE_PREAMBLE
src = KernelSpec(
name="vec_add",
params=[
"__global const float* a",
"__global const float* b",
"__global float* c",
],
preamble=PRIMITIVE_PREAMBLE,
body="int gid = get_global_id(0); c[gid] = a[gid] + b[gid];",
).to_source()
KernelSpec.to_source() returns a complete __kernel definition with the preamble spliced
in. PRIMITIVE_PREAMBLE is a small set of C macros (LOAD, STORE, ADD, SUB, MUL,
DIV, RELU) that make generated bodies easier to write and easier to constant-fold.
WorkGroupTuner is the heuristic that picks local_size and rounds global_size up to a
multiple of it. It is intentionally simple — the heavy lifting is done by
KernelSelector — but it is the default for hand-written kernels.
PrimitiveBuilder is a thin convenience over KernelSpec:
from netcl.core.kernels.primitives import PrimitiveBuilder, WorkGroupTuner
builder = PrimitiveBuilder(dtype="float", tuner=WorkGroupTuner())
spec = builder.elementwise_spec(name="mul_add", arity=2, expression="MUL(ADD(v0, v1), 2.0f)")
# Returns a KernelSpec; pass a cl.Context to .build() to compile it.
build_elementwise_kernel is the one-shot helper that fuses the above:
from netcl.core.kernels.primitives import build_elementwise_kernel
spec, kernel = build_elementwise_kernel(
context, name="mul_add", arity=2,
expression="MUL(ADD(v0, v1), 2.0f)",
dtype="float",
)
When dtype is "half" or "float16", the helper automatically prepends
#pragma OPENCL EXTENSION cl_khr_fp16 : enable and adds the relevant -cl-fast-relaxed-math
build options. This is how the ops API gets correct fp16 kernels without
requiring every call site to remember the pragma.
KernelSelector
netcl.core.kernel_selector.KernelSelector (and the get_kernel_selector() factory) is the
autotuner that decides which kernel variant to run for a given problem size and device.
It was added to make the matmul and Conv2d code paths GPU-aware
without hard-coding vendor-specific tuning inside each op.
The selector is a thin policy layer over a richer ExtendedDeviceProfile (from
netcl.core.device_profile). It maintains an in-memory autotune cache keyed on
(op, shape, dtype, flags) so a given problem shape is decided only once per process.
KernelVariant and KernelConfig
KernelVariant is an Enum of every implementation the selector knows about:
- MatMul:
MATMUL_NAIVE,MATMUL_TILED,MATMUL_REGISTER_TILED,MATMUL_VECTORIZED - Conv2D:
CONV2D_NAIVE,CONV2D_IM2COL,CONV2D_IMPLICIT_GEMM,CONV2D_TILED_LOCAL,CONV2D_WINOGRAD - Elementwise:
ELEMENTWISE_SCALAR,ELEMENTWISE_VECTORIZED - Reduction:
REDUCTION_SEQUENTIAL,REDUCTION_PARALLEL,REDUCTION_WORKGROUP - BatchNorm:
BATCHNORM_NAIVE,BATCHNORM_FUSED
KernelConfig is a dataclass of the actual tile and launch parameters:
@dataclass
class KernelConfig:
variant: KernelVariant
tile_m: int = 32
tile_n: int = 32
tile_k: int = 8
work_per_thread: int = 4
local_size_1d: int = 256
local_size_2d: Tuple[int, int] = (16, 16)
use_local_memory: bool = True
use_vectorization: bool = True
vector_width: int = 4
extra: Dict[str, Any] = field(default_factory=dict)
Selection Methods
| Method | Purpose |
|---|---|
select_matmul_kernel(M, N, K, dtype) |
Picks naive for tiny problems, tiled for small/medium, register-tiled for large, with vendor-specific overrides for NVIDIA Ampere/Turing and AMD RDNA2/3. |
select_conv2d_kernel(...) |
Routes 1×1 to implicit-GEMM, 3×3 stride-1 to Winograd (disable with NETCL_CONV_WINOGRAD=0), falls back to implicit-GEMM otherwise; CPU is forced to im2col. |
select_elementwise_kernel(n, dtype) |
Uses vectorized loads (width 4 or 8) above 1024 elements. |
select_reduction_kernel(n, kind) |
Sequential below 256 elements, parallel below 64 K, multi-stage workgroup above. |
select_batchnorm_kernel(...) |
Returns the fused BN+ReLU variant when fuse_relu=True and training=False. |
clear_cache() |
Drops the autotune cache (useful after switching devices). |
Two module-level convenience functions — select_matmul_config(M, N, K) and
select_conv2d_config(...) — wrap the singleton returned by get_kernel_selector() and
are what the higher-level ops API calls.
Environment Variables
| Variable | Default | Effect |
|---|---|---|
NETCL_KERNEL_STRATEGY |
auto |
Forces portable or optimized (or auto-detect). |
NETCL_CONV_WINOGRAD |
1 |
Set to 0 to disable Winograd 3×3 selection. |
NETCL_CONV_TILED_LOCAL |
0 |
Set to 1 to allow the tiled-local conv2d variant on small outputs. |
NETCL_MAX_CACHED_GB |
4 |
Default cap for PersistentBufferPool.max_cached_bytes. |
NETCL_ASYNC_H2D |
1 |
Default for Tensor.from_host(async_copy=...). |
NETCL_PINNED_H2D |
1 |
Default for Tensor.from_host(use_pinned=...). |
NETCL_PROFILE_EVENTS |
0 |
Set to 1 to request PROFILING_ENABLE on the OpenCL queue. |
cl_khr_fp16 Capability Probe
OpenCL exposes optional features through extension strings. netcl.core.capabilities
inspects the device's extensions list and exposes two booleans on DeviceProfile:
has_fp16/supports_fp16—Trueif any extension mentions"fp16"or"cl_khr_fp16"is in the list. Half-precision math is only safe to use when this isTrue.has_subgroups—Trueif any extension mentions"subgroup". Subgroup operations can be a substantial win for reductions and scans.fast_atomics—Trueif the device advertisescl_khr_int64_base_atomicsor the extended atomics extension.
Use device_profile(cl_device) to get a frozen DeviceProfile, then check the relevant flag
before launching fp16 code paths:
from netcl.core.capabilities import device_profile
prof = device_profile(queue.device)
if prof.has_fp16:
# safe to use half / float16 tensors and fp16 kernels
...
The kernel_strategy(profile) helper returns "optimized" when the device reports
subgroups or fp16, or when the vendor is NVIDIA or AMD; it returns "portable" for CPUs
and devices with very small local memory.
See also
- Tensor — the
TensorAPI and lifecycle that uses all of the above. - ops API — elementary operations built on these primitives.
- autograd API — how the Tape wraps a Tensor for automatic differentiation.
- JIT Compiler — how
KernelSpecis composed into fused graphs. - Tensor Backend — the bigger picture of backends, queues, and pools.
- Memory Pool — bucket sizing, hit rates, and the choice between
BufferPool and
PersistentBufferPool.