Architecture: Tensor Backend
Architecture: Tensor Backend
A Tensor is a thin Python wrapper around a
cl.Buffer plus shape, dtype, and
device metadata. The real work — issuing kernels, copying data,
synchronizing queues, handling Ctrl+C cleanly — is done by the backend
in core/backend/. netcl ships two backends:
OpenCLBackend(filecore/backend/opencl.py, ~440 lines) — the default. Talks to the GPU throughpyopencl.CPUBackend(filecore/backend/cpu.py) — a thin NumPy wrapper used whenpyopenclis not installed or when the user explicitly requests a CPU device.
The choice is made when the Tensor is constructed
(backend="cl" vs. backend="cpu") and is recorded on the tensor
itself (Tensor.backend); you can inspect it later but cannot change it
in place.
Stack
The layers, from highest to lowest, are shown below. The Tensor holds a buffer; the buffer is owned by a BufferPool; the pool is owned by the backend; the backend is owned by a DeviceManager.
Selecting a backend
# Explicit
Tensor.from_host(queue, data, backend="cl")
Tensor.from_host(queue, data, backend="cpu")
# Implicit — follow the queue's backend
Tensor.from_host(queue, data) # -> queue.backend
Tensor.backend is a string ("cl" or "cpu") that you can read at
any time to dispatch your own logic. The
core API page documents the full factory surface.
The CPUBackend is also used automatically in three situations:
pyopenclis not installed (e.g. CI without GPU drivers).manager.default("cpu")was explicitly requested.- The
Tensorconstructor was givenbackend="cpu".
OpenCLBackend — features
The OpenCLBackend in core/backend/opencl.py is responsible
for the entire lifecycle of every cl.Context and cl.CommandQueue
created by netcl. The features below are environment-driven so that
production jobs and CI runs can opt in or out per-process.
| Feature | How to enable | What it does |
|---|---|---|
| Async H2D | NETCL_ASYNC_H2D=1 (default) |
Host→Device copy enqueues without synchronizing the queue. Loss values are the natural sync point. |
| Pinned memory | NETCL_PINNED_H2D=1 (default) |
Uses PinnedBufferPool and cl.mem_flags.ALLOC_HOST_PTR so DMA is faster than pageable memory. |
| atexit cleanup | always on | _flush_all_queues() is registered with atexit; it calls cl.CommandQueue.finish() on every queue that was ever registered, preventing VRAM leaks on normal interpreter shutdown. |
| SIGINT handler | always on | First Ctrl+C: 3 s grace window for the GPU to finish; second Ctrl+C: os._exit(130) to release VRAM through the OS. |
| Fork snapshot | always on | os.register_at_fork snapshots live cl.Buffer / cl.CommandQueue objects in the parent so the child does not need to call gc.get_objects() (which crashes if torch C-extensions are loaded). |
interruptible_finish |
always on | queue.finish() runs in a daemon thread; the main thread polls SIGINT so Ctrl+C is responsive even when the GPU is busy. |
| Auto fp16 | detected per device | If cl_khr_fp16 is reported by the device, fp16 paths become available to AMP and the JIT Compiler. |
Note — the env vars above are checked at queue-creation time, not at every kernel enqueue. If you change them, restart the interpreter.
Teardown contract
A netcl process that owns an OpenCL queue must clean up before it exits, otherwise the GPU driver will leak VRAM. The OpenCLBackend handles three exit paths:
- Normal interpreter shutdown (
atexit) — every registered queue hasfinish()called on it. This is fast (a few milliseconds) and releases allcl.Bufferobjects in the right order. - Single Ctrl+C (
SIGINT) — the handler arms a 3-second grace window. If the GPU finishes in time, the handler raisesKeyboardInterruptso the user can save state and exit. If the GPU is still busy after 3 s, the handler callsos._exit(130). The OS reclaims VRAM because the process is gone. - Double Ctrl+C — the handler calls
os._exit(130)immediately (no grace). This is the "I'm stuck, kill it" path. SIGTERM— the handler callscl.CommandQueue.finish()and then defers to the previous SIGTERM handler (typically process exit).
All four paths are implemented in core/backend/opencl.py; the
docstrings on _opencl_sigint_handler and _ensure_cleanup_registered
explain the rationale in detail. If you write code that takes the
process down a non-standard path (e.g. os._exit from your own
handler), make sure to call _flush_all_queues() first.
CPUBackend
core/backend/cpu.py is small: a Tensor with
backend="cpu" stores its data in a NumPy array rather than a
cl.Buffer. to_host() is a
no-op (or a copy when the dtype changes) and dtype/broadcasting
follow NumPy exactly. This makes the CPU backend a useful reference
implementation when porting a new op — the kernel can be a single
NumPy expression.
BufferPool interaction
The backend is the owner of the BufferPool that the Tensor ultimately draws from. The handshake is:
OpenCLBackend.__init__creates acl.Contextandcl.CommandQueueand caches the pair in aDeviceHandle.- The first Tensor allocated on that queue constructs (or attaches to) a BufferPool that holds the same context.
- Each
pool.allocate(nbytes)returns aBufferHandlewhosebufferis the actual cl.Buffer. - The Tensor holds a reference to the
BufferHandlein itspool_handlefield. When the tensor is garbage-collected, the handle'srelease()is called, returning the buffer to the pool.
This means a raw cl.Buffer from pyopencl and a netcl
Tensor are not interchangeable — the tensor is the
higher-level object that knows about the pool. The
Tensor Backend page documents the
Tensor.from_buffer factory for the rare case where you already have
a cl.Buffer and want to wrap it.
DeviceHandle
DeviceHandle is the dataclass returned by manager.default(...).
It is the only object you should need to keep around for a long-lived
session; everything else (pool, tensor, queue) is reachable from it.
@dataclass
class DeviceHandle:
platform_name: str # "NVIDIA CUDA", "Apple", "Intel(R) OpenCL", …
device_name: str # "GeForce RTX 4090", "Apple M2 Max", …
backend: str # "cl" or "cpu"
device_type: str # "gpu" | "cpu" | "accel" | "other"
context: cl.Context # OpenCL-Context (None for CPU)
queue: cl.CommandQueue # OpenCL-Queue (or CPUQueue)
device — context manager (not a singleton)
core/device.py exports both manager (a DeviceManager instance)
and device (a class used as a context manager):
from netcl.core.device import device, manager
# Module-level default: built lazily on first call.
default = manager.default("auto")
# Use a specific device for the duration of a block.
with device("gpu") as dev:
t = Tensor.from_host(dev.queue, data) # uses dev.queue
# The previously active device is restored on exit.
Important —
deviceis a class, not a singleton. The German version of this page incorrectly described it as a module-level singleton instance. The actual code incore/device.pydefinesclass device:with__enter__and__exit__methods. The module-level active device is stored in a thread-localThreadLocalState, so different threads can have different active devices at the same time.
When to use which backend
| Scenario | Recommended backend |
|---|---|
| Training a real model on a discrete GPU | OpenCLBackend (default) |
| Running on an integrated GPU (Intel, AMD APU) | OpenCLBackend with zero-copy buffers via is_integrated_gpu() in core/memory |
CI / unit tests with no pyopencl |
CPUBackend (backend="cpu") |
| Debugging a new op kernel | CPUBackend first (NumPy reference), then port to OpenCL |
| Mixing CPU and GPU in the same process | OpenCLBackend for the GPU, separate Tensors with backend="cpu" for the host-side NumPy |
See also
- core API —
DeviceManager,DeviceHandle,devicecontext manager, and the BufferPool factories. - Memory Pool — the
BufferPool and
BufferHandlethat the Tensor sits on top of. - Tensor API — the user-facing
Tensor class and the
Tensor.from_host,Tensor.from_buffer,Tensor.to_hostfactories. - JIT Compiler — the layer that actually issues kernels through this backend.
- AMP API — uses the backend's fp16 detection to decide which precision to run at.