netcl wiki
architecture

Architecture: Tensor Backend

Architecture: Tensor Backend

A Tensor is a thin Python wrapper around a cl.Buffer plus shape, dtype, and device metadata. The real work — issuing kernels, copying data, synchronizing queues, handling Ctrl+C cleanly — is done by the backend in core/backend/. netcl ships two backends:

  • OpenCLBackend (file core/backend/opencl.py, ~440 lines) — the default. Talks to the GPU through pyopencl.
  • CPUBackend (file core/backend/cpu.py) — a thin NumPy wrapper used when pyopencl is not installed or when the user explicitly requests a CPU device.

The choice is made when the Tensor is constructed (backend="cl" vs. backend="cpu") and is recorded on the tensor itself (Tensor.backend); you can inspect it later but cannot change it in place.

Stack

The layers, from highest to lowest, are shown below. The Tensor holds a buffer; the buffer is owned by a BufferPool; the pool is owned by the backend; the backend is owned by a DeviceManager.

Selecting a backend

# Explicit
Tensor.from_host(queue, data, backend="cl")
Tensor.from_host(queue, data, backend="cpu")

# Implicit — follow the queue's backend
Tensor.from_host(queue, data)   # -> queue.backend

Tensor.backend is a string ("cl" or "cpu") that you can read at any time to dispatch your own logic. The core API page documents the full factory surface.

The CPUBackend is also used automatically in three situations:

  • pyopencl is not installed (e.g. CI without GPU drivers).
  • manager.default("cpu") was explicitly requested.
  • The Tensor constructor was given backend="cpu".

OpenCLBackend — features

The OpenCLBackend in core/backend/opencl.py is responsible for the entire lifecycle of every cl.Context and cl.CommandQueue created by netcl. The features below are environment-driven so that production jobs and CI runs can opt in or out per-process.

Feature How to enable What it does
Async H2D NETCL_ASYNC_H2D=1 (default) Host→Device copy enqueues without synchronizing the queue. Loss values are the natural sync point.
Pinned memory NETCL_PINNED_H2D=1 (default) Uses PinnedBufferPool and cl.mem_flags.ALLOC_HOST_PTR so DMA is faster than pageable memory.
atexit cleanup always on _flush_all_queues() is registered with atexit; it calls cl.CommandQueue.finish() on every queue that was ever registered, preventing VRAM leaks on normal interpreter shutdown.
SIGINT handler always on First Ctrl+C: 3 s grace window for the GPU to finish; second Ctrl+C: os._exit(130) to release VRAM through the OS.
Fork snapshot always on os.register_at_fork snapshots live cl.Buffer / cl.CommandQueue objects in the parent so the child does not need to call gc.get_objects() (which crashes if torch C-extensions are loaded).
interruptible_finish always on queue.finish() runs in a daemon thread; the main thread polls SIGINT so Ctrl+C is responsive even when the GPU is busy.
Auto fp16 detected per device If cl_khr_fp16 is reported by the device, fp16 paths become available to AMP and the JIT Compiler.

Note — the env vars above are checked at queue-creation time, not at every kernel enqueue. If you change them, restart the interpreter.

Teardown contract

A netcl process that owns an OpenCL queue must clean up before it exits, otherwise the GPU driver will leak VRAM. The OpenCLBackend handles three exit paths:

  1. Normal interpreter shutdown (atexit) — every registered queue has finish() called on it. This is fast (a few milliseconds) and releases all cl.Buffer objects in the right order.
  2. Single Ctrl+C (SIGINT) — the handler arms a 3-second grace window. If the GPU finishes in time, the handler raises KeyboardInterrupt so the user can save state and exit. If the GPU is still busy after 3 s, the handler calls os._exit(130). The OS reclaims VRAM because the process is gone.
  3. Double Ctrl+C — the handler calls os._exit(130) immediately (no grace). This is the "I'm stuck, kill it" path.
  4. SIGTERM — the handler calls cl.CommandQueue.finish() and then defers to the previous SIGTERM handler (typically process exit).

All four paths are implemented in core/backend/opencl.py; the docstrings on _opencl_sigint_handler and _ensure_cleanup_registered explain the rationale in detail. If you write code that takes the process down a non-standard path (e.g. os._exit from your own handler), make sure to call _flush_all_queues() first.

CPUBackend

core/backend/cpu.py is small: a Tensor with backend="cpu" stores its data in a NumPy array rather than a cl.Buffer. to_host() is a no-op (or a copy when the dtype changes) and dtype/broadcasting follow NumPy exactly. This makes the CPU backend a useful reference implementation when porting a new op — the kernel can be a single NumPy expression.

BufferPool interaction

The backend is the owner of the BufferPool that the Tensor ultimately draws from. The handshake is:

  1. OpenCLBackend.__init__ creates a cl.Context and cl.CommandQueue and caches the pair in a DeviceHandle.
  2. The first Tensor allocated on that queue constructs (or attaches to) a BufferPool that holds the same context.
  3. Each pool.allocate(nbytes) returns a BufferHandle whose buffer is the actual cl.Buffer.
  4. The Tensor holds a reference to the BufferHandle in its pool_handle field. When the tensor is garbage-collected, the handle's release() is called, returning the buffer to the pool.

This means a raw cl.Buffer from pyopencl and a netcl Tensor are not interchangeable — the tensor is the higher-level object that knows about the pool. The Tensor Backend page documents the Tensor.from_buffer factory for the rare case where you already have a cl.Buffer and want to wrap it.

DeviceHandle

DeviceHandle is the dataclass returned by manager.default(...). It is the only object you should need to keep around for a long-lived session; everything else (pool, tensor, queue) is reachable from it.

@dataclass
class DeviceHandle:
    platform_name: str        # "NVIDIA CUDA", "Apple", "Intel(R) OpenCL", …
    device_name:   str        # "GeForce RTX 4090", "Apple M2 Max", …
    backend:       str        # "cl" or "cpu"
    device_type:   str        # "gpu" | "cpu" | "accel" | "other"
    context:       cl.Context # OpenCL-Context (None for CPU)
    queue:         cl.CommandQueue # OpenCL-Queue (or CPUQueue)

device — context manager (not a singleton)

core/device.py exports both manager (a DeviceManager instance) and device (a class used as a context manager):

from netcl.core.device import device, manager

# Module-level default: built lazily on first call.
default = manager.default("auto")

# Use a specific device for the duration of a block.
with device("gpu") as dev:
    t = Tensor.from_host(dev.queue, data)   # uses dev.queue
# The previously active device is restored on exit.

Importantdevice is a class, not a singleton. The German version of this page incorrectly described it as a module-level singleton instance. The actual code in core/device.py defines class device: with __enter__ and __exit__ methods. The module-level active device is stored in a thread-local ThreadLocalState, so different threads can have different active devices at the same time.

When to use which backend

Scenario Recommended backend
Training a real model on a discrete GPU OpenCLBackend (default)
Running on an integrated GPU (Intel, AMD APU) OpenCLBackend with zero-copy buffers via is_integrated_gpu() in core/memory
CI / unit tests with no pyopencl CPUBackend (backend="cpu")
Debugging a new op kernel CPUBackend first (NumPy reference), then port to OpenCL
Mixing CPU and GPU in the same process OpenCLBackend for the GPU, separate Tensors with backend="cpu" for the host-side NumPy

See also

  • core APIDeviceManager, DeviceHandle, device context manager, and the BufferPool factories.
  • Memory Pool — the BufferPool and BufferHandle that the Tensor sits on top of.
  • Tensor API — the user-facing Tensor class and the Tensor.from_host, Tensor.from_buffer, Tensor.to_host factories.
  • JIT Compiler — the layer that actually issues kernels through this backend.
  • AMP API — uses the backend's fp16 detection to decide which precision to run at.