architecture

Architecture: Tensor Backend

A Tensor is a thin Python wrapper around a cl.Buffer plus shape, dtype, and device metadata. The real work — issuing kernels, copying data, synchronizing queues, handling Ctrl+C cleanly — is done by the backend in core/backend/. netcl ships two backends:

OpenCLBackend (file core/backend/opencl.py, ~440 lines) — the default. Talks to the GPU through pyopencl.
CPUBackend (file core/backend/cpu.py) — a thin NumPy wrapper used when pyopencl is not installed or when the user explicitly requests a CPU device.

The choice is made when the Tensor is constructed (backend="cl" vs. backend="cpu") and is recorded on the tensor itself (Tensor.backend); you can inspect it later but cannot change it in place.

Stack

The layers, from highest to lowest, are shown below. The Tensor holds a buffer; the buffer is owned by a BufferPool; the pool is owned by the backend; the backend is owned by a DeviceManager.

Selecting a backend

# Explicit
Tensor.from_host(queue, data, backend="cl")
Tensor.from_host(queue, data, backend="cpu")

# Implicit — follow the queue's backend
Tensor.from_host(queue, data)   # -> queue.backend

Tensor.backend is a string ("cl" or "cpu") that you can read at any time to dispatch your own logic. The core API page documents the full factory surface.

The CPUBackend is also used automatically in three situations:

pyopencl is not installed (e.g. CI without GPU drivers).
manager.default("cpu") was explicitly requested.
The Tensor constructor was given backend="cpu".

`OpenCLBackend` — features

The OpenCLBackend in core/backend/opencl.py is responsible for the entire lifecycle of every cl.Context and cl.CommandQueue created by netcl. The features below are environment-driven so that production jobs and CI runs can opt in or out per-process.

Feature	How to enable	What it does
Async H2D	`NETCL_ASYNC_H2D=1` (default)	Host→Device copy enqueues without synchronizing the queue. Loss values are the natural sync point.
Pinned memory	`NETCL_PINNED_H2D=1` (default)	Uses `PinnedBufferPool` and `cl.mem_flags.ALLOC_HOST_PTR` so DMA is faster than pageable memory.
atexit cleanup	always on	`_flush_all_queues()` is registered with `atexit`; it calls `cl.CommandQueue.finish()` on every queue that was ever registered, preventing VRAM leaks on normal interpreter shutdown.
SIGINT handler	always on	First Ctrl+C: 3 s grace window for the GPU to finish; second Ctrl+C: `os._exit(130)` to release VRAM through the OS.
Fork snapshot	always on	`os.register_at_fork` snapshots live `cl.Buffer` / `cl.CommandQueue` objects in the parent so the child does not need to call `gc.get_objects()` (which crashes if `torch` C-extensions are loaded).
`interruptible_finish`	always on	`queue.finish()` runs in a daemon thread; the main thread polls SIGINT so Ctrl+C is responsive even when the GPU is busy.
Auto fp16	detected per device	If `cl_khr_fp16` is reported by the device, fp16 paths become available to AMP and the JIT Compiler.

Note — the env vars above are checked at queue-creation time, not at every kernel enqueue. If you change them, restart the interpreter.

Teardown contract

A netcl process that owns an OpenCL queue must clean up before it exits, otherwise the GPU driver will leak VRAM. The OpenCLBackend handles three exit paths:

Normal interpreter shutdown (atexit) — every registered queue has finish() called on it. This is fast (a few milliseconds) and releases all cl.Buffer objects in the right order.
Single Ctrl+C (SIGINT) — the handler arms a 3-second grace window. If the GPU finishes in time, the handler raises KeyboardInterrupt so the user can save state and exit. If the GPU is still busy after 3 s, the handler calls os._exit(130). The OS reclaims VRAM because the process is gone.
Double Ctrl+C — the handler calls os._exit(130) immediately (no grace). This is the "I'm stuck, kill it" path.
SIGTERM — the handler calls cl.CommandQueue.finish() and then defers to the previous SIGTERM handler (typically process exit).

All four paths are implemented in core/backend/opencl.py; the docstrings on _opencl_sigint_handler and _ensure_cleanup_registered explain the rationale in detail. If you write code that takes the process down a non-standard path (e.g. os._exit from your own handler), make sure to call _flush_all_queues() first.

`CPUBackend`

core/backend/cpu.py is small: a Tensor with backend="cpu" stores its data in a NumPy array rather than a cl.Buffer. to_host() is a no-op (or a copy when the dtype changes) and dtype/broadcasting follow NumPy exactly. This makes the CPU backend a useful reference implementation when porting a new op — the kernel can be a single NumPy expression.

`BufferPool` interaction

The backend is the owner of the BufferPool that the Tensor ultimately draws from. The handshake is:

OpenCLBackend.__init__ creates a cl.Context and cl.CommandQueue and caches the pair in a DeviceHandle.
The first Tensor allocated on that queue constructs (or attaches to) a BufferPool that holds the same context.
Each pool.allocate(nbytes) returns a BufferHandle whose buffer is the actual cl.Buffer.
The Tensor holds a reference to the BufferHandle in its pool_handle field. When the tensor is garbage-collected, the handle's release() is called, returning the buffer to the pool.

This means a raw cl.Buffer from pyopencl and a netcl Tensor are not interchangeable — the tensor is the higher-level object that knows about the pool. The Tensor Backend page documents the Tensor.from_buffer factory for the rare case where you already have a cl.Buffer and want to wrap it.

`DeviceHandle`

DeviceHandle is the dataclass returned by manager.default(...). It is the only object you should need to keep around for a long-lived session; everything else (pool, tensor, queue) is reachable from it.

@dataclass
class DeviceHandle:
    platform_name: str        # "NVIDIA CUDA", "Apple", "Intel(R) OpenCL", …
    device_name:   str        # "GeForce RTX 4090", "Apple M2 Max", …
    backend:       str        # "cl" or "cpu"
    device_type:   str        # "gpu" | "cpu" | "accel" | "other"
    context:       cl.Context # OpenCL-Context (None for CPU)
    queue:         cl.CommandQueue # OpenCL-Queue (or CPUQueue)

`device` — context manager (not a singleton)

core/device.py exports both manager (a DeviceManager instance) and device (a class used as a context manager):

from netcl.core.device import device, manager

# Module-level default: built lazily on first call.
default = manager.default("auto")

# Use a specific device for the duration of a block.
with device("gpu") as dev:
    t = Tensor.from_host(dev.queue, data)   # uses dev.queue
# The previously active device is restored on exit.

Important — device is a class, not a singleton. The German version of this page incorrectly described it as a module-level singleton instance. The actual code in core/device.py defines class device: with __enter__ and __exit__ methods. The module-level active device is stored in a thread-local ThreadLocalState, so different threads can have different active devices at the same time.

When to use which backend

Scenario	Recommended backend
Training a real model on a discrete GPU	`OpenCLBackend` (default)
Running on an integrated GPU (Intel, AMD APU)	`OpenCLBackend` with zero-copy buffers via `is_integrated_gpu()` in core/memory
CI / unit tests with no `pyopencl`	`CPUBackend` (`backend="cpu"`)
Debugging a new op kernel	`CPUBackend` first (NumPy reference), then port to OpenCL
Mixing CPU and GPU in the same process	`OpenCLBackend` for the GPU, separate `Tensor`s with `backend="cpu"` for the host-side NumPy

Architecture: Tensor Backend

Stack

Selecting a backend

OpenCLBackend — features

Teardown contract

CPUBackend

BufferPool interaction

DeviceHandle

device — context manager (not a singleton)