netcl wiki
concepts

Tensor

Tensor

Status: Public API in netcl.core.tensor.Tensor

Tensor is the central data structure of netcl. It is a thin, lightweight wrapper that pairs a contiguous block of memory (a cl.Buffer on an OpenCL device, or a numpy.ndarray on the CPU backend) with the metadata the rest of the stack needs to manipulate it: shape, dtype, the context and queue the memory lives on, a back-reference to a BufferPool so the buffer can be returned when the tensor is dropped, and the autograd fields (requires_grad, grad, grad_fn, pending_event) that let the Tape walk it during backward.

Unlike a torch.Tensor, a netcl Tensor is not a unified view of a single storage that can be reshaped, sliced and broadcast at will. It is closer to a handle to a concrete buffer with a fixed shape. This is a deliberate choice: it keeps the wrapper small enough to be allocated in hot loops, and it lets the JIT compiler reason about lifetimes when it fuses chains of ops.

Overview

A Tensor carries both the user-visible state (shape, dtype, what the caller will read) and the implementation state (which device, which pool, which OpenCL queue, whether an async copy is in flight). When you read tensor.shape you get the logical shape the user wrote; when you read tensor.buffer you get the raw OpenCL buffer the kernel will read from. Most user code never touches the second group of fields.

The lifecycle is intentionally short. Most tensors are transient: allocated, used by one or two kernels, and released back to the pool when the Python wrapper is garbage-collected. Persistent tensors — model parameters, optimizer state, running-mean buffers in BatchNorm — opt in by setting persistent=True on construction so the pool does not reclaim them.

Where It Lives

  • File path: core/tensor.py (class Tensor).
  • Module path: netcl.core.tensor.
  • Public re-export: netcl.Tensor is not re-exported at the top level; import it as from netcl.core.tensor import Tensor.
  • Backed by: cl.Buffer (PyOpenCL) on the GPU queue, numpy.ndarray on the CPU queue.

Diagram

How It Works

The Tensor dataclass has the following fields (paraphrased from core/tensor.py):

  • buffer: Optional[cl.Buffer] — the underlying device memory. None means the tensor has no device side (e.g. a CPU-only numpy.ndarray tensor created on the CPU backend).
  • shape: Tuple[int, ...] — the logical shape. Always set, even for zero-dimensional tensors.
  • dtype: str — one of "float", "float32", "half", "float16", "float64", "double". The mapping is enforced by _np_dtype / _dtype_nbytes.
  • context: Optional[cl.Context] and queue: Any — the OpenCL context and command queue this tensor was created on. Pass these into any kernel launch.
  • pool_handle: Optional[BufferHandle] — the handle the pool handed out when this tensor was allocated. The tensor's __del__ calls pool_handle.release() so the underlying buffer is reused.
  • persistent: bool — when True, the pool's release method short-circuits and the buffer is left alone.
  • requires_grad: bool — whether this tensor should be tracked by the Tape.
  • grad: Optional[Tensor] — populated by the autograd engine during backward().
  • grad_fn: Optional[Callable] — the function that produces the gradient w.r.t. this tensor's inputs, registered with the Tape when the tensor was created.
  • host_ref: Optional[np.ndarray] — a pinned/staged host copy used to read the tensor back from the device.
  • array: Optional[np.ndarray] — set when backend == "cpu" and the tensor lives entirely in RAM.

The class is a @dataclass; the custom __init__ is provided to keep backward compatibility with the historical positional signature.

Code Example

import netcl as nc
from netcl.core.tensor import Tensor

# Allocate a 4 x 1024 fp32 tensor on the active OpenCL device.
ctx, queue = nc.device.manager.default()
t = Tensor.zeros((4, 1024), dtype="float32", context=ctx, queue=queue)
print(t.shape)            # (4, 1024)
print(t.requires_grad)    # False

# Mark it as a learnable parameter.
t.requires_grad = True

# Move data to the device and back.
t.from_host(numpy_array)  # blocking or non-blocking depending on flags
host = t.to_host()        # numpy.ndarray on the host

# Free the underlying buffer (normally done by GC).
del t

Performance & Trade-offs

  • The pool-backed allocation path (Tensor.zeros, Tensor.from_host) is the only way to allocate; calling cl.Buffer directly is not supported and bypasses the hit-rate statistics.
  • requires_grad=True tensors are slightly more expensive to __del__ because the autograd engine inserts them into the global graph. For ephemeral intermediates, leave it False.
  • to_host_async() is non-blocking; it returns a cl.Event so the caller can wait. Use it from inside a training loop to overlap the D2H copy of the previous loss with the next forward pass.
  • The tensor wrapper itself is tiny; the real memory is the cl.Buffer. It is safe to create millions of small tensors per epoch, provided you let the pool's buckets absorb them.

See also

  • Tensor API — full method reference.
  • Tensor Backend — how the OpenCL backend wraps cl.Buffer and dispatches kernels.
  • BufferPool — the memory pool the tensor delegates to.
  • Tape — the autograd graph the tensor is attached to.
  • Autograd & Tape — tape walk, gradient accumulation, backward pass.
  • JIT Compiler — how fused chains of tensor ops are compiled.
  • AMP — how the tensor dtype is downcast in autocast.
  • Tensor — this article.