Tensor
Tensor
Status: Public API in
netcl.core.tensor.Tensor
Tensor is the central data structure of netcl. It is a thin, lightweight
wrapper that pairs a contiguous block of memory (a cl.Buffer on an OpenCL
device, or a numpy.ndarray on the CPU backend) with the metadata the rest
of the stack needs to manipulate it: shape, dtype, the context and queue
the memory lives on, a back-reference to a BufferPool so the buffer can be
returned when the tensor is dropped, and the autograd fields
(requires_grad, grad, grad_fn, pending_event) that let the
Tape walk it during backward.
Unlike a torch.Tensor, a netcl Tensor is not a unified view of a single
storage that can be reshaped, sliced and broadcast at will. It is closer to
a handle to a concrete buffer with a fixed shape. This is a deliberate
choice: it keeps the wrapper small enough to be allocated in hot loops,
and it lets the JIT compiler reason about lifetimes when it fuses chains
of ops.
Overview
A Tensor carries both the user-visible state (shape, dtype, what the
caller will read) and the implementation state (which device, which pool,
which OpenCL queue, whether an async copy is in flight). When you read
tensor.shape you get the logical shape the user wrote; when you read
tensor.buffer you get the raw OpenCL buffer the kernel will read from.
Most user code never touches the second group of fields.
The lifecycle is intentionally short. Most tensors are transient:
allocated, used by one or two kernels, and released back to the pool
when the Python wrapper is garbage-collected. Persistent tensors — model
parameters, optimizer state, running-mean buffers in
BatchNorm — opt in by setting
persistent=True on construction so the pool does not reclaim them.
Where It Lives
- File path:
core/tensor.py(class Tensor). - Module path:
netcl.core.tensor. - Public re-export:
netcl.Tensoris not re-exported at the top level; import it asfrom netcl.core.tensor import Tensor. - Backed by:
cl.Buffer(PyOpenCL) on the GPU queue,numpy.ndarrayon the CPU queue.
Diagram
How It Works
The Tensor dataclass has the following fields (paraphrased from
core/tensor.py):
buffer: Optional[cl.Buffer]— the underlying device memory.Nonemeans the tensor has no device side (e.g. a CPU-onlynumpy.ndarraytensor created on the CPU backend).shape: Tuple[int, ...]— the logical shape. Always set, even for zero-dimensional tensors.dtype: str— one of"float","float32","half","float16","float64","double". The mapping is enforced by_np_dtype/_dtype_nbytes.context: Optional[cl.Context]andqueue: Any— the OpenCL context and command queue this tensor was created on. Pass these into any kernel launch.pool_handle: Optional[BufferHandle]— the handle the pool handed out when this tensor was allocated. The tensor's__del__callspool_handle.release()so the underlying buffer is reused.persistent: bool— whenTrue, the pool's release method short-circuits and the buffer is left alone.requires_grad: bool— whether this tensor should be tracked by the Tape.grad: Optional[Tensor]— populated by the autograd engine duringbackward().grad_fn: Optional[Callable]— the function that produces the gradient w.r.t. this tensor's inputs, registered with the Tape when the tensor was created.host_ref: Optional[np.ndarray]— a pinned/staged host copy used to read the tensor back from the device.array: Optional[np.ndarray]— set whenbackend == "cpu"and the tensor lives entirely in RAM.
The class is a @dataclass; the custom __init__ is provided to keep
backward compatibility with the historical positional signature.
Code Example
import netcl as nc
from netcl.core.tensor import Tensor
# Allocate a 4 x 1024 fp32 tensor on the active OpenCL device.
ctx, queue = nc.device.manager.default()
t = Tensor.zeros((4, 1024), dtype="float32", context=ctx, queue=queue)
print(t.shape) # (4, 1024)
print(t.requires_grad) # False
# Mark it as a learnable parameter.
t.requires_grad = True
# Move data to the device and back.
t.from_host(numpy_array) # blocking or non-blocking depending on flags
host = t.to_host() # numpy.ndarray on the host
# Free the underlying buffer (normally done by GC).
del t
Performance & Trade-offs
- The pool-backed allocation path (
Tensor.zeros,Tensor.from_host) is the only way to allocate; callingcl.Bufferdirectly is not supported and bypasses the hit-rate statistics. requires_grad=Truetensors are slightly more expensive to__del__because the autograd engine inserts them into the global graph. For ephemeral intermediates, leave itFalse.to_host_async()is non-blocking; it returns acl.Eventso the caller can wait. Use it from inside a training loop to overlap the D2H copy of the previous loss with the next forward pass.- The tensor wrapper itself is tiny; the real memory is the
cl.Buffer. It is safe to create millions of small tensors per epoch, provided you let the pool's buckets absorb them.
See also
TensorAPI — full method reference.- Tensor Backend — how the OpenCL
backend wraps
cl.Bufferand dispatches kernels. - BufferPool — the memory pool the tensor delegates to.
- Tape — the autograd graph the tensor is attached to.
- Autograd & Tape — tape walk, gradient accumulation, backward pass.
- JIT Compiler — how fused chains of tensor ops are compiled.
- AMP — how the tensor dtype is downcast in autocast.
- Tensor — this article.