netcl.core.tensor.Tensor
netcl.core.tensor.Tensor
Tensor is the central data structure in netcl. A Tensor wraps an OpenCL buffer (or a
NumPy ndarray on the CPU backend), a logical shape, a dtype, the OpenCLBackend
or CPUBackend that owns it, and an optional lease from a
BufferPool. It is the type that every operator, layer, and optimizer consumes
and produces.
Note — Public factory surface. The real
TensorAPI is the two classmethodsTensor.from_host(queue, arr)andTensor.from_shape(queue, shape, dtype, fill=0)plus the instance methods listed below. There is noTensor.zeros/Tensor.ones/Tensor.reshape/Tensor.transpose/Tensor.view— reshape is a free functionreshape(t, shape), and everything else is built on the ops API. If you need zeros, just doTensor.from_shape(q, shape, dtype, fill=0)(thefillargument is accepted and forwarded to the backend).
Construction
Tensor is a dataclass. The two public entry points are:
Tensor.from_host(queue, data, dtype=None, backend=None, async_copy=None, use_pinned=None)
Copies a host-side np.ndarray (or anything np.asarray accepts) onto the device and
returns a Tensor that owns the destination buffer.
import numpy as np
from netcl.core.tensor import Tensor
from netcl.core.device import manager
q = manager.default("auto").queue
a = Tensor.from_host(q, np.eye(4, dtype=np.float32)) # float32 by default
b = Tensor.from_host(q, host_array, dtype="float16") # cast on copy
Arguments:
| Parameter | Type | Default | Meaning |
|---|---|---|---|
queue |
cl.CommandQueue | CPUQueue |
required | The destination queue. If None, the thread-local active device is used. |
data |
array-like | required | Anything convertible to a NumPy array. |
dtype |
str | None |
"float32" |
One of "float", "float32", "half", "float16", "float64", "double". |
backend |
str | None |
inferred | "cl" or "cpu". Defaults to queue.backend. |
async_copy |
bool | None |
env default | Non-blocking H2D when True. Overridden by NETCL_ASYNC_H2D. |
use_pinned |
bool | None |
env default | Route through the pinned-host pool for faster DMA. Overridden by NETCL_PINNED_H2D. |
The returned Tensor carries a pending_event and (when the copy is async) a
pending_release handle into the pinned pool, both of which are cleared on
wait() or to_host().
Tensor.from_shape(queue, shape, dtype="float32", fill=0, pool=None, backend=None)
Allocates a fresh buffer of the requested shape and returns a Tensor wired to it. The
returned tensor has pool_handle set when the allocation came from a pool (the default
on the OpenCL path), so the buffer is recycled when the tensor is destroyed.
q = manager.default("auto").queue
x = Tensor.from_shape(q, (4, 8), dtype="float32")
| Parameter | Type | Default | Meaning |
|---|---|---|---|
queue |
queue | None |
required | Destination queue; falls back to the thread-local active device. |
shape |
Sequence[int] |
required | Logical shape. Each element is cast to int. |
dtype |
str |
"float32" |
Same set of dtype names as from_host. |
fill |
scalar | 0 |
Initial value. The OpenCLBackend zeroes the buffer; the CPUBackend allocates a NumPy zeros array. |
pool |
BufferPool | None |
None |
Override the default pool (get_persistent_pool(queue) on the OpenCL path). |
backend |
str | None |
inferred | "cl" or "cpu". |
The __init__ of Tensor is also reachable directly (every field of the dataclass is a
keyword argument) but should not be used by application code: the dataclass does no
allocation, no pool wiring, and no event setup, so the resulting tensor cannot be released
back to a pool. Always go through the two classmethods above.
Attributes
| Attribute | Type | Meaning |
|---|---|---|
buffer |
cl.Buffer | None |
The OpenCL buffer. None on the CPU backend. |
shape |
tuple[int, ...] |
Logical shape. |
dtype |
str |
One of "float32", "float16", "float64". |
context |
cl.Context | None |
OpenCL context. None for CPU tensors. |
queue |
cl.CommandQueue | CPUQueue |
Queue the tensor was created on. |
pool_handle |
BufferHandle | None |
The BufferPool lease, if any. Released in __del__. |
persistent |
bool |
If True, the buffer is not recycled back to the pool on release. Use for weights you want to keep alive across many iterations. |
requires_grad |
bool |
Autograd flag. When True, any operation that consumes this tensor records a node in the Tape. |
grad |
Tensor | None |
Accumulated gradient (filled by tape.backward(loss)). |
grad_fn |
callable | None |
The local-gradient callback registered with the Tape. |
pending_event |
cl.Event | None |
Outstanding H2D copy event that wait() will block on. |
pending_release |
BufferHandle | None |
Pinned-pool lease that should be released once pending_event completes. |
host_ref |
np.ndarray | None |
Last host snapshot retained while an async copy is in flight. |
array |
np.ndarray | None |
The actual NumPy array on the CPU backend. |
backend |
str (property) |
"cl" or "cpu". Read-only view onto the underlying DeviceBackend. |
size |
int (property) |
Total number of elements (prod(shape)). |
Methods
to_host() -> np.ndarray
Synchronously reads the tensor's storage into a NumPy array. On the OpenCL path this
internally calls wait() to flush the pending H2D copy (if any), then enqueues a
enqueue_copy to a freshly allocated or zero-copy-mapped host buffer. On the CPU path it
returns self.array directly (no copy).
arr = a.to_host()
to_host_async() -> tuple[np.ndarray, cl.Event | None]
Like to_host, but returns the array and the OpenCL event that marks the copy as
complete, so the caller can chain further work without waiting.
wait() -> None
Blocks until the pending H2D copy (if any) finishes, then releases the pinned lease and clears the pending fields. Safe to call multiple times.
_clear_pending() -> None
Internal helper called by wait(). Releases the pinned lease and nulls
pending_event, pending_release, host_ref. Application code should call
wait() instead.
__del__() -> None
Calls wait() to drain any outstanding copy, then releases the pool_handle if one
exists. This is the moment a BufferPool bucket actually sees the buffer come
back.
Note —
__del__is best-effort. The Python interpreter only calls__del__when the refcount drops to zero. Long-lived references (including cycles that the GC breaks) can delay release. If you need deterministic release, drop every Python reference to the tensor explicitly and callgc.collect()in tight loops.
Free Function: reshape(t, shape)
There is no Tensor.reshape method. The reshape primitive is a free function in
netcl.core.tensor:
from netcl.core.tensor import reshape
flat = Tensor.from_shape(q, (16,), dtype="float32")
sq = reshape(flat, (4, 4)) # view, no copy
reshape returns a new Tensor that shares the same storage but has the requested shape
and exposes a _base attribute pointing back to the original tensor. The OpenCL backend
implements this as a metadata-only change; the CPU backend calls ndarray.reshape.
The number of elements must be the same. There is no automatic flatten or infer_size.
Dtype Mapping
Tensor accepts the following dtype strings (case-sensitive):
| String | NumPy dtype | Bytes | OpenCL C type | Notes |
|---|---|---|---|---|
"float" |
float32 |
4 | float |
Default. |
"float32" |
float32 |
4 | float |
Same as "float". |
"half" |
float16 |
2 | half |
Requires the device to support cl_khr_fp16 (see core for the capability probe). |
"float16" |
float16 |
2 | half |
Same as "half". |
"double" |
float64 |
8 | double |
Requires the device to support cl_khr_fp64. |
"float64" |
float64 |
8 | double |
Same as "double". |
Any other dtype raises ValueError at construction time.
Interaction with BufferPool
A Tensor allocated via Tensor.from_shape on the OpenCL backend holds a BufferHandle
from the persistent pool by default. When the tensor is destroyed, that handle is released
back to the pool, putting the buffer into the appropriate bucket for the next allocation.
Tensor.from_host may also produce a pending_release for the PinnedBufferPool
lease that backed the H2D copy; that is released on wait() rather than on __del__.
If you want a tensor to outlive many iterations without recycling its buffer — for example
the weight buffer of a layer you intend to keep across save/load — set
persistent=True after construction:
w = Tensor.from_shape(q, (out_features, in_features), dtype="float32")
w.persistent = True
# Even after the last Python reference drops, the buffer is not pooled.
For pool statistics, bucket sizing, and the difference between BufferPool,
PinnedBufferPool, and PersistentBufferPool, see Architecture: Memory Pool.
Autograd Integration
Three fields on Tensor are reserved for Autograd & Tape:
requires_grad— set this on a leaf tensor (typically a parameter) to record operations that consume it.grad— populated bytape.backward(loss). Ifgradis already non-Nonewhen backward is called, the new gradient is accumulated into it (the standard semantics; remember to calloptimizer.zero_grad()between iterations).grad_fn— the local-gradient callback. Each ops API wrapper that is differentiable sets this to a closure that takes the upstream gradient and returns a tuple of input gradients.
A typical training step looks like:
import netcl.autograd as ag
with ag.Tape() as tape:
logits = model(x)
loss = F.cross_entropy(logits, y)
tape.backward(loss)
opt.step()
opt.zero_grad()
See Autograd & Tape for the full contract.
Examples
Allocate, Move Data H2D, Move Back D2H
import numpy as np
from netcl.core.device import manager
from netcl.core.tensor import Tensor
q = manager.default("auto").queue
host = np.random.randn(3, 4).astype(np.float32)
dev = Tensor.from_host(q, host)
back = dev.to_host()
assert np.allclose(host, back)
Use a Custom Pool
from netcl.core.memory import BufferPool
from netcl.core.tensor import Tensor
custom_pool = BufferPool(q.context)
x = Tensor.from_shape(q, (128, 128), dtype="float32", pool=custom_pool)
# x.pool_handle is a BufferHandle from `custom_pool`.
Reshape a View
from netcl.core.tensor import reshape
flat = Tensor.from_shape(q, (12,), dtype="float32")
mat = reshape(flat, (3, 4)) # no copy
print(mat.shape, mat._base is flat) # (3, 4) True
fp16 with Capability Check
from netcl.core.capabilities import device_profile
from netcl.core.tensor import Tensor
q = manager.default("auto").queue
prof = device_profile(q.device)
if prof.has_fp16:
x = Tensor.from_shape(q, (4, 4), dtype="float16")
else:
x = Tensor.from_shape(q, (4, 4), dtype="float32")
Errors and Edge Cases
from_hostwithdtypenot in the table above raisesValueError("unsupported dtype …").from_shapewith a zero inshapeis permitted and produces a 0-element tensor; itssizeis0, no buffer is allocated, andpool_handleisNone.- Calling
to_host()while an unrelated H2D copy into the tensor is still in flight is safe;to_host()callswait()first. __del__swallows every exception (the interpreter cannot do anything useful with one during teardown), so a malformed tensor will not crash the process.- On the OpenCL path,
Tensor.from_shapecallsget_persistent_pool(queue)if nopoolis passed. That pool is process-global and bounded byNETCL_MAX_CACHED_GB.
See also
- core API —
DeviceManager,BufferPool,OpenCLBackend,CPUBackend, and theKernelSelectorautotuner. - ops API — the operations that consume and produce
Tensor. - Autograd & Tape — how
requires_grad,grad, andgrad_fnare wired into the backward pass. - JIT Compiler — how generated kernels are compiled against this tensor's queue.
- Tensor Backend — the bigger picture of buffers, queues, and the OpenCL transport.
- Memory Pool — bucket sizing and the choice between simple and persistent pools.