api

netcl.core.tensor.Tensor

`netcl.core.tensor.Tensor`

Tensor is the central data structure in netcl. A Tensor wraps an OpenCL buffer (or a NumPy ndarray on the CPU backend), a logical shape, a dtype, the OpenCLBackend or CPUBackend that owns it, and an optional lease from a BufferPool. It is the type that every operator, layer, and optimizer consumes and produces.

Note — Public factory surface. The real Tensor API is the two classmethods Tensor.from_host(queue, arr) and Tensor.from_shape(queue, shape, dtype, fill=0) plus the instance methods listed below. There is no Tensor.zeros / Tensor.ones / Tensor.reshape / Tensor.transpose / Tensor.view — reshape is a free function reshape(t, shape), and everything else is built on the ops API. If you need zeros, just do Tensor.from_shape(q, shape, dtype, fill=0) (the fill argument is accepted and forwarded to the backend).

Construction

Tensor is a dataclass. The two public entry points are:

`Tensor.from_host(queue, data, dtype=None, backend=None, async_copy=None, use_pinned=None)`

Copies a host-side np.ndarray (or anything np.asarray accepts) onto the device and returns a Tensor that owns the destination buffer.

import numpy as np
from netcl.core.tensor import Tensor
from netcl.core.device import manager

q = manager.default("auto").queue
a = Tensor.from_host(q, np.eye(4, dtype=np.float32))     # float32 by default
b = Tensor.from_host(q, host_array, dtype="float16")     # cast on copy

Arguments:

Parameter	Type	Default	Meaning
`queue`	`cl.CommandQueue` \| `CPUQueue`	required	The destination queue. If `None`, the thread-local active device is used.
`data`	array-like	required	Anything convertible to a NumPy array.
`dtype`	`str` \| `None`	`"float32"`	One of `"float"`, `"float32"`, `"half"`, `"float16"`, `"float64"`, `"double"`.
`backend`	`str` \| `None`	inferred	`"cl"` or `"cpu"`. Defaults to `queue.backend`.
`async_copy`	`bool` \| `None`	env default	Non-blocking H2D when `True`. Overridden by `NETCL_ASYNC_H2D`.
`use_pinned`	`bool` \| `None`	env default	Route through the pinned-host pool for faster DMA. Overridden by `NETCL_PINNED_H2D`.

The returned Tensor carries a pending_event and (when the copy is async) a pending_release handle into the pinned pool, both of which are cleared on wait() or to_host().

`Tensor.from_shape(queue, shape, dtype="float32", fill=0, pool=None, backend=None)`

Allocates a fresh buffer of the requested shape and returns a Tensor wired to it. The returned tensor has pool_handle set when the allocation came from a pool (the default on the OpenCL path), so the buffer is recycled when the tensor is destroyed.

q = manager.default("auto").queue
x = Tensor.from_shape(q, (4, 8), dtype="float32")

Parameter	Type	Default	Meaning
`queue`	queue \| `None`	required	Destination queue; falls back to the thread-local active device.
`shape`	`Sequence[int]`	required	Logical shape. Each element is cast to `int`.
`dtype`	`str`	`"float32"`	Same set of dtype names as `from_host`.
`fill`	scalar	`0`	Initial value. The `OpenCLBackend` zeroes the buffer; the `CPUBackend` allocates a NumPy `zeros` array.
`pool`	`BufferPool` \| `None`	`None`	Override the default pool (`get_persistent_pool(queue)` on the OpenCL path).
`backend`	`str` \| `None`	inferred	`"cl"` or `"cpu"`.

The __init__ of Tensor is also reachable directly (every field of the dataclass is a keyword argument) but should not be used by application code: the dataclass does no allocation, no pool wiring, and no event setup, so the resulting tensor cannot be released back to a pool. Always go through the two classmethods above.

Attributes

Attribute	Type	Meaning
`buffer`	`cl.Buffer` \| `None`	The OpenCL buffer. `None` on the CPU backend.
`shape`	`tuple[int, ...]`	Logical shape.
`dtype`	`str`	One of `"float32"`, `"float16"`, `"float64"`.
`context`	`cl.Context` \| `None`	OpenCL context. `None` for CPU tensors.
`queue`	`cl.CommandQueue` \| `CPUQueue`	Queue the tensor was created on.
`pool_handle`	`BufferHandle` \| `None`	The BufferPool lease, if any. Released in `__del__`.
`persistent`	`bool`	If `True`, the buffer is not recycled back to the pool on release. Use for weights you want to keep alive across many iterations.
`requires_grad`	`bool`	Autograd flag. When `True`, any operation that consumes this tensor records a node in the Tape.
`grad`	`Tensor` \| `None`	Accumulated gradient (filled by `tape.backward(loss)`).
`grad_fn`	`callable` \| `None`	The local-gradient callback registered with the Tape.
`pending_event`	`cl.Event` \| `None`	Outstanding H2D copy event that `wait()` will block on.
`pending_release`	`BufferHandle` \| `None`	Pinned-pool lease that should be released once `pending_event` completes.
`host_ref`	`np.ndarray` \| `None`	Last host snapshot retained while an async copy is in flight.
`array`	`np.ndarray` \| `None`	The actual NumPy array on the CPU backend.
`backend`	`str` (property)	`"cl"` or `"cpu"`. Read-only view onto the underlying `DeviceBackend`.
`size`	`int` (property)	Total number of elements (`prod(shape)`).

Methods

`to_host() -> np.ndarray`

Synchronously reads the tensor's storage into a NumPy array. On the OpenCL path this internally calls wait() to flush the pending H2D copy (if any), then enqueues a enqueue_copy to a freshly allocated or zero-copy-mapped host buffer. On the CPU path it returns self.array directly (no copy).

arr = a.to_host()

`to_host_async() -> tuple[np.ndarray, cl.Event | None]`

Like to_host, but returns the array and the OpenCL event that marks the copy as complete, so the caller can chain further work without waiting.

`wait() -> None`

Blocks until the pending H2D copy (if any) finishes, then releases the pinned lease and clears the pending fields. Safe to call multiple times.

`_clear_pending() -> None`

Internal helper called by wait(). Releases the pinned lease and nulls pending_event, pending_release, host_ref. Application code should call wait() instead.

`del() -> None`

Calls wait() to drain any outstanding copy, then releases the pool_handle if one exists. This is the moment a BufferPool bucket actually sees the buffer come back.

Note — __del__ is best-effort. The Python interpreter only calls __del__ when the refcount drops to zero. Long-lived references (including cycles that the GC breaks) can delay release. If you need deterministic release, drop every Python reference to the tensor explicitly and call gc.collect() in tight loops.

Free Function: `reshape(t, shape)`

There is no Tensor.reshape method. The reshape primitive is a free function in netcl.core.tensor:

from netcl.core.tensor import reshape

flat = Tensor.from_shape(q, (16,), dtype="float32")
sq = reshape(flat, (4, 4))    # view, no copy

reshape returns a new Tensor that shares the same storage but has the requested shape and exposes a _base attribute pointing back to the original tensor. The OpenCL backend implements this as a metadata-only change; the CPU backend calls ndarray.reshape.

The number of elements must be the same. There is no automatic flatten or infer_size.

Dtype Mapping

Tensor accepts the following dtype strings (case-sensitive):

String	NumPy dtype	Bytes	OpenCL C type	Notes
`"float"`	`float32`	4	`float`	Default.
`"float32"`	`float32`	4	`float`	Same as `"float"`.
`"half"`	`float16`	2	`half`	Requires the device to support `cl_khr_fp16` (see core for the capability probe).
`"float16"`	`float16`	2	`half`	Same as `"half"`.
`"double"`	`float64`	8	`double`	Requires the device to support `cl_khr_fp64`.
`"float64"`	`float64`	8	`double`	Same as `"double"`.

Any other dtype raises ValueError at construction time.

Interaction with `BufferPool`

A Tensor allocated via Tensor.from_shape on the OpenCL backend holds a BufferHandle from the persistent pool by default. When the tensor is destroyed, that handle is released back to the pool, putting the buffer into the appropriate bucket for the next allocation. Tensor.from_host may also produce a pending_release for the PinnedBufferPool lease that backed the H2D copy; that is released on wait() rather than on __del__.

If you want a tensor to outlive many iterations without recycling its buffer — for example the weight buffer of a layer you intend to keep across save/load — set persistent=True after construction:

w = Tensor.from_shape(q, (out_features, in_features), dtype="float32")
w.persistent = True
# Even after the last Python reference drops, the buffer is not pooled.

For pool statistics, bucket sizing, and the difference between BufferPool, PinnedBufferPool, and PersistentBufferPool, see Architecture: Memory Pool.

Autograd Integration

Three fields on Tensor are reserved for Autograd & Tape:

requires_grad — set this on a leaf tensor (typically a parameter) to record operations that consume it.
grad — populated by tape.backward(loss). If grad is already non-None when backward is called, the new gradient is accumulated into it (the standard semantics; remember to call optimizer.zero_grad() between iterations).
grad_fn — the local-gradient callback. Each ops API wrapper that is differentiable sets this to a closure that takes the upstream gradient and returns a tuple of input gradients.

A typical training step looks like:

import netcl.autograd as ag

with ag.Tape() as tape:
    logits = model(x)
    loss = F.cross_entropy(logits, y)
tape.backward(loss)
opt.step()
opt.zero_grad()

See Autograd & Tape for the full contract.

Examples

Allocate, Move Data H2D, Move Back D2H

import numpy as np
from netcl.core.device import manager
from netcl.core.tensor import Tensor

q = manager.default("auto").queue
host = np.random.randn(3, 4).astype(np.float32)
dev = Tensor.from_host(q, host)
back = dev.to_host()
assert np.allclose(host, back)

Use a Custom Pool

from netcl.core.memory import BufferPool
from netcl.core.tensor import Tensor

custom_pool = BufferPool(q.context)
x = Tensor.from_shape(q, (128, 128), dtype="float32", pool=custom_pool)
# x.pool_handle is a BufferHandle from `custom_pool`.

Reshape a View

from netcl.core.tensor import reshape

flat = Tensor.from_shape(q, (12,), dtype="float32")
mat = reshape(flat, (3, 4))      # no copy
print(mat.shape, mat._base is flat)   # (3, 4)  True

fp16 with Capability Check

from netcl.core.capabilities import device_profile
from netcl.core.tensor import Tensor

q = manager.default("auto").queue
prof = device_profile(q.device)
if prof.has_fp16:
    x = Tensor.from_shape(q, (4, 4), dtype="float16")
else:
    x = Tensor.from_shape(q, (4, 4), dtype="float32")

Errors and Edge Cases

from_host with dtype not in the table above raises ValueError("unsupported dtype …").
from_shape with a zero in shape is permitted and produces a 0-element tensor; its size is 0, no buffer is allocated, and pool_handle is None.
Calling to_host() while an unrelated H2D copy into the tensor is still in flight is safe; to_host() calls wait() first.
__del__ swallows every exception (the interpreter cannot do anything useful with one during teardown), so a malformed tensor will not crash the process.
On the OpenCL path, Tensor.from_shape calls get_persistent_pool(queue) if no pool is passed. That pool is process-global and bounded by NETCL_MAX_CACHED_GB.

netcl.core.tensor.Tensor

Construction

Tensor.from_host(queue, data, dtype=None, backend=None, async_copy=None, use_pinned=None)

Tensor.from_shape(queue, shape, dtype="float32", fill=0, pool=None, backend=None)