netcl wiki
api

netcl.core.tensor.Tensor

netcl.core.tensor.Tensor

Tensor is the central data structure in netcl. A Tensor wraps an OpenCL buffer (or a NumPy ndarray on the CPU backend), a logical shape, a dtype, the OpenCLBackend or CPUBackend that owns it, and an optional lease from a BufferPool. It is the type that every operator, layer, and optimizer consumes and produces.

Note — Public factory surface. The real Tensor API is the two classmethods Tensor.from_host(queue, arr) and Tensor.from_shape(queue, shape, dtype, fill=0) plus the instance methods listed below. There is no Tensor.zeros / Tensor.ones / Tensor.reshape / Tensor.transpose / Tensor.view — reshape is a free function reshape(t, shape), and everything else is built on the ops API. If you need zeros, just do Tensor.from_shape(q, shape, dtype, fill=0) (the fill argument is accepted and forwarded to the backend).

Construction

Tensor is a dataclass. The two public entry points are:

Tensor.from_host(queue, data, dtype=None, backend=None, async_copy=None, use_pinned=None)

Copies a host-side np.ndarray (or anything np.asarray accepts) onto the device and returns a Tensor that owns the destination buffer.

import numpy as np
from netcl.core.tensor import Tensor
from netcl.core.device import manager

q = manager.default("auto").queue
a = Tensor.from_host(q, np.eye(4, dtype=np.float32))     # float32 by default
b = Tensor.from_host(q, host_array, dtype="float16")     # cast on copy

Arguments:

Parameter Type Default Meaning
queue cl.CommandQueue | CPUQueue required The destination queue. If None, the thread-local active device is used.
data array-like required Anything convertible to a NumPy array.
dtype str | None "float32" One of "float", "float32", "half", "float16", "float64", "double".
backend str | None inferred "cl" or "cpu". Defaults to queue.backend.
async_copy bool | None env default Non-blocking H2D when True. Overridden by NETCL_ASYNC_H2D.
use_pinned bool | None env default Route through the pinned-host pool for faster DMA. Overridden by NETCL_PINNED_H2D.

The returned Tensor carries a pending_event and (when the copy is async) a pending_release handle into the pinned pool, both of which are cleared on wait() or to_host().

Tensor.from_shape(queue, shape, dtype="float32", fill=0, pool=None, backend=None)

Allocates a fresh buffer of the requested shape and returns a Tensor wired to it. The returned tensor has pool_handle set when the allocation came from a pool (the default on the OpenCL path), so the buffer is recycled when the tensor is destroyed.

q = manager.default("auto").queue
x = Tensor.from_shape(q, (4, 8), dtype="float32")
Parameter Type Default Meaning
queue queue | None required Destination queue; falls back to the thread-local active device.
shape Sequence[int] required Logical shape. Each element is cast to int.
dtype str "float32" Same set of dtype names as from_host.
fill scalar 0 Initial value. The OpenCLBackend zeroes the buffer; the CPUBackend allocates a NumPy zeros array.
pool BufferPool | None None Override the default pool (get_persistent_pool(queue) on the OpenCL path).
backend str | None inferred "cl" or "cpu".

The __init__ of Tensor is also reachable directly (every field of the dataclass is a keyword argument) but should not be used by application code: the dataclass does no allocation, no pool wiring, and no event setup, so the resulting tensor cannot be released back to a pool. Always go through the two classmethods above.

Attributes

Attribute Type Meaning
buffer cl.Buffer | None The OpenCL buffer. None on the CPU backend.
shape tuple[int, ...] Logical shape.
dtype str One of "float32", "float16", "float64".
context cl.Context | None OpenCL context. None for CPU tensors.
queue cl.CommandQueue | CPUQueue Queue the tensor was created on.
pool_handle BufferHandle | None The BufferPool lease, if any. Released in __del__.
persistent bool If True, the buffer is not recycled back to the pool on release. Use for weights you want to keep alive across many iterations.
requires_grad bool Autograd flag. When True, any operation that consumes this tensor records a node in the Tape.
grad Tensor | None Accumulated gradient (filled by tape.backward(loss)).
grad_fn callable | None The local-gradient callback registered with the Tape.
pending_event cl.Event | None Outstanding H2D copy event that wait() will block on.
pending_release BufferHandle | None Pinned-pool lease that should be released once pending_event completes.
host_ref np.ndarray | None Last host snapshot retained while an async copy is in flight.
array np.ndarray | None The actual NumPy array on the CPU backend.
backend str (property) "cl" or "cpu". Read-only view onto the underlying DeviceBackend.
size int (property) Total number of elements (prod(shape)).

Methods

to_host() -> np.ndarray

Synchronously reads the tensor's storage into a NumPy array. On the OpenCL path this internally calls wait() to flush the pending H2D copy (if any), then enqueues a enqueue_copy to a freshly allocated or zero-copy-mapped host buffer. On the CPU path it returns self.array directly (no copy).

arr = a.to_host()

to_host_async() -> tuple[np.ndarray, cl.Event | None]

Like to_host, but returns the array and the OpenCL event that marks the copy as complete, so the caller can chain further work without waiting.

wait() -> None

Blocks until the pending H2D copy (if any) finishes, then releases the pinned lease and clears the pending fields. Safe to call multiple times.

_clear_pending() -> None

Internal helper called by wait(). Releases the pinned lease and nulls pending_event, pending_release, host_ref. Application code should call wait() instead.

__del__() -> None

Calls wait() to drain any outstanding copy, then releases the pool_handle if one exists. This is the moment a BufferPool bucket actually sees the buffer come back.

Note — __del__ is best-effort. The Python interpreter only calls __del__ when the refcount drops to zero. Long-lived references (including cycles that the GC breaks) can delay release. If you need deterministic release, drop every Python reference to the tensor explicitly and call gc.collect() in tight loops.

Free Function: reshape(t, shape)

There is no Tensor.reshape method. The reshape primitive is a free function in netcl.core.tensor:

from netcl.core.tensor import reshape

flat = Tensor.from_shape(q, (16,), dtype="float32")
sq = reshape(flat, (4, 4))    # view, no copy

reshape returns a new Tensor that shares the same storage but has the requested shape and exposes a _base attribute pointing back to the original tensor. The OpenCL backend implements this as a metadata-only change; the CPU backend calls ndarray.reshape.

The number of elements must be the same. There is no automatic flatten or infer_size.

Dtype Mapping

Tensor accepts the following dtype strings (case-sensitive):

String NumPy dtype Bytes OpenCL C type Notes
"float" float32 4 float Default.
"float32" float32 4 float Same as "float".
"half" float16 2 half Requires the device to support cl_khr_fp16 (see core for the capability probe).
"float16" float16 2 half Same as "half".
"double" float64 8 double Requires the device to support cl_khr_fp64.
"float64" float64 8 double Same as "double".

Any other dtype raises ValueError at construction time.

Interaction with BufferPool

A Tensor allocated via Tensor.from_shape on the OpenCL backend holds a BufferHandle from the persistent pool by default. When the tensor is destroyed, that handle is released back to the pool, putting the buffer into the appropriate bucket for the next allocation. Tensor.from_host may also produce a pending_release for the PinnedBufferPool lease that backed the H2D copy; that is released on wait() rather than on __del__.

If you want a tensor to outlive many iterations without recycling its buffer — for example the weight buffer of a layer you intend to keep across save/load — set persistent=True after construction:

w = Tensor.from_shape(q, (out_features, in_features), dtype="float32")
w.persistent = True
# Even after the last Python reference drops, the buffer is not pooled.

For pool statistics, bucket sizing, and the difference between BufferPool, PinnedBufferPool, and PersistentBufferPool, see Architecture: Memory Pool.

Autograd Integration

Three fields on Tensor are reserved for Autograd & Tape:

  • requires_grad — set this on a leaf tensor (typically a parameter) to record operations that consume it.
  • grad — populated by tape.backward(loss). If grad is already non-None when backward is called, the new gradient is accumulated into it (the standard semantics; remember to call optimizer.zero_grad() between iterations).
  • grad_fn — the local-gradient callback. Each ops API wrapper that is differentiable sets this to a closure that takes the upstream gradient and returns a tuple of input gradients.

A typical training step looks like:

import netcl.autograd as ag

with ag.Tape() as tape:
    logits = model(x)
    loss = F.cross_entropy(logits, y)
tape.backward(loss)
opt.step()
opt.zero_grad()

See Autograd & Tape for the full contract.

Examples

Allocate, Move Data H2D, Move Back D2H

import numpy as np
from netcl.core.device import manager
from netcl.core.tensor import Tensor

q = manager.default("auto").queue
host = np.random.randn(3, 4).astype(np.float32)
dev = Tensor.from_host(q, host)
back = dev.to_host()
assert np.allclose(host, back)

Use a Custom Pool

from netcl.core.memory import BufferPool
from netcl.core.tensor import Tensor

custom_pool = BufferPool(q.context)
x = Tensor.from_shape(q, (128, 128), dtype="float32", pool=custom_pool)
# x.pool_handle is a BufferHandle from `custom_pool`.

Reshape a View

from netcl.core.tensor import reshape

flat = Tensor.from_shape(q, (12,), dtype="float32")
mat = reshape(flat, (3, 4))      # no copy
print(mat.shape, mat._base is flat)   # (3, 4)  True

fp16 with Capability Check

from netcl.core.capabilities import device_profile
from netcl.core.tensor import Tensor

q = manager.default("auto").queue
prof = device_profile(q.device)
if prof.has_fp16:
    x = Tensor.from_shape(q, (4, 4), dtype="float16")
else:
    x = Tensor.from_shape(q, (4, 4), dtype="float32")

Errors and Edge Cases

  • from_host with dtype not in the table above raises ValueError("unsupported dtype …").
  • from_shape with a zero in shape is permitted and produces a 0-element tensor; its size is 0, no buffer is allocated, and pool_handle is None.
  • Calling to_host() while an unrelated H2D copy into the tensor is still in flight is safe; to_host() calls wait() first.
  • __del__ swallows every exception (the interpreter cannot do anything useful with one during teardown), so a malformed tensor will not crash the process.
  • On the OpenCL path, Tensor.from_shape calls get_persistent_pool(queue) if no pool is passed. That pool is process-global and bounded by NETCL_MAX_CACHED_GB.

See also

  • core APIDeviceManager, BufferPool, OpenCLBackend, CPUBackend, and the KernelSelector autotuner.
  • ops API — the operations that consume and produce Tensor.
  • Autograd & Tape — how requires_grad, grad, and grad_fn are wired into the backward pass.
  • JIT Compiler — how generated kernels are compiled against this tensor's queue.
  • Tensor Backend — the bigger picture of buffers, queues, and the OpenCL transport.
  • Memory Pool — bucket sizing and the choice between simple and persistent pools.