architecture

Architecture: Memory Pool

BufferPool (defined in core/memory.py) is netcl's power-of-two-bucketed allocator for OpenCL cl.Buffer objects. Its job is to reduce the number of clCreateBuffer calls a training loop issues — those calls are surprisingly expensive on most OpenCL implementations (NVIDIA, Intel, Apple), and the kernel-launch path is much faster when it can reuse a buffer that was created earlier.

Every Tensor on a GPU device owns (or borrows) a BufferPool BufferHandle. When the tensor is garbage-collected, the handle is returned to the pool; on the next allocation of the same bucket size, the cached buffer is handed out again without an actual clCreateBuffer round-trip.

Allocation flow

The decision tree for pool.allocate(nbytes) is shown below. The "hit" branch is the entire reason the pool exists; the "miss" branch falls through to a real clCreateBuffer.

Caption — BufferPool.allocate rounds nbytes up to the next power-of-two bucket, then tries the bucket's free list. A hit returns the popped BufferHandle; a miss calls clCreateBuffer with cl.mem_flags.READ_WRITE (or the flags the caller passed) and wraps it in a fresh BufferHandle.

`BufferHandle` lifecycle

@dataclass
class BufferHandle:
    buffer:      cl.Buffer
    nbytes:      int
    bucket_size: int = 0
    pool:        Optional[BufferPool] = None

    def release(self) -> None:
        if self.pool is not None:
            self.pool.release(self)

A BufferHandle is intentionally tiny — the heavy lifting is done by the pool, not the handle. The fields are:

buffer — the actual cl.Buffer (or np.ndarray for the CPU backend).
nbytes — the size the caller asked for. May be less than bucket_size because the pool rounds up.
bucket_size — the power-of-two the pool actually allocated.
pool — back-pointer; set by the pool on allocate and release, used by release() to return the buffer to the right bucket.

release() is idempotent: the pool keeps an id() set per bucket to make sure the same handle isn't freed twice (which would corrupt the free list).

Hit-rate statistics

PoolStats is a small dataclass that the pool updates on every allocate/release:

@dataclass
class PoolStats:
    hits: int = 0
    misses: int = 0
    bytes_allocated: int = 0
    bytes_cached: int = 0

    @property
    def hit_rate(self) -> float:
        total = self.hits + self.misses
        return self.hits / total if total > 0 else 0.0

For a steady-state training step (forward, backward, optimizer all producing same-size intermediates), you should see a hit rate well above 95% after a few warm-up iterations. If you see the hit rate climb from 0% to 100% during the first step, that is normal: the first call into each bucket always misses.

To inspect the stats at runtime:

print(pool.stats)
# PoolStats(hits=8432, misses=12, bytes_allocated=50331648, bytes_cached=41943040)
print(f"hit_rate = {pool.stats.hit_rate:.2%}")

PersistentBufferPool (see below) exposes a richer get_stats() dict including per-bucket counts.

`PinnedBufferPool` — pinned host memory for fast DMA

PinnedBufferPool is a sibling of BufferPool in core/memory.py. It allocates pinned (page-locked) host memory with cl.mem_flags.ALLOC_HOST_PTR so that the DMA controller on a discrete GPU can stream data without bouncing through pageable memory. Pinned buffers are roughly 2–4× faster for H2D/D2H on discrete GPUs, at the cost of higher host-side memory pressure (the OS cannot swap pinned pages out).

from netcl.core.memory import get_pinned_pool

pinned = get_pinned_pool(queue)
h = pinned.allocate(nbytes)         # cl.Buffer with ALLOC_HOST_PTR
# ... use h.buffer for staging H2D copies ...
h.release()

The pool is per-context (keyed by ctx.int_ptr) and is created on first use by get_pinned_pool(queue). There is no need to instantiate it yourself.

If you don't know whether you want pinned memory, the Tensor Backend default is on (NETCL_PINNED_H2D=1); discrete-GPU users should leave it on, integrated-GPU users can switch it off because the device shares RAM with the host anyway.

`PersistentBufferPool` — bounded cache with per-bucket limits

PersistentBufferPool is the advanced pool used by long-running training jobs and by the JIT Compiler. It is similar to BufferPool but with two extra guarantees:

Bounded total cache size — max_cached_bytes defaults to NETCL_MAX_CACHED_GB (4 GiB if unset). When the cache exceeds this limit, newly released buffers are not added to the free list; the GC reclaims them.
Bounded per-bucket count — max_buffers_per_bucket (default 16) prevents one bucket (e.g. the 1 MiB bucket) from hoarding the entire cache.

The bucket set is also fixed (rather than power-of-two) and aligned to common tensor sizes:

BUCKETS = [
    1024, 4096, 16384, 65536,           # 1KB .. 64KB
    262144, 1048576, 4194304,           # 256KB .. 4MB
    16777216, 67108864, 268435456,      # 16MB .. 256MB
]

Allocations larger than 256 MiB fall back to power-of-two rounding on the fly.

from netcl.core.memory import get_persistent_pool

pp = get_persistent_pool(queue)
h = pp.allocate(40_000_000)            # rounds to 64 MiB bucket
print(pp.get_stats())
# {'hits': 42, 'misses': 3, 'hit_rate': 0.93, 'bytes_cached': 67108864, ...}

PersistentBufferPool.allocate also detects integrated GPUs (via is_integrated_gpu(queue)) and uses ALLOC_HOST_PTR for the buffer; discrete GPUs get plain READ_WRITE. This is the recommended pool for production training because it caps VRAM growth at the configured limit and reports a hit rate that can be alerted on.

Async `cl.enqueue_copy` handoff

The pool itself does not call cl.enqueue_copy; that is the Tensor Backend's job. The handoff looks like this:

# In the backend (paraphrased)
h = pool.allocate(nbytes)
cl.enqueue_copy(queue, h.buffer, host_array, is_blocking=False)
# ... enqueue kernels that read h.buffer ...
# ... wait for kernels ...
h.release()        # back to pool; not clReleaseBuffer

is_blocking=False is the default in netcl because the loss value pulled at the end of a step acts as a natural sync point. The result is a fully-async H2D/D2H pipeline that overlaps the host-side prefetch with device-side compute — a measurable win on small-batch training where the kernels themselves are short.

Integration with `Tensor`

The Tensor factory wires the pool into the user-facing type:

pool  = BufferPool(context)
h     = pool.allocate(nbytes)
t     = Tensor(buffer=h.buffer, shape=..., dtype=..., context=context,
               queue=q, pool_handle=h, persistent=False)
# ... when t is garbage-collected, t.__del__ calls h.release() ...

persistent=False (the default) means the buffer is returned to the pool on GC; persistent=True is used for parameter tensors that should outlive a single operation (e.g. model weights).

Trade-offs

Advantage	Cost
Fewer `clCreateBuffer` calls (each is ~10–100 µs depending on driver)	Higher peak memory (power-of-two rounding means a 1.1 MiB tensor consumes a 2 MiB bucket)
Better cache locality for repeated shapes	First-touch allocation is as expensive as without a pool
Predictable latency on steady-state workloads	Buckets can fragment over a long run; `PersistentBufferPool.clear()` resets the cache

For a typical 4 MiB feature map on a discrete GPU, the clCreateBuffer cost is roughly 80 µs on NVIDIA and 30 µs on Apple; a pool that hits 100% in steady state saves 80 µs × N (where N is the number of intermediate tensors) per step.

Architecture: Memory Pool

Allocation flow

BufferHandle lifecycle

Hit-rate statistics

PinnedBufferPool — pinned host memory for fast DMA

PersistentBufferPool — bounded cache with per-bucket limits

Async cl.enqueue_copy handoff

Integration with Tensor