Architecture: Memory Pool
Architecture: Memory Pool
BufferPool (defined in core/memory.py) is netcl's
power-of-two-bucketed allocator for OpenCL
cl.Buffer objects. Its job is to
reduce the number of clCreateBuffer calls a training loop issues —
those calls are surprisingly expensive on most OpenCL implementations
(NVIDIA, Intel, Apple), and the kernel-launch path is much faster when
it can reuse a buffer that was created earlier.
Every Tensor on a GPU device owns (or borrows) a
BufferPool BufferHandle. When the tensor
is garbage-collected, the handle is returned to the pool; on the next
allocation of the same bucket size, the cached buffer is handed out
again without an actual clCreateBuffer round-trip.
Allocation flow
The decision tree for pool.allocate(nbytes) is shown below. The
"hit" branch is the entire reason the pool exists; the "miss" branch
falls through to a real clCreateBuffer.
Caption — BufferPool.allocate rounds nbytes up to the next
power-of-two bucket, then tries the bucket's free list. A hit returns
the popped BufferHandle; a miss calls clCreateBuffer with
cl.mem_flags.READ_WRITE (or the flags the caller passed) and wraps
it in a fresh BufferHandle.
BufferHandle lifecycle
@dataclass
class BufferHandle:
buffer: cl.Buffer
nbytes: int
bucket_size: int = 0
pool: Optional[BufferPool] = None
def release(self) -> None:
if self.pool is not None:
self.pool.release(self)
A BufferHandle is intentionally tiny — the heavy lifting is done by
the pool, not the handle. The fields are:
buffer— the actualcl.Buffer(ornp.ndarrayfor the CPU backend).nbytes— the size the caller asked for. May be less thanbucket_sizebecause the pool rounds up.bucket_size— the power-of-two the pool actually allocated.pool— back-pointer; set by the pool onallocateandrelease, used byrelease()to return the buffer to the right bucket.
release() is idempotent: the pool keeps an id() set per bucket to
make sure the same handle isn't freed twice (which would corrupt the
free list).
Hit-rate statistics
PoolStats is a small dataclass that the pool updates on every
allocate/release:
@dataclass
class PoolStats:
hits: int = 0
misses: int = 0
bytes_allocated: int = 0
bytes_cached: int = 0
@property
def hit_rate(self) -> float:
total = self.hits + self.misses
return self.hits / total if total > 0 else 0.0
For a steady-state training step (forward, backward, optimizer all producing same-size intermediates), you should see a hit rate well above 95% after a few warm-up iterations. If you see the hit rate climb from 0% to 100% during the first step, that is normal: the first call into each bucket always misses.
To inspect the stats at runtime:
print(pool.stats)
# PoolStats(hits=8432, misses=12, bytes_allocated=50331648, bytes_cached=41943040)
print(f"hit_rate = {pool.stats.hit_rate:.2%}")
PersistentBufferPool (see below) exposes a richer get_stats()
dict including per-bucket counts.
PinnedBufferPool — pinned host memory for fast DMA
PinnedBufferPool is a sibling of BufferPool in core/memory.py.
It allocates pinned (page-locked) host memory with
cl.mem_flags.ALLOC_HOST_PTR so that the DMA controller on a discrete
GPU can stream data without bouncing through pageable memory. Pinned
buffers are roughly 2–4× faster for H2D/D2H on discrete GPUs, at the
cost of higher host-side memory pressure (the OS cannot swap pinned
pages out).
from netcl.core.memory import get_pinned_pool
pinned = get_pinned_pool(queue)
h = pinned.allocate(nbytes) # cl.Buffer with ALLOC_HOST_PTR
# ... use h.buffer for staging H2D copies ...
h.release()
The pool is per-context (keyed by ctx.int_ptr) and is created on
first use by get_pinned_pool(queue). There is no need to instantiate
it yourself.
If you don't know whether you want pinned memory, the Tensor Backend default is on (
NETCL_PINNED_H2D=1); discrete-GPU users should leave it on, integrated-GPU users can switch it off because the device shares RAM with the host anyway.
PersistentBufferPool — bounded cache with per-bucket limits
PersistentBufferPool is the advanced pool used by long-running
training jobs and by the JIT Compiler.
It is similar to BufferPool but with two extra guarantees:
- Bounded total cache size —
max_cached_bytesdefaults toNETCL_MAX_CACHED_GB(4 GiB if unset). When the cache exceeds this limit, newly released buffers are not added to the free list; the GC reclaims them. - Bounded per-bucket count —
max_buffers_per_bucket(default 16) prevents one bucket (e.g. the 1 MiB bucket) from hoarding the entire cache.
The bucket set is also fixed (rather than power-of-two) and aligned to common tensor sizes:
BUCKETS = [
1024, 4096, 16384, 65536, # 1KB .. 64KB
262144, 1048576, 4194304, # 256KB .. 4MB
16777216, 67108864, 268435456, # 16MB .. 256MB
]
Allocations larger than 256 MiB fall back to power-of-two rounding on the fly.
from netcl.core.memory import get_persistent_pool
pp = get_persistent_pool(queue)
h = pp.allocate(40_000_000) # rounds to 64 MiB bucket
print(pp.get_stats())
# {'hits': 42, 'misses': 3, 'hit_rate': 0.93, 'bytes_cached': 67108864, ...}
PersistentBufferPool.allocate also detects integrated GPUs (via
is_integrated_gpu(queue)) and uses ALLOC_HOST_PTR for the buffer;
discrete GPUs get plain READ_WRITE. This is the recommended pool
for production training because it caps VRAM growth at the configured
limit and reports a hit rate that can be alerted on.
Async cl.enqueue_copy handoff
The pool itself does not call cl.enqueue_copy; that is the
Tensor Backend's job. The handoff
looks like this:
# In the backend (paraphrased)
h = pool.allocate(nbytes)
cl.enqueue_copy(queue, h.buffer, host_array, is_blocking=False)
# ... enqueue kernels that read h.buffer ...
# ... wait for kernels ...
h.release() # back to pool; not clReleaseBuffer
is_blocking=False is the default in netcl because the loss value
pulled at the end of a step acts as a natural sync point. The result
is a fully-async H2D/D2H pipeline that overlaps the host-side
prefetch with device-side compute — a measurable win on small-batch
training where the kernels themselves are short.
Integration with Tensor
The Tensor factory wires the pool into the user-facing type:
pool = BufferPool(context)
h = pool.allocate(nbytes)
t = Tensor(buffer=h.buffer, shape=..., dtype=..., context=context,
queue=q, pool_handle=h, persistent=False)
# ... when t is garbage-collected, t.__del__ calls h.release() ...
persistent=False (the default) means the buffer is returned to the
pool on GC; persistent=True is used for parameter tensors that
should outlive a single operation (e.g. model weights).
Trade-offs
| Advantage | Cost |
|---|---|
Fewer clCreateBuffer calls (each is ~10–100 µs depending on driver) |
Higher peak memory (power-of-two rounding means a 1.1 MiB tensor consumes a 2 MiB bucket) |
| Better cache locality for repeated shapes | First-touch allocation is as expensive as without a pool |
| Predictable latency on steady-state workloads | Buckets can fragment over a long run; PersistentBufferPool.clear() resets the cache |
For a typical 4 MiB feature map on a discrete GPU, the
clCreateBuffer cost is roughly 80 µs on NVIDIA and 30 µs on Apple;
a pool that hits 100% in steady state saves 80 µs × N (where N is the
number of intermediate tensors) per step.
See also
- core API —
BufferPool,PinnedBufferPool,PersistentBufferPool,BufferHandle,PoolStats, and theget_pinned_pool/get_persistent_poolfactories. - Tensor Backend — the backend that owns the context/queue that the pool is bound to.
- Tensor API — the user-facing
Tensor type that holds a
BufferHandle. - JIT Compiler — uses BufferPool buffers for its intermediate tensors.
- runtime API — kernel cache; the runtime API also keeps its own scratch buffers, but those go through the same pool when they are device-resident.