BufferPool
BufferPool
Status: Public API in
netcl.core.memory.BufferPool
BufferPool is netcl's power-of-two bucketed allocator for OpenCL
cl.Buffer objects. Its job is to reduce the number of clCreateBuffer
calls a training loop issues — those calls are surprisingly expensive on
most OpenCL implementations (NVIDIA, Intel, Apple), and the kernel-launch
path is much faster when it can reuse a buffer that was created earlier.
Every Tensor on a GPU device owns (or borrows) a
BufferPool BufferHandle. When the tensor is garbage-collected the
handle is returned to the pool; on the next allocation of the same bucket
size, the cached buffer is handed out again without an actual
clCreateBuffer round-trip.
Overview
The pool keeps a single dict: {bucket_size: list[BufferHandle]}. On
allocate(nbytes) it rounds nbytes up to the next power of two
(_bucket_size), pops a handle from the corresponding free list, and
returns it. On release(handle) it appends the handle to the free list
of the handle's bucket. The pool is locked with a threading.Lock
because OpenCL is multi-threaded in netcl (the autograd engine, the data
loader prefetch, and the user code all share the same context).
Two statistics are maintained on the pool: hits and misses. A hit
means the bucket had a free handle, the miss path falls through to a
real clCreateBuffer. After a few warm-up iterations of a steady-state
training step, the hit rate is well above 95%.
Where It Lives
- File path:
core/memory.py(class BufferPool). - Module path:
netcl.core.memory. - Public re-export: top-level
netcl.BufferPoolis available viafrom netcl.core.memory import BufferPool. - Sibling classes in the same file:
PinnedBufferPool,PersistentBufferPool.
Diagram
How It Works
class BufferPool:
def __init__(self, context):
self.context = context
self._free: Dict[int, list[BufferHandle]] = {}
self._lock = threading.Lock()
self.stats = PoolStats()
@staticmethod
def _bucket_size(nbytes: int) -> int:
size = 1
while size < nbytes:
size <<= 1
return size
def allocate(self, nbytes, flags=None) -> BufferHandle:
bucket = self._bucket_size(nbytes)
with self._lock:
free_list = self._free.get(bucket)
if free_list:
self.stats.hits += 1
handle = free_list.pop()
handle.pool = self
return handle
# miss
self.stats.misses += 1
buf = cl.Buffer(self.context, flags or cl.mem_flags.READ_WRITE, bucket)
return BufferHandle(buffer=buf, nbytes=bucket, bucket_size=bucket, pool=self)
def release(self, handle):
bucket = self._bucket_size(handle.nbytes)
with self._lock:
free_list = self._free.setdefault(bucket, [])
if handle in free_list:
return
free_list.append(handle)
Key behaviours:
allocateis rounding up: requesting 1025 bytes returns a handle sized to 2048 bytes. The caller must rememberhandle.bucket_sizevshandle.nbytes.releaseis idempotent. Theid()check on the same handle prevents a double-free from corrupting the free list.clear()drops every cached buffer. Use it after a hot-swap of the context (e.g. multi-process) or beforedel-ing a pool at shutdown.
Code Example
import pyopencl as cl
from netcl.core.memory import BufferPool
ctx = cl.create_some_context()
pool = BufferPool(ctx)
# Allocate three tensors of different sizes — each lands in a different bucket.
h1 = pool.allocate(1024) # -> 1024-byte bucket
h2 = pool.allocate(1025) # -> 2048-byte bucket
h3 = pool.allocate(1_000_000) # -> 1_048_576-byte bucket
# Reuse the buckets.
h1.release(); h1b = pool.allocate(1024) # hit
h2.release(); h2b = pool.allocate(1024) # hit
print(pool.stats.hit_rate) # 1.0
Performance & Trade-offs
- Bucket fragmentation: a tensor that is 1 KiB and one that is 1 MiB land in different buckets; the 1 MiB bucket cannot satisfy a 512 KiB request. This is by design — the alternative is variable sized free lists, which are more flexible but slower to scan.
- Memory ceiling: the pool has no upper bound. Long-running jobs that allocate many distinct sizes can balloon VRAM. Use PersistentBufferPool for that case.
- Thread-safety: a single
threading.Lockcovers the whole pool. This is fine for typical workloads; for very high allocation pressure, switch toPersistentBufferPoolwhich uses a per-bucket lock. - CPU backend: the pool is still useful for the CPU queue because
it bounds NumPy
ndarrayallocations to a small set of sizes, but the win is smaller (a few percent) than on the GPU queue.
See also
- BufferPool — architecture deep-dive.
- PinnedBufferPool — pinned host memory variant.
- PersistentBufferPool — bounded pool with per-bucket limits.
- Tensor — the tensor factory wires the pool in.
- Tensor Backend — how the pool hands
off to
cl.enqueue_copy. - BufferPool — this article.