netcl wiki
concepts

BufferPool

BufferPool

Status: Public API in netcl.core.memory.BufferPool

BufferPool is netcl's power-of-two bucketed allocator for OpenCL cl.Buffer objects. Its job is to reduce the number of clCreateBuffer calls a training loop issues — those calls are surprisingly expensive on most OpenCL implementations (NVIDIA, Intel, Apple), and the kernel-launch path is much faster when it can reuse a buffer that was created earlier.

Every Tensor on a GPU device owns (or borrows) a BufferPool BufferHandle. When the tensor is garbage-collected the handle is returned to the pool; on the next allocation of the same bucket size, the cached buffer is handed out again without an actual clCreateBuffer round-trip.

Overview

The pool keeps a single dict: {bucket_size: list[BufferHandle]}. On allocate(nbytes) it rounds nbytes up to the next power of two (_bucket_size), pops a handle from the corresponding free list, and returns it. On release(handle) it appends the handle to the free list of the handle's bucket. The pool is locked with a threading.Lock because OpenCL is multi-threaded in netcl (the autograd engine, the data loader prefetch, and the user code all share the same context).

Two statistics are maintained on the pool: hits and misses. A hit means the bucket had a free handle, the miss path falls through to a real clCreateBuffer. After a few warm-up iterations of a steady-state training step, the hit rate is well above 95%.

Where It Lives

  • File path: core/memory.py (class BufferPool).
  • Module path: netcl.core.memory.
  • Public re-export: top-level netcl.BufferPool is available via from netcl.core.memory import BufferPool.
  • Sibling classes in the same file: PinnedBufferPool, PersistentBufferPool.

Diagram

How It Works

class BufferPool:
    def __init__(self, context):
        self.context = context
        self._free: Dict[int, list[BufferHandle]] = {}
        self._lock = threading.Lock()
        self.stats = PoolStats()

    @staticmethod
    def _bucket_size(nbytes: int) -> int:
        size = 1
        while size < nbytes:
            size <<= 1
        return size

    def allocate(self, nbytes, flags=None) -> BufferHandle:
        bucket = self._bucket_size(nbytes)
        with self._lock:
            free_list = self._free.get(bucket)
            if free_list:
                self.stats.hits += 1
                handle = free_list.pop()
                handle.pool = self
                return handle
        # miss
        self.stats.misses += 1
        buf = cl.Buffer(self.context, flags or cl.mem_flags.READ_WRITE, bucket)
        return BufferHandle(buffer=buf, nbytes=bucket, bucket_size=bucket, pool=self)

    def release(self, handle):
        bucket = self._bucket_size(handle.nbytes)
        with self._lock:
            free_list = self._free.setdefault(bucket, [])
            if handle in free_list:
                return
            free_list.append(handle)

Key behaviours:

  • allocate is rounding up: requesting 1025 bytes returns a handle sized to 2048 bytes. The caller must remember handle.bucket_size vs handle.nbytes.
  • release is idempotent. The id() check on the same handle prevents a double-free from corrupting the free list.
  • clear() drops every cached buffer. Use it after a hot-swap of the context (e.g. multi-process) or before del-ing a pool at shutdown.

Code Example

import pyopencl as cl
from netcl.core.memory import BufferPool

ctx = cl.create_some_context()
pool = BufferPool(ctx)

# Allocate three tensors of different sizes — each lands in a different bucket.
h1 = pool.allocate(1024)        # -> 1024-byte bucket
h2 = pool.allocate(1025)        # -> 2048-byte bucket
h3 = pool.allocate(1_000_000)   # -> 1_048_576-byte bucket

# Reuse the buckets.
h1.release(); h1b = pool.allocate(1024)   # hit
h2.release(); h2b = pool.allocate(1024)   # hit
print(pool.stats.hit_rate)                 # 1.0

Performance & Trade-offs

  • Bucket fragmentation: a tensor that is 1 KiB and one that is 1 MiB land in different buckets; the 1 MiB bucket cannot satisfy a 512 KiB request. This is by design — the alternative is variable sized free lists, which are more flexible but slower to scan.
  • Memory ceiling: the pool has no upper bound. Long-running jobs that allocate many distinct sizes can balloon VRAM. Use PersistentBufferPool for that case.
  • Thread-safety: a single threading.Lock covers the whole pool. This is fine for typical workloads; for very high allocation pressure, switch to PersistentBufferPool which uses a per-bucket lock.
  • CPU backend: the pool is still useful for the CPU queue because it bounds NumPy ndarray allocations to a small set of sizes, but the win is smaller (a few percent) than on the GPU queue.

See also