concepts

PersistentBufferPool

Status: Public API in netcl.core.memory.PersistentBufferPool

PersistentBufferPool is the bounded alternative to BufferPool. It is similar in structure (power-of-two buckets, free lists, per-pool lock) but with two extra guarantees that make it the right pool for long-running training jobs:

Bounded total cache size — max_cached_bytes defaults to NETCL_MAX_CACHED_GB (4 GiB if unset). When the cache exceeds this limit, newly released buffers are not added to the free list; the GC reclaims them.
Bounded per-bucket count — max_buffers_per_bucket (default 16) prevents one bucket (e.g. the 1 MiB bucket) from hoarding the entire cache.

The pool is used by the JIT Compiler and by long-running data-parallel training jobs where the unbounded BufferPool would balloon VRAM.

Overview

The bucket set is also fixed (rather than power-of-two) and aligned to common tensor sizes:

BUCKETS = [
    1024, 4096, 16384, 65536,           # 1KB .. 64KB
    262144, 1048576, 4194304,           # 256KB .. 4MB
    16777216, 67108864, 268435456,      # 16MB .. 256MB
]

Allocations larger than 256 MiB fall back to power-of-two rounding on the fly. The fixed bucket set makes the cache behaviour predictable and the per-bucket limit meaningful.

PersistentBufferPool.allocate also detects integrated GPUs (via is_integrated_gpu(queue)) and uses ALLOC_HOST_PTR for the buffer; discrete GPUs get plain READ_WRITE. This is the recommended pool for production training because it caps VRAM growth at the configured limit and reports a hit rate that can be alerted on.

Where It Lives

File path: core/memory.py (class PersistentBufferPool).
Module path: netcl.core.memory.
Public re-export: from netcl.core.memory import PersistentBufferPool, get_persistent_pool.

How It Works

class PersistentBufferPool:
    def __init__(self, context, queue,
                 max_cached_bytes=4 * 1024**3,
                 max_buffers_per_bucket=16):
        self.context = context
        self.queue = queue
        self.max_cached_bytes = max_cached_bytes
        self.max_buffers_per_bucket = max_buffers_per_bucket
        self._free: Dict[int, deque[BufferHandle]] = {}
        self._lock = threading.RLock()
        self.stats = PoolStats()

    def get_stats(self) -> dict:
        return {
            "hits": self.stats.hits,
            "misses": self.stats.misses,
            "hit_rate": self.stats.hit_rate,
            "bytes_cached": self.stats.bytes_cached,
            "bytes_allocated": self.stats.bytes_allocated,
            "per_bucket": {
                b: len(v) for b, v in self._free.items()
            },
        }

release(handle) checks the per-bucket count and the total cache size before adding the handle to the free list. If either limit is exceeded, the handle is not added — the underlying cl.Buffer is released to the driver instead.

Code Example

import netcl as nc
from netcl.core.memory import get_persistent_pool

ctx, queue = nc.device.manager.default()
pp = get_persistent_pool(queue)

h = pp.allocate(40_000_000)            # rounds to 64 MiB bucket
print(pp.get_stats())
# {'hits': 42, 'misses': 3, 'hit_rate': 0.93,
#  'bytes_cached': 67108864, ...}

A bounded pool for a long-running job:

pp = get_persistent_pool(queue,
                         max_cached_bytes=8 * 1024**3,
                         max_buffers_per_bucket=32)

Performance & Trade-offs

VRAM ceiling: the max_cached_bytes cap is the binding constraint. Once the cache is full, allocations miss even if a buffer of the right size was previously released. The hit rate is therefore a useful alerting signal: a steady hit rate above 90% is healthy; a hit rate that drops to 50% mid-training means the cache is too small.
Per-bucket fairness: the per-bucket limit prevents one common bucket size from hogging the entire cache. Without it, a model that produces many 1 MiB intermediates would push 4 MiB intermediates out of the cache.
Cost vs. BufferPool: the per-release check is two integer comparisons; the overhead is negligible.
Recommended for production training: yes. The unbounded BufferPool is fine for short-running scripts and unit tests, but a long-running training job should use PersistentBufferPool to avoid OOM.