PersistentBufferPool
PersistentBufferPool
Status: Public API in
netcl.core.memory.PersistentBufferPool
PersistentBufferPool is the bounded alternative to
BufferPool. It is similar in structure
(power-of-two buckets, free lists, per-pool lock) but with two
extra guarantees that make it the right pool for long-running
training jobs:
- Bounded total cache size —
max_cached_bytesdefaults toNETCL_MAX_CACHED_GB(4 GiB if unset). When the cache exceeds this limit, newly released buffers are not added to the free list; the GC reclaims them. - Bounded per-bucket count —
max_buffers_per_bucket(default 16) prevents one bucket (e.g. the 1 MiB bucket) from hoarding the entire cache.
The pool is used by the JIT Compiler
and by long-running data-parallel training jobs where the
unbounded BufferPool would balloon VRAM.
Overview
The bucket set is also fixed (rather than power-of-two) and aligned to common tensor sizes:
BUCKETS = [
1024, 4096, 16384, 65536, # 1KB .. 64KB
262144, 1048576, 4194304, # 256KB .. 4MB
16777216, 67108864, 268435456, # 16MB .. 256MB
]
Allocations larger than 256 MiB fall back to power-of-two rounding on the fly. The fixed bucket set makes the cache behaviour predictable and the per-bucket limit meaningful.
PersistentBufferPool.allocate also detects integrated GPUs
(via is_integrated_gpu(queue)) and uses ALLOC_HOST_PTR for
the buffer; discrete GPUs get plain READ_WRITE. This is the
recommended pool for production training because it caps VRAM
growth at the configured limit and reports a hit rate that can
be alerted on.
Where It Lives
- File path:
core/memory.py(class PersistentBufferPool). - Module path:
netcl.core.memory. - Public re-export:
from netcl.core.memory import PersistentBufferPool, get_persistent_pool.
How It Works
class PersistentBufferPool:
def __init__(self, context, queue,
max_cached_bytes=4 * 1024**3,
max_buffers_per_bucket=16):
self.context = context
self.queue = queue
self.max_cached_bytes = max_cached_bytes
self.max_buffers_per_bucket = max_buffers_per_bucket
self._free: Dict[int, deque[BufferHandle]] = {}
self._lock = threading.RLock()
self.stats = PoolStats()
def get_stats(self) -> dict:
return {
"hits": self.stats.hits,
"misses": self.stats.misses,
"hit_rate": self.stats.hit_rate,
"bytes_cached": self.stats.bytes_cached,
"bytes_allocated": self.stats.bytes_allocated,
"per_bucket": {
b: len(v) for b, v in self._free.items()
},
}
release(handle) checks the per-bucket count and the total
cache size before adding the handle to the free list. If either
limit is exceeded, the handle is not added — the underlying
cl.Buffer is released to the driver instead.
Code Example
import netcl as nc
from netcl.core.memory import get_persistent_pool
ctx, queue = nc.device.manager.default()
pp = get_persistent_pool(queue)
h = pp.allocate(40_000_000) # rounds to 64 MiB bucket
print(pp.get_stats())
# {'hits': 42, 'misses': 3, 'hit_rate': 0.93,
# 'bytes_cached': 67108864, ...}
A bounded pool for a long-running job:
pp = get_persistent_pool(queue,
max_cached_bytes=8 * 1024**3,
max_buffers_per_bucket=32)
Performance & Trade-offs
- VRAM ceiling: the
max_cached_bytescap is the binding constraint. Once the cache is full, allocations miss even if a buffer of the right size was previously released. The hit rate is therefore a useful alerting signal: a steady hit rate above 90% is healthy; a hit rate that drops to 50% mid-training means the cache is too small. - Per-bucket fairness: the per-bucket limit prevents one common bucket size from hogging the entire cache. Without it, a model that produces many 1 MiB intermediates would push 4 MiB intermediates out of the cache.
- Cost vs.
BufferPool: the per-release check is two integer comparisons; the overhead is negligible. - Recommended for production training: yes. The unbounded
BufferPoolis fine for short-running scripts and unit tests, but a long-running training job should usePersistentBufferPoolto avoid OOM.
See also
- PersistentBufferPool — the architecture page.
- BufferPool — the unbounded alternative.
- PinnedBufferPool — the pinned host-memory pool.
- JIT Compiler — the JIT uses
PersistentBufferPoolfor its compiled-kernel working set. - PersistentBufferPool — this article.