PinnedBufferPool
PinnedBufferPool
Status: Public API in
netcl.core.memory.PinnedBufferPool
PinnedBufferPool is a sibling of BufferPool
for pinned (page-locked) host memory. Pinned memory cannot be
swapped out by the OS, which lets the DMA controller on a
discrete GPU stream data directly without bouncing through
pageable memory. The result is roughly 2x to 4x faster host /
device transfers on discrete GPUs, at the cost of higher
host-side memory pressure.
The pool is per-context (keyed on cl.Context.int_ptr) and is
created on first use by get_pinned_pool(queue). There is no
need to instantiate it yourself.
Overview
The pool is structurally identical to BufferPool — a dict
{bucket_size: list[BufferHandle]} with a per-pool lock and a
PoolStats record — but the buffers it hands out are allocated
with cl.mem_flags.ALLOC_HOST_PTR. The flags tell the OpenCL
runtime to allocate the memory in a page-locked region, which
the driver then uses as a DMA source / target.
The pool is not the default for device memory; it is the
default for staging host memory used for H2D / D2H copies.
The Tensor factory uses it under the hood
when pin_memory=True is passed.
Where It Lives
- File path:
core/memory.py(class PinnedBufferPool). - Module path:
netcl.core.memory. - Public re-export:
from netcl.core.memory import PinnedBufferPool, get_pinned_pool.
How It Works
class PinnedBufferPool:
def __init__(self, context: cl.Context) -> None:
self.context = context
self._free: Dict[int, list[BufferHandle]] = {}
self._lock = threading.Lock()
self.stats = PoolStats()
def allocate(self, nbytes: int) -> BufferHandle:
bucket = self._bucket_size(nbytes)
with self._lock:
free_list = self._free.get(bucket)
if free_list:
self.stats.hits += 1
handle = free_list.pop()
handle.pool = self
return handle
self.stats.misses += 1
buf = cl.Buffer(self.context, cl.mem_flags.ALLOC_HOST_PTR, bucket)
return BufferHandle(buffer=buf, nbytes=bucket, bucket_size=bucket, pool=self)
The interface mirrors BufferPool exactly. The only difference
is the ALLOC_HOST_PTR flag in the buffer allocation.
Code Example
import netcl as nc
from netcl.core.memory import get_pinned_pool
ctx, queue = nc.device.manager.default()
pinned = get_pinned_pool(queue)
h = pinned.allocate(nbytes) # cl.Buffer with ALLOC_HOST_PTR
# ... use h.buffer for staging H2D copies ...
h.release()
Via the Tensor factory:
x = nc.Tensor.from_host(numpy_array, pin_memory=True)
# Internally, x.host_ref is allocated from get_pinned_pool(queue).
Performance & Trade-offs
- Discrete GPU: 2x to 4x faster H2D / D2H than pageable memory. This is the single biggest perf win in the data pipeline.
- Integrated GPU: 1x to 1.5x — the device shares RAM with
the host, so the DMA advantage is smaller. The
NETCL_PINNED_H2D=0env var disables the pinned path on integrated devices. - Memory cost: pinned memory is not swappable. Pinning a
lot of memory can starve the OS of pageable pages. The
default
Tensor.from_hostallocates the pinned buffer and frees it as soon as the H2D copy completes, so the cost is transient. - Thread-safety: same as
BufferPool— a singlethreading.Lockcovers the whole pool. - NUMA effects: on multi-socket systems, pinning a buffer
on the wrong node can hurt. The pool does not currently
handle this; use the OS's
numactlif you care.
See also
- PinnedBufferPool — the architecture page.
- BufferPool — the device-memory pool.
- PersistentBufferPool — the bounded alternative.
- Tensor — the user-facing type that uses pinned memory for H2D copies.
- DataLoader — the data pipeline that benefits most from pinned memory.
- PinnedBufferPool — this article.