concepts

PinnedBufferPool

Status: Public API in netcl.core.memory.PinnedBufferPool

PinnedBufferPool is a sibling of BufferPool for pinned (page-locked) host memory. Pinned memory cannot be swapped out by the OS, which lets the DMA controller on a discrete GPU stream data directly without bouncing through pageable memory. The result is roughly 2x to 4x faster host / device transfers on discrete GPUs, at the cost of higher host-side memory pressure.

The pool is per-context (keyed on cl.Context.int_ptr) and is created on first use by get_pinned_pool(queue). There is no need to instantiate it yourself.

Overview

The pool is structurally identical to BufferPool — a dict {bucket_size: list[BufferHandle]} with a per-pool lock and a PoolStats record — but the buffers it hands out are allocated with cl.mem_flags.ALLOC_HOST_PTR. The flags tell the OpenCL runtime to allocate the memory in a page-locked region, which the driver then uses as a DMA source / target.

The pool is not the default for device memory; it is the default for staging host memory used for H2D / D2H copies. The Tensor factory uses it under the hood when pin_memory=True is passed.

Where It Lives

File path: core/memory.py (class PinnedBufferPool).
Module path: netcl.core.memory.
Public re-export: from netcl.core.memory import PinnedBufferPool, get_pinned_pool.

How It Works

class PinnedBufferPool:
    def __init__(self, context: cl.Context) -> None:
        self.context = context
        self._free: Dict[int, list[BufferHandle]] = {}
        self._lock = threading.Lock()
        self.stats = PoolStats()

    def allocate(self, nbytes: int) -> BufferHandle:
        bucket = self._bucket_size(nbytes)
        with self._lock:
            free_list = self._free.get(bucket)
            if free_list:
                self.stats.hits += 1
                handle = free_list.pop()
                handle.pool = self
                return handle
        self.stats.misses += 1
        buf = cl.Buffer(self.context, cl.mem_flags.ALLOC_HOST_PTR, bucket)
        return BufferHandle(buffer=buf, nbytes=bucket, bucket_size=bucket, pool=self)

The interface mirrors BufferPool exactly. The only difference is the ALLOC_HOST_PTR flag in the buffer allocation.

Code Example

import netcl as nc
from netcl.core.memory import get_pinned_pool

ctx, queue = nc.device.manager.default()
pinned = get_pinned_pool(queue)

h = pinned.allocate(nbytes)         # cl.Buffer with ALLOC_HOST_PTR
# ... use h.buffer for staging H2D copies ...
h.release()

Via the Tensor factory:

x = nc.Tensor.from_host(numpy_array, pin_memory=True)
# Internally, x.host_ref is allocated from get_pinned_pool(queue).

Performance & Trade-offs

Discrete GPU: 2x to 4x faster H2D / D2H than pageable memory. This is the single biggest perf win in the data pipeline.
Integrated GPU: 1x to 1.5x — the device shares RAM with the host, so the DMA advantage is smaller. The NETCL_PINNED_H2D=0 env var disables the pinned path on integrated devices.
Memory cost: pinned memory is not swappable. Pinning a lot of memory can starve the OS of pageable pages. The default Tensor.from_host allocates the pinned buffer and frees it as soon as the H2D copy completes, so the cost is transient.
Thread-safety: same as BufferPool — a single threading.Lock covers the whole pool.
NUMA effects: on multi-socket systems, pinning a buffer on the wrong node can hurt. The pool does not currently handle this; use the OS's numactl if you care.