concepts

DataLoader

Status: Public API in netcl.data.dataloader.DataLoader

DataLoader is netcl's data-pipeline primitive. It is responsible for fetching the next batch of training examples from a Dataset, applying any user-supplied transforms, copying the batch to the device (or to a pinned-memory staging buffer), and yielding it to the training loop.

netcl's DataLoader is worker-based and asynchronous: by default it spawns a small pool of worker processes that load and preprocess the next batch in parallel with the GPU compute. The workers write their finished batches into a shared-memory ring buffer; the main process picks up finished batches without ever blocking on a worker.

Overview

The standard pattern is:

from netcl.data import DataLoader
from my_dataset import MyDataset

dataset = MyDataset(root="data/", train=True)
loader  = DataLoader(dataset, batch_size=128, shuffle=True,
                     num_workers=4, pin_memory=True,
                     drop_last=False)

for x, y in loader:
    # x, y are device tensors
    ...

Each iteration of the for loop returns a tuple of Tensors, one per dataset output. The conversion from the host numpy representation to the device tensor is done with Tensor.from_host() (which goes through the pinned-memory path when pin_memory=True).

Where It Lives

File path: data/dataloader.py.
Module path: netcl.data.dataloader.
Public re-export: from netcl.data import DataLoader.
Sibling: data.shared_batch.SharedBatchRingBuffer (the shared-memory ring used by the workers), data.augment (the CPU augmentation primitives), data.augment_gpu (the GPU augmentation primitives).

Diagram

How It Works

The DataLoader.__iter__ method sets up a pool of worker processes, each of which runs the user's collate function on a slice of the dataset. The workers write their finished batches to a SharedBatchRingBuffer, a fixed-size ring of pinned-memory buffers. The main process polls the ring; when a finished batch is available, it is copied from the pinned buffer to the device asynchronously (cl.enqueue_copy with is_blocking=False), and yielded to the user.

The pin_memory=True flag enables the pinned-memory path: the workers write directly into a pinned host buffer, and the asynchronous H2D copy from pinned memory is about 2x faster than from pageable memory on a discrete GPU. On an integrated GPU the win is smaller (the device shares RAM with the host) and pin_memory can be turned off.

The num_workers parameter controls the worker count. A good default is the number of CPU cores available to the process divided by the per-worker CPU cost. 4 is a safe starting point for most image-classification workloads.

The shuffle=True flag enables index shuffling. The shuffle is done at the dataset level (via torch.utils.data.SubsetRandomSampler-style indices), not in the worker; the workers process the indices in order. For deterministic shuffling across epochs, pass a generator argument.

Code Example

A minimal DataLoader use:

from netcl.data import DataLoader
from torchvision.datasets import MNIST

dataset = MNIST(root="data/", train=True, download=True)
loader  = DataLoader(dataset, batch_size=128, shuffle=True,
                     num_workers=4, pin_memory=True)

for x, y in loader:
    # x.shape == (128, 1, 28, 28), y.shape == (128,)
    ...

A custom collate function for variable-length sequences:

def collate(batch):
    xs, ys = zip(*batch)
    max_len = max(len(x) for x in xs)
    padded = np.zeros((len(xs), max_len), dtype=np.float32)
    for i, x in enumerate(xs):
        padded[i, :len(x)] = x
    return padded, np.array(ys)

loader = DataLoader(dataset, batch_size=32, collate_fn=collate)

A GPU-augmentation pipeline (augment_gpu):

from netcl.data.augment_gpu import random_crop, horizontal_flip

def gpu_collate(batch):
    x, y = zip(*batch)
    x = nc.concatenate([nc.from_host(b) for b in x], axis=0)
    x = random_crop(x, (224, 224))
    x = horizontal_flip(x, p=0.5)
    return x, nc.stack([nc.from_host(b) for b in y])

loader = DataLoader(dataset, batch_size=128, collate_fn=gpu_collate)

Performance & Trade-offs

num_workers=0 (the default) runs the data loading in the main process. This is the simplest configuration and is fine for small models or for unit tests. For production training, set num_workers=4 or higher.
Pinned memory is the cheapest single perf win. On a discrete GPU, pin_memory=True typically cuts the H2D copy time in half. On an integrated GPU, the win is smaller and the pinned allocation is a real cost; set it to False.
Worker lifetime is one epoch. At the end of each epoch, the workers are torn down and a new pool is spawned. This is cheap (no state survives), but it means a long-running training job will see a small pause at every epoch boundary.
Shared-memory ring size is 2 * num_workers by default. This is enough to absorb small bursts; for very uneven per-batch compute (e.g. one batch is 10x slower than the others), raise it to 4 * num_workers to give the main process more slack.