DataLoader
DataLoader
Status: Public API in
netcl.data.dataloader.DataLoader
DataLoader is netcl's data-pipeline primitive. It is responsible
for fetching the next batch of training examples from a Dataset,
applying any user-supplied transforms, copying the batch to the
device (or to a pinned-memory staging buffer), and yielding it to
the training loop.
netcl's DataLoader is worker-based and asynchronous: by
default it spawns a small pool of worker processes that load and
preprocess the next batch in parallel with the GPU compute. The
workers write their finished batches into a shared-memory ring
buffer; the main process picks up finished batches without ever
blocking on a worker.
Overview
The standard pattern is:
from netcl.data import DataLoader
from my_dataset import MyDataset
dataset = MyDataset(root="data/", train=True)
loader = DataLoader(dataset, batch_size=128, shuffle=True,
num_workers=4, pin_memory=True,
drop_last=False)
for x, y in loader:
# x, y are device tensors
...
Each iteration of the for loop returns a tuple of
Tensors, one per dataset output. The
conversion from the host numpy representation to the device
tensor is done with Tensor.from_host() (which goes through the
pinned-memory path when pin_memory=True).
Where It Lives
- File path:
data/dataloader.py. - Module path:
netcl.data.dataloader. - Public re-export:
from netcl.data import DataLoader. - Sibling:
data.shared_batch.SharedBatchRingBuffer(the shared-memory ring used by the workers),data.augment(the CPU augmentation primitives),data.augment_gpu(the GPU augmentation primitives).
Diagram
How It Works
The DataLoader.__iter__ method sets up a pool of worker
processes, each of which runs the user's collate function on a
slice of the dataset. The workers write their finished batches to
a SharedBatchRingBuffer, a fixed-size ring of pinned-memory
buffers. The main process polls the ring; when a finished batch
is available, it is copied from the pinned buffer to the device
asynchronously (cl.enqueue_copy with is_blocking=False), and
yielded to the user.
The pin_memory=True flag enables the pinned-memory path: the
workers write directly into a pinned host buffer, and the
asynchronous H2D copy from pinned memory is about 2x faster than
from pageable memory on a discrete GPU. On an integrated GPU the
win is smaller (the device shares RAM with the host) and
pin_memory can be turned off.
The num_workers parameter controls the worker count. A good
default is the number of CPU cores available to the process
divided by the per-worker CPU cost. 4 is a safe starting point
for most image-classification workloads.
The shuffle=True flag enables index shuffling. The shuffle is
done at the dataset level (via
torch.utils.data.SubsetRandomSampler-style indices), not in the
worker; the workers process the indices in order. For deterministic
shuffling across epochs, pass a generator argument.
Code Example
A minimal DataLoader use:
from netcl.data import DataLoader
from torchvision.datasets import MNIST
dataset = MNIST(root="data/", train=True, download=True)
loader = DataLoader(dataset, batch_size=128, shuffle=True,
num_workers=4, pin_memory=True)
for x, y in loader:
# x.shape == (128, 1, 28, 28), y.shape == (128,)
...
A custom collate function for variable-length sequences:
def collate(batch):
xs, ys = zip(*batch)
max_len = max(len(x) for x in xs)
padded = np.zeros((len(xs), max_len), dtype=np.float32)
for i, x in enumerate(xs):
padded[i, :len(x)] = x
return padded, np.array(ys)
loader = DataLoader(dataset, batch_size=32, collate_fn=collate)
A GPU-augmentation pipeline (augment_gpu):
from netcl.data.augment_gpu import random_crop, horizontal_flip
def gpu_collate(batch):
x, y = zip(*batch)
x = nc.concatenate([nc.from_host(b) for b in x], axis=0)
x = random_crop(x, (224, 224))
x = horizontal_flip(x, p=0.5)
return x, nc.stack([nc.from_host(b) for b in y])
loader = DataLoader(dataset, batch_size=128, collate_fn=gpu_collate)
Performance & Trade-offs
num_workers=0(the default) runs the data loading in the main process. This is the simplest configuration and is fine for small models or for unit tests. For production training, setnum_workers=4or higher.- Pinned memory is the cheapest single perf win. On a
discrete GPU,
pin_memory=Truetypically cuts the H2D copy time in half. On an integrated GPU, the win is smaller and the pinned allocation is a real cost; set it toFalse. - Worker lifetime is one epoch. At the end of each epoch, the workers are torn down and a new pool is spawned. This is cheap (no state survives), but it means a long-running training job will see a small pause at every epoch boundary.
- Shared-memory ring size is
2 * num_workersby default. This is enough to absorb small bursts; for very uneven per-batch compute (e.g. one batch is 10x slower than the others), raise it to4 * num_workersto give the main process more slack.
See also
- DataLoader — the API page.
- Tensor — the destination type for each batch.
- BufferPool — the pool the device side uses.
- DistributedDataParallel — the multi-replica training wrapper.
- Tutorial: Data-Parallel Training — how the loader integrates with multi-replica training.
- DataLoader — this article.