tutorials

Tutorial: MNIST with an MLP

In this tutorial we train a two-layer MLP on MNIST — entirely in netcl, with no PyTorch and no CUDA. The walk-through is deliberately explicit: it is the smallest end-to-end example that exercises Tensors, the Tape recorder, an Optimizer, an LR schedule, AMP mixed precision, and save_model / load_model checkpointing. Once you have MNIST running, the same skeleton applies (with a different model) to any small classification dataset.

The training script is ~50 lines and lives in a single file.

Prerequisites

You should be comfortable with:

The Quickstart page, which shows how to install netcl and run a single kernel launch.
The Tensor data model — what a Tensor is, how it carries a queue and a buffer, and how to_host() synchronizes.
The Tensor Backend page, for the difference between the OpenCL and CPU backends, the BufferPool, and the asynchronous H2D copy.

You do not need to have read the autograd or JIT Compiler pages in detail; this tutorial explains the parts of the Tape it touches.

What You'll Build

By the end of the tutorial you will have a script that:

Loads MNIST with the DataLoader worker pool and a normalize filter.
Defines a 784 → 256 → 128 → 10 MLP using stock Linear + activation + Dropout blocks.
Trains the model with CrossEntropyLoss, AdamW, and a CosineAnnealingLR schedule, all inside an explicit Tape loop.
Optionally wraps the forward in autocast and the step in GradScaler for half-precision training on devices that advertise cl_khr_fp16.
Saves the trained weights with save_model and reloads them with load_model for inference.

Step-by-Step

1. Load the Data

The DataLoader accepts any object with __len__ and __getitem__. We wrap the in-memory MNIST arrays in a tiny dataset class and let the loader handle batching, shuffling, and worker processes.

import numpy as np
from netcl.data.dataloader import DataLoader
from netcl.data.filters import normalize, to_float

class MNISTInMemory:
    """Minimal dataset: yields (x, y) as NumPy arrays."""
    def __init__(self, x: np.ndarray, y: np.ndarray):
        self.x = x.astype(np.float32) / 255.0     # scale to [0, 1]
        self.y = y.astype(np.int64)
    def __len__(self):
        return len(self.x)
    def __getitem__(self, i):
        return {"x": self.x[i].reshape(-1), "y": self.y[i]}

# Load MNIST (e.g. via tensorflow_datasets, torchvision, or your own loader).
# x_train: (60000, 28, 28), y_train: (60000,)
ds = MNISTInMemory(x_train, y_train)

# Per-channel MNIST normalization. mean=0.1307, std=0.3081 are the standard values.
pipeline = [normalize(mean=(0.1307,), std=(0.3081,))]

loader = DataLoader(
    ds,
    batch_size=128,
    prefetch=4,
    shuffle=True,
    num_workers=2,
    transforms=pipeline,
)

Cross-platform note. On Linux the worker pool uses fork so the in-memory dataset is shared copy-on-write at almost no cost. On Windows / macOS the start method is spawn, and the dataset is sent to each worker once through the pool initializer. Either way, the loader interface is the same.

The transforms=[normalize(...)] argument is a list of FilterFns, each shaped like (xb, yb) -> (xb, yb). The normalize helper here is the dataset-level filter that subtracts the mean and divides by the standard deviation per channel.

2. Build the Model

A 784→256→128→10 MLP is built from Linear + ReLU + optional Dropout blocks using build_sequential and example_mlp_config from netcl.nn. The result is a Sequential Module subclass.

from netcl.core.device import manager
from netcl.nn import build_sequential, Linear, ReLU, Dropout, Sequential

dev = manager.default("auto")

model = Sequential(
    Linear(dev.queue, 28 * 28, 256),
    ReLU(),
    Dropout(p=0.1),
    Linear(dev.queue, 256, 128),
    ReLU(),
    Dropout(p=0.1),
    Linear(dev.queue, 128, 10),
)
print(model)

model.parameters() returns every Parameter on the active device. We hand that iterator to the optimizer below.

3. Choose a Loss, an Optimizer, and a Schedule

The recommended defaults for a small classification task are:

CrossEntropyLoss — log-softmax + negative log-likelihood fused into one op. Accepts raw logits and integer class labels.
AdamW — adaptive moments with decoupled weight decay. Converges in fewer epochs than SGD on MNIST and is much less sensitive to the learning rate.
CosineAnnealingLR — smoothly decays the learning rate from the initial value to eta_min over T_max epochs.

from netcl.optim import AdamW, CosineAnnealingLR
import netcl.autograd as ag

opt = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
sched = CosineAnnealingLR(opt, T_max=10, eta_min=1e-5)

T_max=10 matches the 10-epoch budget below. If you train for longer, set T_max to the total number of epochs.

4. Train with an Explicit Tape Loop

Why no Trainer? netcl does not ship a high-level Trainer class. The distributed module exposes data_parallel_step as a function-based training helper, but for single-device training the loop is written out explicitly. The explicitness is short and useful while learning the framework.

import netcl.autograd as ag

for epoch in range(10):
    for batch in loader:
        x, y = batch["x"], batch["y"]
        with ag.Tape() as tape:
            logits = model(x)
            loss = ag.cross_entropy(logits, y)
        tape.backward(loss)
        opt.step()
        opt.zero_grad()
    sched.step()
    print(f"epoch {epoch}: loss = {float(loss.to_host()):.4f}")

Five things are happening on every step:

with ag.Tape() as tape: installs the thread-local current tape so that every op inside the block is recorded.
logits = model(x) runs the forward pass — each Linear / Dropout call goes through apply_op and is appended to tape.nodes.
loss = ag.cross_entropy(logits, y) registers the loss op.
tape.backward(loss) walks the graph in reverse topological order, calls each op's grad_fn, and accumulates the per-parameter gradient into param.grad.
opt.step() mutates the parameters in place; opt.zero_grad() clears the gradient buffers. sched.step() is called once per epoch, not per batch.

5. Optional: Mixed Precision with AMP

If your device advertises cl_khr_fp16 (NVIDIA, AMD RDNA, Intel ARC — most discrete GPUs since 2017), wrap the forward in autocast and the step in GradScaler to get a meaningful speedup.

import netcl.amp as amp

scaler = amp.GradScaler(init_scale=2.0**16, enabled=True)

for epoch in range(10):
    for batch in loader:
        x, y = batch["x"], batch["y"]
        with ag.Tape() as tape:
            with amp.autocast(enabled=True):
                logits = model(x)
                loss = ag.cross_entropy(logits, y)
            scaled = scaler.scale_loss(loss)
        tape.backward(scaled)
        scaler.step(opt, model.parameters())
        opt.zero_grad()
    sched.step()
    print(f"epoch {epoch}: loss = {float(loss.to_host()):.4f}")

Two new pieces:

The autocast context manager flips a thread-local flag. The autograd ops read the flag and cast fp32 Tensors to fp16 where the device allows it.
GradScaler.scale_loss multiplies the loss by a running scale (starting at 2**16) on-device. The backward is then computed on the scaled loss, and scaler.step divides the gradient by scale before calling opt.step(). If any gradient is inf/nan, the step is skipped and the scale is reduced; the next scaler.update() may grow it again.

On devices without cl_khr_fp16 the autocast context manager silently degrades to fp32 (it re-probes supports_fp16(queue) on __enter__).

6. Inference

For inference you want grad mode off so the Tape does not record the forward, and the loss tensor is not allocated with a gradient buffer.

x_test = x_test[:8]
with ag.no_grad():
    out = model(x_test).to_host()       # (8, 10) logits
pred = out.argmax(axis=1)

no_grad (or its functional twin set_grad_enabled(False)) skips the recording pass on every op; the output Tensor is the same shape and dtype as in training but its grad slot is never allocated.

7. Save and Load the Model

save_model writes a single self-contained .netcl file (a NumPy .npz) holding every parameter and a sidecar JSON describing the model architecture.

from netcl.io import save_model, load_model

save_model(model, "mnist_mlp.netcl")

# In a fresh process:
model = load_model("mnist_mlp.netcl")

The file format is documented in detail on the io page; the short version is that each parameter becomes a {layer_index}:{state_dict_key} entry, and the __netcl_meta__ entry carries the model type, layer config, and version. load_model is backwards-compatible with older two-file layouts (.json + .npz) and silently keeps the freshly-built layer's initialization for any missing key.

Troubleshooting

The narrative version:

NaN loss from step 1. The two usual culprits are (a) the learning rate is too high for the AdamW defaults — try lr=3e-4 or lr=1e-4, and (b) mixed precision on a device that does not advertise cl_khr_fp16. Disable AMP by setting amp.autocast(enabled=False) and GradScaler(enabled=False); if the NaNs go away, your device does not support fp16 and the autocast probe was bypassed.
Very slow first step, then normal speed. The first call to a function decorated with @jit_compile runs the JIT pass — it traces the op chain, generates an OpenCL kernel pair, and waits for the device to build it. This can take 200–800 ms the first time per unique shape; subsequent steps reuse the cached program. The JIT Compiler page explains the warm-up budget.
Crash on Windows with clBuildProgram failure or a black-screen driver reset. The AMD/NVIDIA OpenCL driver is older than the ICD that netcl was tested against. Update the vendor driver first; if the issue persists, set NETCL_KERNEL_STRATEGY=portable to force the conservative kernel variants and try again. On Intel iGPUs, the OpenCL runtime is bundled with the GPU driver; install the latest one from the Intel Arc & Iris Xe Graphics driver page.
RuntimeError: detect_anomaly: Gradient w.r.t. parent N of op 'X' contains NaN or Inf. This is the detect_anomaly diagnostic firing. It compares the analytical gradient (the kernel chain) against a finite-difference check. The error message includes the creation_trace of the offending Node, so jump to the frame it points at. Most often the cause is a divide-by-zero in a custom op, or an eps that is too small in a normalization.
Loss is stuck near log(10) ≈ 2.30. The model is not learning at all. Check that the DataLoader shuffle=True and that transforms=[normalize(...)] is being applied (a typo in the filter name silently falls back to the identity).
save_model is much larger than expected. MLP only has ~236 K parameters (~944 KB at fp32); if your file is tens of MB, you are probably saving the full Optimizer state as well — use save_params from netcl.io.checkpoint for a raw NPZ of just the weights.