Tutorial: MNIST with an MLP
Tutorial: MNIST with an MLP
In this tutorial we train a two-layer MLP on MNIST — entirely in netcl, with no PyTorch and no CUDA. The walk-through is deliberately explicit: it is the smallest end-to-end example that exercises Tensors, the Tape recorder, an Optimizer, an LR schedule, AMP mixed precision, and save_model / load_model checkpointing. Once you have MNIST running, the same skeleton applies (with a different model) to any small classification dataset.
The training script is ~50 lines and lives in a single file.
Prerequisites
You should be comfortable with:
- The Quickstart page, which shows how to install netcl and run a single kernel launch.
- The Tensor data model — what a
Tensoris, how it carries aqueueand abuffer, and howto_host()synchronizes. - The Tensor Backend page, for the difference between the OpenCL and CPU backends, the BufferPool, and the asynchronous H2D copy.
You do not need to have read the autograd or JIT Compiler pages in detail; this tutorial explains the parts of the Tape it touches.
What You'll Build
By the end of the tutorial you will have a script that:
- Loads MNIST with the DataLoader worker pool and a
normalizefilter. - Defines a 784 → 256 → 128 → 10 MLP using stock Linear + activation + Dropout blocks.
- Trains the model with CrossEntropyLoss, AdamW, and a CosineAnnealingLR schedule, all inside an explicit Tape loop.
- Optionally wraps the forward in autocast and the step in GradScaler for half-precision training on devices that advertise cl_khr_fp16.
- Saves the trained weights with save_model and reloads them with load_model for inference.
Step-by-Step
1. Load the Data
The DataLoader accepts any object with __len__ and __getitem__. We wrap the
in-memory MNIST arrays in a tiny dataset class and let the loader handle batching, shuffling,
and worker processes.
import numpy as np
from netcl.data.dataloader import DataLoader
from netcl.data.filters import normalize, to_float
class MNISTInMemory:
"""Minimal dataset: yields (x, y) as NumPy arrays."""
def __init__(self, x: np.ndarray, y: np.ndarray):
self.x = x.astype(np.float32) / 255.0 # scale to [0, 1]
self.y = y.astype(np.int64)
def __len__(self):
return len(self.x)
def __getitem__(self, i):
return {"x": self.x[i].reshape(-1), "y": self.y[i]}
# Load MNIST (e.g. via tensorflow_datasets, torchvision, or your own loader).
# x_train: (60000, 28, 28), y_train: (60000,)
ds = MNISTInMemory(x_train, y_train)
# Per-channel MNIST normalization. mean=0.1307, std=0.3081 are the standard values.
pipeline = [normalize(mean=(0.1307,), std=(0.3081,))]
loader = DataLoader(
ds,
batch_size=128,
prefetch=4,
shuffle=True,
num_workers=2,
transforms=pipeline,
)
Cross-platform note. On Linux the worker pool uses
forkso the in-memory dataset is shared copy-on-write at almost no cost. On Windows / macOS the start method isspawn, and the dataset is sent to each worker once through the pool initializer. Either way, the loader interface is the same.
The transforms=[normalize(...)] argument is a list of FilterFns, each shaped like
(xb, yb) -> (xb, yb). The normalize helper here is the dataset-level filter
that subtracts the mean and divides by the standard deviation per channel.
2. Build the Model
A 784→256→128→10 MLP is built from Linear + ReLU +
optional Dropout blocks using build_sequential and example_mlp_config from
netcl.nn. The result is a Sequential Module subclass.
from netcl.core.device import manager
from netcl.nn import build_sequential, Linear, ReLU, Dropout, Sequential
dev = manager.default("auto")
model = Sequential(
Linear(dev.queue, 28 * 28, 256),
ReLU(),
Dropout(p=0.1),
Linear(dev.queue, 256, 128),
ReLU(),
Dropout(p=0.1),
Linear(dev.queue, 128, 10),
)
print(model)
model.parameters() returns every Parameter on the active device. We hand that
iterator to the optimizer below.
3. Choose a Loss, an Optimizer, and a Schedule
The recommended defaults for a small classification task are:
- CrossEntropyLoss — log-softmax + negative log-likelihood fused into one op. Accepts raw logits and integer class labels.
- AdamW — adaptive moments with decoupled weight decay. Converges in fewer epochs than SGD on MNIST and is much less sensitive to the learning rate.
- CosineAnnealingLR — smoothly decays the learning rate from the
initial value to
eta_minoverT_maxepochs.
from netcl.optim import AdamW, CosineAnnealingLR
import netcl.autograd as ag
opt = AdamW(model.parameters(), lr=1e-3, weight_decay=1e-4)
sched = CosineAnnealingLR(opt, T_max=10, eta_min=1e-5)
T_max=10 matches the 10-epoch budget below. If you train for longer, set T_max to the
total number of epochs.
4. Train with an Explicit Tape Loop
Why no
Trainer? netcl does not ship a high-levelTrainerclass. The distributed module exposesdata_parallel_stepas a function-based training helper, but for single-device training the loop is written out explicitly. The explicitness is short and useful while learning the framework.
import netcl.autograd as ag
for epoch in range(10):
for batch in loader:
x, y = batch["x"], batch["y"]
with ag.Tape() as tape:
logits = model(x)
loss = ag.cross_entropy(logits, y)
tape.backward(loss)
opt.step()
opt.zero_grad()
sched.step()
print(f"epoch {epoch}: loss = {float(loss.to_host()):.4f}")
Five things are happening on every step:
with ag.Tape() as tape:installs the thread-local current tape so that every op inside the block is recorded.logits = model(x)runs the forward pass — each Linear / Dropout call goes throughapply_opand is appended totape.nodes.loss = ag.cross_entropy(logits, y)registers the loss op.tape.backward(loss)walks the graph in reverse topological order, calls each op'sgrad_fn, and accumulates the per-parameter gradient intoparam.grad.opt.step()mutates the parameters in place;opt.zero_grad()clears the gradient buffers.sched.step()is called once per epoch, not per batch.
5. Optional: Mixed Precision with AMP
If your device advertises cl_khr_fp16 (NVIDIA, AMD RDNA, Intel ARC — most discrete GPUs since 2017), wrap the forward in autocast and the step in GradScaler to get a meaningful speedup.
import netcl.amp as amp
scaler = amp.GradScaler(init_scale=2.0**16, enabled=True)
for epoch in range(10):
for batch in loader:
x, y = batch["x"], batch["y"]
with ag.Tape() as tape:
with amp.autocast(enabled=True):
logits = model(x)
loss = ag.cross_entropy(logits, y)
scaled = scaler.scale_loss(loss)
tape.backward(scaled)
scaler.step(opt, model.parameters())
opt.zero_grad()
sched.step()
print(f"epoch {epoch}: loss = {float(loss.to_host()):.4f}")
Two new pieces:
- The autocast context manager flips a thread-local flag. The autograd ops read the flag and cast fp32 Tensors to fp16 where the device allows it.
- GradScaler.scale_loss multiplies the loss by a running
scale(starting at2**16) on-device. The backward is then computed on the scaled loss, and scaler.step divides the gradient byscalebefore callingopt.step(). If any gradient is inf/nan, the step is skipped and the scale is reduced; the nextscaler.update()may grow it again.
On devices without cl_khr_fp16 the autocast context manager
silently degrades to fp32 (it re-probes supports_fp16(queue) on __enter__).
6. Inference
For inference you want grad mode off so the Tape does not record the forward, and the loss tensor is not allocated with a gradient buffer.
x_test = x_test[:8]
with ag.no_grad():
out = model(x_test).to_host() # (8, 10) logits
pred = out.argmax(axis=1)
no_grad (or its functional twin set_grad_enabled(False))
skips the recording pass on every op; the output Tensor is the same shape
and dtype as in training but its grad slot is never allocated.
7. Save and Load the Model
save_model writes a single self-contained .netcl file (a NumPy .npz) holding
every parameter and a sidecar JSON describing the model architecture.
from netcl.io import save_model, load_model
save_model(model, "mnist_mlp.netcl")
# In a fresh process:
model = load_model("mnist_mlp.netcl")
The file format is documented in detail on the io page; the short version is that
each parameter becomes a {layer_index}:{state_dict_key} entry, and the __netcl_meta__
entry carries the model type, layer config, and version. load_model is
backwards-compatible with older two-file layouts (.json + .npz) and silently keeps the
freshly-built layer's initialization for any missing key.
Troubleshooting
The narrative version:
- NaN loss from step 1. The two usual culprits are (a) the learning rate is too high
for the AdamW defaults — try
lr=3e-4orlr=1e-4, and (b) mixed precision on a device that does not advertise cl_khr_fp16. Disable AMP by settingamp.autocast(enabled=False)andGradScaler(enabled=False); if the NaNs go away, your device does not support fp16 and the autocast probe was bypassed. - Very slow first step, then normal speed. The first call to a function decorated with
@jit_compileruns the JIT pass — it traces the op chain, generates an OpenCL kernel pair, and waits for the device to build it. This can take 200–800 ms the first time per unique shape; subsequent steps reuse the cached program. The JIT Compiler page explains the warm-up budget. - Crash on Windows with
clBuildProgramfailure or a black-screen driver reset. The AMD/NVIDIA OpenCL driver is older than the ICD that netcl was tested against. Update the vendor driver first; if the issue persists, setNETCL_KERNEL_STRATEGY=portableto force the conservative kernel variants and try again. On Intel iGPUs, the OpenCL runtime is bundled with the GPU driver; install the latest one from the Intel Arc & Iris Xe Graphics driver page. RuntimeError: detect_anomaly: Gradient w.r.t. parent N of op 'X' contains NaN or Inf. This is the detect_anomaly diagnostic firing. It compares the analytical gradient (the kernel chain) against a finite-difference check. The error message includes thecreation_traceof the offending Node, so jump to the frame it points at. Most often the cause is a divide-by-zero in a custom op, or anepsthat is too small in a normalization.- Loss is stuck near
log(10) ≈ 2.30. The model is not learning at all. Check that the DataLoadershuffle=Trueand thattransforms=[normalize(...)]is being applied (a typo in the filter name silently falls back to the identity). save_modelis much larger than expected. MLP only has ~236 K parameters (~944 KB at fp32); if your file is tens of MB, you are probably saving the full Optimizer state as well — usesave_paramsfromnetcl.io.checkpointfor a raw NPZ of just the weights.
See also
- Quickstart — install netcl and run a single kernel launch.
- Tensor — the on-device value type that the DataLoader and the model both produce.
- Tensor Backend — the OpenCL / CPU backends, the BufferPool, and the asynchronous H2D copy.
- Understanding Autograd — the Tape and Node internals, the operator overloading story, and how to write a custom op.
- Data-Parallel Training — extend the same MLP to multiple devices with the distributed module.
- Writing a Custom OpenCL Kernel — drop down to the OpenCL layer for an op netcl does not ship.
- AMP — full reference for autocast, GradScaler, and the fp16 capability probe.
- io — the save_model / load_model format and the
save_checkpoint/load_checkpointtraining-state helpers.