main

Quickstart

This guide takes you from a clean pip install to a working netcl training loop in about ten minutes. It assumes basic familiarity with NumPy and Python.

Prerequisites

Python 3.9 or newer. netcl uses structural pattern matching in several places; 3.8 will not work.
A working OpenCL driver for your hardware. On Linux this is typically the vendor ICD (intel-opencl-icd, mesa-opencl-icd, or amdgpu-pro). On Windows it ships with the GPU driver. On macOS the system OpenCL framework is built in. A CPU OpenCL runtime is enough if you have no GPU.

Install

python3 -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install netcl

Smoke test

import numpy as np
from netcl.core.device import manager
from netcl.core.tensor import Tensor

dev = manager.default("auto")
a   = Tensor.from_host(dev.queue, np.eye(4, dtype=np.float32))
b   = Tensor.from_host(dev.queue, np.eye(4, dtype=np.float32))
out = a @ b
print("shape:", out.shape, "backend:", out.backend)
assert np.allclose(out.to_host(), np.eye(4))
print("OK")

Expected output: shape: (4, 4) backend: cl (or cpu if pyopencl is not installed). If you see an OpenCL ICD error on Windows, see Common Pitfalls.

Your first tensor

Tensor.from_host copies a NumPy array onto the device. to_host copies it back. Between those two calls, the tensor lives on the device and all ops dispatch to OpenCL kernels (or NumPy on the CPU backend):

import numpy as np
from netcl.core.device import manager
from netcl.core.tensor import Tensor

q = manager.default("auto").queue
a = Tensor.from_host(q, np.arange(12, dtype=np.float32).reshape(3, 4))
b = Tensor.from_host(q, np.ones((3, 4), dtype=np.float32))

print(a.shape, a.dtype, a.backend)   # (3, 4) float32 cl
print((a + b).to_host())             # back to NumPy for inspection

The backend field reports "cl" when the OpenCL path served the tensor and "cpu" when NumPy did. The rest of the API is identical either way.

A complete training loop

The snippet below trains a small MLP on a synthetic classification task. It shows the full netcl pattern: model construction, the autograd Tape, optimizer step, and scalar extraction.

import numpy as np
import netcl.autograd as ag
from netcl.core.device import manager
from netcl.core.tensor import Tensor
from netcl.nn import Linear, ReLU, Dropout, Sequential, cross_entropy
from netcl.optim import Adam, CosineAnnealingLR, clip_grad_norm

dev   = manager.default("auto")
q     = dev.queue

# Model
model = Sequential(
    Linear(q, 16, 64), ReLU(), Dropout(p=0.1),
    Linear(q, 64, 64), ReLU(),
    Linear(q, 64, 4),
)
opt   = Adam(model.parameters(), lr=3e-4)
sched = CosineAnnealingLR(opt, T_max=10)

# Synthetic data — 512 samples, 4 classes
rng   = np.random.default_rng(0)
x_all = rng.standard_normal((512, 16)).astype(np.float32)
y_all = rng.integers(0, 4, size=512).astype(np.int32)

for epoch in range(10):
    # Mini-batch loop
    for i in range(0, 512, 64):
        x = Tensor.from_host(q, x_all[i:i+64])
        y = Tensor.from_host(q, y_all[i:i+64])

        with ag.Tape() as tape:
            logits = model(x)
            loss   = cross_entropy(logits, y)

        tape.backward(loss)
        clip_grad_norm(model.parameters(), max_norm=1.0)
        opt.step()
        opt.zero_grad()

    sched.step()
    print(f"epoch {epoch+1:2d}  loss = {float(loss.to_host()):.4f}")

Loss should drop from around 1.4 to below 0.4 over 10 epochs.

Saving and loading

from netcl.io import save_model, load_model

save_model(model, "my_model.netcl")

# Later — rebuild the model architecture, then load weights
model2 = Sequential(
    Linear(q, 16, 64), ReLU(), Dropout(p=0.1),
    Linear(q, 64, 64), ReLU(),
    Linear(q, 64, 4),
)
load_model(model2, "my_model.netcl", queue=q)

For training checkpoints that include optimizer state and epoch number, use netcl.io.checkpoint.save_checkpoint / load_checkpoint — see Checkpointing.

What just happened

Device discovery. manager.default("auto") enumerates the OpenCL platforms on the host and returns the highest-priority GPU. If pyopencl is not installed it falls back to the CPU backend transparently.

Tape recording. with ag.Tape() as tape: installs a thread-local tracer. Every op that fires inside the block (matmul, relu, cross_entropy, etc.) registers a node with the tape, capturing the inputs and any saved tensors needed for the backward pass.

Backward pass. tape.backward(loss) walks the node graph in reverse topological order, dispatches the gradient kernels, and accumulates .grad on each leaf parameter.

Optimizer step. Adam.step() reads each parameter's .grad and applies the Adam update. zero_grad() clears gradients before the next iteration.

JIT Compiler. The first forward pass compiles and caches OpenCL kernels for each op. Subsequent passes reuse the cache, which is why there is a noticeable latency spike on the first step.

Common pitfalls

Windows ICD selection. If you have both an Intel and an NVIDIA GPU, DeviceManager picks whichever ICD is registered first. Call manager.discover() to list all detected (platform, device) pairs, then pass device="gpu" or device="cpu" to manager.default().

macOS OpenCL deprecation. Apple marked OpenCL deprecated since macOS 10.14 but it still works through macOS 14. On macOS 15 some operations may fail at runtime.

Fork after context creation. Forking after manager.default() has been called is unsafe. The DataLoader avoids this by forking workers before any OpenCL context exists. If you fork manually, do it before the first Tensor.from_host() call.

fp16 device requirement. autocast from netcl.amp falls back silently to fp32 if the device does not advertise cl_khr_fp16.

First-step latency. The very first op on a fresh process can take 200 ms to 2 s while the JIT Compiler builds and caches the kernel. This is normal. Run at least 5 warm-up iterations before benchmarking.

Silent CPU fallback. If import pyopencl raises ImportError, netcl switches to the CPU backend without error. Check dev.backend == "cpu" to confirm which path you are on.

Where to go next

MNIST with MLP — real data, evaluation loop, and checkpointing.
Data-Parallel Training — multi-GPU on a single host.
Architecture Overview — the full internal module map.
Tensor API — every constructor and method on Tensor.
Autograd and Tape — how the dynamic graph and backward pass work in detail.