main

netcl

netcl is a Python deep-learning framework built on PyOpenCL. It runs on GPUs, CPUs, and accelerators from any vendor that ships an OpenCL 1.2/2.0 driver: Intel, AMD, NVIDIA, Apple Silicon, ARM Mali, and most embedded SoCs. There is no CUDA dependency anywhere in the stack.

On top of the device layer, netcl adds a Tape-based autograd engine, a runtime JIT Compiler that fuses chains of elementwise ops into single OpenCL kernels, native AMP with automatic cl_khr_fp16 detection, and a host-based Distributed stack for single-node multi-device training.

Hello, netcl

The block below covers device discovery, tensor creation, a model forward pass, and one full training step:

import numpy as np
import netcl.autograd as ag
from netcl.core.device import manager
from netcl.core.tensor import Tensor
from netcl.nn import Linear, ReLU, Sequential, cross_entropy
from netcl.optim import Adam

# Pick a device (auto-detects GPU, falls back to CPU)
dev = manager.default("auto")
q   = dev.queue
print(dev.device_name, dev.backend)   # e.g. "Intel Arc A770"  "cl"

# Build a model
model = Sequential(
    Linear(q, 784, 256), ReLU(),
    Linear(q, 256, 128), ReLU(),
    Linear(q, 128, 10),
)
opt = Adam(model.parameters(), lr=1e-3)

# Synthetic batch — 32 images, 10 classes
x = Tensor.from_host(q, np.random.randn(32, 784).astype(np.float32))
y = Tensor.from_host(q, np.zeros(32, dtype=np.int32))

# One training step
with ag.Tape() as tape:
    logits = model(x)
    loss   = cross_entropy(logits, y)
tape.backward(loss)
opt.step()
opt.zero_grad()

print("loss:", float(loss.to_host()))

For a guided walkthrough see the Quickstart. For a real model with data loading and checkpointing, jump to MNIST with MLP.

Features

Device and tensor layer

Tensor backed by cl.Buffer with an integrated BufferPool, fp16/fp32/fp64 dtypes, automatic H2D/D2H transfers, and zero-copy where the driver supports it.
OpenCLBackend with async command queues, pinned host memory, SIGINT-safe and fork-safe. CPUBackend uses NumPy with an identical op surface, ideal for CI and environments without a GPU.
DeviceManager that auto-discovers every GPU, CPU, and accelerator ICD on the system and exposes them as named DeviceHandle objects.

Compute

Ops: matmul, Conv2d, elementwise, pooling, BatchNorm — implemented as OpenCL kernels with per-device tuning.
Fused ops: linear+relu, conv+relu, batchnorm+relu, add+relu, matmul+bias+relu, and more.
JIT Compiler that detects fusable chains in the forward trace and emits single-kernel replacements at runtime.

Training

Autograd via Tape and Node with dynamic graph construction and a backward-pass kernel dispatcher.
nn: Linear, Conv2d, BatchNorm2d, ResNet18, Sequential, factory functions, init routines, pooling, dropout.
optim: SGD, Momentum, Adam, AdamW, RMSProp, CosineAnnealingLR, WarmupCosine, ReduceLROnPlateau, gradient clipping.
DataLoader with multiprocess prefetching and a shared-memory ring buffer (Linux fork and Windows spawn workers).
Distributed host-mediated collectives (all_reduce, broadcast, scatter, gather) and a data_parallel_step loop for single-node multi-GPU training.
AMP: autocast and AMPGradScaler with automatic cl_khr_fp16 detection, falls back to fp32 if unavailable.
IO: save_model / load_model for weights and a portable checkpoint format with optimizer state.

When to use netcl

netcl is the right tool if you are:

Running on Apple Silicon, AMD, Intel, or ARM Mali — any vendor GPU stack other than CUDA-only.
Writing research code where you want to write a custom OpenCL kernel, fuse ops by hand, or instrument the JIT pipeline.
Experimenting with fp16 / AMP on hardware that lacks cuBLAS/cuDNN.
Training a multi-device single-node job on a workstation with two or more GPUs.
Maintaining a portable pipeline that must build on macOS, Linux, and Windows from one source tree.
Teaching a deep-learning systems class where students should be able to read the kernel, the Tape, and the BufferPool in the source.

netcl is probably not the right tool if you need massive transformer training with NVLink/RDMA, full cuDNN throughput on NVIDIA-only hardware, multi-node distributed training, or a beginner-friendly high-level API. For those cases, use PyTorch, JAX, or TensorFlow.

Where to next

Quickstart — install, smoke test, and a 30-line training loop.
MNIST with MLP — first complete model with real data and checkpointing.
Architecture Overview — internal module map at a glance.
Autograd and Tape — how the dynamic graph works.
JIT Compiler — runtime kernel fusion explained.
Data-Parallel Training — multi-GPU on a single host.