Architecture — Module Overview
Architecture — Module Overview
netcl is organized as a stack of cooperating subpackages. Every top-level subpackage has a single, clearly-scoped responsibility, and the dependency direction is always from high to low: high-level code (models, trainers, data-parallel loops) imports from the core API; the core never imports from the trainer. This keeps the build graph acyclic and makes it possible to swap or replace a subpackage (for example, a different Tensor backend) without touching the layers above it.
The eight top-level subpackages and their import direction are summarized below.
Caption — solid arrows are direct imports. The core layer is at the bottom because it is the only layer that talks to the OpenCL driver through PyOpenCL; the trainer layer is at the top because it is the only layer that end users are expected to call directly in long-running jobs.
Subpackages
core — Foundation
The foundation. Owns device discovery, the Tensor type, and all memory management. Nothing in netcl can run without the core layer, and nothing in core imports from higher layers.
core/device.py—DeviceManagerplus a thread-localdevicecontext manager.DeviceManagerenumerates OpenCL platforms and devices and caches aDeviceHandleper request.core/tensor.py— the Tensor class itself: shape, dtype, device buffer, and an optional BufferPool handle.core/memory.py— BufferPool,PinnedBufferPool,PersistentBufferPool,BufferHandle, andPoolStatsfor hit-rate tracking.core/backend/opencl.py— the Tensor Backend; async H2D/D2H copies, SIGINT/atexit teardown contract, fork-safety snapshot.core/backend/cpu.py— NumPy fallback used whenpyopenclis missing or when the user explicitly asks forbackend="cpu".core/kernels/primitives.py—KernelSpec,WorkGroupTuner, and the sharedPRIMITIVE_PREAMBLEinjected into every JIT Compiler source string.
ops — Element-wise and linalg primitives
Each op is a single function with Tensor inputs and
outputs. The ops layer is forward-only — it knows nothing
about gradients, no
autograd Node is created here, and no
metadata beyond shape/dtype is tracked. All differentiable variants
live in autograd/ops.py and forward to these primitives.
autograd — Tape + JIT
The automatic-differentiation layer. It owns the Tape singleton, every differentiable Node, and the JIT Compiler that fuses element-wise op chains into a single OpenCL kernel. Higher layers (nn, distributed) consume this layer; the JIT Compiler in turn depends on ops and the Tensor Backend.
nn — Modules, Layers, ResNet
The neural-network module library. Implements Linear, Conv2d,
BatchNorm2d, the ResNet family, and friends. Built entirely on top of
autograd — nn.Module never calls into
ops directly, so all nn modules are differentiable
end-to-end. The
AMP and
Distributed Architecture hooks plug in at
this layer.
optim — Optimizers and Schedules
SGD, Adam, AdamW, RMSProp, Momentum, and LR schedulers
(CosineAnnealingLR, WarmupCosine, ReduceLROnPlateau). Optimizers consume
nn Module.parameters() and touch
autograd only for the optional clip_grad_norm
and clip_grad_norm_device helpers in optim.
runtime — Cross-cutting services
Cache, replay-graph capture, and stream scheduling. Used by the trainer, the JIT Compiler, and the AMP loss-scaler. Treat this as the "library" that the high-level layers reach for when they need a kernel cache, a captured graph, or a scheduler.
data — DataLoader, augmentation, shared-memory ring
The input pipeline. DataLoader, augment, augment_gpu, filters,
and a shared-memory ring (shared_batch) for low-latency host-side
prefetch. Has no compile-time dependency on the rest of the stack — it
emits NumPy batches and any compute layer can consume them.
distributed — Collectives and data-parallel
Multi-device, multi-process support on a single host. Provides
host-based all_reduce, broadcast, scatter, and
gather (see the
Distributed Architecture page), plus the
high-level data_parallel_step helper that wires
together DataLoader → Tensor →
nn → Optimizer →
Checkpoint.
Data flow during a single training step
The diagram below is a faithful walk-through of one forward / backward / optimizer step. Boxes are real subpackages; arrows are the direction in which Tensor objects (or gradients) move.
A concrete code outline that matches the picture (no
autograd tape shown — it is created
implicitly when loss is computed):
import netcl.autograd as ag
from netcl.io import save_model
# 1. DataLoader yields a NumPy batch
for x, y in loader: # data API
# 2. Tensor factory copies to device
x_t = Tensor.from_host(queue, x) # core API
y_t = Tensor.from_host(queue, y)
# 3. nn.Module forward — differentiable, recorded by the Tape
with ag.Tape() as tape:
logits = model(x_t) # nn API
loss = ag.cross_entropy(logits, y_t) # autograd API
# 4. Tape backward
tape.backward(loss) # autograd API
# 5. Optimizer step
opt.step() # optim API
opt.zero_grad()
# 6. Checkpoint / sync
save_model(model, "ckpt.netcl") # io API
sync_grads(replicas) # distributed API
Dependency direction, in one sentence
End user → trainer → nn + distributed + data → autograd → ops + runtime → core → Tensor Backend → Memory Pool → pyopencl.
Following that direction tells you which file in netcl is the right place to look for a given feature: if you are chasing a missing kernel argument, look in Tensor Backend; if you are chasing a wrong gradient, look in Autograd & Tape; if you are chasing a slowdown, look in JIT Compiler and Memory Pool; if you are chasing a multi-device stall, look in Distributed Architecture.
See also
- Tensor Backend — the OpenCL/CPU device layer beneath Tensor.
- Memory Pool — the BufferPool that backs every Tensor.
- Autograd & Tape — the
Tape, Node, and
apply_opmachinery. - JIT Compiler — the source generator and
cache used by
@jit_compileandTrainingGraphCompiler. - Distributed Architecture — the host-based multi-device design.
- core API — the full symbol list for the core layer.
- Quickstart — end-to-end example that touches most of the subpackages above.