architecture

Architecture — Module Overview

netcl is organized as a stack of cooperating subpackages. Every top-level subpackage has a single, clearly-scoped responsibility, and the dependency direction is always from high to low: high-level code (models, trainers, data-parallel loops) imports from the core API; the core never imports from the trainer. This keeps the build graph acyclic and makes it possible to swap or replace a subpackage (for example, a different Tensor backend) without touching the layers above it.

The eight top-level subpackages and their import direction are summarized below.

Caption — solid arrows are direct imports. The core layer is at the bottom because it is the only layer that talks to the OpenCL driver through PyOpenCL; the trainer layer is at the top because it is the only layer that end users are expected to call directly in long-running jobs.

Subpackages

core — Foundation

The foundation. Owns device discovery, the Tensor type, and all memory management. Nothing in netcl can run without the core layer, and nothing in core imports from higher layers.

core/device.py — DeviceManager plus a thread-local device context manager. DeviceManager enumerates OpenCL platforms and devices and caches a DeviceHandle per request.
core/tensor.py — the Tensor class itself: shape, dtype, device buffer, and an optional BufferPool handle.
core/memory.py — BufferPool, PinnedBufferPool, PersistentBufferPool, BufferHandle, and PoolStats for hit-rate tracking.
core/backend/opencl.py — the Tensor Backend; async H2D/D2H copies, SIGINT/atexit teardown contract, fork-safety snapshot.
core/backend/cpu.py — NumPy fallback used when pyopencl is missing or when the user explicitly asks for backend="cpu".
core/kernels/primitives.py — KernelSpec, WorkGroupTuner, and the shared PRIMITIVE_PREAMBLE injected into every JIT Compiler source string.

ops — Element-wise and linalg primitives

Each op is a single function with Tensor inputs and outputs. The ops layer is forward-only — it knows nothing about gradients, no autograd Node is created here, and no metadata beyond shape/dtype is tracked. All differentiable variants live in autograd/ops.py and forward to these primitives.

autograd — Tape + JIT

The automatic-differentiation layer. It owns the Tape singleton, every differentiable Node, and the JIT Compiler that fuses element-wise op chains into a single OpenCL kernel. Higher layers (nn, distributed) consume this layer; the JIT Compiler in turn depends on ops and the Tensor Backend.

nn — Modules, Layers, ResNet

The neural-network module library. Implements Linear, Conv2d, BatchNorm2d, the ResNet family, and friends. Built entirely on top of autograd — nn.Module never calls into ops directly, so all nn modules are differentiable end-to-end. The AMP and Distributed Architecture hooks plug in at this layer.

optim — Optimizers and Schedules

SGD, Adam, AdamW, RMSProp, Momentum, and LR schedulers (CosineAnnealingLR, WarmupCosine, ReduceLROnPlateau). Optimizers consume nn Module.parameters() and touch autograd only for the optional clip_grad_norm and clip_grad_norm_device helpers in optim.

runtime — Cross-cutting services

Cache, replay-graph capture, and stream scheduling. Used by the trainer, the JIT Compiler, and the AMP loss-scaler. Treat this as the "library" that the high-level layers reach for when they need a kernel cache, a captured graph, or a scheduler.

data — DataLoader, augmentation, shared-memory ring

The input pipeline. DataLoader, augment, augment_gpu, filters, and a shared-memory ring (shared_batch) for low-latency host-side prefetch. Has no compile-time dependency on the rest of the stack — it emits NumPy batches and any compute layer can consume them.

distributed — Collectives and data-parallel

Multi-device, multi-process support on a single host. Provides host-based all_reduce, broadcast, scatter, and gather (see the Distributed Architecture page), plus the high-level data_parallel_step helper that wires together DataLoader → Tensor → nn → Optimizer → Checkpoint.

Data flow during a single training step

The diagram below is a faithful walk-through of one forward / backward / optimizer step. Boxes are real subpackages; arrows are the direction in which Tensor objects (or gradients) move.

A concrete code outline that matches the picture (no autograd tape shown — it is created implicitly when loss is computed):

import netcl.autograd as ag
from netcl.io import save_model

# 1. DataLoader yields a NumPy batch
for x, y in loader:                       # data API

    # 2. Tensor factory copies to device
    x_t = Tensor.from_host(queue, x)      # core API
    y_t = Tensor.from_host(queue, y)

    # 3. nn.Module forward — differentiable, recorded by the Tape
    with ag.Tape() as tape:
        logits = model(x_t)               # nn API
        loss = ag.cross_entropy(logits, y_t)  # autograd API

    # 4. Tape backward
    tape.backward(loss)                   # autograd API

    # 5. Optimizer step
    opt.step()                            # optim API
    opt.zero_grad()

    # 6. Checkpoint / sync
    save_model(model, "ckpt.netcl")       # io API
    sync_grads(replicas)                  # distributed API

Dependency direction, in one sentence

End user → trainer → nn + distributed + data → autograd → ops + runtime → core → Tensor Backend → Memory Pool → pyopencl.

Following that direction tells you which file in netcl is the right place to look for a given feature: if you are chasing a missing kernel argument, look in Tensor Backend; if you are chasing a wrong gradient, look in Autograd & Tape; if you are chasing a slowdown, look in JIT Compiler and Memory Pool; if you are chasing a multi-device stall, look in Distributed Architecture.