netcl wiki
architecture

Architecture — Module Overview

Architecture — Module Overview

netcl is organized as a stack of cooperating subpackages. Every top-level subpackage has a single, clearly-scoped responsibility, and the dependency direction is always from high to low: high-level code (models, trainers, data-parallel loops) imports from the core API; the core never imports from the trainer. This keeps the build graph acyclic and makes it possible to swap or replace a subpackage (for example, a different Tensor backend) without touching the layers above it.

The eight top-level subpackages and their import direction are summarized below.

Caption — solid arrows are direct imports. The core layer is at the bottom because it is the only layer that talks to the OpenCL driver through PyOpenCL; the trainer layer is at the top because it is the only layer that end users are expected to call directly in long-running jobs.

Subpackages

core — Foundation

The foundation. Owns device discovery, the Tensor type, and all memory management. Nothing in netcl can run without the core layer, and nothing in core imports from higher layers.

  • core/device.pyDeviceManager plus a thread-local device context manager. DeviceManager enumerates OpenCL platforms and devices and caches a DeviceHandle per request.
  • core/tensor.py — the Tensor class itself: shape, dtype, device buffer, and an optional BufferPool handle.
  • core/memory.pyBufferPool, PinnedBufferPool, PersistentBufferPool, BufferHandle, and PoolStats for hit-rate tracking.
  • core/backend/opencl.py — the Tensor Backend; async H2D/D2H copies, SIGINT/atexit teardown contract, fork-safety snapshot.
  • core/backend/cpu.py — NumPy fallback used when pyopencl is missing or when the user explicitly asks for backend="cpu".
  • core/kernels/primitives.pyKernelSpec, WorkGroupTuner, and the shared PRIMITIVE_PREAMBLE injected into every JIT Compiler source string.

ops — Element-wise and linalg primitives

Each op is a single function with Tensor inputs and outputs. The ops layer is forward-only — it knows nothing about gradients, no autograd Node is created here, and no metadata beyond shape/dtype is tracked. All differentiable variants live in autograd/ops.py and forward to these primitives.

autograd — Tape + JIT

The automatic-differentiation layer. It owns the Tape singleton, every differentiable Node, and the JIT Compiler that fuses element-wise op chains into a single OpenCL kernel. Higher layers (nn, distributed) consume this layer; the JIT Compiler in turn depends on ops and the Tensor Backend.

nn — Modules, Layers, ResNet

The neural-network module library. Implements Linear, Conv2d, BatchNorm2d, the ResNet family, and friends. Built entirely on top of autogradnn.Module never calls into ops directly, so all nn modules are differentiable end-to-end. The AMP and Distributed Architecture hooks plug in at this layer.

optim — Optimizers and Schedules

SGD, Adam, AdamW, RMSProp, Momentum, and LR schedulers (CosineAnnealingLR, WarmupCosine, ReduceLROnPlateau). Optimizers consume nn Module.parameters() and touch autograd only for the optional clip_grad_norm and clip_grad_norm_device helpers in optim.

runtime — Cross-cutting services

Cache, replay-graph capture, and stream scheduling. Used by the trainer, the JIT Compiler, and the AMP loss-scaler. Treat this as the "library" that the high-level layers reach for when they need a kernel cache, a captured graph, or a scheduler.

data — DataLoader, augmentation, shared-memory ring

The input pipeline. DataLoader, augment, augment_gpu, filters, and a shared-memory ring (shared_batch) for low-latency host-side prefetch. Has no compile-time dependency on the rest of the stack — it emits NumPy batches and any compute layer can consume them.

distributed — Collectives and data-parallel

Multi-device, multi-process support on a single host. Provides host-based all_reduce, broadcast, scatter, and gather (see the Distributed Architecture page), plus the high-level data_parallel_step helper that wires together DataLoaderTensornnOptimizerCheckpoint.

Data flow during a single training step

The diagram below is a faithful walk-through of one forward / backward / optimizer step. Boxes are real subpackages; arrows are the direction in which Tensor objects (or gradients) move.

A concrete code outline that matches the picture (no autograd tape shown — it is created implicitly when loss is computed):

import netcl.autograd as ag
from netcl.io import save_model

# 1. DataLoader yields a NumPy batch
for x, y in loader:                       # data API

    # 2. Tensor factory copies to device
    x_t = Tensor.from_host(queue, x)      # core API
    y_t = Tensor.from_host(queue, y)

    # 3. nn.Module forward — differentiable, recorded by the Tape
    with ag.Tape() as tape:
        logits = model(x_t)               # nn API
        loss = ag.cross_entropy(logits, y_t)  # autograd API

    # 4. Tape backward
    tape.backward(loss)                   # autograd API

    # 5. Optimizer step
    opt.step()                            # optim API
    opt.zero_grad()

    # 6. Checkpoint / sync
    save_model(model, "ckpt.netcl")       # io API
    sync_grads(replicas)                  # distributed API

Dependency direction, in one sentence

End user → trainernn + distributed + dataautogradops + runtimecoreTensor BackendMemory Poolpyopencl.

Following that direction tells you which file in netcl is the right place to look for a given feature: if you are chasing a missing kernel argument, look in Tensor Backend; if you are chasing a wrong gradient, look in Autograd & Tape; if you are chasing a slowdown, look in JIT Compiler and Memory Pool; if you are chasing a multi-device stall, look in Distributed Architecture.

See also