netcl wiki
concepts

Data-Parallel Training

Data-Parallel Training

netcl's distributed API is in netcl.distributed. The key entry point for data-parallel training is data_parallel_step.

netcl supports data-parallel training across multiple OpenCL devices on a single host. Each device holds a copy of the model parameters; batches are sharded across devices; gradients are averaged after each backward pass. There is no cluster networking — it is single-host, multi-device only.

The high-level pattern uses data_parallel_step:

import netcl.autograd as ag
from netcl.distributed import DeviceManager, prepare_replicas, data_parallel_step
from netcl.optim import AdamW

dm = DeviceManager()
queues = dm.get_queues()                          # one per device
replicas = prepare_replicas(model.parameters(), queues)
opts = [AdamW(r, lr=3e-4) for r in replicas]

def forward_fn(queue, xb, yb, params):
    with ag.Tape() as tape:
        logits = model(xb)
        loss = ag.cross_entropy(logits, yb)
    return loss, tape

for x, y in dataloader:
    loss = data_parallel_step(
        forward_fn=forward_fn,
        param_replicas=replicas,
        optimizers=opts,
        batch=(x, y),
    )

Where It Lives

  • distributed/data_parallel.pyshard_batch, replicate_params, sync_grads, broadcast_params.
  • distributed/trainer.pyprepare_replicas, data_parallel_step.
  • distributed/collectives.pyall_reduce, broadcast, scatter, gather.
  • distributed/device_manager.pyDeviceManager.

How It Works

data_parallel_step drives one full data-parallel iteration:

  1. shard_batch(batch, n_devices) — splits the batch along axis 0.
  2. For each device: runs forward_fn(queue, xb, yb, params) and calls tape.backward(loss).
  3. sync_grads(param_replicas) — in-place mean of every param.grad across all replicas using all_reduce.
  4. For each device: optimizer.step() and optimizer.zero_grad().

The all-reduce in step 3 is the only synchronisation point. It copies gradients to host, reduces with np.mean, and copies back. For same-host replicas this is a pure host-memory operation — no network, no driver overhead beyond the H2D/D2H copies.

Code Example

A data-parallel step using the netcl distributed API:

import netcl.autograd as ag
from netcl.distributed import DeviceManager, prepare_replicas, data_parallel_step
from netcl.nn import Linear, ReLU, Sequential, cross_entropy
from netcl.optim import AdamW

# One queue per device
dm = DeviceManager()
queues = dm.get_queues()

# Build the model on device 0, replicate params to all queues
def make_model(q):
    return Sequential(Linear(q, 784, 256), ReLU(), Linear(q, 256, 10))

model = make_model(queues[0])
replicas = prepare_replicas(model.parameters(), queues)
opt_replicas = [AdamW(r, lr=3e-4) for r in replicas]

def forward_fn(queue, xb, yb, params):
    with ag.Tape() as tape:
        logits = model(xb)
        loss = cross_entropy(logits, yb)
    return loss, tape

for x, y in dataloader:
    loss = data_parallel_step(
        forward_fn=forward_fn,
        param_replicas=replicas,
        optimizers=opt_replicas,
        batch=(x, y),
    )

Performance & Trade-offs

  • The all-reduce in sync_grads is the only synchronisation point. It is synchronous; all replicas must finish their backward before any optimizer step starts.
  • Memory budget: each replica holds a full copy of all parameters and their gradients. For a 100 M-parameter model in fp32, that is ~800 MB per device. With 4 replicas you need 4× that on your host.
  • Model-parallel (tensor-parallel or pipeline-parallel) is not implemented in netcl. All replicas run the full model.
  • For deterministic runs, seed all per-replica RNGs identically before the first data_parallel_step.

See also