Data-Parallel Training
Data-Parallel Training
netcl's distributed API is in
netcl.distributed. The key entry point for data-parallel training isdata_parallel_step.
netcl supports data-parallel training across multiple OpenCL devices on a single host. Each device holds a copy of the model parameters; batches are sharded across devices; gradients are averaged after each backward pass. There is no cluster networking — it is single-host, multi-device only.
The high-level pattern uses data_parallel_step:
import netcl.autograd as ag
from netcl.distributed import DeviceManager, prepare_replicas, data_parallel_step
from netcl.optim import AdamW
dm = DeviceManager()
queues = dm.get_queues() # one per device
replicas = prepare_replicas(model.parameters(), queues)
opts = [AdamW(r, lr=3e-4) for r in replicas]
def forward_fn(queue, xb, yb, params):
with ag.Tape() as tape:
logits = model(xb)
loss = ag.cross_entropy(logits, yb)
return loss, tape
for x, y in dataloader:
loss = data_parallel_step(
forward_fn=forward_fn,
param_replicas=replicas,
optimizers=opts,
batch=(x, y),
)
Where It Lives
distributed/data_parallel.py—shard_batch,replicate_params,sync_grads,broadcast_params.distributed/trainer.py—prepare_replicas,data_parallel_step.distributed/collectives.py—all_reduce,broadcast,scatter,gather.distributed/device_manager.py—DeviceManager.
How It Works
data_parallel_step drives one full data-parallel iteration:
shard_batch(batch, n_devices)— splits the batch along axis 0.- For each device: runs
forward_fn(queue, xb, yb, params)and callstape.backward(loss). sync_grads(param_replicas)— in-place mean of everyparam.gradacross all replicas usingall_reduce.- For each device:
optimizer.step()andoptimizer.zero_grad().
The all-reduce in step 3 is the only synchronisation point. It
copies gradients to host, reduces with np.mean, and copies back.
For same-host replicas this is a pure host-memory operation — no
network, no driver overhead beyond the H2D/D2H copies.
Code Example
A data-parallel step using the netcl distributed API:
import netcl.autograd as ag
from netcl.distributed import DeviceManager, prepare_replicas, data_parallel_step
from netcl.nn import Linear, ReLU, Sequential, cross_entropy
from netcl.optim import AdamW
# One queue per device
dm = DeviceManager()
queues = dm.get_queues()
# Build the model on device 0, replicate params to all queues
def make_model(q):
return Sequential(Linear(q, 784, 256), ReLU(), Linear(q, 256, 10))
model = make_model(queues[0])
replicas = prepare_replicas(model.parameters(), queues)
opt_replicas = [AdamW(r, lr=3e-4) for r in replicas]
def forward_fn(queue, xb, yb, params):
with ag.Tape() as tape:
logits = model(xb)
loss = cross_entropy(logits, yb)
return loss, tape
for x, y in dataloader:
loss = data_parallel_step(
forward_fn=forward_fn,
param_replicas=replicas,
optimizers=opt_replicas,
batch=(x, y),
)
Performance & Trade-offs
- The all-reduce in
sync_gradsis the only synchronisation point. It is synchronous; all replicas must finish their backward before any optimizer step starts. - Memory budget: each replica holds a full copy of all parameters and their gradients. For a 100 M-parameter model in fp32, that is ~800 MB per device. With 4 replicas you need 4× that on your host.
- Model-parallel (tensor-parallel or pipeline-parallel) is not implemented in netcl. All replicas run the full model.
- For deterministic runs, seed all per-replica RNGs identically
before the first
data_parallel_step.
See also
- distributed API — the full API page with all collective functions.
- Collectives —
all_reduce,broadcast,scatter,gather. - DataLoader — the data pipeline.
- Tutorial: Data-Parallel Training — a worked example. this article.