netcl wiki
api

netcl.io — Checkpointing & Serialization

netcl.io — Checkpointing & Serialization

The io API is the persistent layer of netcl. It writes a model's parameters (or an entire training state) to disk in a self-contained, framework-agnostic file, and reads them back. The format is NumPy .npz — a ZIP container of named .npy arrays — with a single __netcl_meta__ entry that carries the layer-by-layer architecture as JSON. The same module also exposes a lower-level, parameter-list checkpoint API that can save the optimizer state, scheduler state, GradScaler state, and step counter.

Note — Two submodules, two import shapes. netcl/io/__init__.py re-exports only save_model and load_model (the high-level model files). The training-state checkpoint helpers — save_checkpoint, load_checkpoint, save_params, and load_params — live in io/checkpoint.py and are not re-exported from the package root. Use the long-form imports:

python from netcl.io import save_model, load_model # model files from netcl.io.checkpoint import save_checkpoint, load_checkpoint # training state from netcl.io.checkpoint import save_params, load_params # raw parameter NPZ

Public API

Symbol Path Purpose
save_model(model, path) io/serialization.py Save a Sequential model to a single .npz file
load_model(path, queue=None, pool=None) io/serialization.py Load a Sequential model from a .npz (or legacy two-file) export
save_params(params, path, names=None) io/checkpoint.py Write an iterable of Tensors to a raw NPZ
load_params(queue, params, path, names=None) io/checkpoint.py Read a raw NPZ back into existing Tensors
save_checkpoint(params, path, optim_state=None, config=None, names=None) io/checkpoint.py Write params NPZ + sidecar JSON containing optimizer / config state
load_checkpoint(queue, params, path, names=None) io/checkpoint.py Read a checkpoint back; returns the parsed optim_state / config dict

Model File Format (.netcl)

A .netcl file is a single NumPy .npz (a ZIP container of .npy arrays). The arrays are keyed as follows:

  • One entry per parameter, named "{layer_index}:{state_dict_key}". For a Sequential with two Linear layers, the keys look like "0:weight", "0:bias", "1:weight", "1:bias". Buffers that are already ndarrays (e.g. an Embedding.weight) are saved the same way.
  • A single __netcl_meta__ entry whose value is a dtype=np.str_ array wrapping a JSON document. The document has the shape {"type": "Sequential", "config": [...], "version": 2, "format": "netcl.single-file"}.
mnist_mlp.netcl      (single NPZ file, ZIP under the hood)
├── __netcl_meta__   # JSON: {"type": "Sequential", "config": [...], "version": 2, "format": "netcl.single-file"}
├── 0:weight         # ndarray, dtype = model's weight dtype
├── 0:bias           # ndarray
├── 1:weight         # ndarray
└── 1:bias           # ndarray

Note — Legacy two-file format. Older code (and the German original) described a <path>.json + <path>.npz pair. load_model() still accepts that layout as a fallback: if <path> does not exist but <path>.json and <path>.npz do, it reads the two files. New exports from save_model() use the single-file format described above.

Saving

from netcl.io import save_model
from netcl.nn import Linear, ReLU, Sequential
from netcl.core.device import manager

q = manager.default("auto").queue
model = Sequential(Linear(q, 784, 256), ReLU(), Linear(q, 256, 10))
# ... train ...
save_model(model, "mnist_mlp.netcl")

save_model creates parent directories on demand and writes the file in a single np.savez(...) call. The metadata JSON is re-serialized on every save, so a file written with the current code is bit-identical regardless of the OS line ending or the platform.

Loading

from netcl.io import load_model

new_model = load_model("mnist_mlp.netcl")

load_model(path, queue=None, pool=None) does the following:

  1. If queue is None, take the default device's queue from core.device.manager.default.
  2. Open the file with np.load(path, allow_pickle=False). If the file does not exist, fall back to the legacy two-file layout (<path>.json + <path>.npz).
  3. Verify the __netcl_meta__ key is present; parse the JSON.
  4. Rebuild the Sequential from the config list using nn.factory.build_sequential.
  5. For each layer, copy the matching "{idx}:{key}" entries into the layer's state_dict. Missing keys are tolerated — they keep whatever the freshly-built layer was initialized with — which makes it safe to load a checkpoint saved from a slightly older model.
  6. Always weights.close() on exit (the NPZ file handle).

The pool= argument is currently unused on the open path but is reserved for a future fast-path that will route the new Tensors through a PersistentBufferPool instead of allocating fresh buffers.

Training Checkpoint Format

The training-state checkpoint is a thin layer on top of the raw NPZ parameter writer. The output is two files: <path>.npz for the parameter values, and <path>.json for the metadata.

from netcl.io.checkpoint import save_checkpoint, load_checkpoint

save_checkpoint(model.parameters(),              # params first
                "ckpt/iter_1000",               # NOTE: no extension; .npz + .json are added
                optim_state={"adam_state": ...},
                config={"lr": 1e-3, "step": 1000},
                names=["fc1.weight", "fc1.bias", "fc2.weight", "fc2.bias"])

state = load_checkpoint(queue, model.parameters(), "ckpt/iter_1000")
print(state["config"], state["optim_state"])

The JSON sidecar is a single object with two keys:

{
  "optim_state": { "...": "..." },
  "config":      { "...": "..." }
}

Both are opaque to load_checkpoint: it just deserializes the JSON and returns the dict. It is the caller's responsibility to know that optim_state is an Optimizer state dict (compatible with opt.load_state_dict(...)) and that config typically contains a step counter, a Scheduler state, a GradScaler state, and a Python random / NumPy RNG state for exact-resume training.

Raw Parameter NPZ

If you do not need the JSON sidecar — for example, when you only care about the parameters and want to do the bookkeeping yourself — the save_params and load_params functions write and read a bare <path>.npz.

from netcl.io.checkpoint import save_params, load_params

save_params(model.parameters(), "raw/iter_1000.npz",
            names=["fc1.weight", "fc1.bias", "fc2.weight", "fc2.bias"])
load_params(queue, model.parameters(), "raw/iter_1000.npz",
            names=["fc1.weight", "fc1.bias", "fc2.weight", "fc2.bias"])

load_params raises KeyError on a missing name and ValueError on a shape mismatch — both at the matching-name index, so a wrong-name typo is loud and immediate rather than silent.

Device & Dtype Behavior on Load

load_model and load_params both honor a few simple invariants:

  • Default device. When queue is not given, the new Tensors are allocated on the default device from core.device.manager.default. If no OpenCL device is available, a RuntimeError is raised.
  • Dtype preserved. The dtype stored in the file is used as-is. A checkpoint saved in float32 is loaded as float32, even on a device that supports cl_khr_fp16; this avoids silent precision loss on load.
  • Shape checked. A loaded parameter whose shape differs from the freshly-built layer's parameter is reported as a ValueError (for load_params) or silently kept at its initialized value (for load_model, where the missing key is a "load nothing" case).

After load, if you want the parameters on a specific device or in a specific dtype, use the same model.to(device) / manual Tensor.from_host(...) pattern you would use after fresh construction.

Backwards-Compatibility Policy

netcl.io follows a deliberately conservative compatibility policy:

  1. The single-file .netcl format is the only format new code will write. All new training scripts should call save_model(model, path) and let the library decide the exact on-disk layout.
  2. Reads remain backwards-compatible. A file written by an older netcl version (including the legacy two-file <path>.json + <path>.npz layout) is still readable by the current code.
  3. The JSON version field is bumped only on a breaking change (renamed state-dict keys, removed layer type, mandatory new field). Code that needs to know what version it is reading can check meta["version"] before proceeding.
  4. np.savez is forward-compatible by construction. New parameters added to a layer are simply absent from older files; load_model keeps the freshly-built layer's initialization for them. The opposite direction (an older netcl reading a newer file with an extra parameter) raises a clear KeyError at the load call site.

See also

  • Tensor — the value type saved and loaded by every helper here.
  • nn APISequential, MLP, and the state_dict protocol.
  • Optimizer — the per-parameter state that save_checkpoint carries in the optim_state JSON sidecar.
  • Scheduler — the LR scheduler state that lives in config.
  • GradScaler — the AMP loss-scaler state that lives in config.
  • AMP — recommended to wrap the forward pass in autocast before saving a checkpoint, so the saved weights reflect the half-precision forward.
  • MNIST with MLP — the tutorial that uses save_model / load_model end-to-end.
  • Data-Parallel Training — the tutorial that uses save_checkpoint / load_checkpoint to resume a multi-replica run.