netcl.io — Checkpointing & Serialization
netcl.io — Checkpointing & Serialization
The io API is the persistent layer of netcl. It writes a model's
parameters (or an entire training state) to disk in a self-contained, framework-agnostic
file, and reads them back. The format is NumPy .npz — a ZIP container of named
.npy arrays — with
a single __netcl_meta__ entry that carries the layer-by-layer architecture as JSON. The
same module also exposes a lower-level, parameter-list checkpoint API that can save the
optimizer state, scheduler state, GradScaler state, and step counter.
Note — Two submodules, two import shapes.
netcl/io/__init__.pyre-exports onlysave_modelandload_model(the high-level model files). The training-state checkpoint helpers —save_checkpoint,load_checkpoint,save_params, andload_params— live inio/checkpoint.pyand are not re-exported from the package root. Use the long-form imports:
python from netcl.io import save_model, load_model # model files from netcl.io.checkpoint import save_checkpoint, load_checkpoint # training state from netcl.io.checkpoint import save_params, load_params # raw parameter NPZ
Public API
| Symbol | Path | Purpose |
|---|---|---|
save_model(model, path) |
io/serialization.py |
Save a Sequential model to a single .npz file |
load_model(path, queue=None, pool=None) |
io/serialization.py |
Load a Sequential model from a .npz (or legacy two-file) export |
save_params(params, path, names=None) |
io/checkpoint.py |
Write an iterable of Tensors to a raw NPZ |
load_params(queue, params, path, names=None) |
io/checkpoint.py |
Read a raw NPZ back into existing Tensors |
save_checkpoint(params, path, optim_state=None, config=None, names=None) |
io/checkpoint.py |
Write params NPZ + sidecar JSON containing optimizer / config state |
load_checkpoint(queue, params, path, names=None) |
io/checkpoint.py |
Read a checkpoint back; returns the parsed optim_state / config dict |
Model File Format (.netcl)
A .netcl file is a single NumPy .npz (a ZIP container of .npy arrays). The
arrays are keyed as follows:
- One entry per parameter, named
"{layer_index}:{state_dict_key}". For aSequentialwith twoLinearlayers, the keys look like"0:weight","0:bias","1:weight","1:bias". Buffers that are already ndarrays (e.g. anEmbedding.weight) are saved the same way. - A single
__netcl_meta__entry whose value is adtype=np.str_array wrapping a JSON document. The document has the shape{"type": "Sequential", "config": [...], "version": 2, "format": "netcl.single-file"}.
mnist_mlp.netcl (single NPZ file, ZIP under the hood)
├── __netcl_meta__ # JSON: {"type": "Sequential", "config": [...], "version": 2, "format": "netcl.single-file"}
├── 0:weight # ndarray, dtype = model's weight dtype
├── 0:bias # ndarray
├── 1:weight # ndarray
└── 1:bias # ndarray
Note — Legacy two-file format. Older code (and the German original) described a
<path>.json+<path>.npzpair.load_model()still accepts that layout as a fallback: if<path>does not exist but<path>.jsonand<path>.npzdo, it reads the two files. New exports fromsave_model()use the single-file format described above.
Saving
from netcl.io import save_model
from netcl.nn import Linear, ReLU, Sequential
from netcl.core.device import manager
q = manager.default("auto").queue
model = Sequential(Linear(q, 784, 256), ReLU(), Linear(q, 256, 10))
# ... train ...
save_model(model, "mnist_mlp.netcl")
save_model creates parent directories on demand and writes the file in a single
np.savez(...) call. The metadata JSON is re-serialized on every save, so a file written
with the current code is bit-identical regardless of the OS line ending or the platform.
Loading
from netcl.io import load_model
new_model = load_model("mnist_mlp.netcl")
load_model(path, queue=None, pool=None) does the following:
- If
queueisNone, take the default device's queue fromcore.device.manager.default. - Open the file with
np.load(path, allow_pickle=False). If the file does not exist, fall back to the legacy two-file layout (<path>.json+<path>.npz). - Verify the
__netcl_meta__key is present; parse the JSON. - Rebuild the
Sequentialfrom theconfiglist usingnn.factory.build_sequential. - For each layer, copy the matching
"{idx}:{key}"entries into the layer'sstate_dict. Missing keys are tolerated — they keep whatever the freshly-built layer was initialized with — which makes it safe to load a checkpoint saved from a slightly older model. - Always
weights.close()on exit (the NPZ file handle).
The pool= argument is currently unused on the open path but is reserved for a future
fast-path that will route the new Tensors through a
PersistentBufferPool instead of allocating fresh buffers.
Training Checkpoint Format
The training-state checkpoint is a thin layer on top of the raw NPZ parameter writer.
The output is two files: <path>.npz for the parameter values, and <path>.json for
the metadata.
from netcl.io.checkpoint import save_checkpoint, load_checkpoint
save_checkpoint(model.parameters(), # params first
"ckpt/iter_1000", # NOTE: no extension; .npz + .json are added
optim_state={"adam_state": ...},
config={"lr": 1e-3, "step": 1000},
names=["fc1.weight", "fc1.bias", "fc2.weight", "fc2.bias"])
state = load_checkpoint(queue, model.parameters(), "ckpt/iter_1000")
print(state["config"], state["optim_state"])
The JSON sidecar is a single object with two keys:
{
"optim_state": { "...": "..." },
"config": { "...": "..." }
}
Both are opaque to load_checkpoint: it just deserializes the JSON and returns the
dict. It is the caller's responsibility to know that optim_state is an
Optimizer state dict (compatible with opt.load_state_dict(...)) and that
config typically contains a step counter, a Scheduler state, a
GradScaler state, and a Python random / NumPy RNG state for exact-resume
training.
Raw Parameter NPZ
If you do not need the JSON sidecar — for example, when you only care about the
parameters and want to do the bookkeeping yourself — the save_params and
load_params functions write and read a bare <path>.npz.
from netcl.io.checkpoint import save_params, load_params
save_params(model.parameters(), "raw/iter_1000.npz",
names=["fc1.weight", "fc1.bias", "fc2.weight", "fc2.bias"])
load_params(queue, model.parameters(), "raw/iter_1000.npz",
names=["fc1.weight", "fc1.bias", "fc2.weight", "fc2.bias"])
load_params raises KeyError on a missing name and ValueError on a shape mismatch —
both at the matching-name index, so a wrong-name typo is loud and immediate rather than
silent.
Device & Dtype Behavior on Load
load_model and load_params both honor a few simple invariants:
- Default device. When
queueis not given, the new Tensors are allocated on the default device fromcore.device.manager.default. If no OpenCL device is available, aRuntimeErroris raised. - Dtype preserved. The dtype stored in the file is used as-is. A checkpoint saved in
float32is loaded asfloat32, even on a device that supportscl_khr_fp16; this avoids silent precision loss on load. - Shape checked. A loaded parameter whose shape differs from the freshly-built
layer's parameter is reported as a
ValueError(forload_params) or silently kept at its initialized value (forload_model, where the missing key is a "load nothing" case).
After load, if you want the parameters on a specific device or in a specific dtype, use
the same model.to(device) / manual Tensor.from_host(...) pattern you would use after
fresh construction.
Backwards-Compatibility Policy
netcl.io follows a deliberately conservative compatibility policy:
- The single-file
.netclformat is the only format new code will write. All new training scripts should callsave_model(model, path)and let the library decide the exact on-disk layout. - Reads remain backwards-compatible. A file written by an older
netclversion (including the legacy two-file<path>.json+<path>.npzlayout) is still readable by the current code. - The JSON
versionfield is bumped only on a breaking change (renamed state-dict keys, removed layer type, mandatory new field). Code that needs to know what version it is reading can checkmeta["version"]before proceeding. np.savezis forward-compatible by construction. New parameters added to a layer are simply absent from older files;load_modelkeeps the freshly-built layer's initialization for them. The opposite direction (an older netcl reading a newer file with an extra parameter) raises a clearKeyErrorat the load call site.
See also
- Tensor — the value type saved and loaded by every helper here.
- nn API —
Sequential,MLP, and thestate_dictprotocol. - Optimizer — the per-parameter state that
save_checkpointcarries in theoptim_stateJSON sidecar. - Scheduler — the LR scheduler state that lives in
config. - GradScaler — the AMP loss-scaler state that lives in
config. - AMP — recommended to wrap the forward pass in
autocastbefore saving a checkpoint, so the saved weights reflect the half-precision forward. - MNIST with MLP — the tutorial that uses
save_model/load_modelend-to-end. - Data-Parallel Training — the tutorial that uses
save_checkpoint/load_checkpointto resume a multi-replica run.