Checkpoint
Checkpoint
Status: Public API in
netcl.io.checkpoint.load_checkpoint,save_checkpoint
A Checkpoint is a serialised snapshot of a model's state. It
is the standard unit of persistence in netcl: every long-running
training job saves a checkpoint at the end of every epoch (or
every N steps), and resumes from the latest checkpoint on
restart.
A checkpoint is a Python dict with at least two keys:
"model"— the model'sstate_dict(a{name: Tensor}dict)."optimizer"— the optimizer's state (the moments, the step count, etc.).- Optionally:
"scheduler","scaler","epoch","step","rng".
The dict is serialised with numpy.savez_compressed (default)
or pickle (when the user requests it). The on-disk format
is a single .npz (or .pkl) file.
Overview
save_checkpoint and load_checkpoint are the two entry
points. They handle:
- Tensor <-> numpy round-trip (with dtype preservation).
- Mapping-aware state-dict handling (so the user can save / load into a model that has been slightly modified since the last save — only the matching keys are restored).
- Optimizer state (moments, step count, etc.).
- Atomic file replacement (write to a temp file, then
os.replace).
The default format is .npz; the file is small (a few MB for
a 100 M-parameter model in fp32) and is portable across
machines and Python versions.
Where It Lives
- File path:
io/checkpoint.py. - Module path:
netcl.io.checkpoint. - Not re-exported from
netcl.io— always use the full path:from netcl.io.checkpoint import save_checkpoint, load_checkpoint. - The simpler model-only helpers
save_model/load_modelare re-exported fromnetcl.ioand are preferred when you only need to save the network weights.
Code Example
Saving a training-state checkpoint:
from netcl.io.checkpoint import save_checkpoint
save_checkpoint(
model.parameters(), # iterable of Tensor
"checkpoints/iter_1000", # no extension; .npz + .json added automatically
optim_state={"lr": 1e-3, "step": 1000},
config={"epoch": 5},
names=["fc1.weight", "fc1.bias", "fc2.weight", "fc2.bias"],
)
Loading a checkpoint back:
from netcl.io.checkpoint import load_checkpoint
from netcl.core.device import manager
q = manager.default("auto").queue
state = load_checkpoint(q, model.parameters(), "checkpoints/iter_1000")
print(state["config"], state["optim_state"])
For saving only the model architecture + weights (the common case), use
save_model / load_model from netcl.io instead — they are simpler
and the resulting .netcl file is self-describing.
Performance & Trade-offs
.npzformat: portable, small, easy to inspect withnumpy.load. The compression adds a few hundred ms at save time but cuts the file size in half. Use it for the final-epoch checkpoint; for in-training checkpoints, use.npy(no compression) for faster save.- Atomicity:
os.replaceis atomic on POSIX. On Windows, the equivalent isos.replacesince Python 3.3. Either way, a crash mid-save leaves either the old or the new file intact, never a half-written one. - State-dict mapping:
load_state_dictis strict by default — a missing key or a wrong-shape tensor raises. Thestrict=Falseargument turns this into a warning, which is useful when loading a checkpoint into a model that has been slightly modified (e.g. adding a new head). - Cross-device portability: a checkpoint saved on a
CUDA-capable box loads on a CPU-only box. The tensor's
.bufferfield is dropped on save; only the host numpy array is persisted.
See also
- Checkpoint — the API page.
- save_model — the model-only convenience function.
- load_model — the model-only restore function.
- Trainer — the high-level wrapper that handles checkpointing.
- Serialization — the lower-level tensor <-> numpy round-trip.
- Checkpoint — this article.