netcl wiki
concepts

Checkpoint

Checkpoint

Status: Public API in netcl.io.checkpoint.load_checkpoint, save_checkpoint

A Checkpoint is a serialised snapshot of a model's state. It is the standard unit of persistence in netcl: every long-running training job saves a checkpoint at the end of every epoch (or every N steps), and resumes from the latest checkpoint on restart.

A checkpoint is a Python dict with at least two keys:

  • "model" — the model's state_dict (a {name: Tensor} dict).
  • "optimizer" — the optimizer's state (the moments, the step count, etc.).
  • Optionally: "scheduler", "scaler", "epoch", "step", "rng".

The dict is serialised with numpy.savez_compressed (default) or pickle (when the user requests it). The on-disk format is a single .npz (or .pkl) file.

Overview

save_checkpoint and load_checkpoint are the two entry points. They handle:

  • Tensor <-> numpy round-trip (with dtype preservation).
  • Mapping-aware state-dict handling (so the user can save / load into a model that has been slightly modified since the last save — only the matching keys are restored).
  • Optimizer state (moments, step count, etc.).
  • Atomic file replacement (write to a temp file, then os.replace).

The default format is .npz; the file is small (a few MB for a 100 M-parameter model in fp32) and is portable across machines and Python versions.

Where It Lives

  • File path: io/checkpoint.py.
  • Module path: netcl.io.checkpoint.
  • Not re-exported from netcl.io — always use the full path: from netcl.io.checkpoint import save_checkpoint, load_checkpoint.
  • The simpler model-only helpers save_model / load_model are re-exported from netcl.io and are preferred when you only need to save the network weights.

Code Example

Saving a training-state checkpoint:

from netcl.io.checkpoint import save_checkpoint

save_checkpoint(
    model.parameters(),               # iterable of Tensor
    "checkpoints/iter_1000",          # no extension; .npz + .json added automatically
    optim_state={"lr": 1e-3, "step": 1000},
    config={"epoch": 5},
    names=["fc1.weight", "fc1.bias", "fc2.weight", "fc2.bias"],
)

Loading a checkpoint back:

from netcl.io.checkpoint import load_checkpoint
from netcl.core.device import manager

q = manager.default("auto").queue
state = load_checkpoint(q, model.parameters(), "checkpoints/iter_1000")
print(state["config"], state["optim_state"])

For saving only the model architecture + weights (the common case), use save_model / load_model from netcl.io instead — they are simpler and the resulting .netcl file is self-describing.

Performance & Trade-offs

  • .npz format: portable, small, easy to inspect with numpy.load. The compression adds a few hundred ms at save time but cuts the file size in half. Use it for the final-epoch checkpoint; for in-training checkpoints, use .npy (no compression) for faster save.
  • Atomicity: os.replace is atomic on POSIX. On Windows, the equivalent is os.replace since Python 3.3. Either way, a crash mid-save leaves either the old or the new file intact, never a half-written one.
  • State-dict mapping: load_state_dict is strict by default — a missing key or a wrong-shape tensor raises. The strict=False argument turns this into a warning, which is useful when loading a checkpoint into a model that has been slightly modified (e.g. adding a new head).
  • Cross-device portability: a checkpoint saved on a CUDA-capable box loads on a CPU-only box. The tensor's .buffer field is dropped on save; only the host numpy array is persisted.

See also