FAQ — Frequently Asked Questions
FAQ — Frequently Asked Questions
What is netcl?
netcl is a Python deep-learning framework built on PyOpenCL that runs on any OpenCL 1.2/2.0 device — Intel/AMD/NVIDIA GPUs, Intel/AMD CPUs, FPGAs, Apple Silicon GPUs, ARM Mali, and a long tail of embedded SoCs. It includes a Tape-based autograd, a runtime JIT Compiler that fuses chains of elementwise ops into single kernels, automatic AMP, and a host-based Distributed stack for single-node multi-device training. There is no CUDA dependency.
The Overview page has the full one-paragraph pitch.
Why PyOpenCL and not CUDA / ROCm / oneAPI?
Because PyOpenCL runs on all of them. CUDA is locked to NVIDIA, ROCm to a curated list of AMD GPUs, oneAPI/SYCL to a much smaller hardware matrix. OpenCL is the lowest common denominator that covers Intel, AMD, NVIDIA, Apple, ARM, and a wide range of embedded parts from one source tree. The trade-off is that the peak kernel throughput is below what a vendor-specific stack can deliver on the same silicon — but the JIT Compiler and the OpenCLBackend close a lot of that gap, and for research code the portability usually wins.
If you have NVIDIA-only hardware and you need every last percent, use PyTorch. If you have one of the dozen other platforms netcl supports, netcl is the only framework that will give you a coherent Python stack there.
Which OpenCL devices are supported?
Any ICD that exposes OpenCL 1.2 or 2.0. In practice this includes:
- Intel GPUs — HD Graphics (Skylake+), Iris Xe, Arc A-series. fp16/AMP via
cl_khr_fp16on Gen 9.5+. - AMD GPUs — RDNA, RDNA 2, RDNA 3, and CDNA accelerators under the
amdgpu-proormesa-opencl-icddrivers. - NVIDIA GPUs — OpenCL 1.2 works on every GeForce since Kepler. The performance is generally below CUDA on the same card; the Tensor Backend page documents the gap.
- Intel CPUs — Xeon and Core i-series via the Intel CPU OpenCL runtime, or via POCL on Linux.
- AMD CPUs — Zen-based CPUs via POCL or AMD's own CPU OpenCL.
- Apple Silicon — M1/M2/M3 GPUs are reachable through the system OpenCL framework. macOS 15 deprecation status is tracked on the Tensor Backend page.
- ARM Mali — Bifrost and Valhall generations. Mid-range fp16 support.
- FPGAs — Intel FPGA OpenCL BSP and Xilinx SDAccel. You will need to vendor the CPUBackend for layers the BSP does not cover.
- Embedded SoCs — Rockchip, Allwinner, etc. with vendor OpenCL.
The DeviceManager enumerates all of them and lets you pick by name, platform, or capability.
Is fp16/AMP actually faster on Intel/AMD GPUs?
Yes, with two caveats. On hardware that supports cl_khr_fp16 natively (Intel Gen 9.5+, AMD RDNA+, Apple M1+), the autocast + GradScaler combination delivers 1.5×–2.2× throughput on bandwidth-bound layers (matmul, conv2d, batchnorm) compared to fp32 eager. The two caveats are:
- Loss scaling is non-optional. fp16 has only 11 bits of mantissa, so naive accumulation of gradients underflows to zero. The GradScaler dynamically scales the loss and unscales the gradients, which keeps the math well-conditioned. Always wrap the backward pass in
scaler.scale_loss(loss).backward()and the optimizer inscaler.step(opt, ...). - Some ops are fp32-only. Reductions, softmax, and anything that accumulates across the channel axis are kept in fp32 by the autocast policy to avoid catastrophic cancellation. The AMP page lists every op and its policy.
On hardware without cl_khr_fp16 (Intel pre-Gen 9.5, some older AMD parts), autocast silently falls back to fp32 and prints a one-line warning.
How does the JIT compiler compare to TVM / Triton / PyTorch Inductor?
The netcl JIT Compiler is closer to Triton in scope than to TVM. It traces a Python function, captures the AutogradPrimitive ops, generates OpenCL C source text for the trace, compiles it at runtime with the vendor OpenCL driver, and caches the binary keyed on the op signature. It is restricted to elementwise-ish chains (pointwise, broadcast, small reductions); it does not do auto-tuning, search-based optimization, or schedule-space exploration the way TVM does.
Compared to PyTorch Inductor, netcl's compiler is much narrower in scope — it has no TorchDynamo/AOTAutograd equivalent and no FX-graph-fallback. It also does not yet autotune across tile sizes; the profiling module has hooks for that, but the autotuner driver is not wired into the production compile path. See the JIT Compiler page for the current scope.
Compared to TVM, the netcl compiler is faster to build, easier to read (the generated OpenCL C is one screen of code), and integrated with the Tape, but it does not do graph-level optimization across heterogeneous backends and does not support auto-scheduling.
What about distributed training across multiple machines?
netcl is single-node multi-device only. The Distributed stack implements host-mediated Collectives (all_reduce, broadcast, gather, reduce, all_gather) across all OpenCL devices visible to the process, and the DataParallel wrapper uses those collectives to do synchronized data-parallel training. None of that crosses the network boundary.
For multi-node training:
- The cleanest path is to launch one process per node and use a higher-level framework (e.g. a small custom launcher that calls
torch.distributedor a sidecar like Horovod) to coordinate. - The host-mediated collective design was a deliberate choice: it works on any OpenCL device without vendor-specific NCCL/RCCL, but it also means we get no benefit from NVLink, InfiniBand, or RoCE.
Why so much low-level stuff? Who is the target audience?
netcl is a research framework, not a high-level wrapper. The intended audience is:
- Researchers who want to write a custom OpenCL kernel, fuse ops by hand, and instrument the JIT Compiler trace without fighting an abstraction layer.
- Engineers building a deep-learning stack on heterogeneous hardware (Intel + AMD + ARM) where no off-the-shelf framework works.
- Students in a deep-learning systems class who want to see the kernel, the Tape, the BufferPool, and the Memory Pool in the source.
If you want a beginner-friendly framework that "just works" out of the box, use PyTorch. The trade-off there is that you cannot read the source for the GPU dispatch in a single afternoon.
How do I profile a kernel?
Two options, in increasing depth:
- Wall-time timing. Wrap the call in
time.perf_counter, calldev.queue.finish()to flush pending GPU work, then read the counter again. This gives you end-to-end op cost including H2D/D2H. - Kernel-level tuning.
from netcl.profiling.autotuner import autotune_matmul, autotune_conv2dsweeps tile sizes and workgroup shapes for a given problem size and returns the best config. SetNETCL_PROFILE_EVENTS=1before creating the queue to collectcl_eventtimestamps you can feed to your vendor profiler (Intel VTune, AMD Radeon GPU Profiler, etc.).
Why is the first step so slow?
The first op on a fresh process pays three one-time costs:
- OpenCL context creation. The driver allocates internal state structures, validates the ICD, and links the runtime. Typically 50–200 ms.
- JIT compile of the kernel. The JIT Compiler generates OpenCL C, the vendor driver compiles it, and the binary is cached on disk. Typically 100 ms–2 s the first time, near-zero on subsequent runs.
- Driver warm-up. Many vendors (notably Intel) do a final validation pass on the first kernel launch. Up to 500 ms.
Subsequent steps in the same process are 10×–100× faster. The BufferPool caches the Tensor cl.Buffer objects, the JIT Compiler cache is hot, and the driver is warmed up. If you measure first-step latency for benchmarking, do at least 5 warm-up runs and discard them.
Why does my DataLoader hang on Linux?
The DataLoader forks workers on the first call to __iter__ (the first for xb, yb in loader: line). If a PyOpenCL context has already been created before that point — for example because you called manager.default() or created a Tensor before starting iteration — the child process inherits a half-initialized driver state and may deadlock or crash.
Two fixes:
- Start iteration before creating any OpenCL context. Call
iter(loader)or begin the training loop before the firstmanager.default()/Tensor.from_host()call. The workers are forked before the ICD is loaded, which is safe. - Use single-threaded mode. Pass
num_workers=0to the loader. This disables forking entirely and runs everything on the main thread. It is slower for large datasets but never has this issue.
How do I checkpoint and resume?
save_model and load_model from netcl.io persist a Sequential model — architecture config plus all weights — into a single .netcl file (NumPy NPZ format):
from netcl.io import save_model, load_model
# Save after training
save_model(model, "checkpoint.netcl")
# Load in a fresh process (reconstructs the Sequential and loads weights)
model = load_model("checkpoint.netcl")
Optimizer state is not persisted. If you need to resume training from a mid-run checkpoint, save the optimizer's parameter arrays manually with np.savez and reload them with np.load.
Can I use netcl from C/C++?
No. netcl is a Python-only framework. The OpenCL kernels are written in OpenCL C and exposed to Python via PyOpenCL, but there is no C/C++ public API and there is no C ABI. If you need to call netcl from C++, embed the Python interpreter and call into the Python API — the Quickstart is a fine starting point for that.
Does netcl support dynamic shapes?
Partially. The JIT Compiler recompiles a kernel when it sees a new shape, so dynamic shapes work — the first time you see a new shape the compile cost is paid, and the binary is cached for the next time. What is not yet implemented is an autotuning graph cache that re-uses the best tile size / workgroup shape across recompiles. The current implementation re-picks heuristics on every recompile.
For training workloads with a fixed batch size, this is invisible. For inference workloads with widely varying input sizes, you can pre-warm the cache by running a few representative shapes through the model at startup. The JIT Compiler page documents the cache key format.
Why does my custom OpenCL kernel fail to build?
The four most common reasons, in order of how often we see them in issues:
__attribute__((vec_type_hint))is not supported on the device. Some older Intel and AMD GPUs ignore thevec_type_hintand emit a build warning, but on a few embedded parts it is a hard error. Drop the attribute and let the compiler autovectorize.- Workgroup size too large. The OpenCL spec guarantees only 256 work-items per workgroup on GPU. Some Intel parts cap at 128. Query
CL_DEVICE_MAX_WORK_GROUP_SIZEand clamp your__localsizing. - Missing
-cl-fast-relaxed-math. Without it, NaN/Inf handling in1.0f / sqrt(x)produces a kernel that is correct but two orders of magnitude slower on AMD. Pass the build flag in yourclBuildProgramcall or via the JIT Compilerfast_math=Trueoption. local memoryoverflow. The__localarray size is multiplied by the workgroup size. On Intel HD Graphics the limit is 64 KB, on AMD GPUs 32–64 KB. Reduce your tile size or use registers.
The Writing a Custom OpenCL Kernel tutorial walks through a working example, and the Tensor Backend page lists per-device build flags and limits.