netcl.ops — Elementary Operations
netcl.ops — Elementary Operations
netcl.ops is the library of single-kernel GPU operations. It is the lowest user-facing
layer above the Tensor Backend: every op consumes and
produces Tensor objects, and most ops go through the JIT Compiler
so that the generated OpenCL source is cached per shape and dtype.
Note — Top-level re-exports.
netcl/ops/__init__.pyis empty in the current code. Eight names —matmul,build_matmul_kernel,elementwise_binary,relu,bias_add,reduce_sum,softmax,conv2d— are surfaced from the rootnetcl/__init__.pyand can be imported asfrom netcl import matmul, …. Everything else has to come from the explicit submodule:
python from netcl.ops.elementwise import elementwise_unary from netcl.ops.fused_ops import conv2d_relu_bn from netcl.ops.reduction import reduce_meanThe table below maps every documented op to its source file and import path.
Op Index
.py` file.*
| Op | Submodule / import path | One-line description |
|---|---|---|
matmul |
netcl.ops.matmul (root: netcl.matmul) |
GEMM, auto-tuned via KernelSelector. |
matmul_optimized |
netcl.ops.matmul_optimized |
Register-tiled GEMM with __local and __private buffers. |
build_matmul_kernel |
netcl.ops.matmul (root: netcl.build_matmul_kernel) |
Returns the generated OpenCL source for a custom GEMM kernel. |
elementwise_binary |
netcl.ops.elementwise (root: netcl.elementwise_binary) |
Two-arg OpenCL-like expression, JIT-compiled. |
elementwise_unary |
netcl.ops.elementwise |
One-arg expression (sin, cos, exp, log, …). |
elementwise_optimized |
netcl.ops.elementwise_optimized |
Vectorized paths (float4 / float8) for large elementwise workloads. |
relu |
netcl.ops.elementwise (root: netcl.relu) |
Elementwise ReLU. |
leaky_relu, gelu, swish, sigmoid, tanh, elu, softplus, prelu, clamp, hard_* |
netcl.ops.elementwise |
Other activations. |
bias_add |
netcl.ops.elementwise (root: netcl.bias_add) |
Broadcasting bias addition. |
reduce_sum |
netcl.ops.reduction (root: netcl.reduce_sum) |
Sum reduction. |
reduce_mean |
netcl.ops.reduction |
Mean reduction. |
softmax |
netcl.ops.softmax (root: netcl.softmax) |
Numerically stable softmax (max-subtract before exp). |
softmax_fused |
netcl.ops.softmax_fused |
Single-kernel max-subtract → exp → /sum. |
softmax_fp16 |
netcl.ops.softmax_fp16 |
fp16-specialized softmax (uses cl_khr_fp16). |
conv2d |
netcl.ops.conv2d (root: netcl.conv2d) |
Vanilla 2D conv (im2col + matmul). |
conv2d_optimized |
netcl.ops.conv2d_optimized |
Tiled + vectorized conv2d. |
conv2d_planner |
netcl.ops.conv2d_planner |
Runtime selection of the conv2d strategy. |
conv2d_cpu |
netcl.ops.conv2d_cpu |
NumPy CPU fallback for conv2d. |
depthwise_conv2d |
netcl.ops.depthwise_conv2d |
Depthwise-separable 2D conv. |
conv_transpose2d |
netcl.ops.conv_transpose2d |
Transposed conv (fractionally strided). |
transpose2d |
netcl.ops.transpose |
2D transpose. |
permute |
netcl.ops.permute |
Arbitrary axis permutation. |
bmm |
netcl.ops.bmm |
Batched matrix multiplication. |
broadcast_binary |
netcl.ops.broadcast |
Elementwise op with NumPy-style broadcasting. |
im2col |
netcl.ops.im2col |
Helper that turns a conv2d into a matmul. |
winograd_fused |
netcl.ops.winograd_fused |
Winograd F(2×2, 3×3) for 3×3 stride-1 convs. |
jit_fusion |
netcl.ops.jit_fusion |
Symbolic kernel fusion across multiple ops. |
fused_ops |
netcl.ops.fused_ops |
Collection of pre-fused kernels (BN+ReLU, Conv+ReLU+BN, …). |
matmul
General matrix multiplication, auto-tuned through the KernelSelector.
from netcl.ops.matmul import matmul
# a: (M, K), b: (K, N) -> c: (M, N)
c = matmul(a, b)
At call time matmul queries the KernelSelector for an appropriate variant.
The selector considers the device profile (subgroups, fp16, vendor), the local memory
budget, and the total op count, then dispatches to one of:
MATMUL_NAIVE— one work-item per output element; used only for tiny shapes.MATMUL_TILED— workgroup-level tile, good for small/medium.MATMUL_REGISTER_TILED— register-level tile, with vendor-specific tuning for NVIDIA Ampere/Turing and AMD RDNA2/3.MATMUL_VECTORIZED— vector loads/stores when the device has a wide preferred float width and the shape is aligned.
matmul_optimized (in netcl.ops.matmul_optimized) is a more aggressive
implementation that uses __local memory explicitly and is selected by the planner on
larger GEMMs.
build_matmul_kernel(M, N, K, dtype, tile_m, tile_n, tile_k, …) returns the generated
OpenCL source for a custom-tiled GEMM kernel, useful when you want to compile a kernel
once and reuse it many times.
elementwise_binary
The most flexible op: you supply a small OpenCL-like expression and the op builds a specialized kernel for it.
from netcl.ops.elementwise import elementwise_binary
out = elementwise_binary(a, b, expression="MUL(ADD(v0, v1), 2.0f)")
| Token | Meaning |
|---|---|
v0 |
First argument |
v1 |
Second argument |
ADD(v0, v1) |
v0 + v1 |
SUB(v0, v1) |
v0 - v1 |
MUL(v0, v1) |
v0 * v1 |
DIV(v0, v1) |
v0 / v1 |
MAX(v0, v1) |
max(v0, v1) |
MIN(v0, v1) |
min(v0, v1) |
CMP_EQ / CMP_LT / CMP_GT |
Comparisons |
POW(v0, v1) |
pow(v0, v1) |
The expression is parsed into an OpenCL AST and compiled via the JIT Compiler,
so there is no Python-level per-element loop. For unary operations, use
elementwise_unary.
elementwise_unary
One-argument expressions for sin, cos, exp, log, log2, sqrt, abs, neg, and
the various activation functions.
from netcl.ops.elementwise import elementwise_unary
y = elementwise_unary(x, expression="EXP(v0)")
elementwise_optimized
The vectorized path in netcl.ops.elementwise_optimized. Used when the tensor size is
large and the element count is a multiple of the preferred vector width (4 or 8). The
KernelSelector routes large elementwise workloads here automatically.
Activations
netcl.ops.elementwise exports a one-liner for every standard activation. They are
all differentiable — autograd wrappers live in autograd/ops.py.
from netcl.ops.elementwise import (
relu, leaky_relu, gelu, swish, sigmoid, tanh,
elu, softplus, prelu, clamp, hard_sigmoid, hard_swish, hard_tanh,
)
y = relu(x)
y = gelu(x)
y = clamp(x, 0.0, 6.0) # min/max pair
The relu symbol is also re-exported at the package root: from netcl import relu.
bias_add
Broadcasting bias addition. Accepts a bias vector and an N-dim input; the bias is broadcast across the leading batch dimensions.
from netcl.ops.elementwise import bias_add
y = bias_add(x, b) # x: (B, F), b: (F,) -> (B, F)
y = bias_add(x, b) # x: (B, C, H, W), b: (C, 1, 1) -> (B, C, H, W)
On the autograd path this becomes a differentiable op via autograd/ops.py.
Reductions
from netcl.ops.reduction import reduce_sum, reduce_mean
y = reduce_sum(x, axis=-1) # sum along the last axis
y = reduce_mean(x, axis=(0, 2, 3)) # mean across three axes
The planner picks REDUCTION_SEQUENTIAL (tiny), REDUCTION_PARALLEL (medium), or
REDUCTION_WORKGROUP (large, multi-stage). reduce_sum is also re-exported at the
package root.
softmax
from netcl.ops.softmax import softmax
y = softmax(x, axis=-1) # numerically stable
The reference implementation does a max-subtract in a separate pass. softmax_fused
collapses the entire op into a single kernel, which is what the
JIT Compiler selects when the input is large enough to
amortize the launch overhead. softmax_fp16 is the fp16-specialized variant for devices
that advertise cl_khr_fp16 (see the core capability probe).
conv2d
The full 2D convolution family is spread across several files:
from netcl.ops.conv2d import conv2d # vanilla im2col + matmul
from netcl.ops.conv2d_optimized import conv2d_optimized
from netcl.ops.conv2d_cpu import conv2d_cpu # NumPy fallback
from netcl.ops.conv2d_planner import conv2d_planner # runtime strategy chooser
from netcl.ops.depthwise_conv2d import depthwise_conv2d
from netcl.ops.conv_transpose2d import conv_transpose2d
from netcl.ops.im2col import im2col # helper
from netcl.ops.winograd_fused import winograd_fused
conv2d_planner is what user code should normally call: it asks the
KernelSelector for the right variant given the device and shape, then
dispatches. conv2d_planner chooses:
CONV2D_IMPLICIT_GEMMfor the general case.CONV2D_WINOGRADfor 3×3 stride-1 (unlessNETCL_CONV_WINOGRAD=0).CONV2D_TILED_LOCALfor small outputs whenNETCL_CONV_TILED_LOCAL=1.CONV2D_IM2COLfor CPU devices.CONV2D_IMPLICIT_GEMMwithuse_1x1_optimization=Truefor 1×1 convs.
Permutations and Transposes
from netcl.ops.transpose import transpose2d
from netcl.ops.permute import permute
y = transpose2d(x) # swaps the last two axes (typical "NCHW -> NHWC")
y = permute(x, axes=(0, 2, 3, 1)) # arbitrary permutation
permute is shape-only — it produces a view with new strides (no data motion) on the
CPU backend, and a copy via an out-of-place kernel on the OpenCL backend.
bmm
Batched matrix multiplication. Inputs are 3-D tensors (B, M, K) and (B, K, N);
output is (B, M, N).
from netcl.ops.bmm import bmm
y = bmm(a, b)
Broadcasting
broadcast_binary is the broadcasting-aware elementwise op used inside
autograd/ops.py for any binary op whose two inputs do not share a
shape.
from netcl.ops.broadcast import broadcast_binary
out = broadcast_binary(a, b, op="ADD") # op is one of "ADD", "SUB", "MUL", "DIV"
The OpenCL backend materializes a broadcast index, the CPU backend uses NumPy broadcasting directly.
jit_fusion
netcl.ops.jit_fusion is the symbolic-fusion layer: it walks a list of compatible
elementwise ops and emits a single OpenCL kernel that performs the whole chain. This is
the mechanism that turns a * b + c into one launch instead of two.
from netcl.ops.jit_fusion import fuse_chain
fused = fuse_chain([(a, b, "MUL"), (c, None, "ADD")])
The result is a Tensor plus, optionally, a backward closure if every operation in the chain is registered with autograd.
Fused Ops
netcl.ops.fused_ops collects the pre-fused kernels that combine a conv with its
activation and normalization, the BN+ReLU fusion, and similar layer-level compositions.
These are dispatched by the JIT Compiler when the model
traversal detects the right pattern.
from netcl.ops.fused_ops import conv2d_relu_bn, batch_norm2d_relu
x = conv2d_relu_bn(x, w, b, gamma, beta, mean, var, eps=1e-5)
x = batch_norm2d_relu(x, gamma, beta, mean, var, eps=1e-5)
For the full layer-level fusion table (which includes the linear_relu, conv2d_relu,
and add_relu that the nn API exposes), see Fused Ops.
Example: Fused Conv+ReLU+BN
from netcl.ops.conv2d_planner import conv2d_planner as conv2d
from netcl.ops.elementwise import bias_add, relu
from netcl.ops.fused_ops import conv2d_relu_bn
# As separate kernels
x = conv2d(x, w, padding=1)
x = bias_add(x, b)
x = relu(x)
# x = batch_norm2d(x, gamma, beta, running_mean, running_var, eps=1e-5)
# As one kernel
x = conv2d_relu_bn(x, w, b, gamma, beta, mean, var, eps=1e-5)
The fused form has one launch and one Tensor allocation per call, which on modern GPUs is significantly cheaper than the three-kernel chain — typically 20–30% on typical ResNet stages.
Elementwise Token Reference
The elementwise_binary and elementwise_unary compilers share a token vocabulary.
This is the complete list of tokens recognized at the time of writing:
| Category | Tokens |
|---|---|
| Arithmetic | ADD, SUB, MUL, DIV, POW, NEG |
| Math | SIN, COS, TAN, ASIN, ACOS, ATAN, ATAN2, EXP, LOG, LOG2, LOG10, SQRT, RSQRT |
| Comparisons | CMP_EQ, CMP_NEQ, CMP_LT, CMP_LE, CMP_GT, CMP_GE |
| Reductions | MAX, MIN |
| Activations | RELU, LEAKY_RELU, SIGMOID, TANH, GELU, SWISH, ELU, SOFTPLUS, HARD_SIGMOID, HARD_SWISH, HARD_TANH |
| Constants | numeric literals (1.0f, 2, 0.5, …) |
If a token is not in the table, the op falls back to a generic single-expression kernel that uses the token verbatim (so custom C expressions are still possible, but lose the helper macros).
See also
- Tensor — the type every op produces and consumes.
- core API — the OpenCLBackend, the BufferPool, and the KernelSelector that drives auto-tuning.
- JIT Compiler — how generated source is compiled and cached.
- Tensor Backend — the bigger picture of buffers, queues, and the OpenCL transport.
- Autograd & Tape — the differentiable wrappers around the ops.
- Writing a Custom OpenCL Kernel — extending the op set with your own kernels.
- nn API — the layer-level fusions and the MLP / ResNet containers that use these ops.