api

netcl.ops — Elementary Operations

`netcl.ops` — Elementary Operations

netcl.ops is the library of single-kernel GPU operations. It is the lowest user-facing layer above the Tensor Backend: every op consumes and produces Tensor objects, and most ops go through the JIT Compiler so that the generated OpenCL source is cached per shape and dtype.

Note — Top-level re-exports. netcl/ops/__init__.py is empty in the current code. Eight names — matmul, build_matmul_kernel, elementwise_binary, relu, bias_add, reduce_sum, softmax, conv2d — are surfaced from the root netcl/__init__.py and can be imported as from netcl import matmul, …. Everything else has to come from the explicit submodule:

python from netcl.ops.elementwise import elementwise_unary from netcl.ops.fused_ops import conv2d_relu_bn from netcl.ops.reduction import reduce_mean

The table below maps every documented op to its source file and import path.

Op Index

.py` file.*

Op	Submodule / import path	One-line description
`matmul`	`netcl.ops.matmul` (root: `netcl.matmul`)	GEMM, auto-tuned via KernelSelector.
`matmul_optimized`	`netcl.ops.matmul_optimized`	Register-tiled GEMM with `__local` and `__private` buffers.
`build_matmul_kernel`	`netcl.ops.matmul` (root: `netcl.build_matmul_kernel`)	Returns the generated OpenCL source for a custom GEMM kernel.
`elementwise_binary`	`netcl.ops.elementwise` (root: `netcl.elementwise_binary`)	Two-arg OpenCL-like expression, JIT-compiled.
`elementwise_unary`	`netcl.ops.elementwise`	One-arg expression (`sin`, `cos`, `exp`, `log`, …).
`elementwise_optimized`	`netcl.ops.elementwise_optimized`	Vectorized paths (float4 / float8) for large elementwise workloads.
`relu`	`netcl.ops.elementwise` (root: `netcl.relu`)	Elementwise ReLU.
`leaky_relu`, `gelu`, `swish`, `sigmoid`, `tanh`, `elu`, `softplus`, `prelu`, `clamp`, `hard_*`	`netcl.ops.elementwise`	Other activations.
`bias_add`	`netcl.ops.elementwise` (root: `netcl.bias_add`)	Broadcasting bias addition.
`reduce_sum`	`netcl.ops.reduction` (root: `netcl.reduce_sum`)	Sum reduction.
`reduce_mean`	`netcl.ops.reduction`	Mean reduction.
`softmax`	`netcl.ops.softmax` (root: `netcl.softmax`)	Numerically stable softmax (max-subtract before exp).
`softmax_fused`	`netcl.ops.softmax_fused`	Single-kernel max-subtract → exp → /sum.
`softmax_fp16`	`netcl.ops.softmax_fp16`	fp16-specialized softmax (uses `cl_khr_fp16`).
`conv2d`	`netcl.ops.conv2d` (root: `netcl.conv2d`)	Vanilla 2D conv (im2col + matmul).
`conv2d_optimized`	`netcl.ops.conv2d_optimized`	Tiled + vectorized conv2d.
`conv2d_planner`	`netcl.ops.conv2d_planner`	Runtime selection of the conv2d strategy.
`conv2d_cpu`	`netcl.ops.conv2d_cpu`	NumPy CPU fallback for conv2d.
`depthwise_conv2d`	`netcl.ops.depthwise_conv2d`	Depthwise-separable 2D conv.
`conv_transpose2d`	`netcl.ops.conv_transpose2d`	Transposed conv (fractionally strided).
`transpose2d`	`netcl.ops.transpose`	2D transpose.
`permute`	`netcl.ops.permute`	Arbitrary axis permutation.
`bmm`	`netcl.ops.bmm`	Batched matrix multiplication.
`broadcast_binary`	`netcl.ops.broadcast`	Elementwise op with NumPy-style broadcasting.
`im2col`	`netcl.ops.im2col`	Helper that turns a conv2d into a matmul.
`winograd_fused`	`netcl.ops.winograd_fused`	Winograd F(2×2, 3×3) for 3×3 stride-1 convs.
`jit_fusion`	`netcl.ops.jit_fusion`	Symbolic kernel fusion across multiple ops.
`fused_ops`	`netcl.ops.fused_ops`	Collection of pre-fused kernels (BN+ReLU, Conv+ReLU+BN, …).

`matmul`

General matrix multiplication, auto-tuned through the KernelSelector.

from netcl.ops.matmul import matmul
# a: (M, K), b: (K, N) -> c: (M, N)
c = matmul(a, b)

At call time matmul queries the KernelSelector for an appropriate variant. The selector considers the device profile (subgroups, fp16, vendor), the local memory budget, and the total op count, then dispatches to one of:

MATMUL_NAIVE — one work-item per output element; used only for tiny shapes.
MATMUL_TILED — workgroup-level tile, good for small/medium.
MATMUL_REGISTER_TILED — register-level tile, with vendor-specific tuning for NVIDIA Ampere/Turing and AMD RDNA2/3.
MATMUL_VECTORIZED — vector loads/stores when the device has a wide preferred float width and the shape is aligned.

matmul_optimized (in netcl.ops.matmul_optimized) is a more aggressive implementation that uses __local memory explicitly and is selected by the planner on larger GEMMs.

build_matmul_kernel(M, N, K, dtype, tile_m, tile_n, tile_k, …) returns the generated OpenCL source for a custom-tiled GEMM kernel, useful when you want to compile a kernel once and reuse it many times.

`elementwise_binary`

The most flexible op: you supply a small OpenCL-like expression and the op builds a specialized kernel for it.

from netcl.ops.elementwise import elementwise_binary
out = elementwise_binary(a, b, expression="MUL(ADD(v0, v1), 2.0f)")

Token	Meaning
`v0`	First argument
`v1`	Second argument
`ADD(v0, v1)`	`v0 + v1`
`SUB(v0, v1)`	`v0 - v1`
`MUL(v0, v1)`	`v0 * v1`
`DIV(v0, v1)`	`v0 / v1`
`MAX(v0, v1)`	`max(v0, v1)`
`MIN(v0, v1)`	`min(v0, v1)`
`CMP_EQ` / `CMP_LT` / `CMP_GT`	Comparisons
`POW(v0, v1)`	`pow(v0, v1)`

The expression is parsed into an OpenCL AST and compiled via the JIT Compiler, so there is no Python-level per-element loop. For unary operations, use elementwise_unary.

`elementwise_unary`

One-argument expressions for sin, cos, exp, log, log2, sqrt, abs, neg, and the various activation functions.

from netcl.ops.elementwise import elementwise_unary
y = elementwise_unary(x, expression="EXP(v0)")

`elementwise_optimized`

The vectorized path in netcl.ops.elementwise_optimized. Used when the tensor size is large and the element count is a multiple of the preferred vector width (4 or 8). The KernelSelector routes large elementwise workloads here automatically.

Activations

netcl.ops.elementwise exports a one-liner for every standard activation. They are all differentiable — autograd wrappers live in autograd/ops.py.

from netcl.ops.elementwise import (
    relu, leaky_relu, gelu, swish, sigmoid, tanh,
    elu, softplus, prelu, clamp, hard_sigmoid, hard_swish, hard_tanh,
)
y = relu(x)
y = gelu(x)
y = clamp(x, 0.0, 6.0)         # min/max pair

The relu symbol is also re-exported at the package root: from netcl import relu.

`bias_add`

Broadcasting bias addition. Accepts a bias vector and an N-dim input; the bias is broadcast across the leading batch dimensions.

from netcl.ops.elementwise import bias_add
y = bias_add(x, b)        # x: (B, F),  b: (F,)   -> (B, F)
y = bias_add(x, b)        # x: (B, C, H, W),  b: (C, 1, 1)  -> (B, C, H, W)

On the autograd path this becomes a differentiable op via autograd/ops.py.

Reductions

from netcl.ops.reduction import reduce_sum, reduce_mean

y = reduce_sum(x, axis=-1)            # sum along the last axis
y = reduce_mean(x, axis=(0, 2, 3))    # mean across three axes

The planner picks REDUCTION_SEQUENTIAL (tiny), REDUCTION_PARALLEL (medium), or REDUCTION_WORKGROUP (large, multi-stage). reduce_sum is also re-exported at the package root.

`softmax`

from netcl.ops.softmax import softmax
y = softmax(x, axis=-1)            # numerically stable

The reference implementation does a max-subtract in a separate pass. softmax_fused collapses the entire op into a single kernel, which is what the JIT Compiler selects when the input is large enough to amortize the launch overhead. softmax_fp16 is the fp16-specialized variant for devices that advertise cl_khr_fp16 (see the core capability probe).

`conv2d`

The full 2D convolution family is spread across several files:

from netcl.ops.conv2d import conv2d                 # vanilla im2col + matmul
from netcl.ops.conv2d_optimized import conv2d_optimized
from netcl.ops.conv2d_cpu import conv2d_cpu          # NumPy fallback
from netcl.ops.conv2d_planner import conv2d_planner  # runtime strategy chooser
from netcl.ops.depthwise_conv2d import depthwise_conv2d
from netcl.ops.conv_transpose2d import conv_transpose2d
from netcl.ops.im2col import im2col                  # helper
from netcl.ops.winograd_fused import winograd_fused

conv2d_planner is what user code should normally call: it asks the KernelSelector for the right variant given the device and shape, then dispatches. conv2d_planner chooses:

CONV2D_IMPLICIT_GEMM for the general case.
CONV2D_WINOGRAD for 3×3 stride-1 (unless NETCL_CONV_WINOGRAD=0).
CONV2D_TILED_LOCAL for small outputs when NETCL_CONV_TILED_LOCAL=1.
CONV2D_IM2COL for CPU devices.
CONV2D_IMPLICIT_GEMM with use_1x1_optimization=True for 1×1 convs.

Permutations and Transposes

from netcl.ops.transpose import transpose2d
from netcl.ops.permute import permute

y = transpose2d(x)                # swaps the last two axes (typical "NCHW -> NHWC")
y = permute(x, axes=(0, 2, 3, 1)) # arbitrary permutation

permute is shape-only — it produces a view with new strides (no data motion) on the CPU backend, and a copy via an out-of-place kernel on the OpenCL backend.

`bmm`

Batched matrix multiplication. Inputs are 3-D tensors (B, M, K) and (B, K, N); output is (B, M, N).

from netcl.ops.bmm import bmm
y = bmm(a, b)

Broadcasting

broadcast_binary is the broadcasting-aware elementwise op used inside autograd/ops.py for any binary op whose two inputs do not share a shape.

from netcl.ops.broadcast import broadcast_binary
out = broadcast_binary(a, b, op="ADD")    # op is one of "ADD", "SUB", "MUL", "DIV"

The OpenCL backend materializes a broadcast index, the CPU backend uses NumPy broadcasting directly.

`jit_fusion`

netcl.ops.jit_fusion is the symbolic-fusion layer: it walks a list of compatible elementwise ops and emits a single OpenCL kernel that performs the whole chain. This is the mechanism that turns a * b + c into one launch instead of two.

from netcl.ops.jit_fusion import fuse_chain
fused = fuse_chain([(a, b, "MUL"), (c, None, "ADD")])

The result is a Tensor plus, optionally, a backward closure if every operation in the chain is registered with autograd.

Fused Ops

netcl.ops.fused_ops collects the pre-fused kernels that combine a conv with its activation and normalization, the BN+ReLU fusion, and similar layer-level compositions. These are dispatched by the JIT Compiler when the model traversal detects the right pattern.

from netcl.ops.fused_ops import conv2d_relu_bn, batch_norm2d_relu

x = conv2d_relu_bn(x, w, b, gamma, beta, mean, var, eps=1e-5)
x = batch_norm2d_relu(x, gamma, beta, mean, var, eps=1e-5)

For the full layer-level fusion table (which includes the linear_relu, conv2d_relu, and add_relu that the nn API exposes), see Fused Ops.

Example: Fused Conv+ReLU+BN

from netcl.ops.conv2d_planner import conv2d_planner as conv2d
from netcl.ops.elementwise import bias_add, relu
from netcl.ops.fused_ops import conv2d_relu_bn

# As separate kernels
x = conv2d(x, w, padding=1)
x = bias_add(x, b)
x = relu(x)
# x = batch_norm2d(x, gamma, beta, running_mean, running_var, eps=1e-5)

# As one kernel
x = conv2d_relu_bn(x, w, b, gamma, beta, mean, var, eps=1e-5)

The fused form has one launch and one Tensor allocation per call, which on modern GPUs is significantly cheaper than the three-kernel chain — typically 20–30% on typical ResNet stages.

Elementwise Token Reference

The elementwise_binary and elementwise_unary compilers share a token vocabulary. This is the complete list of tokens recognized at the time of writing:

Category	Tokens
Arithmetic	`ADD`, `SUB`, `MUL`, `DIV`, `POW`, `NEG`
Math	`SIN`, `COS`, `TAN`, `ASIN`, `ACOS`, `ATAN`, `ATAN2`, `EXP`, `LOG`, `LOG2`, `LOG10`, `SQRT`, `RSQRT`
Comparisons	`CMP_EQ`, `CMP_NEQ`, `CMP_LT`, `CMP_LE`, `CMP_GT`, `CMP_GE`
Reductions	`MAX`, `MIN`
Activations	`RELU`, `LEAKY_RELU`, `SIGMOID`, `TANH`, `GELU`, `SWISH`, `ELU`, `SOFTPLUS`, `HARD_SIGMOID`, `HARD_SWISH`, `HARD_TANH`
Constants	numeric literals (`1.0f`, `2`, `0.5`, …)

If a token is not in the table, the op falls back to a generic single-expression kernel that uses the token verbatim (so custom C expressions are still possible, but lose the helper macros).

netcl.ops — Elementary Operations

Op Index

matmul

elementwise_binary

elementwise_unary

elementwise_optimized