netcl wiki
api

netcl.ops — Elementary Operations

netcl.ops — Elementary Operations

netcl.ops is the library of single-kernel GPU operations. It is the lowest user-facing layer above the Tensor Backend: every op consumes and produces Tensor objects, and most ops go through the JIT Compiler so that the generated OpenCL source is cached per shape and dtype.

Note — Top-level re-exports. netcl/ops/__init__.py is empty in the current code. Eight names — matmul, build_matmul_kernel, elementwise_binary, relu, bias_add, reduce_sum, softmax, conv2d — are surfaced from the root netcl/__init__.py and can be imported as from netcl import matmul, …. Everything else has to come from the explicit submodule:

python from netcl.ops.elementwise import elementwise_unary from netcl.ops.fused_ops import conv2d_relu_bn from netcl.ops.reduction import reduce_mean

The table below maps every documented op to its source file and import path.

Op Index

.py` file.*

Op Submodule / import path One-line description
matmul netcl.ops.matmul (root: netcl.matmul) GEMM, auto-tuned via KernelSelector.
matmul_optimized netcl.ops.matmul_optimized Register-tiled GEMM with __local and __private buffers.
build_matmul_kernel netcl.ops.matmul (root: netcl.build_matmul_kernel) Returns the generated OpenCL source for a custom GEMM kernel.
elementwise_binary netcl.ops.elementwise (root: netcl.elementwise_binary) Two-arg OpenCL-like expression, JIT-compiled.
elementwise_unary netcl.ops.elementwise One-arg expression (sin, cos, exp, log, …).
elementwise_optimized netcl.ops.elementwise_optimized Vectorized paths (float4 / float8) for large elementwise workloads.
relu netcl.ops.elementwise (root: netcl.relu) Elementwise ReLU.
leaky_relu, gelu, swish, sigmoid, tanh, elu, softplus, prelu, clamp, hard_* netcl.ops.elementwise Other activations.
bias_add netcl.ops.elementwise (root: netcl.bias_add) Broadcasting bias addition.
reduce_sum netcl.ops.reduction (root: netcl.reduce_sum) Sum reduction.
reduce_mean netcl.ops.reduction Mean reduction.
softmax netcl.ops.softmax (root: netcl.softmax) Numerically stable softmax (max-subtract before exp).
softmax_fused netcl.ops.softmax_fused Single-kernel max-subtract → exp → /sum.
softmax_fp16 netcl.ops.softmax_fp16 fp16-specialized softmax (uses cl_khr_fp16).
conv2d netcl.ops.conv2d (root: netcl.conv2d) Vanilla 2D conv (im2col + matmul).
conv2d_optimized netcl.ops.conv2d_optimized Tiled + vectorized conv2d.
conv2d_planner netcl.ops.conv2d_planner Runtime selection of the conv2d strategy.
conv2d_cpu netcl.ops.conv2d_cpu NumPy CPU fallback for conv2d.
depthwise_conv2d netcl.ops.depthwise_conv2d Depthwise-separable 2D conv.
conv_transpose2d netcl.ops.conv_transpose2d Transposed conv (fractionally strided).
transpose2d netcl.ops.transpose 2D transpose.
permute netcl.ops.permute Arbitrary axis permutation.
bmm netcl.ops.bmm Batched matrix multiplication.
broadcast_binary netcl.ops.broadcast Elementwise op with NumPy-style broadcasting.
im2col netcl.ops.im2col Helper that turns a conv2d into a matmul.
winograd_fused netcl.ops.winograd_fused Winograd F(2×2, 3×3) for 3×3 stride-1 convs.
jit_fusion netcl.ops.jit_fusion Symbolic kernel fusion across multiple ops.
fused_ops netcl.ops.fused_ops Collection of pre-fused kernels (BN+ReLU, Conv+ReLU+BN, …).

matmul

General matrix multiplication, auto-tuned through the KernelSelector.

from netcl.ops.matmul import matmul
# a: (M, K), b: (K, N) -> c: (M, N)
c = matmul(a, b)

At call time matmul queries the KernelSelector for an appropriate variant. The selector considers the device profile (subgroups, fp16, vendor), the local memory budget, and the total op count, then dispatches to one of:

  • MATMUL_NAIVE — one work-item per output element; used only for tiny shapes.
  • MATMUL_TILED — workgroup-level tile, good for small/medium.
  • MATMUL_REGISTER_TILED — register-level tile, with vendor-specific tuning for NVIDIA Ampere/Turing and AMD RDNA2/3.
  • MATMUL_VECTORIZED — vector loads/stores when the device has a wide preferred float width and the shape is aligned.

matmul_optimized (in netcl.ops.matmul_optimized) is a more aggressive implementation that uses __local memory explicitly and is selected by the planner on larger GEMMs.

build_matmul_kernel(M, N, K, dtype, tile_m, tile_n, tile_k, …) returns the generated OpenCL source for a custom-tiled GEMM kernel, useful when you want to compile a kernel once and reuse it many times.

elementwise_binary

The most flexible op: you supply a small OpenCL-like expression and the op builds a specialized kernel for it.

from netcl.ops.elementwise import elementwise_binary
out = elementwise_binary(a, b, expression="MUL(ADD(v0, v1), 2.0f)")
Token Meaning
v0 First argument
v1 Second argument
ADD(v0, v1) v0 + v1
SUB(v0, v1) v0 - v1
MUL(v0, v1) v0 * v1
DIV(v0, v1) v0 / v1
MAX(v0, v1) max(v0, v1)
MIN(v0, v1) min(v0, v1)
CMP_EQ / CMP_LT / CMP_GT Comparisons
POW(v0, v1) pow(v0, v1)

The expression is parsed into an OpenCL AST and compiled via the JIT Compiler, so there is no Python-level per-element loop. For unary operations, use elementwise_unary.

elementwise_unary

One-argument expressions for sin, cos, exp, log, log2, sqrt, abs, neg, and the various activation functions.

from netcl.ops.elementwise import elementwise_unary
y = elementwise_unary(x, expression="EXP(v0)")

elementwise_optimized

The vectorized path in netcl.ops.elementwise_optimized. Used when the tensor size is large and the element count is a multiple of the preferred vector width (4 or 8). The KernelSelector routes large elementwise workloads here automatically.

Activations

netcl.ops.elementwise exports a one-liner for every standard activation. They are all differentiable — autograd wrappers live in autograd/ops.py.

from netcl.ops.elementwise import (
    relu, leaky_relu, gelu, swish, sigmoid, tanh,
    elu, softplus, prelu, clamp, hard_sigmoid, hard_swish, hard_tanh,
)
y = relu(x)
y = gelu(x)
y = clamp(x, 0.0, 6.0)         # min/max pair

The relu symbol is also re-exported at the package root: from netcl import relu.

bias_add

Broadcasting bias addition. Accepts a bias vector and an N-dim input; the bias is broadcast across the leading batch dimensions.

from netcl.ops.elementwise import bias_add
y = bias_add(x, b)        # x: (B, F),  b: (F,)   -> (B, F)
y = bias_add(x, b)        # x: (B, C, H, W),  b: (C, 1, 1)  -> (B, C, H, W)

On the autograd path this becomes a differentiable op via autograd/ops.py.

Reductions

from netcl.ops.reduction import reduce_sum, reduce_mean

y = reduce_sum(x, axis=-1)            # sum along the last axis
y = reduce_mean(x, axis=(0, 2, 3))    # mean across three axes

The planner picks REDUCTION_SEQUENTIAL (tiny), REDUCTION_PARALLEL (medium), or REDUCTION_WORKGROUP (large, multi-stage). reduce_sum is also re-exported at the package root.

softmax

from netcl.ops.softmax import softmax
y = softmax(x, axis=-1)            # numerically stable

The reference implementation does a max-subtract in a separate pass. softmax_fused collapses the entire op into a single kernel, which is what the JIT Compiler selects when the input is large enough to amortize the launch overhead. softmax_fp16 is the fp16-specialized variant for devices that advertise cl_khr_fp16 (see the core capability probe).

conv2d

The full 2D convolution family is spread across several files:

from netcl.ops.conv2d import conv2d                 # vanilla im2col + matmul
from netcl.ops.conv2d_optimized import conv2d_optimized
from netcl.ops.conv2d_cpu import conv2d_cpu          # NumPy fallback
from netcl.ops.conv2d_planner import conv2d_planner  # runtime strategy chooser
from netcl.ops.depthwise_conv2d import depthwise_conv2d
from netcl.ops.conv_transpose2d import conv_transpose2d
from netcl.ops.im2col import im2col                  # helper
from netcl.ops.winograd_fused import winograd_fused

conv2d_planner is what user code should normally call: it asks the KernelSelector for the right variant given the device and shape, then dispatches. conv2d_planner chooses:

  • CONV2D_IMPLICIT_GEMM for the general case.
  • CONV2D_WINOGRAD for 3×3 stride-1 (unless NETCL_CONV_WINOGRAD=0).
  • CONV2D_TILED_LOCAL for small outputs when NETCL_CONV_TILED_LOCAL=1.
  • CONV2D_IM2COL for CPU devices.
  • CONV2D_IMPLICIT_GEMM with use_1x1_optimization=True for 1×1 convs.

Permutations and Transposes

from netcl.ops.transpose import transpose2d
from netcl.ops.permute import permute

y = transpose2d(x)                # swaps the last two axes (typical "NCHW -> NHWC")
y = permute(x, axes=(0, 2, 3, 1)) # arbitrary permutation

permute is shape-only — it produces a view with new strides (no data motion) on the CPU backend, and a copy via an out-of-place kernel on the OpenCL backend.

bmm

Batched matrix multiplication. Inputs are 3-D tensors (B, M, K) and (B, K, N); output is (B, M, N).

from netcl.ops.bmm import bmm
y = bmm(a, b)

Broadcasting

broadcast_binary is the broadcasting-aware elementwise op used inside autograd/ops.py for any binary op whose two inputs do not share a shape.

from netcl.ops.broadcast import broadcast_binary
out = broadcast_binary(a, b, op="ADD")    # op is one of "ADD", "SUB", "MUL", "DIV"

The OpenCL backend materializes a broadcast index, the CPU backend uses NumPy broadcasting directly.

jit_fusion

netcl.ops.jit_fusion is the symbolic-fusion layer: it walks a list of compatible elementwise ops and emits a single OpenCL kernel that performs the whole chain. This is the mechanism that turns a * b + c into one launch instead of two.

from netcl.ops.jit_fusion import fuse_chain
fused = fuse_chain([(a, b, "MUL"), (c, None, "ADD")])

The result is a Tensor plus, optionally, a backward closure if every operation in the chain is registered with autograd.

Fused Ops

netcl.ops.fused_ops collects the pre-fused kernels that combine a conv with its activation and normalization, the BN+ReLU fusion, and similar layer-level compositions. These are dispatched by the JIT Compiler when the model traversal detects the right pattern.

from netcl.ops.fused_ops import conv2d_relu_bn, batch_norm2d_relu

x = conv2d_relu_bn(x, w, b, gamma, beta, mean, var, eps=1e-5)
x = batch_norm2d_relu(x, gamma, beta, mean, var, eps=1e-5)

For the full layer-level fusion table (which includes the linear_relu, conv2d_relu, and add_relu that the nn API exposes), see Fused Ops.

Example: Fused Conv+ReLU+BN

from netcl.ops.conv2d_planner import conv2d_planner as conv2d
from netcl.ops.elementwise import bias_add, relu
from netcl.ops.fused_ops import conv2d_relu_bn

# As separate kernels
x = conv2d(x, w, padding=1)
x = bias_add(x, b)
x = relu(x)
# x = batch_norm2d(x, gamma, beta, running_mean, running_var, eps=1e-5)

# As one kernel
x = conv2d_relu_bn(x, w, b, gamma, beta, mean, var, eps=1e-5)

The fused form has one launch and one Tensor allocation per call, which on modern GPUs is significantly cheaper than the three-kernel chain — typically 20–30% on typical ResNet stages.

Elementwise Token Reference

The elementwise_binary and elementwise_unary compilers share a token vocabulary. This is the complete list of tokens recognized at the time of writing:

Category Tokens
Arithmetic ADD, SUB, MUL, DIV, POW, NEG
Math SIN, COS, TAN, ASIN, ACOS, ATAN, ATAN2, EXP, LOG, LOG2, LOG10, SQRT, RSQRT
Comparisons CMP_EQ, CMP_NEQ, CMP_LT, CMP_LE, CMP_GT, CMP_GE
Reductions MAX, MIN
Activations RELU, LEAKY_RELU, SIGMOID, TANH, GELU, SWISH, ELU, SOFTPLUS, HARD_SIGMOID, HARD_SWISH, HARD_TANH
Constants numeric literals (1.0f, 2, 0.5, …)

If a token is not in the table, the op falls back to a generic single-expression kernel that uses the token verbatim (so custom C expressions are still possible, but lose the helper macros).

See also