Conv2d
Conv2d
Status: Public API in
netcl.nn.layers.Conv2dandnetcl.ops.conv2d.conv2d
A 2D convolution is the workhorse spatial operator of computer vision.
In netcl, Conv2d exists at two layers:
- The high-level
nn.Conv2dmodule — ann.Modulethat wraps a learnable weight tensor and a learnable bias tensor, exposes a forward function, and participates in Tape autograd. - The low-level
ops.conv2d.conv2dfunction — a stateless function that takes two input tensors and a config object, picks a kernel strategy, dispatches to OpenCL, and returns the output tensor.
The high-level module is what 99% of user code uses. The low-level function exists for the JIT, for custom kernels, and for ResNet (which sometimes wants a 1x1 conv with no bias, a config the module supports but does not expose as a default).
Overview
Conv2d performs:
out[b, c_out, h_out, w_out] = sum_{c_in, kh, kw}
weight[c_out, c_in, kh, kw]
* x[b, c_in, h_out * stride_h + kh - pad_h,
w_out * stride_w + kw - pad_w]
+ bias[c_out]
with h_out = (h_in + 2*pad_h - dilation_h * (kernel_h - 1) - 1) / stride_h + 1
(and the analogous formula for w_out).
netcl's Conv2d runs the convolution on the OpenCL device. The
implementation supports several strategies (CONV2D_NAIVE,
CONV2D_IM2COL, CONV2D_IMPLICIT_GEMM, CONV2D_TILED_LOCAL,
CONV2D_WINOGRAD) and the kernel selector picks the best one for
the input shape and the device profile.
Where It Lives
- File path:
nn/layers.py(class Conv2d),ops/conv2d.py(the low-level function),ops/conv2d_planner.py(the strategy selector). - Module path:
netcl.nn.Conv2d(high-level),netcl.ops.conv2d(low-level). - Sibling ops:
ops/conv_transpose2d,ops/depthwise_conv2d,ops/winograd_fused.
Diagram
How It Works
The nn.Conv2d constructor allocates two parameters: a weight of
shape (out_channels, in_channels // groups, kernel_h, kernel_w)
and a bias of shape (out_channels,) (or None if bias=False).
Both are requires_grad=True tensors and are picked up by the
optimizer.
forward(x) calls ops.conv2d(x, weight, bias, stride, padding,
dilation, groups) which delegates to the strategy selector. For
small kernels and small batch sizes, the im2col + GEMM path is
fastest. For 3x3 kernels with stride 1 and batch > 1, the Winograd
path is typically 1.5x to 2x faster. For depthwise convolutions
(groups == in_channels), the dedicated depthwise_conv2d kernel
is used.
The backward pass is implemented in the same file: it computes the gradient w.r.t. the input, weight, and bias, all as fused operations that participate in the Tape.
Code Example
import netcl as nc
import netcl.nn as nn
# A standard 3x3 conv with 64 output channels.
conv = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3,
stride=1, padding=1, bias=True)
conv = conv.to(device) # move parameters to OpenCL
x = nc.Tensor.zeros((8, 32, 224, 224), dtype="float32",
context=ctx, queue=q)
y = conv(x) # (8, 64, 224, 224)
A 1x1 projection (used in ResNet bottlenecks):
proj = nn.Conv2d(in_channels=256, out_channels=64, kernel_size=1,
stride=1, padding=0, bias=False)
A strided downsample:
down = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3,
stride=2, padding=1, bias=False)
Performance & Trade-offs
groups > 1is the cheap way to reduce parameters and FLOPs at the cost of cross-channel information flow. Use it for mobile inference (MobileNet-style blocks).padding=0is avalidconvolution; the output is smaller than the input.padding=kernel_size // 2(withstride=1) is asameconvolution that preserves the spatial size.- The kernel selector reads the device profile and picks the best strategy. For most 3x3 / 1x1 convolutions on modern devices, the Winograd path wins.
- Under AMP, the conv runs in fp16; the accumulator is fp32. The strategy selector is aware of the precision and may pick a different strategy to avoid precision loss.