concepts

Conv2d

Status: Public API in netcl.nn.layers.Conv2d and netcl.ops.conv2d.conv2d

A 2D convolution is the workhorse spatial operator of computer vision. In netcl, Conv2d exists at two layers:

The high-level nn.Conv2d module — a nn.Module that wraps a learnable weight tensor and a learnable bias tensor, exposes a forward function, and participates in Tape autograd.
The low-level ops.conv2d.conv2d function — a stateless function that takes two input tensors and a config object, picks a kernel strategy, dispatches to OpenCL, and returns the output tensor.

The high-level module is what 99% of user code uses. The low-level function exists for the JIT, for custom kernels, and for ResNet (which sometimes wants a 1x1 conv with no bias, a config the module supports but does not expose as a default).

Overview

Conv2d performs:

out[b, c_out, h_out, w_out] = sum_{c_in, kh, kw}
    weight[c_out, c_in, kh, kw]
    * x[b, c_in, h_out * stride_h + kh - pad_h,
            w_out * stride_w + kw - pad_w]
    + bias[c_out]

with h_out = (h_in + 2*pad_h - dilation_h * (kernel_h - 1) - 1) / stride_h + 1 (and the analogous formula for w_out).

netcl's Conv2d runs the convolution on the OpenCL device. The implementation supports several strategies (CONV2D_NAIVE, CONV2D_IM2COL, CONV2D_IMPLICIT_GEMM, CONV2D_TILED_LOCAL, CONV2D_WINOGRAD) and the kernel selector picks the best one for the input shape and the device profile.

Where It Lives

File path: nn/layers.py (class Conv2d), ops/conv2d.py (the low-level function), ops/conv2d_planner.py (the strategy selector).
Module path: netcl.nn.Conv2d (high-level), netcl.ops.conv2d (low-level).
Sibling ops: ops/conv_transpose2d, ops/depthwise_conv2d, ops/winograd_fused.

Diagram

How It Works

The nn.Conv2d constructor allocates two parameters: a weight of shape (out_channels, in_channels // groups, kernel_h, kernel_w) and a bias of shape (out_channels,) (or None if bias=False). Both are requires_grad=True tensors and are picked up by the optimizer.

forward(x) calls ops.conv2d(x, weight, bias, stride, padding, dilation, groups) which delegates to the strategy selector. For small kernels and small batch sizes, the im2col + GEMM path is fastest. For 3x3 kernels with stride 1 and batch > 1, the Winograd path is typically 1.5x to 2x faster. For depthwise convolutions (groups == in_channels), the dedicated depthwise_conv2d kernel is used.

The backward pass is implemented in the same file: it computes the gradient w.r.t. the input, weight, and bias, all as fused operations that participate in the Tape.

Code Example

import netcl as nc
import netcl.nn as nn

# A standard 3x3 conv with 64 output channels.
conv = nn.Conv2d(in_channels=32, out_channels=64, kernel_size=3,
                 stride=1, padding=1, bias=True)
conv = conv.to(device)        # move parameters to OpenCL

x = nc.Tensor.zeros((8, 32, 224, 224), dtype="float32",
                    context=ctx, queue=q)
y = conv(x)                   # (8, 64, 224, 224)

A 1x1 projection (used in ResNet bottlenecks):

proj = nn.Conv2d(in_channels=256, out_channels=64, kernel_size=1,
                 stride=1, padding=0, bias=False)

A strided downsample:

down = nn.Conv2d(in_channels=64, out_channels=128, kernel_size=3,
                 stride=2, padding=1, bias=False)

Performance & Trade-offs

groups > 1 is the cheap way to reduce parameters and FLOPs at the cost of cross-channel information flow. Use it for mobile inference (MobileNet-style blocks).
padding=0 is a valid convolution; the output is smaller than the input. padding=kernel_size // 2 (with stride=1) is a same convolution that preserves the spatial size.
The kernel selector reads the device profile and picks the best strategy. For most 3x3 / 1x1 convolutions on modern devices, the Winograd path wins.
Under AMP, the conv runs in fp16; the accumulator is fp32. The strategy selector is aware of the precision and may pick a different strategy to avoid precision loss.