concepts

im2col

Status: Architecture concept in ops/im2col.py

im2col (image-to-column) is the standard trick for implementing a convolution as a matrix multiplication. The idea is to lay out the input image as a 2D matrix where each row corresponds to one patch of the input (one receptive field) and each column to one input channel at one spatial offset. The convolution is then a matrix-multiply of the im2col matrix by the (reshaped) weight matrix, producing an output matrix that, when reshaped back, is the convolution output.

The reverse operation, col2im, takes the output matrix and scatters its columns back into the output tensor. It is used in the backward pass of the im2col convolution.

Overview

Given a 4D input x of shape (N, C_in, H, W), a 4D weight w of shape (C_out, C_in, kH, kW), and the conv parameters (stride, padding, dilation, groups), the im2col matrix is:

cols: shape (N * H_out * W_out, C_in * kH * kW).
cols[n * H_out * W_out + h * W_out + w, c * kH * kW + kh * kW + kw] is x[n, c, h * stride - pad + kh, w * stride - pad + kw] (clipped at the input boundary if the patch extends past the input).

The convolution output is then output = cols @ w.reshape(C_out, C_in * kH * kW).T, reshaped to (N, C_out, H_out, W_out).

Where It Lives

File path: ops/im2col.py.
Module path: netcl.ops.im2col.
Used by: ops/conv2d.py (CONV2D_IM2COL strategy), ops/conv2d_planner.py (the strategy selector).

How It Works

im2col is implemented as a single OpenCL kernel that does the gather-and-write in one pass. The kernel's work-item is the output index (n, h, w); it iterates over the input channels and the kernel positions, reads the input value (with zero-fill for the padding), and writes the value to the corresponding column of the output matrix. The kernel is launched with global_size = (N, H_out, W_out) and local_size = (1, 1, 1).

The reverse col2im is a scatter-add: the same indexing, but the writes are atomic adds (to handle the case where multiple output indices map to the same input index, which happens with stride < kernel size and overlapping receptive fields).

Code Example

import netcl as nc
from netcl.ops.im2col import im2col, col2im

x = nc.Tensor.from_host(numpy_x)             # (N, C_in, H, W)
cols = im2col(x, kernel_h=3, kernel_w=3,
              stride=1, padding=1, dilation=1)
# cols.shape == (N * H_out * W_out, C_in * 3 * 3)

# ... matrix-multiply cols by the weight matrix ...

grad_input = col2im(grad_cols, input_shape=x.shape,
                    kernel_h=3, kernel_w=3,
                    stride=1, padding=1, dilation=1)

Performance & Trade-offs

Memory overhead: the im2col matrix is (N * H_out * W_out, C_in * kH * kW) floats. For a typical ResNet-50 input (N=32, C_in=64, H=W=56, kH=kW=3), H_out = W_out = 56, so the im2col matrix is (32 * 56 * 56, 64 * 9) = (100352, 576) floats, or 230 MB at fp32. This is the main reason the im2col strategy is not always the fastest.
Speed-up vs. the naive loop: the im2col + GEMM strategy converts a 7-loop nested convolution (over N, C_out, C_in, H, W, kH, kW) into a single matrix-multiply, which can use highly-tuned GEMM kernels. The win is 5x to 20x on most devices.
Cache pressure: the im2col matrix does not fit in cache for most realistic convolutions. The matmul kernel handles this with tiling, but the memory-bandwidth cost is real.
Alternative: implicit GEMM (CONV2D_IMPLICIT_GEMM) does the same matrix-multiply without materialising the im2col matrix. It is faster for large convolutions but requires a more sophisticated kernel.
Winograd (CONV2D_WINOGRAD) is the fastest strategy for 3x3 convolutions with stride 1, but it requires the input to be a multiple of 4 in each spatial dimension and the output channels to be a multiple of 8.