im2col
im2col
Status: Architecture concept in
ops/im2col.py
im2col (image-to-column) is the standard trick for implementing
a convolution as a matrix multiplication. The idea is to lay out
the input image as a 2D matrix where each row corresponds to one
patch of the input (one receptive field) and each column to one
input channel at one spatial offset. The convolution is then a
matrix-multiply of the im2col matrix by the (reshaped) weight
matrix, producing an output matrix that, when reshaped back, is
the convolution output.
The reverse operation, col2im, takes the output matrix and
scatters its columns back into the output tensor. It is used in
the backward pass of the im2col convolution.
Overview
Given a 4D input x of shape (N, C_in, H, W), a 4D weight
w of shape (C_out, C_in, kH, kW), and the conv parameters
(stride, padding, dilation, groups), the im2col matrix is:
cols: shape(N * H_out * W_out, C_in * kH * kW).cols[n * H_out * W_out + h * W_out + w, c * kH * kW + kh * kW + kw]isx[n, c, h * stride - pad + kh, w * stride - pad + kw](clipped at the input boundary if the patch extends past the input).
The convolution output is then
output = cols @ w.reshape(C_out, C_in * kH * kW).T,
reshaped to (N, C_out, H_out, W_out).
Where It Lives
- File path:
ops/im2col.py. - Module path:
netcl.ops.im2col. - Used by:
ops/conv2d.py(CONV2D_IM2COLstrategy),ops/conv2d_planner.py(the strategy selector).
How It Works
im2col is implemented as a single OpenCL kernel that does the
gather-and-write in one pass. The kernel's work-item is the
output index (n, h, w); it iterates over the input channels
and the kernel positions, reads the input value (with zero-fill
for the padding), and writes the value to the corresponding
column of the output matrix. The kernel is launched with
global_size = (N, H_out, W_out) and local_size = (1, 1, 1).
The reverse col2im is a scatter-add: the same indexing, but
the writes are atomic adds (to handle the case where multiple
output indices map to the same input index, which happens with
stride < kernel size and overlapping receptive fields).
Code Example
import netcl as nc
from netcl.ops.im2col import im2col, col2im
x = nc.Tensor.from_host(numpy_x) # (N, C_in, H, W)
cols = im2col(x, kernel_h=3, kernel_w=3,
stride=1, padding=1, dilation=1)
# cols.shape == (N * H_out * W_out, C_in * 3 * 3)
# ... matrix-multiply cols by the weight matrix ...
grad_input = col2im(grad_cols, input_shape=x.shape,
kernel_h=3, kernel_w=3,
stride=1, padding=1, dilation=1)
Performance & Trade-offs
- Memory overhead: the im2col matrix is
(N * H_out * W_out, C_in * kH * kW)floats. For a typical ResNet-50 input (N=32, C_in=64, H=W=56, kH=kW=3),H_out = W_out = 56, so the im2col matrix is(32 * 56 * 56, 64 * 9) = (100352, 576)floats, or 230 MB at fp32. This is the main reason the im2col strategy is not always the fastest. - Speed-up vs. the naive loop: the im2col + GEMM strategy
converts a 7-loop nested convolution (over
N, C_out, C_in, H, W, kH, kW) into a single matrix-multiply, which can use highly-tuned GEMM kernels. The win is 5x to 20x on most devices. - Cache pressure: the im2col matrix does not fit in cache for most realistic convolutions. The matmul kernel handles this with tiling, but the memory-bandwidth cost is real.
- Alternative: implicit GEMM (
CONV2D_IMPLICIT_GEMM) does the same matrix-multiply without materialising the im2col matrix. It is faster for large convolutions but requires a more sophisticated kernel. - Winograd (
CONV2D_WINOGRAD) is the fastest strategy for 3x3 convolutions with stride 1, but it requires the input to be a multiple of 4 in each spatial dimension and the output channels to be a multiple of 8.