Winograd
Winograd
Status: Architecture concept in
ops/winograd_fused.py
Winograd is the fast algorithm for 3x3 convolutions with stride
1, originally published by Shmuel Winograd in 1980 and applied
to convnets by Lavin and Gray (2016). For a 3x3 convolution, the
naive implementation does 9 multiplications per output element;
the Winograd algorithm reduces this to 16 multiplications for a
2x2 output tile (a 2.25x reduction in multiplies).
In netcl, the Winograd path is one of the strategies
Conv2d selects via the
KernelSelector; it is exposed as the
CONV2D_WINOGRAD kernel. The implementation is in
ops/winograd_fused.py and uses a precomputed transform matrix
to convert the input tile to the Winograd domain, do the
multiplication there, and convert back.
Overview
The F(2x2, 3x3) Winograd algorithm transforms a 4x4 input tile
to a 4x4 "Winograd-domain" tile, multiplies it element-wise by a
4x4 transform of the 3x3 weight, and inverse-transforms to a
2x2 output tile. The transforms are fixed and can be baked into
the kernel:
out_tile = inverse_transform(mul(transform(in_tile),
transform(weight_tile)))
The transform matrices have only small-integer entries
(1, 0, -1, 1/2), so the transform itself is a sequence of
adds, subtracts, and halvings — much cheaper than the
multiplications they replace.
Where It Lives
- File path:
ops/winograd_fused.py. - Module path:
netcl.ops.winograd_fused. - Used by:
ops/conv2d.py(when theKernelSelectorpicksCONV2D_WINOGRAD).
How It Works
The Winograd kernel is a single OpenCL program with three phases:
- Input transform: for each 2x2 output tile, read the corresponding 4x4 input tile, apply the input transform matrix, and write the transformed tile to local memory.
- Pointwise multiply: for each (output channel, input channel) pair, multiply the transformed input tile element-wise by the pre-transformed weight tile.
- Output transform: for each output tile, apply the output transform matrix to the result of the pointwise multiply and write the final 2x2 output to global memory.
The weight transform is precomputed once at kernel-build time
and stored in the cl.Program's constant memory. The input
and output transforms run on every call.
Code Example
The Winograd path is selected automatically by the KernelSelector. To force it:
from netcl.core.kernel_selector import KernelSelector
selector = KernelSelector()
# Force Winograd for all 3x3 convs.
def always_winograd(shapes, profile):
return shapes[0].numel() % 4 == 0
selector.register("conv2d", always_winograd, "CONV2D_WINOGRAD")
Inspecting the speed-up:
import time
import netcl as nc
x = nc.Tensor.from_host(numpy_x)
# Naive
nc.set_default_conv_strategy("CONV2D_NAIVE")
t0 = time.time(); y = conv(x); naive_time = time.time() - t0
# Winograd
nc.set_default_conv_strategy("CONV2D_WINOGRAD")
t0 = time.time(); y = conv(x); winograd_time = time.time() - t0
print(f"Winograd is {naive_time / winograd_time:.1f}x faster")
Performance & Trade-offs
- Speed-up vs. the naive implementation: typically 1.5x to 2x on 3x3 convs with stride 1. The win is larger on small batch sizes (because the transform overhead is fixed) and smaller on large batch sizes (because the GEMM is already efficient).
- Restrictions: the input spatial dimensions must be a multiple of 4 (the 2x2 tile size), and the output channels must be a multiple of 8. The kernel selector pads the tensors to satisfy these constraints and unpads the output at the end; the padding cost is usually negligible.
- Numerical precision: the transform matrices contain
fractions (
1/2), so the algorithm is slightly less numerically stable than the naive implementation. The kernel is careful to usemad_hi-style rounding on devices that support it; on devices that do not, fp16 Winograd can accumulate error. - Memory pressure: the Winograd kernel needs more local memory than the naive one (to hold the input tile and the transformed input tile simultaneously). On devices with small local-memory limits, the kernel falls back to a tiled variant that uses less memory but is slightly slower.