netcl wiki
concepts

Winograd

Winograd

Status: Architecture concept in ops/winograd_fused.py

Winograd is the fast algorithm for 3x3 convolutions with stride 1, originally published by Shmuel Winograd in 1980 and applied to convnets by Lavin and Gray (2016). For a 3x3 convolution, the naive implementation does 9 multiplications per output element; the Winograd algorithm reduces this to 16 multiplications for a 2x2 output tile (a 2.25x reduction in multiplies).

In netcl, the Winograd path is one of the strategies Conv2d selects via the KernelSelector; it is exposed as the CONV2D_WINOGRAD kernel. The implementation is in ops/winograd_fused.py and uses a precomputed transform matrix to convert the input tile to the Winograd domain, do the multiplication there, and convert back.

Overview

The F(2x2, 3x3) Winograd algorithm transforms a 4x4 input tile to a 4x4 "Winograd-domain" tile, multiplies it element-wise by a 4x4 transform of the 3x3 weight, and inverse-transforms to a 2x2 output tile. The transforms are fixed and can be baked into the kernel:

out_tile = inverse_transform(mul(transform(in_tile),
                                  transform(weight_tile)))

The transform matrices have only small-integer entries (1, 0, -1, 1/2), so the transform itself is a sequence of adds, subtracts, and halvings — much cheaper than the multiplications they replace.

Where It Lives

  • File path: ops/winograd_fused.py.
  • Module path: netcl.ops.winograd_fused.
  • Used by: ops/conv2d.py (when the KernelSelector picks CONV2D_WINOGRAD).

How It Works

The Winograd kernel is a single OpenCL program with three phases:

  1. Input transform: for each 2x2 output tile, read the corresponding 4x4 input tile, apply the input transform matrix, and write the transformed tile to local memory.
  2. Pointwise multiply: for each (output channel, input channel) pair, multiply the transformed input tile element-wise by the pre-transformed weight tile.
  3. Output transform: for each output tile, apply the output transform matrix to the result of the pointwise multiply and write the final 2x2 output to global memory.

The weight transform is precomputed once at kernel-build time and stored in the cl.Program's constant memory. The input and output transforms run on every call.

Code Example

The Winograd path is selected automatically by the KernelSelector. To force it:

from netcl.core.kernel_selector import KernelSelector

selector = KernelSelector()
# Force Winograd for all 3x3 convs.
def always_winograd(shapes, profile):
    return shapes[0].numel() % 4 == 0

selector.register("conv2d", always_winograd, "CONV2D_WINOGRAD")

Inspecting the speed-up:

import time
import netcl as nc

x = nc.Tensor.from_host(numpy_x)

# Naive
nc.set_default_conv_strategy("CONV2D_NAIVE")
t0 = time.time(); y = conv(x); naive_time = time.time() - t0

# Winograd
nc.set_default_conv_strategy("CONV2D_WINOGRAD")
t0 = time.time(); y = conv(x); winograd_time = time.time() - t0

print(f"Winograd is {naive_time / winograd_time:.1f}x faster")

Performance & Trade-offs

  • Speed-up vs. the naive implementation: typically 1.5x to 2x on 3x3 convs with stride 1. The win is larger on small batch sizes (because the transform overhead is fixed) and smaller on large batch sizes (because the GEMM is already efficient).
  • Restrictions: the input spatial dimensions must be a multiple of 4 (the 2x2 tile size), and the output channels must be a multiple of 8. The kernel selector pads the tensors to satisfy these constraints and unpads the output at the end; the padding cost is usually negligible.
  • Numerical precision: the transform matrices contain fractions (1/2), so the algorithm is slightly less numerically stable than the naive implementation. The kernel is careful to use mad_hi-style rounding on devices that support it; on devices that do not, fp16 Winograd can accumulate error.
  • Memory pressure: the Winograd kernel needs more local memory than the naive one (to hold the input tile and the transformed input tile simultaneously). On devices with small local-memory limits, the kernel falls back to a tiled variant that uses less memory but is slightly slower.

See also

  • Winograd — the architecture page on convolution strategies.
  • Conv2d — the high-level conv module.
  • im2col — the alternative fast strategy.
  • KernelSelector — the per-op dispatcher.
  • ResNet — a heavy user of 3x3 convs.
  • Winograd — this article.