concepts

ResNet

Status: Public API — ResNet18 is lazy-exported from netcl.nn. The internal classes (BasicBlock, Bottleneck) and factory functions (resnet34, resnet50, …) are in netcl.nn.resnet but are not part of the public API.

ResNet is the family of residual networks introduced by He et al. (2015) for ImageNet classification. The defining idea is the residual connection: instead of learning a mapping H(x), the block learns a residual F(x) = H(x) - x and the output is F(x) + x. This makes it easy for the optimizer to learn an identity mapping (just set F to zero), which is what the very deep variants need to avoid the degradation problem (deeper plain networks have higher training error than shallower ones).

netcl ships a faithful PyTorch-style ResNet implementation under netcl.nn.resnet. The class hierarchy is:

BasicBlock — two 3x3 convolutions, used in ResNet-18 and ResNet-34.
Bottleneck — 1x1, 3x3, 1x1 convolutions, used in ResNet-50 and deeper.
ResNet — the full network, parameterised by the block class, the list of layer widths, and the number of classes.
resnet18, resnet34, resnet50, resnet101, resnet152 — pre-configured factory functions matching the original paper.

Overview

The macro-architecture is:

conv1   : 7x7, stride 2, output 64
maxpool : 3x3, stride 2
layer1  : N blocks, output 64
layer2  : N blocks, output 128, first block stride 2
layer3  : N blocks, output 256, first block stride 2
layer4  : N blocks, output 512, first block stride 2
avgpool : global average pool
fc      : 512 * expansion -> num_classes

For BasicBlock, expansion=1. For Bottleneck, expansion=4 (the 1x1 conv expands the channel count by 4 in the third convolution).

The first block of each layer (except layer1) applies a strided 1x1 convolution on the shortcut so the spatial size and the channel count match. This is the only place where the shortcut is not a plain identity.

Where It Lives

File path: nn/resnet.py.
Module path: netcl.nn.resnet.
Public re-export: from netcl.nn import ResNet18 (lazy-loaded).

Diagram

How It Works

Each BasicBlock is:

def forward(self, x):
    identity = x
    out = self.bn1(self.conv1(x))
    out = self.relu(out)
    out = self.bn2(self.conv2(out))
    if self.downsample is not None:
        identity = self.downsample(x)
    out = out + identity
    return self.relu(out)

The convolution strategies are selected by the same kernel selector used by Conv2d (im2col, Winograd, etc.). The batch-norm is the fused implementation; the addition is a single elementwise kernel.

The full ResNet forward pass is a sequence of about 50 convolutions for ResNet-50, with one JIT Compiler fusion opportunity per residual tail (the bn-relu-add chain). The JIT compiler does not currently see the whole ResNet — that is a job for the pattern-based TrainingGraphCompiler — but the per-block elementwise chains (mostly after the addition) are fused.

Code Example

from netcl.nn import ResNet18
from netcl.core.device import manager

q = manager.default("auto").queue

# ResNet18 with a CIFAR-10 head (default num_classes=10)
model = ResNet18(queue=q, num_classes=10)

# ImageNet head
model = ResNet18(queue=q, num_classes=1000)

Performance & Trade-offs

ResNet-50 runs at roughly 60% of the throughput of a hand-tuned cuDNN implementation on the same OpenCL hardware, because the OpenCL convolution selector is more conservative than cuDNN's autotuner. This is a known gap and the focus of ongoing work.
The first 7x7 convolution is the most expensive single op in the network on small input sizes. Some practitioners replace it with three 3x3 convolutions for a small accuracy win and a measurable speed-up on small inputs.
Under AMP, ResNet-50 trains comfortably in fp16; the standard recipe uses lr=0.1 per 256-batch and a cosine schedule.
The bottleneck variant is faster than the basic variant for the same accuracy on ImageNet; the 1x1 convs are cheap and the 3x3 conv runs on a quarter of the channels.