concepts

Linear

Status: Public API in netcl.nn.layers.Linear

Linear is a fully-connected (a.k.a. dense) layer: a learnable matrix multiply followed by an optional bias add. Given an input of shape (N, in_features) and a weight of shape (out_features, in_features), the layer computes:

y[n, o] = sum_{i} weight[o, i] * x[n, i] + bias[o]

The result has shape (N, out_features). The weight is initialised with Kaiming uniform; the bias is initialised with zeros.

Linear is the building block of MLP and is used as the classification head of nearly every classifier architecture. It is also the layer that consumes the most GPU time in most networks: a 1024 -> 1024 linear is 1M multiplies per example, and a 4096 -> 4096 linear is 16M multiplies per example.

Overview

The Linear constructor takes three arguments: in_features, out_features, and (optionally) bias=True. The constructor allocates two Tensor parameters — weight of shape (out_features, in_features) and bias of shape (out_features,) — both with requires_grad=True.

forward(x) calls the matmul op on x and weight, then adds bias if it is not None. The matmul is dispatched through the same op system as the rest of the runtime; the strategy selector picks MATMUL_NAIVE, MATMUL_TILED, MATMUL_REGISTER_TILED, or MATMUL_VECTORIZED depending on the input shape and the device.

The backward pass is implemented in the same op: it computes the gradient w.r.t. x, weight, and bias, all as fused operations that participate in the Tape.

Where It Lives

File path: nn/layers.py (class Linear).
Module path: netcl.nn.layers.
Public re-export: from netcl.nn import Linear.

How It Works

The forward is:

def forward(self, x):
    y = ops.matmul(x, self.weight)
    if self.bias is not None:
        y = y + self.bias
    return y

The matmul op is ops/matmul.py. It supports several strategies:

MATMUL_NAIVE — one work-item per output element. Slow but simple. Used as a fallback.
MATMUL_TILED — output tiles of 16x16, each work-item computes a 16x16 output sub-block. Good for small matrices.
MATMUL_REGISTER_TILED — same tiling, but each work-item uses register storage for the input tile. Faster on devices with enough registers.
MATMUL_VECTORIZED — uses the float4 / float8 vector loads where the device supports them. The fastest strategy on most modern devices.

The strategy selector picks the best one based on the input shape and the device profile. For a 4096 x 4096 matmul on an NVIDIA GPU, MATMUL_VECTORIZED wins; for a 32 x 32 matmul, MATMUL_TILED wins.

Code Example

import netcl as nc
import netcl.nn as nn

fc = nn.Linear(in_features=1024, out_features=256, bias=True)
fc = fc.to(device)

x = nc.Tensor.zeros((128, 1024), dtype="float32",
                    context=ctx, queue=q)
y = fc(x)             # (128, 256)

A network with two linear layers (an MLP with no activation):

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        x = nc.relu(self.fc1(x))
        return self.fc2(x)

Performance & Trade-offs

The matmul is the bottleneck. The strategy selector picks the best algorithm; you rarely need to override.
Under AMP, the linear runs in fp16; the accumulator is fp32. The JIT Compiler can fuse the linear + relu chain into a single kernel.
For very small out_features (e.g. the final classifier head with 10 outputs), the matmul is GEMV-shaped and the strategy selector picks a GEMV kernel. This is 2x to 3x faster than a generic matmul.
bias=False is the right choice for the last layer of a network that will be followed by a BatchNorm (the BN absorbs the bias).