Linear
Linear
Status: Public API in
netcl.nn.layers.Linear
Linear is a fully-connected (a.k.a. dense) layer: a learnable
matrix multiply followed by an optional bias add. Given an
input of shape (N, in_features) and a weight of shape
(out_features, in_features), the layer computes:
y[n, o] = sum_{i} weight[o, i] * x[n, i] + bias[o]
The result has shape (N, out_features). The weight is
initialised with Kaiming uniform; the bias is initialised with
zeros.
Linear is the building block of MLP and is
used as the classification head of nearly every classifier
architecture. It is also the layer that consumes the most
GPU time in most networks: a 1024 -> 1024 linear is 1M
multiplies per example, and a 4096 -> 4096 linear is 16M
multiplies per example.
Overview
The Linear constructor takes three arguments: in_features,
out_features, and (optionally) bias=True. The constructor
allocates two Tensor parameters — weight of shape
(out_features, in_features) and bias of shape
(out_features,) — both with requires_grad=True.
forward(x) calls the matmul op on x and weight, then
adds bias if it is not None. The matmul is dispatched
through the same op system as the rest of the runtime; the
strategy selector picks MATMUL_NAIVE, MATMUL_TILED,
MATMUL_REGISTER_TILED, or MATMUL_VECTORIZED depending on
the input shape and the device.
The backward pass is implemented in the same op: it computes
the gradient w.r.t. x, weight, and bias, all as fused
operations that participate in the Tape.
Where It Lives
- File path:
nn/layers.py(class Linear). - Module path:
netcl.nn.layers. - Public re-export:
from netcl.nn import Linear.
How It Works
The forward is:
def forward(self, x):
y = ops.matmul(x, self.weight)
if self.bias is not None:
y = y + self.bias
return y
The matmul op is ops/matmul.py. It supports several
strategies:
MATMUL_NAIVE— one work-item per output element. Slow but simple. Used as a fallback.MATMUL_TILED— output tiles of 16x16, each work-item computes a 16x16 output sub-block. Good for small matrices.MATMUL_REGISTER_TILED— same tiling, but each work-item uses register storage for the input tile. Faster on devices with enough registers.MATMUL_VECTORIZED— uses thefloat4/float8vector loads where the device supports them. The fastest strategy on most modern devices.
The strategy selector picks the best one based on the input
shape and the device profile. For a 4096 x 4096 matmul on an
NVIDIA GPU, MATMUL_VECTORIZED wins; for a 32 x 32 matmul,
MATMUL_TILED wins.
Code Example
import netcl as nc
import netcl.nn as nn
fc = nn.Linear(in_features=1024, out_features=256, bias=True)
fc = fc.to(device)
x = nc.Tensor.zeros((128, 1024), dtype="float32",
context=ctx, queue=q)
y = fc(x) # (128, 256)
A network with two linear layers (an MLP with no activation):
class MyModel(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.fc2 = nn.Linear(256, 10)
def forward(self, x):
x = nc.relu(self.fc1(x))
return self.fc2(x)
Performance & Trade-offs
- The matmul is the bottleneck. The strategy selector picks the best algorithm; you rarely need to override.
- Under AMP, the linear runs in fp16; the
accumulator is fp32. The JIT Compiler
can fuse the
linear + reluchain into a single kernel. - For very small
out_features(e.g. the final classifier head with 10 outputs), the matmul is GEMV-shaped and the strategy selector picks a GEMV kernel. This is 2x to 3x faster than a generic matmul. bias=Falseis the right choice for the last layer of a network that will be followed by aBatchNorm(the BN absorbs the bias).