concepts

ReLU

Status: Public API in netcl.nn.functional.relu and netcl.autograd.ops.relu

ReLU (Rectified Linear Unit) is the canonical activation function of deep learning. Given an input x, it returns max(x, 0) element-wise. The non-linearity is what lets a deep network approximate non-linear functions; the choice of ReLU over sigmoid or tanh is what makes it cheap to compute and easy to optimise (its gradient is 0 or 1, never vanishingly small for positive inputs).

ReLU is implemented in netcl as a single elementwise op that is fully fusible with the JIT Compiler. A linear + relu chain compiles to a single fused kernel.

Overview

ReLU is elementwise and non-linear; it is one of the cheapest ops in the runtime. The kernel is a single work-item per output element that does one compare and one multiply-by-zero or pass-through.

The backward pass is also elementwise: the gradient is 1 for positive inputs and 0 for negative inputs. The fused backward kernel is paired with the fused forward kernel in the JIT output.

Where It Lives

File path: nn/functional.py (the relu function), nn/modules.py (the ReLU module), autograd/ops.py (the autograd registration), autograd/compiler.py (the JIT primitive).
Module path: netcl.nn.functional.relu, netcl.nn.ReLU.
Sibling activations: tanh, sigmoid, leaky_relu, gelu.

How It Works

The kernel is:

__kernel void relu(__global const float* in,
                   __global float* out) {
    int gid = get_global_id(0);
    out[gid] = in[gid] > 0.0f ? in[gid] : 0.0f;
}

The JIT primitive is:

register_primitive(
    name="relu",
    forward=lambda args, attrs: f"RELU({args[0]})",
    backward=lambda args, grad_var, attrs, out_var:
        [f"{grad_var} * ({args[0]} > 0.0f ? 1.0f : 0.0f)"],
    arity=1,
    fusible=True,
)

A linear + relu chain compiles to a single fused kernel:

__kernel void fused_linear_relu(
    __global const float* in,
    __global const float* w,
    __global const float* b,
    __global float* out
) {
    int gid = get_global_id(0);
    float v0 = in[gid];
    float v1 = 0.0f;
    for (int k = 0; k < W; ++k) v1 += v0 * w[k];
    out[gid] = RELU(v1 + b[0]);
}

The fused kernel is one launch, two reads (in, w), and one write (out).

Code Example

The functional API:

import netcl as nc
import netcl.nn.functional as F

x = nc.Tensor.from_host(numpy_x)
y = F.relu(x)

The module API:

import netcl.nn as nn

relu = nn.ReLU()
y = relu(x)

In an MLP:

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        x = nc.relu(self.fc1(x))
        return self.fc2(x)

Performance & Trade-offs

ReLU is the cheapest activation. The kernel is memory-bound (one read, one write) and the compute is negligible.
Under the JIT Compiler, ReLU chains are fused into a single kernel. The result is one read, one write, no intermediate device tensor.
The "dying ReLU" problem: a ReLU unit that receives only negative gradients will never activate again. Mitigations include LeakyReLU (small negative slope) and better initialisation (Kaiming, used by default in Linear).
ReLU is sensitive to fp16 underflow: a small positive value just above zero in fp32 may round to zero in fp16. The GradScaler is the standard defence.