netcl wiki
concepts

ReLU

ReLU

Status: Public API in netcl.nn.functional.relu and netcl.autograd.ops.relu

ReLU (Rectified Linear Unit) is the canonical activation function of deep learning. Given an input x, it returns max(x, 0) element-wise. The non-linearity is what lets a deep network approximate non-linear functions; the choice of ReLU over sigmoid or tanh is what makes it cheap to compute and easy to optimise (its gradient is 0 or 1, never vanishingly small for positive inputs).

ReLU is implemented in netcl as a single elementwise op that is fully fusible with the JIT Compiler. A linear + relu chain compiles to a single fused kernel.

Overview

ReLU is elementwise and non-linear; it is one of the cheapest ops in the runtime. The kernel is a single work-item per output element that does one compare and one multiply-by-zero or pass-through.

The backward pass is also elementwise: the gradient is 1 for positive inputs and 0 for negative inputs. The fused backward kernel is paired with the fused forward kernel in the JIT output.

Where It Lives

  • File path: nn/functional.py (the relu function), nn/modules.py (the ReLU module), autograd/ops.py (the autograd registration), autograd/compiler.py (the JIT primitive).
  • Module path: netcl.nn.functional.relu, netcl.nn.ReLU.
  • Sibling activations: tanh, sigmoid, leaky_relu, gelu.

How It Works

The kernel is:

__kernel void relu(__global const float* in,
                   __global float* out) {
    int gid = get_global_id(0);
    out[gid] = in[gid] > 0.0f ? in[gid] : 0.0f;
}

The JIT primitive is:

register_primitive(
    name="relu",
    forward=lambda args, attrs: f"RELU({args[0]})",
    backward=lambda args, grad_var, attrs, out_var:
        [f"{grad_var} * ({args[0]} > 0.0f ? 1.0f : 0.0f)"],
    arity=1,
    fusible=True,
)

A linear + relu chain compiles to a single fused kernel:

__kernel void fused_linear_relu(
    __global const float* in,
    __global const float* w,
    __global const float* b,
    __global float* out
) {
    int gid = get_global_id(0);
    float v0 = in[gid];
    float v1 = 0.0f;
    for (int k = 0; k < W; ++k) v1 += v0 * w[k];
    out[gid] = RELU(v1 + b[0]);
}

The fused kernel is one launch, two reads (in, w), and one write (out).

Code Example

The functional API:

import netcl as nc
import netcl.nn.functional as F

x = nc.Tensor.from_host(numpy_x)
y = F.relu(x)

The module API:

import netcl.nn as nn

relu = nn.ReLU()
y = relu(x)

In an MLP:

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(784, 256)
        self.fc2 = nn.Linear(256, 10)

    def forward(self, x):
        x = nc.relu(self.fc1(x))
        return self.fc2(x)

Performance & Trade-offs

  • ReLU is the cheapest activation. The kernel is memory-bound (one read, one write) and the compute is negligible.
  • Under the JIT Compiler, ReLU chains are fused into a single kernel. The result is one read, one write, no intermediate device tensor.
  • The "dying ReLU" problem: a ReLU unit that receives only negative gradients will never activate again. Mitigations include LeakyReLU (small negative slope) and better initialisation (Kaiming, used by default in Linear).
  • ReLU is sensitive to fp16 underflow: a small positive value just above zero in fp32 may round to zero in fp16. The GradScaler is the standard defence.

See also

  • ReLU — the API page.
  • Linear — the typical pre-ReLU layer.
  • JIT Compiler — fuses the linear + ReLU chain.
  • MLP — the canonical user.
  • AMP — fp16 underflow considerations.
  • ReLU — this article.