ReLU
ReLU
Status: Public API in
netcl.nn.functional.reluandnetcl.autograd.ops.relu
ReLU (Rectified Linear Unit) is the canonical activation
function of deep learning. Given an input x, it returns
max(x, 0) element-wise. The non-linearity is what lets a
deep network approximate non-linear functions; the choice of
ReLU over sigmoid or tanh is what makes it cheap to compute
and easy to optimise (its gradient is 0 or 1, never
vanishingly small for positive inputs).
ReLU is implemented in netcl as a single elementwise op
that is fully fusible with the JIT Compiler.
A linear + relu chain compiles to a single fused kernel.
Overview
ReLU is elementwise and non-linear; it is one of the
cheapest ops in the runtime. The kernel is a single work-item
per output element that does one compare and one
multiply-by-zero or pass-through.
The backward pass is also elementwise: the gradient is 1 for positive inputs and 0 for negative inputs. The fused backward kernel is paired with the fused forward kernel in the JIT output.
Where It Lives
- File path:
nn/functional.py(therelufunction),nn/modules.py(theReLUmodule),autograd/ops.py(the autograd registration),autograd/compiler.py(the JIT primitive). - Module path:
netcl.nn.functional.relu,netcl.nn.ReLU. - Sibling activations:
tanh,sigmoid,leaky_relu,gelu.
How It Works
The kernel is:
__kernel void relu(__global const float* in,
__global float* out) {
int gid = get_global_id(0);
out[gid] = in[gid] > 0.0f ? in[gid] : 0.0f;
}
The JIT primitive is:
register_primitive(
name="relu",
forward=lambda args, attrs: f"RELU({args[0]})",
backward=lambda args, grad_var, attrs, out_var:
[f"{grad_var} * ({args[0]} > 0.0f ? 1.0f : 0.0f)"],
arity=1,
fusible=True,
)
A linear + relu chain compiles to a single fused kernel:
__kernel void fused_linear_relu(
__global const float* in,
__global const float* w,
__global const float* b,
__global float* out
) {
int gid = get_global_id(0);
float v0 = in[gid];
float v1 = 0.0f;
for (int k = 0; k < W; ++k) v1 += v0 * w[k];
out[gid] = RELU(v1 + b[0]);
}
The fused kernel is one launch, two reads (in, w), and
one write (out).
Code Example
The functional API:
import netcl as nc
import netcl.nn.functional as F
x = nc.Tensor.from_host(numpy_x)
y = F.relu(x)
The module API:
import netcl.nn as nn
relu = nn.ReLU()
y = relu(x)
In an MLP:
class MyModel(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(784, 256)
self.fc2 = nn.Linear(256, 10)
def forward(self, x):
x = nc.relu(self.fc1(x))
return self.fc2(x)
Performance & Trade-offs
ReLUis the cheapest activation. The kernel is memory-bound (one read, one write) and the compute is negligible.- Under the JIT Compiler,
ReLUchains are fused into a single kernel. The result is one read, one write, no intermediate device tensor. - The "dying ReLU" problem: a
ReLUunit that receives only negative gradients will never activate again. Mitigations includeLeakyReLU(small negative slope) and better initialisation (Kaiming, used by default inLinear). ReLUis sensitive to fp16 underflow: a small positive value just above zero in fp32 may round to zero in fp16. The GradScaler is the standard defence.