concepts

KernelSelector

Status: Public API in netcl.core.kernel_selector.KernelSelector

KernelSelector is the per-op dispatcher that picks which OpenCL kernel runs for a given op, given the device profile and the input shapes. The selector is a small registry mapping an op name to a list of (condition, kernel_name) rules; at dispatch time, the first rule whose condition(shape, profile) is True wins, and the corresponding kernel is launched.

The selector is the reason netcl can run the same model on NVIDIA, Intel, AMD, and Apple GPUs without the user writing per-vendor code. The per-op rule list encodes the "which-strategy-is-best-where" knowledge that the netcl team has accumulated over time.

Overview

class KernelSelector:
    def __init__(self):
        self._rules: dict[str, list[tuple[Callable, str]]] = {}
        self._builtins: dict[str, str] = {}

    def register(self, op_name: str, condition: Callable, kernel: str) -> None:
        ...

    def select(self, op_name: str, shapes, profile) -> str:
        ...

The selector's register method adds a rule of the form "for op op_name, if condition(shape, profile) is True, use kernel kernel". Rules are tried in registration order; the first match wins. The selector's select method is called by the op dispatch to find the right kernel.

Where It Lives

File path: core/kernel_selector.py.
Module path: netcl.core.kernel_selector.
Public re-export: from netcl.core.kernel_selector import KernelSelector.
Sibling: KernelSpec (the per-kernel record), DeviceProfile (the per-device capabilities record).

How It Works

The selector is consulted at every op dispatch. For example, the ops/conv2d.py op:

Calls selector.select("conv2d", shapes, profile).
The selector walks the rule list for "conv2d". Typical rules are: * If profile is Apple and shape is 3x3 / stride 1: use CONV2D_WINOGRAD. * If shape is 1x1: use CONV2D_IM2COL. * If profile is Intel integrated: use CONV2D_TILED_LOCAL. * Otherwise: use CONV2D_IMPLICIT_GEMM.
Returns the kernel name.
The op builds a KernelSpec for that kernel (or fetches the cached one) and launches it.

The rule list is hard-coded for the built-in ops, but the user can add their own rules via selector.register(...). This is the standard way to plug a custom kernel into the dispatch system.

Code Example

Registering a custom rule:

from netcl.core.kernel_selector import KernelSelector

selector = KernelSelector()

# If the shape is small, use a custom kernel.
def is_small(shapes, profile):
    return shapes[0].numel() < 1024

selector.register("conv2d", is_small, "CONV2D_NAIVE_SMALL")

Inspecting the built-in rules:

for op, rules in selector._rules.items():
    print(f"{op}: {len(rules)} rules")
# conv2d: 5 rules
# matmul: 4 rules
# ...

Performance & Trade-offs

The selector is a pure function of (op_name, shapes, profile). The first call is O(rules_per_op); subsequent calls hit a small per-call cache.
The rule list is the only place where the per-vendor heuristics live. Adding a new device is a matter of adding new rules; no other code needs to change.
The selector does not consult the autotuner. The autotuner picks the local size for a given kernel; the selector picks the kernel itself for a given op. The two compose.
Custom rules should be cheap to evaluate. If the condition is expensive, cache the decision yourself and register a precomputed answer.