KernelSelector
KernelSelector
Status: Public API in
netcl.core.kernel_selector.KernelSelector
KernelSelector is the per-op dispatcher that picks which
OpenCL kernel runs for a given op, given the device profile and
the input shapes. The selector is a small registry mapping an
op name to a list of (condition, kernel_name) rules; at
dispatch time, the first rule whose condition(shape,
profile) is True wins, and the corresponding kernel is
launched.
The selector is the reason netcl can run the same model on NVIDIA, Intel, AMD, and Apple GPUs without the user writing per-vendor code. The per-op rule list encodes the "which-strategy-is-best-where" knowledge that the netcl team has accumulated over time.
Overview
class KernelSelector:
def __init__(self):
self._rules: dict[str, list[tuple[Callable, str]]] = {}
self._builtins: dict[str, str] = {}
def register(self, op_name: str, condition: Callable, kernel: str) -> None:
...
def select(self, op_name: str, shapes, profile) -> str:
...
The selector's register method adds a rule of the form
"for op op_name, if condition(shape, profile) is True, use
kernel kernel". Rules are tried in registration order; the
first match wins. The selector's select method is called by
the op dispatch to find the right kernel.
Where It Lives
- File path:
core/kernel_selector.py. - Module path:
netcl.core.kernel_selector. - Public re-export:
from netcl.core.kernel_selector import KernelSelector. - Sibling:
KernelSpec(the per-kernel record),DeviceProfile(the per-device capabilities record).
How It Works
The selector is consulted at every op dispatch. For example, the
ops/conv2d.py op:
- Calls
selector.select("conv2d", shapes, profile). - The selector walks the rule list for
"conv2d". Typical rules are: * If profile is Apple and shape is 3x3 / stride 1: useCONV2D_WINOGRAD. * If shape is 1x1: useCONV2D_IM2COL. * If profile is Intel integrated: useCONV2D_TILED_LOCAL. * Otherwise: useCONV2D_IMPLICIT_GEMM. - Returns the kernel name.
- The op builds a
KernelSpecfor that kernel (or fetches the cached one) and launches it.
The rule list is hard-coded for the built-in ops, but the user
can add their own rules via selector.register(...). This is
the standard way to plug a custom kernel into the dispatch
system.
Code Example
Registering a custom rule:
from netcl.core.kernel_selector import KernelSelector
selector = KernelSelector()
# If the shape is small, use a custom kernel.
def is_small(shapes, profile):
return shapes[0].numel() < 1024
selector.register("conv2d", is_small, "CONV2D_NAIVE_SMALL")
Inspecting the built-in rules:
for op, rules in selector._rules.items():
print(f"{op}: {len(rules)} rules")
# conv2d: 5 rules
# matmul: 4 rules
# ...
Performance & Trade-offs
- The selector is a pure function of
(op_name, shapes, profile). The first call isO(rules_per_op); subsequent calls hit a small per-call cache. - The rule list is the only place where the per-vendor heuristics live. Adding a new device is a matter of adding new rules; no other code needs to change.
- The selector does not consult the autotuner. The autotuner picks the local size for a given kernel; the selector picks the kernel itself for a given op. The two compose.
- Custom rules should be cheap to evaluate. If the condition is expensive, cache the decision yourself and register a precomputed answer.
See also
- KernelSelector — the API page.
- DeviceProfile — the per-device capabilities record the selector consults.
- KernelSpec — the per-kernel record.
- WorkGroupTuner — the per-kernel local-size tuner.
- Autotuner — the higher-level autotuner.
- KernelSelector — this article.