Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Array computing: expression templates and SIMD

Why NumPy expressions are slow

For an expression like b = 2*x + 4*x**2 + np.sin(x), NumPy suffers from two compounding problems:

Memory bandwidth waste. Each sub-expression allocates a temporary array and triggers a full pass over memory. For the expression above that is roughly 10 memory passes (3 reads of x, 3 temporary writes, 3 temporary reads, 1 final write). If x is large enough to overflow L2 cache, each pass is a cache miss storm.

No cross-operation SIMD. Because each operation runs as a separate C loop, the compiler never sees the full expression and cannot vectorize across operations. Transcendental functions (sin, exp) are called one element at a time via scalar libm.

Tools that address this:


How C++ expression templates work

Expression templates are a compile-time metaprogramming pattern. The key idea: operator overloading returns a proxy type encoding the pending operation, not a computed value. Evaluation is deferred until the result is assigned to a concrete array.

// User writes:
auto b = 2*x + xt::sin(x);

// Compiler sees (simplified):
auto b = Add<Mul<Scalar<int>, ArrayRef>, Sin<ArrayRef>>(2, x, x);
// No allocation yet — b is a pure type.

// Assignment triggers evaluation:
xt::xarray<float> result = b;
// Compiler emits ONE fused loop:
for (size_t i = 0; i < N; ++i)
    result[i] = 2*x[i] + std::sin(x[i]);

Properties of this approach:

Pythran’s Pythonic library implements essentially all of NumPy’s API as C++ expression templates, so transpiled Python code that looks like NumPy compiles to a fused loop.


How SIMD fits in

Expression templates and SIMD are orthogonal concerns that compose cleanly. Expression templates fix the loop structure (one fused loop, no temps). SIMD fixes the loop body (process 8–16 elements per iteration instead of 1).

The standard pattern in xsimd/xtensor:

constexpr size_t W = xsimd::batch<float>::size; // e.g. 8 on AVX2

size_t i = 0;
for (; i + W <= N; i += W) {
    auto x_vec = xsimd::load_aligned(&x[i]);       // load 8 floats
    auto res   = 2.0f * x_vec + xsimd::sin(x_vec); // 8-wide fused ops
    xsimd::store_aligned(&out[i], res);
}
for (; i < N; ++i)                                 // scalar tail
    out[i] = 2.0f * x[i] + std::sin(x[i]);

xsimd::batch<float> overloads all arithmetic operators and math functions, so the fused loop body is written once and works for both scalar and SIMD by substituting the element type. Vectorized math (xsimd::sin, xsimd::exp) are hand-tuned polynomial approximations — computing 8 sin values costs roughly the same as 1.5 scalar ones.


SPy’s architecture (from the blog post)

SPy (github.com/spylang/spy) is a statically-typed Python variant that is both interpreted and compiled. Its central mechanism is a blue/red evaluation model:

Key properties:

The spy-opspec.md document details the dispatch machinery:

The interpreter (ASTFrame) and redshifter (DopplerFrame) share the same dispatch and type-resolution logic. DopplerFrame only overrides what happens with the result: instead of executing the resolved W_OpImpl, it emits an ast.Call node pointing to the concrete function.


The myarray.spy example

from unsafe import gc_alloc, gc_ptr
from operator import OpSpec
from types import StructType

@struct
class ArrayData[DTYPE, NDIM]:
    shape: gc_ptr[i32]
    items: gc_ptr[DTYPE]

@blue.generic
def ndarray(DTYPE, NDIM):

    if NDIM > 2:
        raise StaticError("NDIM>2 not supported")

    ArrayData_c = ArrayData[DTYPE, NDIM]

    @struct
    class Self:
        __ll__: gc_ptr[ArrayData_c]

        @blue.metafunc
        def __new__(*args_m):

            if len(args_m) != NDIM + 1:
                raise StaticError("Wrong number of arguments")

            def new(type_: StructType, *args: Args[NDIM, i32]) -> Self:
                ll = gc_alloc[ArrayData_c](1)
                ll.shape = gc_alloc[i32](NDIM)
                for i, n in unroll(enumerate(args)):
                    ll.shape[i] = n
                return Self.__make__(ll)

            return new

    return Self

def main() -> None:
    arr1 = ndarray[f64, 1](4)
    arr2 = ndarray[f64, 2](4, 4)

What happens at blue time

ndarray is a @blue.generic function. Calling ndarray[f64, 1] runs the entire function body at compile time and returns a freshly constructed @struct type. ndarray[f64, 1] and ndarray[f64, 2] are two distinct types, each specialized for their DTYPE and NDIM. This is the SPy equivalent of a C++ class template.

ArrayData is the low-level backing store: a shape pointer and a data pointer, with a fixed C-compatible layout. __ll__ is the SPy convention for the underlying raw representation field.

The __new__ metafunction

__new__ is a @blue.metafunc: it receives MetaArg objects describing the call-site arguments and returns an OpSpec selecting the concrete constructor.

The generated C for ndarray[f64, 1]'s constructor looks roughly like:

ndarray_f64_1 spy_ndarray_f64_1__new__(int32_t n0) {
    ArrayData_f64_1* ll = gc_alloc(sizeof(ArrayData_f64_1));
    ll->shape = gc_alloc(1 * sizeof(int32_t));
    ll->shape[0] = n0;   // statically unrolled
    ll->items = gc_alloc(n0 * sizeof(double));
    return (ndarray_f64_1){ .ll = ll };
}

No virtual dispatch, no Python object overhead, no dynamic type checks.


Expression fusion in SPy: the path forward

SPy’s @blue.metafunc / OpSpec machinery makes it possible to implement expression templates in user-space SPy code, without any compiler magic.

Naïve (eager) operator

@blue.metafunc
def __add__(m_self, m_other):
    def add(a: Self, b: Self) -> Self:
        out = Self.__new__(Self, a.shape[0])
        for i in range(a.shape[0]):
            out.items[i] = a.items[i] + b.items[i]
        return out
    return OpSpec(add)

This works but allocates a temporary array for each +. Same problem as NumPy.

Lazy / fused operator (expression template style)

The key idea: __add__ returns a lazy expression node instead of triggering computation. The loop is only materialized when the result is assigned to a concrete ndarray, via __convert_to__.

@blue.generic
def LazyAdd(DTYPE, NDIM):
    A = ndarray[DTYPE, NDIM]

    @struct
    class Self:
        left:  A
        right: A

        @blue.metafunc
        def __convert_to__(m_expT, m_gotT, m_x):
            if m_expT.blueval == A:
                def materialize(expr: Self) -> A:
                    out = A.__new__(A, expr.left.shape[0])
                    for i in range(expr.left.shape[0]):
                        out.items[i] = expr.left.items[i] + expr.right.items[i]
                    return out
                return OpSpec(materialize, [m_x])
            return OpSpec.NULL

    return Self

And ndarray.__add__ becomes zero-cost at the call site:

@blue.metafunc
def __add__(m_self, m_other):
    Lazy = LazyAdd[DTYPE, NDIM]
    def add(a: Self, b: Self) -> Lazy:
        return Lazy.__make__(a, b)   # just stores two pointers, no allocation
    return OpSpec(add)

The assignment result: ndarray[f64, 1] = a + b + c chains through __convert_to__ and emits a single fused loop in the redshifted AST. No temporary arrays. The C backend sees tight straight-line code per element.

In C++, the expression tree is a compile-time type. In SPy, it is a compile-time value (a blue LazyAdd struct holding references). Both achieve the same result after redshifting: one fused loop, zero overhead.


Extending laziness to functions: sin, exp, etc.

Operators are straightforward because __add__ has a well-defined dispatch hook. Free functions like sin need a different entry point — but the mechanism is identical: make sin itself a @blue.metafunc that inspects the static type of its argument.

sin as a @blue.metafunc

@blue.metafunc
def sin(m_x):
    T = m_x.static_type

    if T == f64:
        # scalar — eager, direct C call
        def sin_eager(x: f64) -> f64:
            return libc.sin(x)
        return OpSpec(sin_eager)

    elif T == ndarray[f64, 1]:
        # array — return a lazy node
        Lazy = LazySin[f64, 1]
        def sin_lazy(x: ndarray[f64, 1]) -> Lazy:
            return Lazy.__make__(x)   # just stores a reference, no computation
        return OpSpec(sin_lazy)

    raise TypeError(f"sin not defined for {T}")

After redshifting, y = sin(x) where x: ndarray[f64, 1] becomes a direct call to sin_lazy, which constructs a LazySin node at runtime — no computation, just a pointer to x. The same pattern applies to every elementwise function in the math library (cos, exp, log, etc.).

The problem: lazy nodes don’t compose naïvely

A LazyAdd node stores left: A and right: A — concrete ndarray instances. But in 2*x + sin(x), the right child is a LazySin, not a concrete array. The materialize function in LazyAdd would call expr.right.items[i], which doesn’t exist on LazySin.

The solution: a common __getitem__ interface

Every lazy node implements __getitem__ as a @blue.metafunc that computes one element on demand. This is the SPy equivalent of C++'s operator[] on proxy types.

For LazySin:

@blue.generic
def LazySin(DTYPE, NDIM):
    A = ndarray[DTYPE, NDIM]

    @struct
    class Self:
        arg: A

        @blue.metafunc
        def __getitem__(m_self, m_i):
            def get(expr: Self, i: i64) -> DTYPE:
                return libc.sin(expr.arg.items[i])
            return OpSpec(get)

        @blue.metafunc
        def __convert_to__(m_expT, m_gotT, m_x):
            if m_expT.blueval == A:
                def materialize(expr: Self) -> A:
                    out = A.__new__(A, expr.arg.shape[0])
                    for i in range(expr.arg.shape[0]):
                        out.items[i] = expr[i]   # via __getitem__
                    return out
                return OpSpec(materialize, [m_x])
            return OpSpec.NULL

    return Self

LazyAdd is updated to use the same interface for its children:

@blue.metafunc
def __getitem__(m_self, m_i):
    def get(expr: Self, i: i64) -> DTYPE:
        return expr.left[i] + expr.right[i]   # recurses into children
    return OpSpec(get)

@blue.metafunc
def __convert_to__(m_expT, m_gotT, m_x):
    if m_expT.blueval == A:
        def materialize(expr: Self) -> A:
            out = A.__new__(A, ...)
            for i in range(...):
                out.items[i] = expr[i]   # walks the full tree
            return out
        return OpSpec(materialize, [m_x])

Now expr[i] on a LazyAdd dispatches (at blue time, type statically known) to expr.left[i] + expr.right[i], which in turn dispatches to the __getitem__ of each child. The entire tree is resolved by blue dispatch — no runtime polymorphism. The redshifted loop body is a single flat expression:

// Redshifted C for: result = 2*x + sin(x)
for (int64_t i = 0; i < N; i++)
    out[i] = 2.0 * x[i] + sin(x[i]);

Making LazyAdd generic over child types

The left and right fields need to hold any lazy node type, not just concrete arrays. The solution is to parametrize LazyAdd over its child types — exactly as C++ expression templates do:

@blue.generic
def LazyAdd(DTYPE, NDIM, LEFT_T, RIGHT_T):
    A = ndarray[DTYPE, NDIM]

    @struct
    class Self:
        left:  LEFT_T
        right: RIGHT_T

        @blue.metafunc
        def __getitem__(m_self, m_i):
            def get(expr: Self, i: i64) -> DTYPE:
                return expr.left[i] + expr.right[i]
            return OpSpec(get)

        ...

    return Self

And ndarray.__add__ becomes:

@blue.metafunc
def __add__(m_self, m_other):
    LEFT_T  = m_self.static_type
    RIGHT_T = m_other.static_type
    Lazy = LazyAdd[DTYPE, NDIM, LEFT_T, RIGHT_T]

    def add(a: LEFT_T, b: RIGHT_T) -> Lazy:
        return Lazy.__make__(a, b)
    return OpSpec(add)

LEFT_T and RIGHT_T are blue constants — the static types of the operands, known at blue time. LazyAdd[f64, 1, LazyMul[...], LazySin[...]] is a fully specialized type generated at blue time, directly analogous to Add<Mul<Scalar, ArrayRef>, Sin<ArrayRef>> in C++. Blue memoization of @blue.generic ensures each unique combination is instantiated only once.

The full composition for result = 2*x + sin(x)

At runtime the lazy tree builds up as:

sin(x)    →  LazySin[f64,1]            { arg   = &x }
2*x       →  LazyMul[f64,1,Scalar,Arr] { left  = 2, right = &x }
... + ...  →  LazyAdd[f64,1,LazyMul,LazySin] { left = LazyMul{...},
                                               right = LazySin{...} }

Assignment to ndarray[f64, 1] triggers __convert_to__ on the outermost LazyAdd, which calls expr[i], which the redshifter resolves — at blue time — into the flat expression 2.0 * x[i] + sin(x[i]). The C compiler sees a single loop with no indirection, no virtual calls, and no temporary arrays.

Summary of the lazy node design

Every lazy node — whether from an operator or a free function — implements two blue interfaces:

Free functions (sin, cos, exp, …) become @blue.metafunc free functions that dispatch on the static type of their argument: scalar → eager C call; array → lazy node wrapping the input. The function library and the lazy node library are written independently and compose through the __getitem__ interface.


SIMD in the SPy/C context

SPy emits C, not LLVM IR, so SIMD is handled differently from xtensor. Three realistic approaches, in increasing complexity:

Path 1: rely on auto-vectorization

A clean fused C loop with contiguous memory access and no opaque function calls will be auto-vectorized by clang/gcc at -O2 -march=native. Covers simple arithmetic well. Transcendental functions (sin, exp) are usually not vectorized this way because the compiler calls scalar libm.

Path 2: emit calls to a C SIMD math library

SLEEF provides vectorized math callable directly from C:

__m256 Sleef_sinf8_u10(__m256 x);   // 8-wide sin, AVX2, float32

SPy’s C backend can, when materializing a fused loop over a contiguous array, emit explicit SIMD intrinsic code with SLEEF for transcendentals. The loop structure:

int64_t i = 0;
for (; i + 8 <= N; i += 8) {
    __m256 x_vec = _mm256_load_ps(&x[i]);
    __m256 res   = Sleef_sinf8_u10(x_vec);
    _mm256_store_ps(&out[i], res);
}
for (; i < N; i++)
    out[i] = sinf(x[i]);

Path 3: OpenMP SIMD pragmas

A lighter-weight middle ground:

#pragma omp simd
for (int64_t i = 0; i < N; i++)
    out[i] = 2.0f * x[i] + sinf(x[i]);

With -fopenmp-simd and a vector math library linked, this can vectorize transcendentals without writing intrinsics by hand.

The clean SPy design

SIMD is a concern of the C backend and the stdlib, not of user code. The materialization function emitted by __convert_to__ knows, at blue time:

It can therefore decide at blue time which codepath to emit:

@blue.metafunc
def __convert_to__(m_expT, m_gotT, m_x):
    if m_expT.blueval == A:
        if DTYPE == f32 and simd_available[AVX2]:
            def materialize(expr: LazyExpr) -> A:
                # emits the AVX2 + SLEEF loop
                ...
        else:
            def materialize(expr: LazyExpr) -> A:
                # emits the plain C loop (auto-vec friendly)
                ...
        return OpSpec(materialize, [m_x])

simd_available is itself a blue query — resolved at compile time from the --target flag. No runtime CPU detection needed; one optimal codepath is emitted for the target. User-written SPy array code remains entirely free of SIMD concerns: out = 2*x + sin(x) compiles to a fused SIMD loop automatically.


Summary

ConcernC++ (xtensor/Pythran)SPy
Loop fusionExpression templates (compile-time types)LazyExpr + __convert_to__ (blue-time values)
Zero tempsProxy types, no allocation until assignmentSame, via deferred materialization
Element-wise interfaceoperator[] and load_simd() on each proxy type__getitem__ as @blue.metafunc on each lazy node
Function dispatch (sin, …)xt::sin overloaded on proxy types; returns UnaryExpr<sin_functor, Arg>sin as @blue.metafunc; dispatches on static type; returns LazySin node for arrays
Composing operators + functionsSame operator[] interface on all nodes; template instantiation walks the treeSame __getitem__ interface on all nodes; blue dispatch walks the tree at compile time
Generic child typesLazyAdd<Left, Right> — tree shape encoded in compile-time typeLazyAdd[DTYPE, NDIM, LEFT_T, RIGHT_T] — tree shape encoded in blue-time type
SIMD dispatchxsimd::batch<T> type substitutionBlue-time decision in __convert_to__, emits C intrinsics or auto-vec loop
Vectorized mathxsimd polynomial approximationsSLEEF callable from C, or OpenMP SIMD pragma
Target selectionRuntime CPU dispatch (SSE2/AVX2/AVX-512)Compile-time, from --target flag
User codeIdentical to naïve — template magic is invisibleIdentical to naïve — blue magic is invisible