Array computing like C++: expression templates and SIMD

Why NumPy expressions are slow¶

For an expression like b = 2*x + 4*x**2 + np.sin(x), NumPy suffers from two compounding problems:

Memory bandwidth waste. Each sub-expression allocates a temporary array and triggers a full pass over memory. For the expression above that is roughly 10 memory passes (3 reads of x, 3 temporary writes, 3 temporary reads, 1 final write). If x is large enough to overflow L2 cache, each pass is a cache miss storm.

No cross-operation SIMD. Because each operation runs as a separate C loop, the compiler never sees the full expression and cannot vectorize across operations. Transcendental functions (sin, exp) are called one element at a time via scalar libm.

Tools that address this:

numexpr — chunks computation to fit in cache, interprets expression at runtime; good gains, but no compile-time knowledge of the expression.
Pythran — transpiles Python to C++; uses expression templates + xsimd; gets full compile-time fusion and SIMD.
xtensor — pure C++ array library; same technique as Pythran’s backend.
Numba — JIT compilation; effective but usually requires rewriting as an explicit scalar loop.

How C++ expression templates work¶

Expression templates are a compile-time metaprogramming pattern. The key idea: operator overloading returns a proxy type encoding the pending operation, not a computed value. Evaluation is deferred until the result is assigned to a concrete array.

// User writes:
auto b = 2*x + xt::sin(x);

// Compiler sees (simplified):
auto b = Add<Mul<Scalar<int>, ArrayRef>, Sin<ArrayRef>>(2, x, x);
// No allocation yet — b is a pure type.

// Assignment triggers evaluation:
xt::xarray<float> result = b;
// Compiler emits ONE fused loop:
for (size_t i = 0; i < N; ++i)
    result[i] = 2*x[i] + std::sin(x[i]);

Properties of this approach:

Zero temporary arrays — proxy types hold references, not copies.
Zero virtual dispatch — the expression tree is a compile-time type; everything is inlined by the compiler.
One memory pass — x[i] is read once, all operations applied, result written; the working set per iteration fits in L1 cache.

Pythran’s Pythonic library implements essentially all of NumPy’s API as C++ expression templates, so transpiled Python code that looks like NumPy compiles to a fused loop.

How SIMD fits in¶

Expression templates and SIMD are orthogonal concerns that compose cleanly. Expression templates fix the loop structure (one fused loop, no temps). SIMD fixes the loop body (process 8–16 elements per iteration instead of 1).

The standard pattern in xsimd/xtensor:

constexpr size_t W = xsimd::batch<float>::size; // e.g. 8 on AVX2

size_t i = 0;
for (; i + W <= N; i += W) {
    auto x_vec = xsimd::load_aligned(&x[i]);       // load 8 floats
    auto res   = 2.0f * x_vec + xsimd::sin(x_vec); // 8-wide fused ops
    xsimd::store_aligned(&out[i], res);
}
for (; i < N; ++i)                                 // scalar tail
    out[i] = 2.0f * x[i] + std::sin(x[i]);

xsimd::batch<float> overloads all arithmetic operators and math functions, so the fused loop body is written once and works for both scalar and SIMD by substituting the element type. Vectorized math (xsimd::sin, xsimd::exp) are hand-tuned polynomial approximations — computing 8 sin values costs roughly the same as 1.5 scalar ones.

SPy’s architecture (from the blog post)¶

SPy (github.com/spylang/spy) is a statically-typed Python variant that is both interpreted and compiled. Its central mechanism is a blue/red evaluation model:

Blue expressions are known at compile time and evaluated during redshifting (the partial-evaluation pass).
Red expressions are emitted into the final AST and run at runtime.
Redshifting eliminates all blue computation, leaving a typed AST of direct function calls with no dispatch overhead.

Key properties:

Static typing is not a separate analysis pass — it falls out of @blue evaluation. x + y implicitly calls operator.ADD(STATIC_TYPE(x), STATIC_TYPE(y)) at blue time; if that fails, you get a compile-time error.
@blue functions run in the interpreter, so they are fully debuggable with breakpoint().
@blue.generic implements generics: a generic function is a @blue function that takes types as arguments and returns a specialized nested function. add[int] is syntactic sugar for a blue function call, memoized.
The compiler backend emits C code, not LLVM IR — simpler to implement, easy to target new platforms (WASI, Emscripten), good performance via clang/gcc.

The spy-opspec.md document details the dispatch machinery:

MetaArg — wraps a value with its static type and color (blue/red).
OpSpec — returned by a blue dispatch function; names the concrete runtime function to call, and optionally rewrites the argument list to bake in compile-time constants (partial application at blue time).
@blue.metafunc — idiomatic decorator for type-based dispatch; receives MetaArg objects, returns an OpSpec, runs entirely at compile time.

The interpreter (ASTFrame) and redshifter (DopplerFrame) share the same dispatch and type-resolution logic. DopplerFrame only overrides what happens with the result: instead of executing the resolved W_OpImpl, it emits an ast.Call node pointing to the concrete function.

The `myarray.spy` example¶

from unsafe import gc_alloc, gc_ptr
from operator import OpSpec
from types import StructType

@struct
class ArrayData[DTYPE, NDIM]:
    shape: gc_ptr[i32]
    items: gc_ptr[DTYPE]

@blue.generic
def ndarray(DTYPE, NDIM):

    if NDIM > 2:
        raise StaticError("NDIM>2 not supported")

    ArrayData_c = ArrayData[DTYPE, NDIM]

    @struct
    class Self:
        __ll__: gc_ptr[ArrayData_c]

        @blue.metafunc
        def __new__(*args_m):

            if len(args_m) != NDIM + 1:
                raise StaticError("Wrong number of arguments")

            def new(type_: StructType, *args: Args[NDIM, i32]) -> Self:
                ll = gc_alloc[ArrayData_c](1)
                ll.shape = gc_alloc[i32](NDIM)
                for i, n in unroll(enumerate(args)):
                    ll.shape[i] = n
                return Self.__make__(ll)

            return new

    return Self

def main() -> None:
    arr1 = ndarray[f64, 1](4)
    arr2 = ndarray[f64, 2](4, 4)

What happens at blue time¶

ndarray is a @blue.generic function. Calling ndarray[f64, 1] runs the entire function body at compile time and returns a freshly constructed @struct type. ndarray[f64, 1] and ndarray[f64, 2] are two distinct types, each specialized for their DTYPE and NDIM. This is the SPy equivalent of a C++ class template.

ArrayData is the low-level backing store: a shape pointer and a data pointer, with a fixed C-compatible layout. __ll__ is the SPy convention for the underlying raw representation field.

The `new` metafunction¶

__new__ is a @blue.metafunc: it receives MetaArg objects describing the call-site arguments and returns an OpSpec selecting the concrete constructor.

The arity check len(args_m) != NDIM+1 is a compile-time error — passing the wrong number of shape arguments is caught at spy build time.
*args: Args[NDIM, i32] is a blue-generic type that expands to exactly NDIM i32 arguments, statically known.
for i, n in unroll(enumerate(args)) — unroll is a blue directive that statically unrolls the loop at compile time. The redshifted C output is a straight-line sequence of assignments with no loop counter.

The generated C for ndarray[f64, 1]'s constructor looks roughly like:

ndarray_f64_1 spy_ndarray_f64_1__new__(int32_t n0) {
    ArrayData_f64_1* ll = gc_alloc(sizeof(ArrayData_f64_1));
    ll->shape = gc_alloc(1 * sizeof(int32_t));
    ll->shape[0] = n0;   // statically unrolled
    ll->items = gc_alloc(n0 * sizeof(double));
    return (ndarray_f64_1){ .ll = ll };
}

No virtual dispatch, no Python object overhead, no dynamic type checks.

Expression fusion in SPy¶

SPy’s @blue.metafunc / OpSpec machinery makes it possible to implement expression templates in user-space SPy code, without any compiler magic. A working implementation is in lib_ndarray.spy in the demo repository.

Naïve (eager) operator¶

@blue.metafunc
def __add__(m_self, m_other):
    def add(a: Self, b: Self) -> Self:
        out = Self(a.size())
        for i in range(a.size()):
            out[i] = a[i] + b[i]
        return out
    return OpSpec(add)

This works but allocates a temporary array for each +. Same problem as NumPy.

Lazy operator: building the expression tree¶

The key idea: __add__ returns a lazy expression node instead of triggering computation. The loop is only materialized when the result is assigned to a concrete ndarray, via __convert_to__. The lazy node is parametrized over its child types so that any combination of ndarray, LazyAdd, or LazySin can appear as operands:

@blue.generic
def LazyAdd(DTYPE, NDIM, LEFT_T, RIGHT_T):
    A = ndarray[DTYPE, NDIM]

    @struct
    class Self:
        left:  LEFT_T
        right: RIGHT_T
        ...
    return Self

ndarray.__add__ then builds the lazy node zero-cost, just storing two references:

@blue.metafunc
def __add__(m_self, m_other):
    LEFT_T  = m_self.static_type
    RIGHT_T = m_other.static_type
    Lazy = LazyAdd[DTYPE, NDIM, LEFT_T, RIGHT_T]

    def add(a: LEFT_T, b: RIGHT_T) -> Lazy:
        return Lazy.__make__(a, b)   # no allocation, no computation
    return OpSpec(add)

LEFT_T and RIGHT_T are blue constants — the static types of the operands, known at blue time. LazyAdd[f64, 1, ndarray[f64,1], LazySin[f64,1,...]] is a fully specialized type generated at blue time, directly analogous to Add<ArrayRef, Sin<ArrayRef>> in C++. Blue memoization of @blue.generic ensures each unique combination is instantiated only once.

The `make_kernel` pattern: composing the fused loop body¶

The central question is: how does LazyAdd.__convert_to__ emit a loop body that correctly evaluates the full expression tree, including arbitrary child types like LazySin?

The working solution (from the demo) is a make_kernel blue classmethod on every lazy node. Called at blue time, it recursively composes concrete red functions — one per node — into a single kernel function that computes one element of the output:

# ndarray — leaf node: kernel just reads items[i]
@blue
@classmethod
def make_kernel(cls):
    def kernel(self: Self, i: i32) -> DTYPE:
        return self.__ll__.items[i]
    return kernel

# LazyAdd — recursively composes the two child kernels
@blue
@classmethod
def make_kernel(cls):
    left_kernel  = LEFT_T.make_kernel()
    right_kernel = RIGHT_T.make_kernel()

    def kernel(expr: Self, i: i32) -> DTYPE:
        return left_kernel(expr.left, i) + right_kernel(expr.right, i)
    return kernel

# LazySin — wraps its argument's kernel in math.sin
@blue
@classmethod
def make_kernel(cls):
    arg_kernel = ARG_TYPE.make_kernel()

    def kernel(expr: Self, i: i32) -> DTYPE:
        return math.sin(arg_kernel(expr.arg, i))
    return kernel

make_kernel runs entirely at blue time. By the time __convert_to__ emits the materialization loop, kernel is a fully resolved, concrete red function. The loop body is a direct call to it — no dynamic dispatch:

@blue.metafunc
def __convert_to__(m_expT, m_gotT, m_x):
    if m_expT.blueval == A:
        kernel = Self.make_kernel()   # blue-time recursion

        def materialize(expr: Self) -> A:
            n = expr.left.size()
            out = A(n)
            for i in range(n):
                out[i] = kernel(expr, i)   # direct call, known at blue time
            return out
        return OpSpec(materialize, [m_x])
    return OpSpec.NULL

Extending to free functions: `sin`¶

Free functions like sin are @blue.metafunc free functions that dispatch on the static type of their argument. For an ndarray argument they return a LazySin node (no computation); for a scalar they call math.sin immediately.

In the demo, DTYPE and NDIM are read from the type object directly using blue attribute access (T._DTYPE, T._NDIM), which are stored as blue class variables on the @struct:

@blue.metafunc
def sin(m_x):
    T = m_x.static_type

    if T == f64:
        def sin_scalar(x: f64) -> f64:
            return math.sin(x)
        return OpSpec(sin_scalar)

    # Read DTYPE and NDIM from the type object (stored as blue class vars)
    DTYPE = T._DTYPE
    NDIM  = T._NDIM
    A     = ndarray[DTYPE, NDIM]
    Lazy  = LazySin[DTYPE, NDIM, A]

    def sin_lazy(x: A) -> Lazy:
        return Lazy.__make__(x)   # zero-cost, just stores a reference
    return OpSpec(sin_lazy)

Note that LazySin takes ARG_TYPE as an explicit type parameter (not just DTYPE, NDIM) so that it can call ARG_TYPE.make_kernel() to recurse into the argument’s own kernel. This means sin(sin(x)) works correctly: LazySin[f64, 1, LazySin[f64, 1, ndarray[f64,1]]] composes two kernel functions.

The full composition for `x + sin(x)`¶

At runtime the lazy tree builds up as:

sin(x)   →  LazySin[f64, 1, ndarray[f64,1]]  { arg  = &x }
x + ...  →  LazyAdd[f64, 1, ndarray[f64,1], LazySin[...]]
                                              { left = &x, right = LazySin{...} }

At blue time, make_kernel is called recursively:

LazyAdd.make_kernel()
  → left_kernel  = ndarray.make_kernel()      # reads items[i]
  → right_kernel = LazySin.make_kernel()
      → arg_kernel = ndarray.make_kernel()    # reads items[i]
      → kernel(expr, i) = math.sin(arg_kernel(expr.arg, i))
  → kernel(expr, i) = left_kernel(expr.left, i) + right_kernel(expr.right, i)

The materialize loop then calls this composed kernel once per element. The redshifted AST contains one loop with direct calls to the composed kernel chain — no dynamic dispatch, no temporary arrays.

Redshift output and the `@force_inline` issue¶

The redshift output currently produces a loop body that calls the composed kernel functions as separate named functions:

# spy redshift output (simplified) for: out = x + sin(x)
for i in range(n):
    out[i] = `LazyAdd[...]::make_kernel::kernel`(expr, i)

where kernel in turn calls ndarray::make_kernel::kernel and LazySin::make_kernel::kernel. The loop structure is correct — one loop, no temporaries — but the kernel calls are not inlined into the loop body during redshifting. This makes the redshift output harder to read and reason about compared to the ideal flat form:

# ideal (inlined) redshift output
for i in range(n):
    out[i] = x.__ll__.items[i] + math.sin(x.__ll__.items[i])

This is tracked as SPy issue #464, which proposes a @force_inline decorator. The idea is to decorate the small one-statement kernel functions with @force_inline, which would instruct the redshifter to substitute the function body at call sites rather than emitting a call node. In lib_ndarray.spy the candidate sites are already marked with # @force_inline comments.

Note

Is inlining during redshift necessary for correctness or performance? No, on both counts. The loop structure is already correct after redshifting — one fused loop, no temporaries. C compilers (clang, gcc) will inline these small kernel functions automatically at -O2 or with --release, so the generated machine code is identical to what a fully inlined redshift would produce.

The value of @force_inline is primarily human-facing:

Readable redshift output — the fused expression is visible as a flat loop, making it easier to verify that fusion worked and useful for documentation and presentations.
Predictable inlining — explicitly inlined redshift output removes any uncertainty about whether the C compiler will inline in a given build configuration.
Simpler implementation — by restricting @force_inline to single-statement functions, the redshifter substitution pass is straightforward, and the interaction with the symtable is limited.

SIMD in the SPy/C context¶

SPy emits C, not LLVM IR, so SIMD is handled differently from xtensor. Three realistic approaches, in increasing complexity:

Path 1: rely on auto-vectorization¶

A clean fused C loop with contiguous memory access and no opaque function calls will be auto-vectorized by clang/gcc at -O2 -march=native. Covers simple arithmetic well. Transcendental functions (sin, exp) are usually not vectorized this way because the compiler calls scalar libm.

Path 2: emit calls to a C SIMD math library¶

SLEEF provides vectorized math callable directly from C:

__m256 Sleef_sinf8_u10(__m256 x);   // 8-wide sin, AVX2, float32

SPy’s C backend can, when materializing a fused loop over a contiguous array, emit explicit SIMD intrinsic code with SLEEF for transcendentals. The loop structure:

int64_t i = 0;
for (; i + 8 <= N; i += 8) {
    __m256 x_vec = _mm256_load_ps(&x[i]);
    __m256 res   = Sleef_sinf8_u10(x_vec);
    _mm256_store_ps(&out[i], res);
}
for (; i < N; i++)
    out[i] = sinf(x[i]);

Path 3: OpenMP SIMD pragmas¶

A lighter-weight middle ground:

#pragma omp simd
for (int64_t i = 0; i < N; i++)
    out[i] = 2.0f * x[i] + sinf(x[i]);

With -fopenmp-simd and a vector math library linked, this can vectorize transcendentals without writing intrinsics by hand.

The clean SPy design¶

SIMD is a concern of the C backend and the stdlib, not of user code. The materialization function emitted by __convert_to__ knows, at blue time:

the element type (DTYPE — a blue constant),
the memory layout (contiguous, known stride),
the full operation tree.

It can therefore decide at blue time which codepath to emit:

@blue.metafunc
def __convert_to__(m_expT, m_gotT, m_x):
    if m_expT.blueval == A:
        if DTYPE == f32 and simd_available[AVX2]:
            def materialize(expr: LazyExpr) -> A:
                # emits the AVX2 + SLEEF loop
                ...
        else:
            def materialize(expr: LazyExpr) -> A:
                # emits the plain C loop (auto-vec friendly)
                ...
        return OpSpec(materialize, [m_x])

simd_available is itself a blue query — resolved at compile time from the --target flag. No runtime CPU detection needed; one optimal codepath is emitted for the target. User-written SPy array code remains entirely free of SIMD concerns: out = 2*x + sin(x) compiles to a fused SIMD loop automatically.

Summary¶

Concern	C++ (xtensor/Pythran)	SPy
Loop fusion	Expression templates (compile-time types)	`LazyExpr` + `__convert_to__` (blue-time values)
Zero temps	Proxy types, no allocation until assignment	Same, via deferred materialization
Element-wise composition	`operator[]` on each proxy type; template instantiation recurses	`make_kernel` blue classmethod; recursively composes concrete red functions at blue time
Function dispatch (`sin`, …)	`xt::sin` overloaded on proxy types; returns `UnaryExpr<sin_functor, Arg>`	`sin` as `@blue.metafunc`; dispatches on static type via `T._DTYPE`/`T._NDIM`; returns `LazySin` node
Generic child types	`LazyAdd<Left, Right>` — tree shape in compile-time type	`LazyAdd[DTYPE, NDIM, LEFT_T, RIGHT_T]` — tree shape in blue-time type
Redshift output readability	C++ inlines `operator[]` everywhere; flat expression visible in IR	Kernel calls not yet inlined during redshift; see `@force_inline` proposal (issue #464)
Inlining guarantee	C++ compiler inlines aggressively at `-O2`	C compiler inlines at `-O2`/`--release`; `@force_inline` would make it explicit at redshift time
SIMD dispatch	`xsimd::batch<T>` type substitution	Blue-time decision in `__convert_to__`, emits C intrinsics or auto-vec loop
Vectorized math	xsimd polynomial approximations	SLEEF callable from C, or OpenMP SIMD pragma
Target selection	Runtime CPU dispatch (SSE2/AVX2/AVX-512)	Compile-time, from `--target` flag
User code	Identical to naïve — template magic is invisible	Identical to naïve — blue magic is invisible