Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Array computing like C++: expression templates and SIMD

Why NumPy expressions are slow

For an expression like b = 2*x + 4*x**2 + np.sin(x), NumPy suffers from two compounding problems:

Memory bandwidth waste. Each sub-expression allocates a temporary array and triggers a full pass over memory. For the expression above that is roughly 10 memory passes (3 reads of x, 3 temporary writes, 3 temporary reads, 1 final write). If x is large enough to overflow L2 cache, each pass is a cache miss storm.

No cross-operation SIMD. Because each operation runs as a separate C loop, the compiler never sees the full expression and cannot vectorize across operations. Transcendental functions (sin, exp) are called one element at a time via scalar libm.

Tools that address this:


How C++ expression templates work

Expression templates are a compile-time metaprogramming pattern. The key idea: operator overloading returns a proxy type encoding the pending operation, not a computed value. Evaluation is deferred until the result is assigned to a concrete array.

// User writes:
auto b = 2*x + xt::sin(x);

// Compiler sees (simplified):
auto b = Add<Mul<Scalar<int>, ArrayRef>, Sin<ArrayRef>>(2, x, x);
// No allocation yet — b is a pure type.

// Assignment triggers evaluation:
xt::xarray<float> result = b;
// Compiler emits ONE fused loop:
for (size_t i = 0; i < N; ++i)
    result[i] = 2*x[i] + std::sin(x[i]);

Properties of this approach:

Pythran’s Pythonic library implements essentially all of NumPy’s API as C++ expression templates, so transpiled Python code that looks like NumPy compiles to a fused loop.


How SIMD fits in

Expression templates and SIMD are orthogonal concerns that compose cleanly. Expression templates fix the loop structure (one fused loop, no temps). SIMD fixes the loop body (process 8–16 elements per iteration instead of 1).

The standard pattern in xsimd/xtensor:

constexpr size_t W = xsimd::batch<float>::size; // e.g. 8 on AVX2

size_t i = 0;
for (; i + W <= N; i += W) {
    auto x_vec = xsimd::load_aligned(&x[i]);       // load 8 floats
    auto res   = 2.0f * x_vec + xsimd::sin(x_vec); // 8-wide fused ops
    xsimd::store_aligned(&out[i], res);
}
for (; i < N; ++i)                                 // scalar tail
    out[i] = 2.0f * x[i] + std::sin(x[i]);

xsimd::batch<float> overloads all arithmetic operators and math functions, so the fused loop body is written once and works for both scalar and SIMD by substituting the element type. Vectorized math (xsimd::sin, xsimd::exp) are hand-tuned polynomial approximations — computing 8 sin values costs roughly the same as 1.5 scalar ones.


SPy’s architecture (from the blog post)

SPy (github.com/spylang/spy) is a statically-typed Python variant that is both interpreted and compiled. Its central mechanism is a blue/red evaluation model:

Key properties:

The spy-opspec.md document details the dispatch machinery:

The interpreter (ASTFrame) and redshifter (DopplerFrame) share the same dispatch and type-resolution logic. DopplerFrame only overrides what happens with the result: instead of executing the resolved W_OpImpl, it emits an ast.Call node pointing to the concrete function.


The myarray.spy example

from unsafe import gc_alloc, gc_ptr
from operator import OpSpec
from types import StructType

@struct
class ArrayData[DTYPE, NDIM]:
    shape: gc_ptr[i32]
    items: gc_ptr[DTYPE]

@blue.generic
def ndarray(DTYPE, NDIM):

    if NDIM > 2:
        raise StaticError("NDIM>2 not supported")

    ArrayData_c = ArrayData[DTYPE, NDIM]

    @struct
    class Self:
        __ll__: gc_ptr[ArrayData_c]

        @blue.metafunc
        def __new__(*args_m):

            if len(args_m) != NDIM + 1:
                raise StaticError("Wrong number of arguments")

            def new(type_: StructType, *args: Args[NDIM, i32]) -> Self:
                ll = gc_alloc[ArrayData_c](1)
                ll.shape = gc_alloc[i32](NDIM)
                for i, n in unroll(enumerate(args)):
                    ll.shape[i] = n
                return Self.__make__(ll)

            return new

    return Self

def main() -> None:
    arr1 = ndarray[f64, 1](4)
    arr2 = ndarray[f64, 2](4, 4)

What happens at blue time

ndarray is a @blue.generic function. Calling ndarray[f64, 1] runs the entire function body at compile time and returns a freshly constructed @struct type. ndarray[f64, 1] and ndarray[f64, 2] are two distinct types, each specialized for their DTYPE and NDIM. This is the SPy equivalent of a C++ class template.

ArrayData is the low-level backing store: a shape pointer and a data pointer, with a fixed C-compatible layout. __ll__ is the SPy convention for the underlying raw representation field.

The __new__ metafunction

__new__ is a @blue.metafunc: it receives MetaArg objects describing the call-site arguments and returns an OpSpec selecting the concrete constructor.

The generated C for ndarray[f64, 1]'s constructor looks roughly like:

ndarray_f64_1 spy_ndarray_f64_1__new__(int32_t n0) {
    ArrayData_f64_1* ll = gc_alloc(sizeof(ArrayData_f64_1));
    ll->shape = gc_alloc(1 * sizeof(int32_t));
    ll->shape[0] = n0;   // statically unrolled
    ll->items = gc_alloc(n0 * sizeof(double));
    return (ndarray_f64_1){ .ll = ll };
}

No virtual dispatch, no Python object overhead, no dynamic type checks.


Expression fusion in SPy

SPy’s @blue.metafunc / OpSpec machinery makes it possible to implement expression templates in user-space SPy code, without any compiler magic. A working implementation is in lib_ndarray.spy in the demo repository.

Naïve (eager) operator

@blue.metafunc
def __add__(m_self, m_other):
    def add(a: Self, b: Self) -> Self:
        out = Self(a.size())
        for i in range(a.size()):
            out[i] = a[i] + b[i]
        return out
    return OpSpec(add)

This works but allocates a temporary array for each +. Same problem as NumPy.

Lazy operator: building the expression tree

The key idea: __add__ returns a lazy expression node instead of triggering computation. The loop is only materialized when the result is assigned to a concrete ndarray, via __convert_to__. The lazy node is parametrized over its child types so that any combination of ndarray, LazyAdd, or LazySin can appear as operands:

@blue.generic
def LazyAdd(DTYPE, NDIM, LEFT_T, RIGHT_T):
    A = ndarray[DTYPE, NDIM]

    @struct
    class Self:
        left:  LEFT_T
        right: RIGHT_T
        ...
    return Self

ndarray.__add__ then builds the lazy node zero-cost, just storing two references:

@blue.metafunc
def __add__(m_self, m_other):
    LEFT_T  = m_self.static_type
    RIGHT_T = m_other.static_type
    Lazy = LazyAdd[DTYPE, NDIM, LEFT_T, RIGHT_T]

    def add(a: LEFT_T, b: RIGHT_T) -> Lazy:
        return Lazy.__make__(a, b)   # no allocation, no computation
    return OpSpec(add)

LEFT_T and RIGHT_T are blue constants — the static types of the operands, known at blue time. LazyAdd[f64, 1, ndarray[f64,1], LazySin[f64,1,...]] is a fully specialized type generated at blue time, directly analogous to Add<ArrayRef, Sin<ArrayRef>> in C++. Blue memoization of @blue.generic ensures each unique combination is instantiated only once.

The make_kernel pattern: composing the fused loop body

The central question is: how does LazyAdd.__convert_to__ emit a loop body that correctly evaluates the full expression tree, including arbitrary child types like LazySin?

The working solution (from the demo) is a make_kernel blue classmethod on every lazy node. Called at blue time, it recursively composes concrete red functions — one per node — into a single kernel function that computes one element of the output:

# ndarray — leaf node: kernel just reads items[i]
@blue
@classmethod
def make_kernel(cls):
    def kernel(self: Self, i: i32) -> DTYPE:
        return self.__ll__.items[i]
    return kernel
# LazyAdd — recursively composes the two child kernels
@blue
@classmethod
def make_kernel(cls):
    left_kernel  = LEFT_T.make_kernel()
    right_kernel = RIGHT_T.make_kernel()

    def kernel(expr: Self, i: i32) -> DTYPE:
        return left_kernel(expr.left, i) + right_kernel(expr.right, i)
    return kernel
# LazySin — wraps its argument's kernel in math.sin
@blue
@classmethod
def make_kernel(cls):
    arg_kernel = ARG_TYPE.make_kernel()

    def kernel(expr: Self, i: i32) -> DTYPE:
        return math.sin(arg_kernel(expr.arg, i))
    return kernel

make_kernel runs entirely at blue time. By the time __convert_to__ emits the materialization loop, kernel is a fully resolved, concrete red function. The loop body is a direct call to it — no dynamic dispatch:

@blue.metafunc
def __convert_to__(m_expT, m_gotT, m_x):
    if m_expT.blueval == A:
        kernel = Self.make_kernel()   # blue-time recursion

        def materialize(expr: Self) -> A:
            n = expr.left.size()
            out = A(n)
            for i in range(n):
                out[i] = kernel(expr, i)   # direct call, known at blue time
            return out
        return OpSpec(materialize, [m_x])
    return OpSpec.NULL

Extending to free functions: sin

Free functions like sin are @blue.metafunc free functions that dispatch on the static type of their argument. For an ndarray argument they return a LazySin node (no computation); for a scalar they call math.sin immediately.

In the demo, DTYPE and NDIM are read from the type object directly using blue attribute access (T._DTYPE, T._NDIM), which are stored as blue class variables on the @struct:

@blue.metafunc
def sin(m_x):
    T = m_x.static_type

    if T == f64:
        def sin_scalar(x: f64) -> f64:
            return math.sin(x)
        return OpSpec(sin_scalar)

    # Read DTYPE and NDIM from the type object (stored as blue class vars)
    DTYPE = T._DTYPE
    NDIM  = T._NDIM
    A     = ndarray[DTYPE, NDIM]
    Lazy  = LazySin[DTYPE, NDIM, A]

    def sin_lazy(x: A) -> Lazy:
        return Lazy.__make__(x)   # zero-cost, just stores a reference
    return OpSpec(sin_lazy)

Note that LazySin takes ARG_TYPE as an explicit type parameter (not just DTYPE, NDIM) so that it can call ARG_TYPE.make_kernel() to recurse into the argument’s own kernel. This means sin(sin(x)) works correctly: LazySin[f64, 1, LazySin[f64, 1, ndarray[f64,1]]] composes two kernel functions.

The full composition for x + sin(x)

At runtime the lazy tree builds up as:

sin(x)   →  LazySin[f64, 1, ndarray[f64,1]]  { arg  = &x }
x + ...  →  LazyAdd[f64, 1, ndarray[f64,1], LazySin[...]]
                                              { left = &x, right = LazySin{...} }

At blue time, make_kernel is called recursively:

LazyAdd.make_kernel()
  → left_kernel  = ndarray.make_kernel()      # reads items[i]
  → right_kernel = LazySin.make_kernel()
      → arg_kernel = ndarray.make_kernel()    # reads items[i]
      → kernel(expr, i) = math.sin(arg_kernel(expr.arg, i))
  → kernel(expr, i) = left_kernel(expr.left, i) + right_kernel(expr.right, i)

The materialize loop then calls this composed kernel once per element. The redshifted AST contains one loop with direct calls to the composed kernel chain — no dynamic dispatch, no temporary arrays.

Redshift output and the @force_inline issue

The redshift output currently produces a loop body that calls the composed kernel functions as separate named functions:

# spy redshift output (simplified) for: out = x + sin(x)
for i in range(n):
    out[i] = `LazyAdd[...]::make_kernel::kernel`(expr, i)

where kernel in turn calls ndarray::make_kernel::kernel and LazySin::make_kernel::kernel. The loop structure is correct — one loop, no temporaries — but the kernel calls are not inlined into the loop body during redshifting. This makes the redshift output harder to read and reason about compared to the ideal flat form:

# ideal (inlined) redshift output
for i in range(n):
    out[i] = x.__ll__.items[i] + math.sin(x.__ll__.items[i])

This is tracked as SPy issue #464, which proposes a @force_inline decorator. The idea is to decorate the small one-statement kernel functions with @force_inline, which would instruct the redshifter to substitute the function body at call sites rather than emitting a call node. In lib_ndarray.spy the candidate sites are already marked with # @force_inline comments.


SIMD in the SPy/C context

SPy emits C, not LLVM IR, so SIMD is handled differently from xtensor. Three realistic approaches, in increasing complexity:

Path 1: rely on auto-vectorization

A clean fused C loop with contiguous memory access and no opaque function calls will be auto-vectorized by clang/gcc at -O2 -march=native. Covers simple arithmetic well. Transcendental functions (sin, exp) are usually not vectorized this way because the compiler calls scalar libm.

Path 2: emit calls to a C SIMD math library

SLEEF provides vectorized math callable directly from C:

__m256 Sleef_sinf8_u10(__m256 x);   // 8-wide sin, AVX2, float32

SPy’s C backend can, when materializing a fused loop over a contiguous array, emit explicit SIMD intrinsic code with SLEEF for transcendentals. The loop structure:

int64_t i = 0;
for (; i + 8 <= N; i += 8) {
    __m256 x_vec = _mm256_load_ps(&x[i]);
    __m256 res   = Sleef_sinf8_u10(x_vec);
    _mm256_store_ps(&out[i], res);
}
for (; i < N; i++)
    out[i] = sinf(x[i]);

Path 3: OpenMP SIMD pragmas

A lighter-weight middle ground:

#pragma omp simd
for (int64_t i = 0; i < N; i++)
    out[i] = 2.0f * x[i] + sinf(x[i]);

With -fopenmp-simd and a vector math library linked, this can vectorize transcendentals without writing intrinsics by hand.

The clean SPy design

SIMD is a concern of the C backend and the stdlib, not of user code. The materialization function emitted by __convert_to__ knows, at blue time:

It can therefore decide at blue time which codepath to emit:

@blue.metafunc
def __convert_to__(m_expT, m_gotT, m_x):
    if m_expT.blueval == A:
        if DTYPE == f32 and simd_available[AVX2]:
            def materialize(expr: LazyExpr) -> A:
                # emits the AVX2 + SLEEF loop
                ...
        else:
            def materialize(expr: LazyExpr) -> A:
                # emits the plain C loop (auto-vec friendly)
                ...
        return OpSpec(materialize, [m_x])

simd_available is itself a blue query — resolved at compile time from the --target flag. No runtime CPU detection needed; one optimal codepath is emitted for the target. User-written SPy array code remains entirely free of SIMD concerns: out = 2*x + sin(x) compiles to a fused SIMD loop automatically.


Summary

ConcernC++ (xtensor/Pythran)SPy
Loop fusionExpression templates (compile-time types)LazyExpr + __convert_to__ (blue-time values)
Zero tempsProxy types, no allocation until assignmentSame, via deferred materialization
Element-wise compositionoperator[] on each proxy type; template instantiation recursesmake_kernel blue classmethod; recursively composes concrete red functions at blue time
Function dispatch (sin, …)xt::sin overloaded on proxy types; returns UnaryExpr<sin_functor, Arg>sin as @blue.metafunc; dispatches on static type via T._DTYPE/T._NDIM; returns LazySin node
Generic child typesLazyAdd<Left, Right> — tree shape in compile-time typeLazyAdd[DTYPE, NDIM, LEFT_T, RIGHT_T] — tree shape in blue-time type
Redshift output readabilityC++ inlines operator[] everywhere; flat expression visible in IRKernel calls not yet inlined during redshift; see @force_inline proposal (issue #464)
Inlining guaranteeC++ compiler inlines aggressively at -O2C compiler inlines at -O2/--release; @force_inline would make it explicit at redshift time
SIMD dispatchxsimd::batch<T> type substitutionBlue-time decision in __convert_to__, emits C intrinsics or auto-vec loop
Vectorized mathxsimd polynomial approximationsSLEEF callable from C, or OpenMP SIMD pragma
Target selectionRuntime CPU dispatch (SSE2/AVX2/AVX-512)Compile-time, from --target flag
User codeIdentical to naïve — template magic is invisibleIdentical to naïve — blue magic is invisible