Array computing: expression templates and SIMD
Why NumPy expressions are slow¶
For an expression like b = 2*x + 4*x**2 + np.sin(x), NumPy suffers from two compounding
problems:
Memory bandwidth waste. Each sub-expression allocates a temporary array and triggers
a full pass over memory. For the expression above that is roughly 10 memory passes (3
reads of x, 3 temporary writes, 3 temporary reads, 1 final write). If x is large
enough to overflow L2 cache, each pass is a cache miss storm.
No cross-operation SIMD. Because each operation runs as a separate C loop, the
compiler never sees the full expression and cannot vectorize across operations.
Transcendental functions (sin, exp) are called one element at a time via scalar libm.
Tools that address this:
numexpr — chunks computation to fit in cache, interprets expression at runtime; good gains, but no compile-time knowledge of the expression.
Pythran — transpiles Python to C++; uses expression templates + xsimd; gets full compile-time fusion and SIMD.
xtensor — pure C++ array library; same technique as Pythran’s backend.
Numba — JIT compilation; effective but usually requires rewriting as an explicit scalar loop.
How C++ expression templates work¶
Expression templates are a compile-time metaprogramming pattern. The key idea: operator overloading returns a proxy type encoding the pending operation, not a computed value. Evaluation is deferred until the result is assigned to a concrete array.
// User writes:
auto b = 2*x + xt::sin(x);
// Compiler sees (simplified):
auto b = Add<Mul<Scalar<int>, ArrayRef>, Sin<ArrayRef>>(2, x, x);
// No allocation yet — b is a pure type.
// Assignment triggers evaluation:
xt::xarray<float> result = b;
// Compiler emits ONE fused loop:
for (size_t i = 0; i < N; ++i)
result[i] = 2*x[i] + std::sin(x[i]);Properties of this approach:
Zero temporary arrays — proxy types hold references, not copies.
Zero virtual dispatch — the expression tree is a compile-time type; everything is inlined by the compiler.
One memory pass —
x[i]is read once, all operations applied, result written; the working set per iteration fits in L1 cache.
Pythran’s Pythonic library implements essentially all of NumPy’s API as C++ expression
templates, so transpiled Python code that looks like NumPy compiles to a fused loop.
How SIMD fits in¶
Expression templates and SIMD are orthogonal concerns that compose cleanly. Expression templates fix the loop structure (one fused loop, no temps). SIMD fixes the loop body (process 8–16 elements per iteration instead of 1).
The standard pattern in xsimd/xtensor:
constexpr size_t W = xsimd::batch<float>::size; // e.g. 8 on AVX2
size_t i = 0;
for (; i + W <= N; i += W) {
auto x_vec = xsimd::load_aligned(&x[i]); // load 8 floats
auto res = 2.0f * x_vec + xsimd::sin(x_vec); // 8-wide fused ops
xsimd::store_aligned(&out[i], res);
}
for (; i < N; ++i) // scalar tail
out[i] = 2.0f * x[i] + std::sin(x[i]);xsimd::batch<float> overloads all arithmetic operators and math functions, so the fused
loop body is written once and works for both scalar and SIMD by substituting the element
type. Vectorized math (xsimd::sin, xsimd::exp) are hand-tuned polynomial
approximations — computing 8 sin values costs roughly the same as 1.5 scalar ones.
SPy’s architecture (from the blog post)¶
SPy (github
Blue expressions are known at compile time and evaluated during redshifting (the partial-evaluation pass).
Red expressions are emitted into the final AST and run at runtime.
Redshifting eliminates all blue computation, leaving a typed AST of direct function calls with no dispatch overhead.
Key properties:
Static typing is not a separate analysis pass — it falls out of
@blueevaluation.x + yimplicitly callsoperator.ADD(STATIC_TYPE(x), STATIC_TYPE(y))at blue time; if that fails, you get a compile-time error.@bluefunctions run in the interpreter, so they are fully debuggable withbreakpoint().@blue.genericimplements generics: a generic function is a@bluefunction that takes types as arguments and returns a specialized nested function.add[int]is syntactic sugar for a blue function call, memoized.The compiler backend emits C code, not LLVM IR — simpler to implement, easy to target new platforms (WASI, Emscripten), good performance via clang/gcc.
The spy-opspec.md document details the dispatch machinery:
MetaArg— wraps a value with its static type and color (blue/red).OpSpec— returned by a blue dispatch function; names the concrete runtime function to call, and optionally rewrites the argument list to bake in compile-time constants (partial application at blue time).@blue.metafunc— idiomatic decorator for type-based dispatch; receivesMetaArgobjects, returns anOpSpec, runs entirely at compile time.
The interpreter (ASTFrame) and redshifter (DopplerFrame) share the same dispatch and
type-resolution logic. DopplerFrame only overrides what happens with the result:
instead of executing the resolved W_OpImpl, it emits an ast.Call node pointing to the
concrete function.
The myarray.spy example¶
from unsafe import gc_alloc, gc_ptr
from operator import OpSpec
from types import StructType
@struct
class ArrayData[DTYPE, NDIM]:
shape: gc_ptr[i32]
items: gc_ptr[DTYPE]
@blue.generic
def ndarray(DTYPE, NDIM):
if NDIM > 2:
raise StaticError("NDIM>2 not supported")
ArrayData_c = ArrayData[DTYPE, NDIM]
@struct
class Self:
__ll__: gc_ptr[ArrayData_c]
@blue.metafunc
def __new__(*args_m):
if len(args_m) != NDIM + 1:
raise StaticError("Wrong number of arguments")
def new(type_: StructType, *args: Args[NDIM, i32]) -> Self:
ll = gc_alloc[ArrayData_c](1)
ll.shape = gc_alloc[i32](NDIM)
for i, n in unroll(enumerate(args)):
ll.shape[i] = n
return Self.__make__(ll)
return new
return Self
def main() -> None:
arr1 = ndarray[f64, 1](4)
arr2 = ndarray[f64, 2](4, 4)What happens at blue time¶
ndarray is a @blue.generic function. Calling ndarray[f64, 1] runs the entire
function body at compile time and returns a freshly constructed @struct type.
ndarray[f64, 1] and ndarray[f64, 2] are two distinct types, each specialized for
their DTYPE and NDIM. This is the SPy equivalent of a C++ class template.
ArrayData is the low-level backing store: a shape pointer and a data pointer, with a
fixed C-compatible layout. __ll__ is the SPy convention for the underlying raw
representation field.
The __new__ metafunction¶
__new__ is a @blue.metafunc: it receives MetaArg objects describing the call-site
arguments and returns an OpSpec selecting the concrete constructor.
The arity check
len(args_m) != NDIM+1is a compile-time error — passing the wrong number of shape arguments is caught atspy buildtime.*args: Args[NDIM, i32]is a blue-generic type that expands to exactlyNDIMi32arguments, statically known.for i, n in unroll(enumerate(args))—unrollis a blue directive that statically unrolls the loop at compile time. The redshifted C output is a straight-line sequence of assignments with no loop counter.
The generated C for ndarray[f64, 1]'s constructor looks roughly like:
ndarray_f64_1 spy_ndarray_f64_1__new__(int32_t n0) {
ArrayData_f64_1* ll = gc_alloc(sizeof(ArrayData_f64_1));
ll->shape = gc_alloc(1 * sizeof(int32_t));
ll->shape[0] = n0; // statically unrolled
ll->items = gc_alloc(n0 * sizeof(double));
return (ndarray_f64_1){ .ll = ll };
}No virtual dispatch, no Python object overhead, no dynamic type checks.
Expression fusion in SPy: the path forward¶
SPy’s @blue.metafunc / OpSpec machinery makes it possible to implement expression
templates in user-space SPy code, without any compiler magic.
Naïve (eager) operator¶
@blue.metafunc
def __add__(m_self, m_other):
def add(a: Self, b: Self) -> Self:
out = Self.__new__(Self, a.shape[0])
for i in range(a.shape[0]):
out.items[i] = a.items[i] + b.items[i]
return out
return OpSpec(add)This works but allocates a temporary array for each +. Same problem as NumPy.
Lazy / fused operator (expression template style)¶
The key idea: __add__ returns a lazy expression node instead of triggering
computation. The loop is only materialized when the result is assigned to a concrete
ndarray, via __convert_to__.
@blue.generic
def LazyAdd(DTYPE, NDIM):
A = ndarray[DTYPE, NDIM]
@struct
class Self:
left: A
right: A
@blue.metafunc
def __convert_to__(m_expT, m_gotT, m_x):
if m_expT.blueval == A:
def materialize(expr: Self) -> A:
out = A.__new__(A, expr.left.shape[0])
for i in range(expr.left.shape[0]):
out.items[i] = expr.left.items[i] + expr.right.items[i]
return out
return OpSpec(materialize, [m_x])
return OpSpec.NULL
return SelfAnd ndarray.__add__ becomes zero-cost at the call site:
@blue.metafunc
def __add__(m_self, m_other):
Lazy = LazyAdd[DTYPE, NDIM]
def add(a: Self, b: Self) -> Lazy:
return Lazy.__make__(a, b) # just stores two pointers, no allocation
return OpSpec(add)The assignment result: ndarray[f64, 1] = a + b + c chains through __convert_to__ and
emits a single fused loop in the redshifted AST. No temporary arrays. The C backend sees
tight straight-line code per element.
In C++, the expression tree is a compile-time type. In SPy, it is a compile-time
value (a blue LazyAdd struct holding references). Both achieve the same result after
redshifting: one fused loop, zero overhead.
Extending laziness to functions: sin, exp, etc.¶
Operators are straightforward because __add__ has a well-defined dispatch hook. Free
functions like sin need a different entry point — but the mechanism is identical: make
sin itself a @blue.metafunc that inspects the static type of its argument.
sin as a @blue.metafunc¶
@blue.metafunc
def sin(m_x):
T = m_x.static_type
if T == f64:
# scalar — eager, direct C call
def sin_eager(x: f64) -> f64:
return libc.sin(x)
return OpSpec(sin_eager)
elif T == ndarray[f64, 1]:
# array — return a lazy node
Lazy = LazySin[f64, 1]
def sin_lazy(x: ndarray[f64, 1]) -> Lazy:
return Lazy.__make__(x) # just stores a reference, no computation
return OpSpec(sin_lazy)
raise TypeError(f"sin not defined for {T}")After redshifting, y = sin(x) where x: ndarray[f64, 1] becomes a direct call to
sin_lazy, which constructs a LazySin node at runtime — no computation, just a pointer
to x. The same pattern applies to every elementwise function in the math library
(cos, exp, log, etc.).
The problem: lazy nodes don’t compose naïvely¶
A LazyAdd node stores left: A and right: A — concrete ndarray instances. But in
2*x + sin(x), the right child is a LazySin, not a concrete array. The materialize
function in LazyAdd would call expr.right.items[i], which doesn’t exist on LazySin.
The solution: a common __getitem__ interface¶
Every lazy node implements __getitem__ as a @blue.metafunc that computes one element
on demand. This is the SPy equivalent of C++'s operator[] on proxy types.
For LazySin:
@blue.generic
def LazySin(DTYPE, NDIM):
A = ndarray[DTYPE, NDIM]
@struct
class Self:
arg: A
@blue.metafunc
def __getitem__(m_self, m_i):
def get(expr: Self, i: i64) -> DTYPE:
return libc.sin(expr.arg.items[i])
return OpSpec(get)
@blue.metafunc
def __convert_to__(m_expT, m_gotT, m_x):
if m_expT.blueval == A:
def materialize(expr: Self) -> A:
out = A.__new__(A, expr.arg.shape[0])
for i in range(expr.arg.shape[0]):
out.items[i] = expr[i] # via __getitem__
return out
return OpSpec(materialize, [m_x])
return OpSpec.NULL
return SelfLazyAdd is updated to use the same interface for its children:
@blue.metafunc
def __getitem__(m_self, m_i):
def get(expr: Self, i: i64) -> DTYPE:
return expr.left[i] + expr.right[i] # recurses into children
return OpSpec(get)
@blue.metafunc
def __convert_to__(m_expT, m_gotT, m_x):
if m_expT.blueval == A:
def materialize(expr: Self) -> A:
out = A.__new__(A, ...)
for i in range(...):
out.items[i] = expr[i] # walks the full tree
return out
return OpSpec(materialize, [m_x])Now expr[i] on a LazyAdd dispatches (at blue time, type statically known) to
expr.left[i] + expr.right[i], which in turn dispatches to the __getitem__ of each
child. The entire tree is resolved by blue dispatch — no runtime polymorphism. The
redshifted loop body is a single flat expression:
// Redshifted C for: result = 2*x + sin(x)
for (int64_t i = 0; i < N; i++)
out[i] = 2.0 * x[i] + sin(x[i]);Making LazyAdd generic over child types¶
The left and right fields need to hold any lazy node type, not just concrete arrays.
The solution is to parametrize LazyAdd over its child types — exactly as C++ expression
templates do:
@blue.generic
def LazyAdd(DTYPE, NDIM, LEFT_T, RIGHT_T):
A = ndarray[DTYPE, NDIM]
@struct
class Self:
left: LEFT_T
right: RIGHT_T
@blue.metafunc
def __getitem__(m_self, m_i):
def get(expr: Self, i: i64) -> DTYPE:
return expr.left[i] + expr.right[i]
return OpSpec(get)
...
return SelfAnd ndarray.__add__ becomes:
@blue.metafunc
def __add__(m_self, m_other):
LEFT_T = m_self.static_type
RIGHT_T = m_other.static_type
Lazy = LazyAdd[DTYPE, NDIM, LEFT_T, RIGHT_T]
def add(a: LEFT_T, b: RIGHT_T) -> Lazy:
return Lazy.__make__(a, b)
return OpSpec(add)LEFT_T and RIGHT_T are blue constants — the static types of the operands, known at
blue time. LazyAdd[f64, 1, LazyMul[...], LazySin[...]] is a fully specialized type
generated at blue time, directly analogous to Add<Mul<Scalar, ArrayRef>, Sin<ArrayRef>>
in C++. Blue memoization of @blue.generic ensures each unique combination is
instantiated only once.
The full composition for result = 2*x + sin(x)¶
At runtime the lazy tree builds up as:
sin(x) → LazySin[f64,1] { arg = &x }
2*x → LazyMul[f64,1,Scalar,Arr] { left = 2, right = &x }
... + ... → LazyAdd[f64,1,LazyMul,LazySin] { left = LazyMul{...},
right = LazySin{...} }Assignment to ndarray[f64, 1] triggers __convert_to__ on the outermost LazyAdd,
which calls expr[i], which the redshifter resolves — at blue time — into the flat
expression 2.0 * x[i] + sin(x[i]). The C compiler sees a single loop with no
indirection, no virtual calls, and no temporary arrays.
Summary of the lazy node design¶
Every lazy node — whether from an operator or a free function — implements two blue interfaces:
__getitem__(i)— element-wise evaluation; used when this node is a child of another lazy node.__convert_to__(ndarray[...])— materialization into a concrete array; used when this node is the root of the expression being assigned.
Free functions (sin, cos, exp, …) become @blue.metafunc free functions that
dispatch on the static type of their argument: scalar → eager C call; array → lazy node
wrapping the input. The function library and the lazy node library are written
independently and compose through the __getitem__ interface.
SIMD in the SPy/C context¶
SPy emits C, not LLVM IR, so SIMD is handled differently from xtensor. Three realistic approaches, in increasing complexity:
Path 1: rely on auto-vectorization¶
A clean fused C loop with contiguous memory access and no opaque function calls will be
auto-vectorized by clang/gcc at -O2 -march=native. Covers simple arithmetic well.
Transcendental functions (sin, exp) are usually not vectorized this way because the
compiler calls scalar libm.
Path 2: emit calls to a C SIMD math library¶
SLEEF provides vectorized math callable directly from C:
__m256 Sleef_sinf8_u10(__m256 x); // 8-wide sin, AVX2, float32SPy’s C backend can, when materializing a fused loop over a contiguous array, emit explicit SIMD intrinsic code with SLEEF for transcendentals. The loop structure:
int64_t i = 0;
for (; i + 8 <= N; i += 8) {
__m256 x_vec = _mm256_load_ps(&x[i]);
__m256 res = Sleef_sinf8_u10(x_vec);
_mm256_store_ps(&out[i], res);
}
for (; i < N; i++)
out[i] = sinf(x[i]);Path 3: OpenMP SIMD pragmas¶
A lighter-weight middle ground:
#pragma omp simd
for (int64_t i = 0; i < N; i++)
out[i] = 2.0f * x[i] + sinf(x[i]);With -fopenmp-simd and a vector math library linked, this can vectorize transcendentals
without writing intrinsics by hand.
The clean SPy design¶
SIMD is a concern of the C backend and the stdlib, not of user code. The
materialization function emitted by __convert_to__ knows, at blue time:
the element type (
DTYPE— a blue constant),the memory layout (contiguous, known stride),
the full operation tree.
It can therefore decide at blue time which codepath to emit:
@blue.metafunc
def __convert_to__(m_expT, m_gotT, m_x):
if m_expT.blueval == A:
if DTYPE == f32 and simd_available[AVX2]:
def materialize(expr: LazyExpr) -> A:
# emits the AVX2 + SLEEF loop
...
else:
def materialize(expr: LazyExpr) -> A:
# emits the plain C loop (auto-vec friendly)
...
return OpSpec(materialize, [m_x])simd_available is itself a blue query — resolved at compile time from the --target
flag. No runtime CPU detection needed; one optimal codepath is emitted for the target.
User-written SPy array code remains entirely free of SIMD concerns: out = 2*x + sin(x)
compiles to a fused SIMD loop automatically.
Summary¶
| Concern | C++ (xtensor/Pythran) | SPy |
|---|---|---|
| Loop fusion | Expression templates (compile-time types) | LazyExpr + __convert_to__ (blue-time values) |
| Zero temps | Proxy types, no allocation until assignment | Same, via deferred materialization |
| Element-wise interface | operator[] and load_simd() on each proxy type | __getitem__ as @blue.metafunc on each lazy node |
Function dispatch (sin, …) | xt::sin overloaded on proxy types; returns UnaryExpr<sin_functor, Arg> | sin as @blue.metafunc; dispatches on static type; returns LazySin node for arrays |
| Composing operators + functions | Same operator[] interface on all nodes; template instantiation walks the tree | Same __getitem__ interface on all nodes; blue dispatch walks the tree at compile time |
| Generic child types | LazyAdd<Left, Right> — tree shape encoded in compile-time type | LazyAdd[DTYPE, NDIM, LEFT_T, RIGHT_T] — tree shape encoded in blue-time type |
| SIMD dispatch | xsimd::batch<T> type substitution | Blue-time decision in __convert_to__, emits C intrinsics or auto-vec loop |
| Vectorized math | xsimd polynomial approximations | SLEEF callable from C, or OpenMP SIMD pragma |
| Target selection | Runtime CPU dispatch (SSE2/AVX2/AVX-512) | Compile-time, from --target flag |
| User code | Identical to naïve — template magic is invisible | Identical to naïve — blue magic is invisible |