SPy architecture, WASM and MLIR

Background: What is SPy?¶

SPy is an experiment by Antonio Cuni (Anaconda) to create a Python variant that can be easily interpreted — for a good development experience — and compiled to native code or WebAssembly for performance. It is emphatically not a drop-in CPython replacement: the language design deliberately diverges from Python where necessary to enable full ahead-of-time compilation.

The architecture rests on two pillars:

libspy — a small C runtime library that handles memory, panics, primitive types, and other low-level concerns. It is compiled to three targets: a static native library (libspy.a), a WASI module (libspy.wasm), and an Emscripten module for the browser.
The SPy compiler — written in Python, it translates .spy source to C (via a cwrite backend), then invokes clang + ninja to produce a native or WASM binary.

The interpreter runs .spy code by loading libspy.wasm into the Python process via wasmtime. An llwasm abstraction layer makes this transparent: on CPython it uses wasmtime; inside Pyodide/PyScript it reuses the browser’s own WASM engine.

Why `libspy.wasm` and Not a Native `libspy.so`?¶

This is one of the most elegant architectural decisions in SPy, and it pays multiple dividends simultaneously.

1. Sandboxing unsafe code¶

SPy has an “unsafe” mode that allows raw pointers and low-level struct manipulation — constructs that can trivially segfault a process. By running this code inside a WASM sandbox via wasmtime, any crash is contained within the WASM linear memory. CPython survives and receives a proper SPyError exception rather than dying. A native .so would offer no such protection without expensive subprocess tricks.

2. Multiple isolated VM instances¶

Each WASM instance has its own linear memory. This means you can instantiate multiple independent SPy VMs inside the same Python process at zero extra cost — something that would require careful global-state management with a shared library.

3. One build artefact, two uses¶

WASM is already a first-class compilation target for SPy (the whole point is to produce .wasm for browser and edge deployment). Using the same libspy.wasm artefact for the interpreter means there is a single build path to maintain. A libspy.so would be a second, diverging artefact.

4. Environment portability for free¶

A native .so cannot be loaded inside a browser or Pyodide environment. A .wasm file works everywhere — on native CPython via wasmtime and inside a browser via its built-in WASM engine. This is how Antonio Cuni and Hood Chatham were able to build the SPy playground running entirely in the browser.

Because each WASM instance has its own linear memory, two SPy VMs cannot accidentally share state. But intentional sharing is possible through the WebAssembly threads proposal, which wasmtime implements.

A SharedMemory object can be created by the host (Python) and passed as an import to two separate libspy.wasm instances. Both instances then read and write the same underlying bytes, with atomic access primitives available for synchronisation.

The limitations are important though:

Only raw bytes are shared — a flat SharedArrayBuffer-like region. There is no concept of sharing heap-allocated SPy objects, reference counts, or GC roots across instances.
The shared-everything threads proposal — which would allow sharing tables, functions, and GC-managed references — is still under development and not yet available in wasmtime.
Wasmtime’s Store architecture adds another constraint: objects from different stores cannot interact directly; SharedMemory is the designated bridge.

Practical upshot: sharing large typed data buffers (arrays of i32, f64, etc.) between two SPy VMs is feasible today and efficient. Sharing higher-level SPy values requires either serialisation or waiting for shared-everything threads.

OpenMP and WASM¶

OpenMP and WASM are not fundamentally incompatible, but they are an awkward fit.

A proof-of-concept (wasm-openmp-examples) already demonstrates compiling libomp.a to WebAssembly using wasi-threads and running it in wasmtime. However several friction points exist:

The architectural mismatch. OpenMP’s fork-join model assumes threads share an address space and a module instance. WASM’s threading model is “instance-per-thread” — each worker thread is a separately instantiated WASM module that shares only the linear memory, not globals or the function table. The OpenMP runtime must re-implement its fork-join barrier entirely inside WASM using atomics, which works but is not how it was designed.

wasi-threads is now legacy. The proposal that enables pthreads-style threading in WASM outside the browser is now considered legacy for WASI preview1. Future work on threads will go through the shared-everything-threads proposal targeting WASI v0.2.

A WASM-native alternative: wasi-parallel. Rather than mapping OpenMP onto WASM, the ecosystem is building wasi-parallel — a WASI proposal that provides a parallel for construct designed from scratch for WASM’s constraints. This is likely a cleaner long-term path than OpenMP-on-WASM.

For SPy specifically, libspy.wasm is single-threaded today, and OpenMP is not a near-term target. Explicit multi-instance concurrency or wasi-parallel are more natural future paths.

WASM and GPU¶

WASM and GPU are orthogonal technologies by design — WASM is a sandboxed CPU abstraction with no notion of a GPU. The ecosystem has two answers:

In the browser: WebGPU. WASM code calls out to WebGPU (a W3C API available in Chrome, Edge, and experimentally Firefox) to dispatch work to the GPU. Emscripten already has bindings for WebGPU. The division of labour is: WASM handles CPU logic, WebGPU handles GPU kernels.

Outside the browser: wasi-gfx. For runtimes like wasmtime, wasi-gfx is a phase-2 WASI proposal that exposes GPU access through WebGPU semantics, providing component bindings via wasi-webgpu. It is not yet production-ready.

MLIR and WASM¶

MLIR currently compiles to WASM by lowering through the LLVM dialect and then using LLVM’s existing WASM backend. This works, but it loses structural information: WASM’s control flow is structured (block/loop/if) whereas LLVM IR is flat, so the LLVM WASM backend has to reconstruct structure using algorithms like Relooper.

Active research (the WAMI project and a 2025 RFC to the LLVM community) proposes a native WASM dialect in MLIR — lifting WASM from being an LLVM backend target to a full citizen of the MLIR ecosystem. This would allow implementing new WASM proposals by adding a dialect and a lowering pass, without needing complicated reconstruction logic.

The MLIR → GPU → WASM gap¶

MLIR has a mature gpu dialect with a full pipeline for generating GPU kernels (PTX/NVVM for CUDA, SPIR-V for OpenCL). What does not exist is a unified compilation path that combines WASM for CPU and GPU kernels in one target. The WebGPU path uses WGSL (WebGPU Shading Language) — a completely different IR from PTX or SPIR-V — and no bridge between MLIR’s GPU dialects and WGSL exists today.

A realistic future “SPy on the web with GPU” would probably require compiling CPU orchestration code to WASM and writing GPU kernels separately in WGSL, with the SPy compiler eventually knowing how to emit both.

numbacc: A SPy→MLIR Compiler Under the Numba Umbrella¶

numba/numbacc is described as “the Numba ahead-of-time compiler”, but it has essentially nothing to do with Numba the library at a technical level.

The repository contains:

.spy source files (including an e2e.spy end-to-end demo)
A nbcc/ Python package implementing the compiler pipeline
.mlir files as intermediate output
Dependencies on spy and mlir — not on numba

The actual pipeline is SPy source → SPy type inference → MLIR (linalg/affine dialects) → native binary or GPU kernels. Compare this with numba-mlir, which explicitly reuses Numba’s CPython bytecode frontend and type inference alongside its LLVM infrastructure — numbacc shares none of that.

The connection to “Numba” is organisational and aspirational: the project lives under the numba/ GitHub organisation (Antonio Cuni works at Anaconda, which sponsors both projects) and signals “this is the direction Numba could evolve towards” — a clean-slate reimagining with SPy’s type-safe frontend and MLIR’s backend, rather than a dependency on the existing Numba codebase.

numbacc and interpreter mode¶

numbacc is purely an AOT tool — there is no interpreter mode, by design. The two tools are complementary:

Mode	Tool	GPU
Development / debugging	SPy interpreter + `libspy.wasm`	No
High-performance compiled output	numbacc + MLIR pipeline	Yes (CUDA/NVVM)
“Interpreted” GPU	—	Fundamental gap

GPU kernels need a physical GPU (or a deprecated software emulator) to run. There is no lightweight interpreter-mode equivalent of libspy.wasm for GPU execution — this is not a numbacc limitation, it is a property of GPU hardware.

Summary¶

SPy’s use of WASM as its interpreter substrate is not a performance trick — it is a deliberate architectural choice that buys sandboxing, isolation, portability, and build-system simplicity in one move. The same .wasm artefact that gets deployed to edge and browser runtimes is the one that runs inside the development interpreter.

The ecosystem around SPy (numbacc, the WASM dialect RFC, wasi-parallel, wasi-gfx) is young but coherent: each piece addresses a real gap, and the pieces fit together in a way that suggests the overall architecture is sound. The main unknowns are in the packaging and distribution layer — an area that tends to lag compiler research by several years in any language ecosystem.