How to Fix: Cranelift: Inconsistent execution results of .clif code and corresponding wasm program across x86_64 and aarch64

Updated June 10, 2026 8 min read

Aldawsari

8 min read

Cranelift cross-architecture mismatch: why the same .clif and Wasm produce different results on x86_64 and aarch64

A mismatch between a .clif test case and its corresponding WebAssembly program across x86_64 and aarch64 usually points to one of three low-level causes: undefined lane semantics, ABI or calling-convention differences for SIMD values, or a bug in legalization/lowering where Cranelift rewrites IR differently per backend. When the same logical program returns different vector contents or scalar side results depending on the target ISA, the right fix is not to guess at the output but to isolate whether the inconsistency starts in CLIF construction, Wasm translation, mid-end optimization, or backend lowering.

Table of Contents

Reproduce the inconsistency
Understanding the Root Cause
Step-by-Step Solution
Common Edge Cases
FAQ

Reproduce the inconsistency

The issue description shows a multi-target .clif test that includes s390x, aarch64, riscv64, and x86_64, with a function returning several SIMD and scalar values. This kind of test is especially sensitive because vector return values and multi-value returns are handled differently across backends.

Start by reducing the failing test to the smallest possible reproducer. Keep the exact targets that expose the bug and remove anything not required to trigger the mismatch.

test optimize
set opt_level=none
set preserve_frame_pointers=true
target aarch64
target x86_64

function %main() -> i8x16, i8x16, i8x16, i32 {
    ; keep only the instructions required to reproduce the mismatch
    ; remove dead blocks, extra constants, and unrelated lanes
}

Then verify the result in two paths:

Run the .clif test directly.
Compile and run the equivalent Wasm module that should produce the same semantics.

If those diverge only on one backend, that strongly suggests a backend-specific lowering problem. If they diverge everywhere, the issue may be in the Wasm translation layer or the CLIF test itself.

Understanding the Root Cause

The technical reason this happens is that Cranelift IR and Wasm SIMD semantics are not always exercised identically once code reaches machine-specific lowering. A program can appear equivalent at a high level while producing different observable results because of backend rules for lane ordering, register assignment, bitcasts, sign/zero extension, or multi-register return conventions.

The most common root causes are:

Backend legalization differences: Cranelift may expand or legalize a vector instruction differently on x86_64 than on aarch64. If one legalization path mishandles lane extraction, shuffle masks, or narrowing/widening behavior, the generated machine code becomes semantically inconsistent.
ABI mismatches for SIMD returns: Returning multiple i8x16 values plus scalars can expose calling-convention edge cases. One backend may return a vector in registers while another uses a different placement rule, and a bug in ABI lowering can make the caller interpret the result incorrectly.
Poison or undefined-looking test construction: If the CLIF test uses uninitialized values, relies on unspecified bits after a conversion, or observes lanes that were never semantically defined, backend-specific differences become visible even though the IR is not valid for deterministic comparison.
Wasm-to-CLIF translation asymmetry: The generated CLIF from Wasm may include canonicalization steps that the handwritten .clif test omits. That can make two snippets look equivalent while actually differing in signedness, lane interpretation, or trapping behavior.

In practice, this class of issue often comes down to one question: is the source program fully defined at the IR level? If yes, and x86_64 and aarch64 disagree, the bug is usually in instruction selection, legalization, or ABI lowering. If not, the observed difference is a test bug rather than a compiler bug.

Step-by-Step Solution

The safest way to solve this issue is to validate semantics layer by layer and then patch the exact stage that diverges.

1. Minimize the failing CLIF program

Strip the function down until only the problematic vector operation and return sequence remain.

# Reduce targets while debugging
sed -n '/function %main/,/}/p' failing.clif

Keep only the architectures that show the mismatch:

test optimize
set opt_level=none
target aarch64
target x86_64

function %main() -> i8x16, i32 {
block0:
    v0 = vconst.i8x16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    s0 = iconst.i32 42
    return v0, s0
}

If this reduced case stops failing, add instructions back one at a time until the mismatch returns.

2. Compare generated CLIF from Wasm against handwritten CLIF

If you have a corresponding Wasm file, dump the translated Cranelift IR and compare it to the handwritten test. Look specifically for:

bitcast vs splat differences
signed vs unsigned lane extensions
extra shuffle or swizzle normalization
inserted iconst, bmask, or extractlane operations

# Pseudocode workflow
# 1. translate wasm to CLIF
# 2. diff the result with the handwritten .clif
# 3. identify semantic mismatches before lowering

If the translated CLIF already differs from the handwritten one, fix the test first. The issue may not be backend-specific at all.

3. Inspect target-specific lowering

Once the CLIF is confirmed semantically equivalent, inspect how each backend lowers the same instruction sequence. The critical goal is to detect where aarch64 and x86_64 stop agreeing.

# Build and run target-specific tests
cargo test cranelift -- --nocapture

# Narrow to the failing file or test name
cargo test test_interpret -- --nocapture

Check whether the mismatch appears:

in the interpreter
after legalization
only in final machine code execution

If the interpreter agrees with Wasm semantics but native execution differs, the bug is almost certainly in backend lowering or ABI handling.

4. Validate return-value ABI handling

Because the sample signature returns multiple vectors and scalars, verify how values are returned per architecture. A bug here can look like a computation bug even when the internal math is correct.

function %main() -> i8x16, i8x16, i8x16, i32 {
block0:
    a = vconst.i8x16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
    b = vconst.i8x16 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
    c = vconst.i8x16 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
    d = iconst.i32 7
    return a, b, c, d
}

If this simple return-only test fails on one architecture, the defect is likely in the calling convention implementation rather than in SIMD instruction lowering.

5. Fix undefined or underspecified test behavior

If the test depends on partially defined vector contents, rewrite it so every observed lane is explicitly set before use. Also avoid relying on architecture-specific behavior for narrowing conversions or bit reinterpretation unless the CLIF operation guarantees it.

; Bad pattern: observe lanes from a value not fully defined by the IR
; Better pattern: define all lanes before extract/return
v0 = vconst.i8x16 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
l0 = extractlane.i8 v0, 0

This step resolves a surprising number of cross-target discrepancies.

6. Patch the lowering or legalization rule

If the bug is confirmed in backend codegen, patch the machine-specific rule. Typical fixes include:

correcting shuffle mask interpretation
using the proper lane endian mapping
fixing sign extension during scalar extraction from vectors
repairing register assignment for multiple SIMD return values

// Pseudocode example of the kind of fix to look for
match inst {
    Opcode::Extractlane => {
        // ensure lane index and element type are lowered consistently
        // for both aarch64 and x86_64 backends
    }
    Opcode::Return => {
        // verify ABI return slots for vector + scalar multi-value returns
    }
}

7. Add regression tests for both CLIF and Wasm

Do not stop at fixing the single failing file. Add two regression tests:

a .clif backend regression test for the minimized reproducer
a Wasm translation or execution test proving semantic equivalence

test optimize
set opt_level=none
target aarch64
target x86_64

function %main() -> i8x16, i32 {
block0:
    v0 = vconst.i8x16 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
    s0 = iconst.i32 42
    return v0, s0
}

; Add expected results or backend verification checks as supported by the test harness

This ensures future backend changes do not reintroduce the discrepancy.

Common Edge Cases

Lane numbering assumptions: Developers often assume lane indexing behaves identically across all internal representations. If a lowering path flips interpretation during shuffle or extract operations, only some architectures expose it.
Endianness-sensitive vector construction: While x86_64 and aarch64 are commonly little-endian in practice, IR transformations involving byte-level reinterpretation can still reveal mistakes that look like endianness bugs.
Multi-value return packing: Returning several vectors and scalars in one signature can trigger corner cases not seen with a single return value.
Opt level differences: Even with opt_level=none, legalization still happens. A bug may disappear or change shape at higher optimization levels because instruction selection patterns differ.
Interpreter vs native mismatch: If the interpreter agrees with Wasm but native execution does not, focus on codegen. If both disagree with Wasm, focus on translation or the test program itself.
Target list masking the real failure: Including many architectures in one file can make it harder to see whether the bug is generic or isolated to one backend pair. Reduce aggressively while debugging.

FAQ

Why does the .clif test fail on aarch64 but not x86_64 when both represent the same logic?

Because the same Cranelift IR can be legalized and lowered differently per backend. If one target mishandles SIMD lane operations, return-value ABI assignment, or scalar extraction, the machine code result diverges even though the high-level logic is identical.

How do I know whether this is a Cranelift bug or an invalid test?

First verify that every observed value is fully defined in the IR and that the handwritten .clif is truly equivalent to the Wasm-generated CLIF. If the interpreter and Wasm agree but one native backend does not, that is strong evidence of a Cranelift backend bug.

What is the best permanent fix for this class of issue?

The best fix is a combination of backend patching and regression coverage: repair the incorrect legalization or ABI lowering rule, then add minimized tests covering both the direct .clif path and the equivalent Wasm path across the affected architectures.

In short, solve this issue by treating it as a semantic equivalence audit: reduce the test, compare handwritten and translated CLIF, isolate whether the break begins in interpretation or lowering, patch the architecture-specific rule, and lock it down with regression tests. That workflow consistently resolves Cranelift cross-architecture mismatches involving SIMD, multi-value returns, and backend-specific execution differences.