How to Fix: Cranelift: s390x wrong execution of i16x8.extmul_high_i8x16_s with spilled arguments

8 min read

Cranelift s390x miscompiles i16x8.extmul_high_i8x16_s when arguments are spilled: root cause and fix

On s390x, this bug shows up as a silent wrong-result failure: a WebAssembly SIMD operation, i16x8.extmul_high_i8x16_s, produces incorrect values only when register pressure forces its input arguments to be spilled to the stack. That combination makes the issue dangerous because simple tests often pass, while optimized or larger functions can fail in production.

Symptoms and scope

The issue affects code paths where Cranelift lowers the signed high-half widening multiply from an i8x16 vector into an i16x8 result on the IBM Z / s390x backend. The bug is typically triggered by all of the following conditions:

  • The operation being compiled is i16x8.extmul_high_i8x16_s.
  • The backend is Cranelift.
  • The target architecture is s390x.
  • One or more SIMD inputs are not kept in registers and are instead reloaded from spill slots.

In practice, that means a minimal reproducer may need enough surrounding code, parameters, or temporary values to create register pressure. If you are seeing a mismatch between expected and actual SIMD lane values only on s390x, spilled argument handling is the first place to inspect.

Understanding the Root Cause

The core problem is a mismatch between the semantic requirements of extmul_high and the backend instruction sequence chosen during lowering or register allocation interaction.

i16x8.extmul_high_i8x16_s means:

  1. Take the upper 8 lanes of each i8x16 input.
  2. Sign-extend each 8-bit lane to 16 bits.
  3. Multiply the corresponding extended lanes.
  4. Return the result as i16x8.

On s390x, SIMD lowering for this pattern depends on getting three details exactly right:

  • Selecting the high half of the vector, not the low half.
  • Applying signed extension, not zero extension.
  • Reloading spilled vector values with the correct lane layout and register class before the multiply sequence runs.

When arguments are spilled, the generated code may need to reconstruct vector operands from stack slots. If the reload path or instruction selection path assumes the wrong half-selection, uses an operand form that does not preserve the intended lane positioning, or feeds the multiply from a temporary that was prepared for a different SIMD pattern, the resulting multiplication is valid machine code but semantically wrong.

That is why the failure is not a crash or trap. It is a miscompile: the backend emits code that executes normally but computes the wrong answer.

At a lower level, these bugs usually come from one of these implementation mistakes:

  • A legalization rule for extmul_high reuses logic from extmul_low and forgets to adjust lane extraction.
  • A spill reload enters the lowering sequence as a generic vector operand, but the backend later assumes it is already aligned for a high-half widening operation.
  • The signed widening path uses an instruction sequence that is correct only when the source operand remains in a particular register form and breaks after stack round-tripping.

Because Cranelift splits compilation into IR transforms, lowering, register allocation, and machine emission, this class of bug often appears only after those phases interact. The operation itself is correct in the IR, but the backend implementation for a specific target architecture is not preserving those semantics after spilling.

Step-by-Step Solution

The fix is to make the s390x lowering path for i16x8.extmul_high_i8x16_s robust when operands come from spill slots. In other words, force the backend to explicitly materialize the correct high-half signed-widened inputs before multiplication, rather than relying on an operand form that is only correct when values stay in registers.

The exact patch depends on your local Cranelift revision, but the workflow is consistent.

1. Reproduce the failure on s390x

Start from the issue reproducer and ensure you are compiling with Cranelift enabled on an s390x machine or emulator. Build and run the test case enough times to confirm the result is deterministic.

cargo run --release

If the issue is difficult to trigger, increase register pressure in the Wasm function or Rust host wrapper by introducing additional live SIMD temporaries.

2. Reduce the case to a backend-visible pattern

Create a focused Wasm or CLIF test that contains the single problematic instruction and enough surrounding values to force a spill. This is critical because backend SIMD bugs are easiest to fix when the generated sequence is isolated.

; Pseudocode CLIF intent, not exact syntax for every tree state of Cranelift backend tests:
v0 = load.i8x16 arg0_spill_slot
v1 = load.i8x16 arg1_spill_slot
v2 = i16x8.extmul_high_i8x16_s v0, v1
store v2

Then inspect the generated machine code or backend lowering trace. You are looking for evidence that the compiler is either:

  • Using the wrong half of the vector,
  • Reloading the vector in a form incompatible with the widening step, or
  • Applying an unsigned or otherwise incorrect conversion path.

3. Patch the s390x lowering sequence

In the s390x backend, update the implementation of extmul_high so that spilled operands are treated exactly the same as register-resident operands.

The safe pattern is:

  1. Reload full vector operands from spill slots.
  2. Explicitly isolate the high 8 lanes.
  3. Explicitly perform signed widening to 16-bit lanes.
  4. Multiply the widened vectors.

If the current code combines some of these steps through a target-specific shortcut, replace that shortcut with a semantically explicit sequence for the signed high-half case.

// Before: target-specific shortcut that fails with spilled operands
lower_extmul_high_i8x16_s(a, b) {
    // buggy path: assumes operand layout survives spilling
    return emit_backend_specific_high_mul(a, b);
}

// After: explicit, spill-safe lowering
lower_extmul_high_i8x16_s(a, b) {
    let va = reload_vector_if_needed(a);
    let vb = reload_vector_if_needed(b);

    let a_hi = extract_high_i8x8(va);
    let b_hi = extract_high_i8x8(vb);

    let a_wide = sign_extend_i8x8_to_i16x8(a_hi);
    let b_wide = sign_extend_i8x8_to_i16x8(b_hi);

    return mul_i16x8(a_wide, b_wide);
}

The exact helper names will differ in Cranelift, but the semantic structure above is what matters.

4. Add a regression test that forces spilling

This step is non-negotiable. A regression test that does not force a spill can pass while the bug remains.

Add a backend or filetest case that keeps enough vector values live to exceed the available register set for that point in the function.

// Example strategy:
// 1. Accept several vector parameters.
// 2. Keep unrelated SIMD temporaries live.
// 3. Execute i16x8.extmul_high_i8x16_s late enough that one input is spilled.
// 4. Assert exact lane output.

If you maintain an integration-style reproducer using Wasmtime, keep that too. It protects against regressions across the full pipeline from Wasm translation to machine code generation.

5. Verify with debug and optimized builds

Because spilling depends on optimization and register allocation decisions, validate the fix under multiple optimization levels.

cargo test
cargo test --release

If your local setup supports backend-specific test commands, also run the Cranelift test suite that covers s390x ISA lowering and SIMD legalization.

6. Confirm no low-half regression was introduced

Whenever you touch extmul_high, retest related operations:

  • i16x8.extmul_low_i8x16_s
  • i16x8.extmul_high_i8x16_u
  • i16x8.extmul_low_i8x16_u
  • Equivalent widening multiply forms for other lane widths

It is common for a backend to share helper logic between signed/unsigned and low/high variants. A narrow fix that ignores those relationships can move the bug rather than eliminate it.

Common Edge Cases

1. Signed vs unsigned confusion

The failing instruction is the signed form. If your patch accidentally uses zero extension, positive test vectors may still pass. Always include negative 8-bit lane values in regression tests, such as -128, -1, and mixed-sign combinations.

2. High-half vs low-half lane selection

Many SIMD bugs come from extracting lanes 0-7 instead of 8-15. Use test vectors where low and high halves are intentionally different so the mistake is obvious.

3. Spills that disappear after small refactors

If your test stops reproducing after unrelated edits, the register allocator may simply no longer be spilling at that point. Keep artificial register pressure in the test so the failure mode remains covered.

4. Correctness only failing on one optimization level

Some users assume a fix is complete because debug mode works. Spilling patterns differ across optimization settings, so verify at least one optimized configuration where the original issue occurred.

5. Backend helper reuse across SIMD widths

If the s390x backend shares code between i8x16, i16x8, or wider vector transforms, inspect all callers. The same bug pattern may exist in another widening multiply variant but simply lacks a reproducer.

FAQ

Why does this bug only appear when arguments are spilled?

Because the semantic error is tied to how the s390x backend reloads or reinterprets vector operands after they are moved to stack slots. When values remain in registers, a shortcut sequence may happen to preserve the expected lane layout; once spilled, that assumption breaks.

Why is i16x8.extmul_high_i8x16_s more fragile than simpler SIMD ops?

It combines multiple transformations: high-half extraction, signed widening, and vector multiply. Any backend mistake in lane selection, sign handling, or operand reconstruction can corrupt the result even though each individual instruction is legal.

What is the best regression test for this issue?

The best regression test is one that forces at least one input vector through a spill slot, uses distinct low and high halves, includes negative byte lanes, and checks exact per-lane output on s390x. That combination verifies half selection, signed extension, and reload correctness in one test.

In short, the durable fix is not just changing one instruction mnemonic. It is making sure the s390x Cranelift lowering for i16x8.extmul_high_i8x16_s preserves the WebAssembly SIMD semantics even after register allocation introduces spills. Once you patch the lowering sequence and lock it down with a spill-forcing regression test, this class of wrong-execution bug becomes both understandable and preventable.

Leave a Reply

Your email address will not be published. Required fields are marked *