How to Fix: Cranelift: `vselect.i16x8` wrong result on x64
Cranelift vselect.i16x8 Wrong Result on x64: Root Cause and Fix
When Cranelift produces the wrong value for vselect.i16x8 on x86_64 while AArch64, S390X, and the interpreter all agree, the problem is almost never in the test itself. It usually means the x64 backend is lowering the vector select with the wrong mask semantics. This is a classic cross-ISA SIMD bug: one architecture treats lane masks one way, while another instruction sequence silently assumes a different representation.
Bug Overview
The issue centers on vselect.i16x8, which selects each 16-bit lane from one of two input vectors based on a mask vector. In Cranelift IR, the semantics are lane-wise and exact. If the interpreter and non-x64 backends agree, but x64 diverges, that strongly indicates a backend lowering mismatch rather than an IR or frontend bug.
On x64, vector select is often implemented through combinations of AND, ANDN, and OR, or via blending instructions when the mask shape is compatible. The bug appears when the lowering assumes a byte-wise or sign-bit-based mask format that does not match the actual vselect.i16x8 lane representation required by Cranelift.
Understanding the Root Cause
Why this happens: x64 SIMD instructions do not all interpret masks the same way.
Cranelift’s vselect conceptually means:
result[i] = mask[i] ? a[i] : b[i]
But the backend must map that into target-specific machine instructions. The subtle failure happens when one of these assumptions is wrong:
- The lowering treats the mask as a bitmask when the IR expects an all-ones/all-zeros per-lane boolean vector.
- The lowering uses a byte-granular select sequence for a 16-bit lane operation without first canonicalizing the mask.
- The lowering relies on an x64 instruction whose semantics are based on the most significant bit of each element, while the IR mask values are not normalized for that expectation.
- The backend reuses logic from another vector type, such as
i8x16ori32x4, where the mask handling happens to work, but it breaks fori16x8.
In practice, the most likely root cause is that the x64 lowering for vselect.i16x8 failed to canonicalize the boolean mask into the form expected by the chosen SIMD instruction sequence. AArch64 and S390X often have cleaner or different lowering paths for vector selects, which is why they can agree with the interpreter while x64 does not.
A correct lane-wise select on x64 typically needs the mask to behave like this for each 16-bit lane:
0xFFFF => select from lhs lane
0x0000 => select from rhs lane
If the mask instead contains partially set bits, sign-only information, or a representation borrowed from another operation, then a naive (mask & a) | (~mask & b) sequence may produce incorrect results per byte or per lane.
Step-by-Step Solution
The fix is to make the x64 backend lower vselect.i16x8 using a mask representation that exactly matches Cranelift IR semantics.
1. Reproduce the issue with the failing CLIF test
Start by running the relevant test on x64 and compare it with the interpreter and other backends.
cargo test -p cranelift-codegen
# or run the specific filetest if available in your workflow
If you are working from a .clif test case, keep the architecture targets explicit so the mismatch is visible across backends.
2. Inspect the x64 lowering for vselect
Look in the x64 backend for the lowering path handling vector select operations. Depending on the Cranelift revision, this may be in ISLE lowering rules, legacy lowering code, or instruction selection helpers.
You are looking for logic equivalent to:
vselect(mask, x, y) => (mask & x) | (~mask & y)
The critical question is whether mask is guaranteed to be all ones or all zeros per 16-bit lane.
3. Canonicalize the mask before selection
If the mask is not in canonical lane form, normalize it first. The exact implementation depends on the backend representation, but the idea is:
// Pseudocode
canonical_mask = icmp_ne(mask, 0)
canonical_mask = splat_lane_bits(canonical_mask) // each true lane => 0xFFFF
result = (canonical_mask & x) | (~canonical_mask & y)
For x64 SIMD lowering, this often means generating a compare that produces a proper boolean vector instead of reusing a value whose bit pattern only incidentally acted as a mask elsewhere.
4. Avoid using the wrong blend primitive
If the backend uses a blend instruction, verify that its lane semantics match i16x8. Some x64 blend instructions are immediate-controlled rather than vector-mask-controlled, and others operate with mask bits taken from specific positions. If the instruction expects sign bits or byte masks, feeding it raw Cranelift mask lanes will cause silent corruption.
A safer fallback is the explicit boolean select sequence:
// Safe conceptual lowering
selected_a = band(canonical_mask, a)
selected_b = band(bnot(canonical_mask), b)
result = bor(selected_a, selected_b)
5. Add or update a regression test
Once fixed, add a targeted test that proves x64 now matches the interpreter and other architectures.
test interpret
test run
target x86_64
target aarch64
target s390x
function %main() -> i32 {
; construct mask, lhs, rhs
; perform vselect.i16x8
; assert expected lane values
}
The best regression test is one where the mask contains values that expose non-canonical behavior, especially patterns that are not already all-ones/all-zeros unless the IR guarantees they should be.
6. Validate generated machine code
After implementing the fix, inspect the generated x64 sequence. You want to confirm that the backend now emits either:
- a compare-producing canonical boolean mask followed by AND/ANDN/OR, or
- a target instruction whose documented semantics exactly match the lane mask shape.
cargo test -p cranelift-codegen -- --nocapture
If your local workflow includes backend dumps, use them to verify the mask normalization step is present.
Common Edge Cases
Non-canonical boolean vectors
This is the most important edge case. If another optimization or lowering phase passes through a vector that is merely truthy
instead of a proper boolean lane mask, x64 select lowering may break again.
Byte-wise versus lane-wise selection
A sequence that works for i8x16 is not automatically correct for i16x8. If the backend uses byte-level mask operations, verify that each 16-bit lane still selects as a whole and not as two independently selected bytes.
Sign-bit-based mask instructions
Some SIMD operations treat only the top bit of each element as meaningful. If the mask was not created by a compare or another canonical boolean producer, the result may vary unexpectedly.
Reusing lowering across vector widths
Backend code shared between i16x8, i32x4, and i64x2 can hide type-specific bugs. Always verify the legal instruction sequence for the exact lane width.
Optimization passes folding masks incorrectly
Even after fixing instruction selection, an optimization pass could simplify mask-producing code into a form that no longer preserves boolean-lane semantics. Regression tests should cover both direct and transformed mask creation.
FAQ
Why does this only fail on x64 if the IR is the same?
Because Cranelift IR is target-independent, but each backend implements its own lowering. x64 has different SIMD mask conventions than AArch64 or S390X, so a backend-specific bug can appear even when the IR and interpreter are correct.
Why would i16x8 fail when other vector types seem fine?
Mask handling is highly sensitive to lane width. A lowering that accidentally works for byte lanes or 32-bit lanes can still be wrong for 16-bit lanes, especially if the implementation assumes the wrong per-element granularity.
What is the safest implementation strategy for vselect.i16x8 on x64?
The safest approach is to first generate a canonical boolean vector mask and then lower the select as (mask & a) | (~mask & b). This is less fragile than relying on an x64 blend instruction unless the mask format is proven to match that instruction exactly.
Conclusion
The wrong result for vselect.i16x8 on x64 comes from a mismatch between Cranelift’s lane-wise boolean mask semantics and the x64 SIMD instruction sequence chosen by the backend. The fix is straightforward in principle: normalize the mask, lower with semantics-preserving vector boolean operations, and lock the behavior down with a regression test that compares x64 against the interpreter and other architectures. Once that is done, this class of cross-architecture SIMD miscompile becomes much easier to prevent.
For related backend and code generation context, see the Wasmtime repository and the Cranelift source tree.