How to Fix: Cranelift: Little/Big endian has no effect on `sload8x8` in s390x
Cranelift on s390x: why sload8x8 ignored little vs big endian and how to fix it
This bug is a classic backend-lowering mistake: the IR operation sload8x8 should preserve the byte order semantics of the target, but on s390x it effectively behaved the same regardless of configured endianness. That means generated code could produce the right-looking shape while still interpreting bytes in the wrong order during lane construction or extension.
Understanding the Root Cause
On s390x, endianness matters whenever a scalar or vector load maps bytes from memory into lanes or larger scalar values. The issue with sload8x8 is that it semantically means: load 8 bytes, treat each byte as signed, and widen them into 8 lanes. If backend lowering or instruction selection assumes a single lane order independent of endianness, both little-endian and big-endian configurations end up producing identical results.
Technically, this usually happens in one of these places:
- The ISLE lowering rule or legalization path converts sload8x8 into an instruction sequence that does not account for byte order.
- The backend uses a helper that is correct for generic vector widening, but incorrect for a target where memory byte order and lane numbering interact differently.
- A post-load shuffle or lane-reversal step that should exist for one endian mode is missing.
- The test only validates instruction shape, not the semantic ordering of the loaded signed bytes.
For s390x, the critical distinction is this: a widening load from memory is not just a load plus sign extension. It is also a mapping from memory byte positions into vector lanes. If the lowering emits the same expansion for both endian modes, then one mode is semantically wrong by construction.
In practice, the faulty pattern often looks like this:
load 8 bytes -> widen bytes to lanes -> no endian-specific shuffle
But the correct lowering should look more like:
load 8 bytes -> if endian mode requires it, reorder bytes/lane view -> sign-extend into lanes
The reason this bug is easy to miss is that s390x is natively big-endian, so cross-checking alternate endian expectations often depends on IR-level semantics, legalization rules, and test harness behavior rather than real-world host execution alone.
Step-by-Step Solution
The fix is to make lowering for sload8x8 explicitly endian-aware in the s390x backend, then add regression tests that prove byte order affects the result.
1. Reproduce the issue with a focused .clif test
Start from the failing test and reduce it to the smallest possible case that uses sload8x8 on s390x. You want the test to observe lane ordering, not just instruction emission.
test optimize
set opt_level=none
set preserve_frame_pointers=true
set enable_multi_ret_implicit_sret=true
target s390x
function %main(i64) -> i64x8 {
block0(v0: i64):
v1 = sload8x8 little v0
return v1
}
Create a sibling case for big endianness and ensure the expected output differs meaningfully when lane order should differ.
function %main_be(i64) -> i64x8 {
block0(v0: i64):
v1 = sload8x8 big v0
return v1
}
2. Locate the lowering path for sload8x8
Search the Cranelift codebase for sload8x8 handling in:
- backend-specific ISLE rules
- legalization code
- vector load expansion helpers
- s390x instruction selection patterns
rg "sload8x8|uload8x8|load.*8x8|widen.*8" cranelift/
You are looking for logic that either:
- shares the same path for both endian variants, or
- normalizes the load into a target instruction without a compensating shuffle.
3. Verify how lane order is modeled
Before changing code, confirm the intended semantics in Cranelift IR:
- Does sload8x8 little map the first memory byte to lane 0?
- Does sload8x8 big reverse the effective lane interpretation?
- Does the backend expect canonical lane numbering independent of architectural vector register display?
This matters because the fix may belong either:
- at load time, or
- as an explicit shuffle/reverse after the load but before sign extension, depending on existing backend conventions.
4. Implement endian-aware lowering
If the current lowering emits identical code for both modes, split it. Conceptually, the fix is:
match sload8x8(mem, Endianness::Little):
tmp = load_8_bytes(mem)
tmp = reorder_for_little_if_needed(tmp)
result = sign_extend_each_byte_to_lane(tmp)
match sload8x8(mem, Endianness::Big):
tmp = load_8_bytes(mem)
tmp = reorder_for_big_if_needed(tmp)
result = sign_extend_each_byte_to_lane(tmp)
Depending on the backend design, this may become one of these implementation styles:
- Emit a different instruction sequence per endian mode.
- Emit the same load, followed by an endian-specific permute or reverse.
- Adjust the helper that constructs the widened vector so the lane extraction order is correct.
A representative pseudo-patch structure might look like this:
// Pseudocode only
fn lower_sload8x8(addr, endian) -> Value {
let bytes = emit_load64(addr);
let ordered = match endian {
Endianness::Little => maybe_reverse_bytes_for_lane_order(bytes),
Endianness::Big => bytes,
};
emit_sign_extend_bytes_to_vector_lanes(ordered)
}
If the backend already canonicalizes values in the opposite direction, invert the conditional. The point is not that little-endian always needs reversal on s390x, but that one path must differ when the semantics differ.
5. Add regression tests that check semantics, not just syntax
Instruction-checking tests alone are too weak here. Add tests that verify the actual vector lane contents or the emitted lane transformation sequence.
; Pseudocode expectations
; memory bytes: [0x80, 0x01, 0xFE, 0x7F, 0xAA, 0x55, 0x00, 0xFF]
; signed bytes: [-128, 1, -2, 127, -86, 85, 0, -1]
; little-endian expected lanes: ...
; big-endian expected lanes: ...
If the test framework cannot directly assert lane values, validate using a sequence of extracts or stores after the load:
function %check(i64) -> i8, i8, i8, i8, i8, i8, i8, i8 {
block0(v0: i64):
v1 = sload8x8 little v0
v2 = extractlane v1, 0
v3 = extractlane v1, 1
v4 = extractlane v1, 2
v5 = extractlane v1, 3
v6 = extractlane v1, 4
v7 = extractlane v1, 5
v8 = extractlane v1, 6
v9 = extractlane v1, 7
return v2, v3, v4, v5, v6, v7, v8, v9
}
6. Run the full backend test suite
After applying the fix, run targeted and full tests to catch collateral regressions:
cargo test -p cranelift-codegen s390x
cargo test -p cranelift-filetests
cargo test
If available in your environment, run specific filetests for the modified lowering path:
cargo test -p cranelift-filetests -- s390x
7. Review related operations
Once sload8x8 is fixed, inspect sibling ops because the same bug pattern often affects them too:
- uload8x8
- sload16x4
- uload16x4
- any vector load widening op with explicit endianness
rg "little|big" cranelift/codegen/src/isa/s390x
If several ops share the same helper, fixing the helper may be safer and more maintainable than patching a single opcode path.
Common Edge Cases
1. Sign extension is correct, but lane order is still wrong
This is the most likely follow-up bug. You may confirm that bytes like 0x80 become -128, yet lanes appear reversed. That means the sign extension logic is fine, but the byte-to-lane mapping is not.
2. Tests pass on big-endian native s390x but fail in abstract IR expectations
That usually indicates backend assumptions are coupled too tightly to architecture-native ordering instead of Cranelift IR semantics. The IR contract must win.
3. A fix for sload8x8 breaks uload8x8
If the implementation shares a widening helper, changing ordering in only one place can accidentally alter unsigned behavior. Review both signed and unsigned loads together.
4. Optimization level hides the bug
At higher optimization levels, instruction combining or vector canonicalization can fold away a visible shuffle. Keep a regression test with opt_level=none so the intended lowering remains observable.
5. Multi-result or ABI-related noise masks the real failure
The issue body hints at a larger function signature. Strip the test down aggressively. ABI complexity, implicit sret handling, and frame-pointer preservation can distract from the actual endian bug.
6. Wrong fix location
If you patch only the pretty-printed output or a late combine pass, the semantic error can survive elsewhere. The best fix point is usually where the backend first commits to a concrete load-and-widen strategy.
FAQ
Why does sload8x8 need explicit endian handling at all?
Because it is not just widening bytes. It also defines how bytes loaded from memory become vector lanes. Endianness influences that mapping, especially on targets like s390x.
Why can both little and big endian appear to generate valid code?
Because the generated instruction sequence may be structurally valid while still being semantically identical for both modes. The backend can emit legal machine code that interprets byte order incorrectly.
Should the fix be in IR legalization or in the s390x backend?
If the bug is specific to how s390x lowers or selects instructions, the fix belongs in the backend. If multiple ISAs share the same incorrect endian-neutral transformation, then the legalization or shared lowering layer is the right place.
In short, the real solution is to stop treating sload8x8 as an endian-neutral widening load on s390x. Make byte order visible in lowering, verify lane semantics with regression tests, and audit sibling widening loads so the same defect does not reappear under a different opcode.