How to Fix: Cranelift: Little/Big endian has no effect on `sload8x8` in s390x

Updated June 10, 2026 7 min read

Aldawsari

7 min read

Cranelift on s390x: why sload8x8 ignored little vs big endian and how to fix it

This bug is a classic backend-lowering mistake: the IR operation sload8x8 should preserve the byte order semantics of the target, but on s390x it effectively behaved the same regardless of configured endianness. That means generated code could produce the right-looking shape while still interpreting bytes in the wrong order during lane construction or extension.

Table of Contents

Understanding the Root Cause
Step-by-Step Solution
Common Edge Cases
FAQ

Understanding the Root Cause

On s390x, endianness matters whenever a scalar or vector load maps bytes from memory into lanes or larger scalar values. The issue with sload8x8 is that it semantically means: load 8 bytes, treat each byte as signed, and widen them into 8 lanes. If backend lowering or instruction selection assumes a single lane order independent of endianness, both little-endian and big-endian configurations end up producing identical results.

Technically, this usually happens in one of these places:

The ISLE lowering rule or legalization path converts sload8x8 into an instruction sequence that does not account for byte order.
The backend uses a helper that is correct for generic vector widening, but incorrect for a target where memory byte order and lane numbering interact differently.
A post-load shuffle or lane-reversal step that should exist for one endian mode is missing.
The test only validates instruction shape, not the semantic ordering of the loaded signed bytes.

For s390x, the critical distinction is this: a widening load from memory is not just a load plus sign extension. It is also a mapping from memory byte positions into vector lanes. If the lowering emits the same expansion for both endian modes, then one mode is semantically wrong by construction.

In practice, the faulty pattern often looks like this:

load 8 bytes -> widen bytes to lanes -> no endian-specific shuffle

But the correct lowering should look more like:

load 8 bytes -> if endian mode requires it, reorder bytes/lane view -> sign-extend into lanes

The reason this bug is easy to miss is that s390x is natively big-endian, so cross-checking alternate endian expectations often depends on IR-level semantics, legalization rules, and test harness behavior rather than real-world host execution alone.

Step-by-Step Solution

The fix is to make lowering for sload8x8 explicitly endian-aware in the s390x backend, then add regression tests that prove byte order affects the result.

1. Reproduce the issue with a focused .clif test

Start from the failing test and reduce it to the smallest possible case that uses sload8x8 on s390x. You want the test to observe lane ordering, not just instruction emission.

test optimize
    set opt_level=none
    set preserve_frame_pointers=true
    set enable_multi_ret_implicit_sret=true
    target s390x

function %main(i64) -> i64x8 {
block0(v0: i64):
    v1 = sload8x8 little v0
    return v1
}

Create a sibling case for big endianness and ensure the expected output differs meaningfully when lane order should differ.

function %main_be(i64) -> i64x8 {
block0(v0: i64):
    v1 = sload8x8 big v0
    return v1
}

2. Locate the lowering path for sload8x8

Search the Cranelift codebase for sload8x8 handling in:

backend-specific ISLE rules
legalization code
vector load expansion helpers
s390x instruction selection patterns

rg "sload8x8|uload8x8|load.*8x8|widen.*8" cranelift/

You are looking for logic that either:

shares the same path for both endian variants, or
normalizes the load into a target instruction without a compensating shuffle.

3. Verify how lane order is modeled

Before changing code, confirm the intended semantics in Cranelift IR:

Does sload8x8 little map the first memory byte to lane 0?
Does sload8x8 big reverse the effective lane interpretation?
Does the backend expect canonical lane numbering independent of architectural vector register display?

This matters because the fix may belong either:

at load time, or
as an explicit shuffle/reverse after the load but before sign extension, depending on existing backend conventions.

4. Implement endian-aware lowering

If the current lowering emits identical code for both modes, split it. Conceptually, the fix is:

match sload8x8(mem, Endianness::Little):
    tmp = load_8_bytes(mem)
    tmp = reorder_for_little_if_needed(tmp)
    result = sign_extend_each_byte_to_lane(tmp)

match sload8x8(mem, Endianness::Big):
    tmp = load_8_bytes(mem)
    tmp = reorder_for_big_if_needed(tmp)
    result = sign_extend_each_byte_to_lane(tmp)

Depending on the backend design, this may become one of these implementation styles:

Emit a different instruction sequence per endian mode.
Emit the same load, followed by an endian-specific permute or reverse.
Adjust the helper that constructs the widened vector so the lane extraction order is correct.

A representative pseudo-patch structure might look like this:

// Pseudocode only
fn lower_sload8x8(addr, endian) -> Value {
    let bytes = emit_load64(addr);
    let ordered = match endian {
        Endianness::Little => maybe_reverse_bytes_for_lane_order(bytes),
        Endianness::Big => bytes,
    };
    emit_sign_extend_bytes_to_vector_lanes(ordered)
}

If the backend already canonicalizes values in the opposite direction, invert the conditional. The point is not that little-endian always needs reversal on s390x, but that one path must differ when the semantics differ.

5. Add regression tests that check semantics, not just syntax

Instruction-checking tests alone are too weak here. Add tests that verify the actual vector lane contents or the emitted lane transformation sequence.

; Pseudocode expectations
; memory bytes: [0x80, 0x01, 0xFE, 0x7F, 0xAA, 0x55, 0x00, 0xFF]
; signed bytes: [-128, 1, -2, 127, -86, 85, 0, -1]

; little-endian expected lanes: ...
; big-endian expected lanes: ...

If the test framework cannot directly assert lane values, validate using a sequence of extracts or stores after the load:

function %check(i64) -> i8, i8, i8, i8, i8, i8, i8, i8 {
block0(v0: i64):
    v1 = sload8x8 little v0
    v2 = extractlane v1, 0
    v3 = extractlane v1, 1
    v4 = extractlane v1, 2
    v5 = extractlane v1, 3
    v6 = extractlane v1, 4
    v7 = extractlane v1, 5
    v8 = extractlane v1, 6
    v9 = extractlane v1, 7
    return v2, v3, v4, v5, v6, v7, v8, v9
}

6. Run the full backend test suite

After applying the fix, run targeted and full tests to catch collateral regressions:

cargo test -p cranelift-codegen s390x
cargo test -p cranelift-filetests
cargo test

If available in your environment, run specific filetests for the modified lowering path:

cargo test -p cranelift-filetests -- s390x

Once sload8x8 is fixed, inspect sibling ops because the same bug pattern often affects them too:

uload8x8
sload16x4
uload16x4
any vector load widening op with explicit endianness

rg "little|big" cranelift/codegen/src/isa/s390x

If several ops share the same helper, fixing the helper may be safer and more maintainable than patching a single opcode path.

Common Edge Cases

1. Sign extension is correct, but lane order is still wrong

This is the most likely follow-up bug. You may confirm that bytes like 0x80 become -128, yet lanes appear reversed. That means the sign extension logic is fine, but the byte-to-lane mapping is not.

2. Tests pass on big-endian native s390x but fail in abstract IR expectations

That usually indicates backend assumptions are coupled too tightly to architecture-native ordering instead of Cranelift IR semantics. The IR contract must win.

3. A fix for sload8x8 breaks uload8x8

If the implementation shares a widening helper, changing ordering in only one place can accidentally alter unsigned behavior. Review both signed and unsigned loads together.

4. Optimization level hides the bug

At higher optimization levels, instruction combining or vector canonicalization can fold away a visible shuffle. Keep a regression test with opt_level=none so the intended lowering remains observable.

The issue body hints at a larger function signature. Strip the test down aggressively. ABI complexity, implicit sret handling, and frame-pointer preservation can distract from the actual endian bug.

6. Wrong fix location

If you patch only the pretty-printed output or a late combine pass, the semantic error can survive elsewhere. The best fix point is usually where the backend first commits to a concrete load-and-widen strategy.

FAQ

Why does sload8x8 need explicit endian handling at all?

Because it is not just widening bytes. It also defines how bytes loaded from memory become vector lanes. Endianness influences that mapping, especially on targets like s390x.

Why can both little and big endian appear to generate valid code?

Because the generated instruction sequence may be structurally valid while still being semantically identical for both modes. The backend can emit legal machine code that interprets byte order incorrectly.

Should the fix be in IR legalization or in the s390x backend?

If the bug is specific to how s390x lowers or selects instructions, the fix belongs in the backend. If multiple ISAs share the same incorrect endian-neutral transformation, then the legalization or shared lowering layer is the right place.

In short, the real solution is to stop treating sload8x8 as an endian-neutral widening load on s390x. Make byte order visible in lowering, verify lane semantics with regression tests, and audit sibling widening loads so the same defect does not reappear under a different opcode.