How to Fix: Cranelift: `umulhi.i8`/`smulhi.i8` is not implemented on x86_64

Updated June 10, 2026 6 min read

Aldawsari

6 min read

Cranelift on x86_64 Fails for `umulhi.i8`/`smulhi.i8` Because the Backend Never Learned How to Produce the High Byte Safely

If your .clif test passes in the interpreter and on aarch64 but breaks on x86_64, the problem is not your test case. The issue is that Cranelift’s x86_64 backend does not implement lowering for 8-bit high-half multiply operations like umulhi.i8 and smulhi.i8. In practical terms, Cranelift knows the IR instruction exists, but the target-specific instruction selection path cannot legally generate machine code for it on x86_64.

Table of Contents

Understanding the Root Cause
Step-by-Step Solution
Common Edge Cases
FAQ

Understanding the Root Cause

The Cranelift IR instructions umulhi and smulhi return the upper half of a multiplication result. For i8, multiplying two 8-bit values produces a conceptual 16-bit result, and the operation returns the upper 8 bits.

On some architectures, this is straightforward to lower. On x86_64, however, 8-bit multiplication has awkward constraints:

mul r/m8 and imul r/m8 implicitly use AL as one operand.
The 16-bit result is returned in AX, where AH contains the high 8 bits.
Modern register allocators and code generators typically avoid depending on high-byte registers like AH, BH, CH, and DH because they interact badly with REX-prefixed encodings and general register allocation.

That creates a backend gap: Cranelift can represent umulhi.i8 and smulhi.i8 in IR, but the x86_64 lowering layer does not have a safe, implemented rule to materialize the result. This is why the issue appears target-specific rather than IR-specific.

The clean fix is to legalize the operation before x86_64 lowering by widening the inputs to i16, performing a wider multiply, shifting right by 8 bits, and then reducing back to i8. That avoids direct use of AH and keeps the implementation consistent with Cranelift’s backend design.

Step-by-Step Solution

The most robust approach is to expand umulhi.i8 and smulhi.i8 into a sequence of existing, legal operations during legalization or lowering.

1. Reproduce the Failure

A minimal failing test looks like this:

test interpret
test run
target x86_64
target aarch64

function %umulhi_i8(i8) -> i8 {
block0(v0: i8):
    v1 = umulhi.i8 v0, v0
    return v1
}

This usually works in the interpreter because the interpreter executes IR semantics directly. It fails when the x86_64 backend must lower the instruction to machine code.

2. Expand `umulhi.i8` via Zero-Extension

For the unsigned case, the transformation is:

umulhi.i8 a, b
==>
ua = uextend.i16 a
ub = uextend.i16 b
prod = imul ua, ub
hi = ushr_imm prod, 8
result = ireduce.i8 hi

Equivalent CLIF-style logic:

function %umulhi_i8_lowered(i8, i8) -> i8 {
block0(v0: i8, v1: i8):
    v2 = uextend.i16 v0
    v3 = uextend.i16 v1
    v4 = imul v2, v3
    v5 = ushr_imm v4, 8
    v6 = ireduce.i8 v5
    return v6
}

3. Expand `smulhi.i8` via Sign-Extension

For the signed case, the inputs must be sign-extended:

smulhi.i8 a, b
==>
sa = sextend.i16 a
sb = sextend.i16 b
prod = imul sa, sb
hi = sshr_imm prod, 8
result = ireduce.i8 hi

Equivalent CLIF-style logic:

function %smulhi_i8_lowered(i8, i8) -> i8 {
block0(v0: i8, v1: i8):
    v2 = sextend.i16 v0
    v3 = sextend.i16 v1
    v4 = imul v2, v3
    v5 = sshr_imm v4, 8
    v6 = ireduce.i8 v5
    return v6
}

4. Implement the Legalization in Cranelift

Depending on the current Cranelift code layout in your checkout, this belongs in the legalization or ISLE lowering path for x86_64. The intent is the same: intercept umulhi.i8 and smulhi.i8 and rewrite them before final instruction selection.

Pseudocode for the rewrite logic:

match inst {
    umulhi.i8(x, y) => {
        let x16 = uextend.i16(x);
        let y16 = uextend.i16(y);
        let p16 = imul(x16, y16);
        let hi16 = ushr_imm(p16, 8);
        replace_with(ireduce.i8(hi16));
    }
    smulhi.i8(x, y) => {
        let x16 = sextend.i16(x);
        let y16 = sextend.i16(y);
        let p16 = imul(x16, y16);
        let hi16 = sshr_imm(p16, 8);
        replace_with(ireduce.i8(hi16));
    }
}

The key detail is that the multiply itself can still be a normal 16-bit integer multiply. You are not trying to emit an x86 8-bit multiply and extract AH; you are expressing the same math in a form the backend already understands well.

5. Add Regression Tests

Add dedicated tests for both unsigned and signed cases on x86_64 and verify they still work on aarch64.

test interpret
test run
target x86_64
target aarch64

function %umulhi_i8(i8, i8) -> i8 {
block0(v0: i8, v1: i8):
    v2 = umulhi.i8 v0, v1
    return v2
}
; run: %umulhi_i8(255, 255) == 254
; run: %umulhi_i8(16, 16) == 1
; run: %umulhi_i8(3, 7) == 0

test interpret
test run
target x86_64
target aarch64

function %smulhi_i8(i8, i8) -> i8 {
block0(v0: i8, v1: i8):
    v2 = smulhi.i8 v0, v1
    return v2
}
; run: %smulhi_i8(-128, 2) == -1
; run: %smulhi_i8(127, 127) == 63
; run: %smulhi_i8(-64, -64) == 16

6. Validate the Backend Output

After the rewrite, inspect the generated x86_64 code if possible. You should see code based on ordinary widening, multiply, shift, and truncate operations rather than any fragile dependency on high-byte register extraction.

cargo test -p cranelift-codegen
cargo test -p cranelift-filetests

If your local workflow includes filetests, rerun the exact failing test and confirm that both interpretation and native execution now pass.

Common Edge Cases

Using the wrong extension operation: umulhi.i8 must use uextend; smulhi.i8 must use sextend. Mixing them produces incorrect upper-half bits.
Using logical shift for signed multiply-high: after signed multiplication, use sshr_imm if your legalization depends on preserving signed semantics during the intermediate step.
Reducing too early: if you truncate before shifting, you destroy the high byte you are trying to compute.
Backend-only patching: trying to special-case x86 byte registers directly can introduce register allocation bugs, especially around REX encodings and partial-register hazards.
Missing two-operand test coverage: even if your original issue uses the same input twice, the implementation should be validated with independent operands to catch legalization mistakes.
Assuming this affects all widths: this issue is specifically painful for i8 on x86_64 because of byte-register behavior. Wider forms may already lower correctly.

FAQ

Why does this work in the interpreter but not on x86_64?

The interpreter executes Cranelift IR semantics directly and does not require target-specific instruction lowering. The x86_64 backend must map the IR to legal machine instructions, and that mapping is what is missing for umulhi.i8 and smulhi.i8.

Why not just use x86 `mul`/`imul` and read `AH`?

Because high-byte registers are awkward in 64-bit x86 code generation. They conflict with common register allocation and encoding strategies, especially when REX prefixes are involved. Legalizing to a wider multiply is simpler and more maintainable.

Is this an x86_64 bug or a Cranelift IR bug?

It is primarily a Cranelift x86_64 backend implementation gap, not a flaw in the IR operation itself. The IR instruction is valid; the backend just needs a legalization path for the 8-bit form.

The best long-term fix is to treat umulhi.i8 and smulhi.i8 as operations that should be expanded before final x86_64 lowering. That keeps the backend correct, avoids brittle byte-register handling, and gives you deterministic behavior across targets.

Cranelift on x86_64 Fails for umulhi.i8/smulhi.i8 Because the Backend Never Learned How to Produce the High Byte Safely