How to Fix: Cranelift: `umulhi.i8`/`smulhi.i8` is not implemented on x86_64
Cranelift on x86_64 Fails for umulhi.i8/smulhi.i8 Because the Backend Never Learned How to Produce the High Byte Safely
If your .clif test passes in the interpreter and on aarch64 but breaks on x86_64, the problem is not your test case. The issue is that Cranelift’s x86_64 backend does not implement lowering for 8-bit high-half multiply operations like umulhi.i8 and smulhi.i8. In practical terms, Cranelift knows the IR instruction exists, but the target-specific instruction selection path cannot legally generate machine code for it on x86_64.
Table of Contents
Understanding the Root Cause
The Cranelift IR instructions umulhi and smulhi return the upper half of a multiplication result. For i8, multiplying two 8-bit values produces a conceptual 16-bit result, and the operation returns the upper 8 bits.
On some architectures, this is straightforward to lower. On x86_64, however, 8-bit multiplication has awkward constraints:
mul r/m8andimul r/m8implicitly use AL as one operand.- The 16-bit result is returned in AX, where AH contains the high 8 bits.
- Modern register allocators and code generators typically avoid depending on high-byte registers like
AH,BH,CH, andDHbecause they interact badly with REX-prefixed encodings and general register allocation.
That creates a backend gap: Cranelift can represent umulhi.i8 and smulhi.i8 in IR, but the x86_64 lowering layer does not have a safe, implemented rule to materialize the result. This is why the issue appears target-specific rather than IR-specific.
The clean fix is to legalize the operation before x86_64 lowering by widening the inputs to i16, performing a wider multiply, shifting right by 8 bits, and then reducing back to i8. That avoids direct use of AH and keeps the implementation consistent with Cranelift’s backend design.
Step-by-Step Solution
The most robust approach is to expand umulhi.i8 and smulhi.i8 into a sequence of existing, legal operations during legalization or lowering.
1. Reproduce the Failure
A minimal failing test looks like this:
test interpret
test run
target x86_64
target aarch64
function %umulhi_i8(i8) -> i8 {
block0(v0: i8):
v1 = umulhi.i8 v0, v0
return v1
}
This usually works in the interpreter because the interpreter executes IR semantics directly. It fails when the x86_64 backend must lower the instruction to machine code.
2. Expand umulhi.i8 via Zero-Extension
For the unsigned case, the transformation is:
umulhi.i8 a, b
==>
ua = uextend.i16 a
ub = uextend.i16 b
prod = imul ua, ub
hi = ushr_imm prod, 8
result = ireduce.i8 hi
Equivalent CLIF-style logic:
function %umulhi_i8_lowered(i8, i8) -> i8 {
block0(v0: i8, v1: i8):
v2 = uextend.i16 v0
v3 = uextend.i16 v1
v4 = imul v2, v3
v5 = ushr_imm v4, 8
v6 = ireduce.i8 v5
return v6
}
3. Expand smulhi.i8 via Sign-Extension
For the signed case, the inputs must be sign-extended:
smulhi.i8 a, b
==>
sa = sextend.i16 a
sb = sextend.i16 b
prod = imul sa, sb
hi = sshr_imm prod, 8
result = ireduce.i8 hi
Equivalent CLIF-style logic:
function %smulhi_i8_lowered(i8, i8) -> i8 {
block0(v0: i8, v1: i8):
v2 = sextend.i16 v0
v3 = sextend.i16 v1
v4 = imul v2, v3
v5 = sshr_imm v4, 8
v6 = ireduce.i8 v5
return v6
}
4. Implement the Legalization in Cranelift
Depending on the current Cranelift code layout in your checkout, this belongs in the legalization or ISLE lowering path for x86_64. The intent is the same: intercept umulhi.i8 and smulhi.i8 and rewrite them before final instruction selection.
Pseudocode for the rewrite logic:
match inst {
umulhi.i8(x, y) => {
let x16 = uextend.i16(x);
let y16 = uextend.i16(y);
let p16 = imul(x16, y16);
let hi16 = ushr_imm(p16, 8);
replace_with(ireduce.i8(hi16));
}
smulhi.i8(x, y) => {
let x16 = sextend.i16(x);
let y16 = sextend.i16(y);
let p16 = imul(x16, y16);
let hi16 = sshr_imm(p16, 8);
replace_with(ireduce.i8(hi16));
}
}
The key detail is that the multiply itself can still be a normal 16-bit integer multiply. You are not trying to emit an x86 8-bit multiply and extract AH; you are expressing the same math in a form the backend already understands well.
5. Add Regression Tests
Add dedicated tests for both unsigned and signed cases on x86_64 and verify they still work on aarch64.
test interpret
test run
target x86_64
target aarch64
function %umulhi_i8(i8, i8) -> i8 {
block0(v0: i8, v1: i8):
v2 = umulhi.i8 v0, v1
return v2
}
; run: %umulhi_i8(255, 255) == 254
; run: %umulhi_i8(16, 16) == 1
; run: %umulhi_i8(3, 7) == 0
test interpret
test run
target x86_64
target aarch64
function %smulhi_i8(i8, i8) -> i8 {
block0(v0: i8, v1: i8):
v2 = smulhi.i8 v0, v1
return v2
}
; run: %smulhi_i8(-128, 2) == -1
; run: %smulhi_i8(127, 127) == 63
; run: %smulhi_i8(-64, -64) == 16
6. Validate the Backend Output
After the rewrite, inspect the generated x86_64 code if possible. You should see code based on ordinary widening, multiply, shift, and truncate operations rather than any fragile dependency on high-byte register extraction.
cargo test -p cranelift-codegen
cargo test -p cranelift-filetests
If your local workflow includes filetests, rerun the exact failing test and confirm that both interpretation and native execution now pass.
Common Edge Cases
- Using the wrong extension operation:
umulhi.i8must useuextend;smulhi.i8must usesextend. Mixing them produces incorrect upper-half bits. - Using logical shift for signed multiply-high: after signed multiplication, use
sshr_immif your legalization depends on preserving signed semantics during the intermediate step. - Reducing too early: if you truncate before shifting, you destroy the high byte you are trying to compute.
- Backend-only patching: trying to special-case x86 byte registers directly can introduce register allocation bugs, especially around REX encodings and partial-register hazards.
- Missing two-operand test coverage: even if your original issue uses the same input twice, the implementation should be validated with independent operands to catch legalization mistakes.
- Assuming this affects all widths: this issue is specifically painful for
i8on x86_64 because of byte-register behavior. Wider forms may already lower correctly.
FAQ
Why does this work in the interpreter but not on x86_64?
The interpreter executes Cranelift IR semantics directly and does not require target-specific instruction lowering. The x86_64 backend must map the IR to legal machine instructions, and that mapping is what is missing for umulhi.i8 and smulhi.i8.
Why not just use x86 mul/imul and read AH?
Because high-byte registers are awkward in 64-bit x86 code generation. They conflict with common register allocation and encoding strategies, especially when REX prefixes are involved. Legalizing to a wider multiply is simpler and more maintainable.
Is this an x86_64 bug or a Cranelift IR bug?
It is primarily a Cranelift x86_64 backend implementation gap, not a flaw in the IR operation itself. The IR instruction is valid; the backend just needs a legalization path for the 8-bit form.
The best long-term fix is to treat umulhi.i8 and smulhi.i8 as operations that should be expanded before final x86_64 lowering. That keeps the backend correct, avoids brittle byte-register handling, and gives you deterministic behavior across targets.