How to Fix: Cranelift: Fuzz failure with egraphs on AArch64
Cranelift fuzz failure with egraphs on AArch64: root cause, fix strategy, and validation workflow
This failure is a classic case of a target-specific optimization mismatch: the same .clif test passes on x86 but fails on AArch64 when egraphs are enabled, which strongly suggests the optimizer is introducing a rewrite that is valid in one lowering model but incorrect, incomplete, or inconsistently legalized on another.
Symptoms and reproduction
The issue description points to a fuzz-generated .clif case with:
test interpret
test run
set opt_level=speed_and_size
set use_egraphs=true
target aarch64 ...
The important signal is the combination of:
- AArch64-only failure
- Passes on x86
- Requires egraphs
- Triggered under optimization
That pattern usually means one of four things:
- An egraph rewrite is too aggressive and assumes semantics that do not hold after AArch64 legalization.
- A transformed instruction sequence is legal in the IR but lowered incorrectly for AArch64.
- AArch64 has stricter behavior around flags, lanes, immediates, shifts, extends, or bit-width canonicalization.
- The interpreter and machine backend disagree because the optimization changed the IR into a shape that exposes a backend bug.
To reproduce locally, start by running the exact test case through Cranelift’s filetests and enable verbose passes if available in your local setup:
cargo test -p cranelift-filetests -- test_aarch64_egraphs --nocapture
If you have the original generated file, run the targeted filetest directly instead of a broad suite so you can inspect the transformed IR before and after egraph extraction.
Understanding the Root Cause
At a technical level, this happens because egraphs perform equivalence-based rewriting, not just local peephole optimization. That is powerful, but it also raises the risk that a rewrite considered semantically equivalent at the IR level may stop being equivalent after target-specific lowering rules are applied.
For AArch64, common problem areas include:
- Integer extension semantics, especially mixing sign-extension and zero-extension.
- Shift and rotate rewrites where out-of-range behavior, masked shift amounts, or bit-width assumptions differ.
- Condition code materialization and compare folding.
- Vector lane transformations that are legal in generic IR but not preserved identically by the backend.
- Narrow/wide value rewrites where intermediate truncation or extension is silently changed.
The reason it passes on x86 is not necessarily that the rewrite is correct. More often, x86 either:
- has a lowering path that accidentally preserves the intended behavior,
- accepts a broader set of instruction forms, or
- does not expose the semantic discrepancy due to different legalization decisions.
In practice, the root cause is usually one of these:
- A missing target guard on an egraph rewrite.
- A rewrite that should only fire for a specific bit-width or value domain.
- A backend legalization bug revealed by a newly rewritten form.
- An extraction cost model choosing a shape that is theoretically equivalent but backend-hostile on AArch64.
If the fuzz case only fails when use_egraphs=true, the fastest path is to inspect the IR immediately before and after egraph optimization, then compare the AArch64-lowered result against the pre-egraph baseline.
Step-by-Step Solution
The safest fix is to identify the offending rewrite, constrain or disable it for AArch64, and then add a regression test so the fuzz case stays fixed.
1. Isolate the optimization delta
Run the same test with and without egraphs and compare the resulting IR.
; baseline.clif
set use_egraphs=false
; repro.clif
set use_egraphs=true
If your local Cranelift tooling supports pass dumping, capture both forms. The key question is: what expression shape changed?
2. Minimize the fuzz case
Shrink the test until only the failing transformation remains. Keep:
- the same target triple,
- the same opt_level,
- the same value widths,
- the same opcode family involved in the rewrite.
A minimal case makes it much easier to prove whether the issue is in:
- egraph rewrite rules,
- instruction legalization, or
- AArch64 lowering.
3. Inspect suspicious rewrite classes
Focus first on rules touching these areas:
- ireduce / uextend / sextend
- band / bor / bxor canonicalization
- ishl / ushr / sshr reassociation
- icmp folding and boolean normalization
- select / bint / flags-related simplification
- vector splat, extractlane, insertlane rewrites
If one rule rewrites a narrow-width operation into a wider form plus masking, verify that AArch64 lowering preserves the exact semantics.
4. Add a target-specific guard or semantic predicate
Once you find the bad rewrite, do not just remove it blindly. Prefer one of these fixes:
- Add a bit-width predicate.
- Add a target predicate so it does not fire on AArch64.
- Require proof that the transformed expression preserves sign/zero extension behavior.
- Move the rewrite later or earlier so legalization sees a safer form.
Conceptually, the change looks like this:
// Before: rewrite always fires
(rewrite (ishl x y) => (some-canonical-form x y))
// After: rewrite fires only when semantics are preserved
(rewrite (ishl x y) => (some-canonical-form x y)
:when (safe_for_bitwidth_and_target x y target))
If the bug is in lowering rather than rewriting, fix the AArch64 backend to correctly legalize the egraph-produced form instead.
5. Validate with interpreter and backend
Because the file includes both test interpret and test run, validate against both the IR interpreter and generated machine code.
cargo test -p cranelift-filetests -- --nocapture
Then run the exact reduced file across targets where possible:
# Expected: pass on x86 and AArch64 after fix
# Compare egraphs enabled/disabled variants
Your goal is to confirm:
- the interpreter still agrees with expected semantics,
- AArch64 machine code now matches interpreter behavior,
- the fix does not regress x86.
6. Add a permanent regression test
Create a dedicated .clif regression file using the minimized repro. Keep the original settings that triggered the bug:
test interpret
test run
set opt_level=speed_and_size
set use_egraphs=true
target aarch64
function %repro(...) -> ... {
; minimized body here
}
This is critical because fuzz failures often reappear when new rewrites are added nearby.
7. If needed, temporarily disable the rewrite
If you need an immediate stabilization patch, it is acceptable to disable the specific optimization on AArch64 while preparing a proper semantic fix. That is better than leaving a known miscompile path enabled.
// Temporary mitigation strategy
if target.is_aarch64() {
disable_problematic_egraph_rule();
}
Use this only as a short-term measure and document why the rule is being gated.
Common Edge Cases
Even after fixing the main bug, several adjacent cases can still fail if they are not explicitly tested.
1. Sign-extension versus zero-extension confusion
A rewrite may preserve value bits for positive numbers while breaking negative values. Always test with inputs that exercise the sign bit.
2. Shift counts at or beyond type width
Backend behavior can diverge if the rewrite assumes shift counts are masked or normalized differently than the target actually lowers them.
3. Boolean canonicalization
Some rewrites assume booleans are always 0 or 1. If the target path materializes all-ones masks or uses flags-based representations, equality-preserving rewrites can still produce backend-visible differences.
4. Narrow integer legalization
AArch64 often legalizes subword operations through wider registers. If a rewrite changes where truncation happens, the resulting code can differ only on certain inputs.
5. Vector lane semantics
If the fuzz case touches vectors, verify lane ordering, extraction, insertion, and splat behavior. These are common places where target-specific lowering bugs hide.
6. Cost-model extraction issues
Sometimes every rewrite is individually valid, but the egraph extractor picks a form that stresses an incomplete backend path. In that case, the fix may belong in extraction heuristics or legalization, not the rewrite itself.
FAQ
Why does this fail only on AArch64 if the IR rewrite is supposed to be target-independent?
Because target-independent IR equivalence is only safe if all downstream lowering paths preserve that equivalence. AArch64 legalization may expose a semantic gap that x86 lowering does not.
Should I disable egraphs entirely for AArch64?
No. Disable only the specific problematic rewrite or fix the backend path. A broad disable loses optimization coverage and hides the real bug.
How do I tell whether the bug is in egraphs or the AArch64 backend?
Compare three stages: original IR, post-egraph IR, and final backend behavior. If post-egraph IR already violates interpreter expectations, the bug is in the rewrite. If interpreter behavior is correct but generated machine code is wrong, the bug is in AArch64 lowering or legalization.
The practical resolution for this GitHub issue is to treat it as a rewrite-to-lowering contract violation: isolate the egraph transformation introduced by the fuzz case, constrain it with the right semantic or target predicate, and lock in the fix with a minimized AArch64 regression test.