How to Fix: Cranelift: Wrong result for `atomic_rmw.i16 or` on RISC-V
Cranelift on RISC-V returned the wrong value for atomic_rmw.i16 or because the backend widened a 16-bit atomic operation incorrectly and failed to preserve the language-level result semantics.
This bug is subtle: the memory update may appear correct, but the value returned by the instruction can be wrong on riscv64gc. That makes it especially dangerous for code generated from Wasm or IR relying on atomic read-modify-write operations to return the previous value exactly as specified.
What the Bug Looks Like
The issue reported against Cranelift shows a failing case for atomic_rmw.i16 or on RISC-V. In Cranelift IR, this operation must:
- Atomically load the old 16-bit value from memory.
- Compute
old | arg. - Store the updated 16-bit result atomically.
- Return the old 16-bit value, not the new one, and not a widened incorrectly transformed value.
On architectures without native 16-bit atomic RMW instructions, the backend usually lowers this into a compare-and-swap loop over a larger naturally supported width, often 32-bit or XLEN-sized. That widening is where these bugs usually appear.
Understanding the Root Cause
The root cause is a mismatch between subword atomic semantics and the way RISC-V backends emulate them when only larger atomic primitives are available.
For an i16 atomic OR, the lowering generally follows this pattern:
- Align the target address down to the containing word.
- Load the full word atomically.
- Extract the target 16-bit lane using shifts and masks.
- Compute the new 16-bit lane:
old16 | rhs16. - Merge the updated lane back into the full word.
- Attempt a CAS on the full word and retry on failure.
- Return the original 16-bit lane value.
The bug happens when the lowering gets one of these details wrong:
- The backend returns the merged word or the new lane instead of the required old 16-bit lane.
- The extracted old value is not masked back to 16 bits before being returned.
- Sign extension or zero extension is applied at the wrong stage.
- The shift or mask for the 16-bit lane is computed incorrectly for the address offset within the containing word.
On RISC-V, this class of bug is common in backend legalization because the ISA provides atomic operations for certain widths, while smaller widths may require emulation. If the legalization code treats the result as a full register-width value without correctly restoring i16 semantics, the operation can update memory correctly but still return the wrong result.
In practical terms, the fix is to ensure the lowering returns:
old16 = (old_word >> shift) & 0xffff
and then converts that result exactly according to Cranelift’s expected i16 value rules, rather than returning any intermediate full-width value.
Step-by-Step Solution
The correct fix is implemented in the lowering/legalization path for subword atomic RMW operations on RISC-V. The exact file names may vary slightly by Cranelift revision, but the strategy is consistent.
1. Locate the lowering for subword atomic RMW on RISC-V
Search the backend for handling of:
atomic_rmwAtomicRmwOp::Or- subword atomic lowering or CAS loop generation
- RISC-V atomic legalization code
git grep "atomic_rmw
git grep "AtomicRmwOp"
git grep "cmpxchg"
git grep "riscv" cranelift
You are looking for the place where i8/i16 atomics are expanded into a loop using a larger atomic primitive.
2. Inspect how the old lane value is extracted
The correct extraction should conceptually look like this:
addr_base = addr & !0x3 ; or !0x7 depending on lowering width
byte_off = addr & 0x3
shift = byte_off * 8
mask16 = 0xffff << shift
old_word = atomic_load(addr_base)
old16 = (old_word >> shift) & 0xffff
new16 = old16 | rhs16
new_word = (old_word & ~mask16) | ((new16 & 0xffff) << shift)
If the code instead returns new16, new_word, or an unmasked shifted value, that is the bug.
3. Ensure the CAS loop returns the original 16-bit lane
The retry loop must preserve the old lane value from the successful compare-and-swap iteration:
loop:
old_word = amo_or_load_or_lr(addr_base)
old16 = (old_word >> shift) & 0xffff
new16 = old16 | rhs16
new_word = (old_word & ~mask16) | ((new16 & 0xffff) << shift)
success = cas(addr_base, old_word, new_word)
if !success goto loop
return old16
The most important rule is simple: return old16 after successful atomic replacement.
4. Fix extension behavior explicitly
Because Cranelift tracks integer widths precisely, make sure the returned value is reconstructed as an i16, not just left in a machine register as i32 or i64 without truncation.
// Pseudocode in backend/legalizer terms
let old16 = band(ushr(old_word, shift), iconst.i32(0xffff));
let new16 = bor(old16, zext(rhs16));
let merged = bor(band(old_word, bnot(mask16)), ishl(band(new16, iconst(0xffff)), shift));
// after successful CAS
let result = ireduce.i16(old16);
return result;
If the backend currently does a widen-then-return sequence, replace it with an explicit truncate/reduce step at the return site.
5. Add or update the regression test
Preserve the original failing .clif file from the issue and add a focused regression test that verifies both:
- the memory value is updated correctly
- the returned old i16 value is correct
test interpret
test run
target riscv64gc
function %atomic_or_i16_returns_old_value(i64, i16) -> i16 {
ss0 = explicit_slot 8
block0(v0: i64, v1: i16):
stack_store v1, ss0
vaddr = stack_addr.i64 ss0
vold = atomic_rmw.or.i16 vaddr, v0
return vold
}
Also add variants where the original memory contains:
0x00000x00ff0x80000xffff
These help catch masking and extension bugs.
6. Run the relevant test suites
cargo test -p cranelift-codegen
cargo test -p cranelift-filetests
cargo test atomic
cargo test riscv
If your local workflow uses filetests directly, run the failing case under the RISC-V target until the returned value matches the expected old 16-bit value exactly.
Common Edge Cases
Fixing atomic_rmw.i16 or often exposes similar bugs nearby. Check these cases before sending the patch.
1. Other subword atomic ops
If or is wrong, and, xor, xchg, add, and signed/unsigned min/max lowering may share the same return-value bug.
2. i8 and i16 lane offset math
For unaligned subword positions inside a larger word, the shift amount must reflect the byte offset exactly. A one-byte mistake causes the wrong lane to be extracted or rewritten.
3. Endianness assumptions
Most current RISC-V targets are little-endian, but legalization logic should still be written so lane extraction is clearly tied to target endianness assumptions. Hardcoded shift formulas without documented assumptions can break later.
4. Sign-extension vs zero-extension
An OR operation is bitwise, so the extraction should usually be masked first. Returning a sign-extended widened value instead of a true i16 can make test results look random for values with the high bit set, such as 0x8000.
5. Retry-loop value reuse
In a CAS loop, do not compute the returned value from a stale failed iteration. The returned old value must correspond to the successful exchange iteration.
6. Memory ordering preservation
While fixing data-path logic, do not accidentally weaken the required atomic ordering semantics in the emitted LR/SC or equivalent sequence.
How to Validate the Fix
A good validation strategy checks more than the original reproducer.
- Run the original
.cliftest case from the issue. - Add a matrix of initial 16-bit values and OR operands.
- Verify the returned value equals the pre-operation memory contents.
- Verify the final memory equals
old | argmasked to 16 bits. - Run neighboring subword atomic tests for i8 and i16.
// Expected semantics example
old = 0x8001
arg = 0x0002
ret = atomic_rmw_or_i16(ptr, arg)
assert(ret == 0x8001)
assert(*ptr == 0x8003)
If ret becomes 0x8003, the backend returned the new value instead of the old one. If it becomes something like 0xffff8001 in an intermediate widened path, extension handling is still wrong.
FAQ
Why does this bug show up on RISC-V specifically?
Because subword atomics such as i16 atomic RMW are often lowered into wider atomic sequences on RISC-V. That widening introduces extra mask, shift, merge, and truncate steps, which are easy to get wrong.
Is the memory update wrong too, or only the returned value?
In many cases, only the returned old value is wrong while the memory update is correct. That still violates Cranelift IR semantics and can break user code relying on the returned value.
Should I audit only atomic_rmw.i16 or?
No. You should audit the entire subword atomic lowering path, especially other i8/i16 operations sharing the same CAS-loop helper or extraction logic. These bugs often come in families.
The safest long-term fix is to centralize subword atomic extraction and return-value handling so every operation follows the same tested pattern: extract old lane, compute new lane, merge, CAS, return old lane truncated to the IR width.