How to Fix: Cranelift: Unexpected negative optimization of switch statement

7 min read

Cranelift switch lowering can make evalloop.c slower instead of faster

When Cranelift optimizes a large switch statement, the generated code can unexpectedly regress compared to a less optimized build. In evalloop.c, this usually shows up when a dense interpreter-style dispatch loop is transformed into code that looks theoretically better, but interacts poorly with WebAssembly code generation, branch prediction, and backend lowering decisions. The result is a negative optimization: -O3 can benchmark worse than -O0 or a simpler lowering path.

Symptoms

The issue typically appears with benchmark-style opcode interpreters such as evalloop.c in llvm-test-suite. A build using aggressive optimization may produce slower runtime even though the source code is unchanged. Common signs include:

  • A large dispatch switch over many cases.
  • Noticeable slowdown only in Wasm output or only in a specific backend using Cranelift.
  • Performance counters showing more branch overhead, worse locality, or increased instruction count after optimization.
  • Different behavior between -O0, -O2, and -O3 when using Emscripten or a Cranelift-based pipeline.

Understanding the Root Cause

The root problem is not that the switch itself is invalid. The problem is how the compiler decides to lower that switch into machine-level or Wasm-level control flow.

In interpreter-like code, the original C switch often has properties that are friendly to execution:

  • The distribution of case values may be sparse or biased.
  • Some cases are dramatically hotter than others.
  • The source order may accidentally help branch prediction or code locality.
  • The dispatch loop may be so tight that even small backend changes dominate runtime.

At higher optimization levels, Cranelift may apply transformations such as:

  • Converting the switch into a jump table.
  • Reordering blocks based on generic heuristics.
  • Merging or splitting control-flow regions.
  • Producing extra bounds checks or index materialization before table dispatch.

Those transformations are usually beneficial, but in this benchmark they can backfire. A jump table is only a clear win when the case space is sufficiently dense and the backend can lower it efficiently. In a WebAssembly pipeline, switch lowering may introduce extra instructions, less favorable structured control flow, or poorer locality than a simpler branch chain. If the hot cases are no longer cheap to reach, total runtime increases.

This is why the issue is described as an unexpected negative optimization: the optimization pass is legal and well-intentioned, but its cost model does not match the actual runtime characteristics of evalloop.c.

In practice, the slowdown usually comes from one or more of these technical causes:

  • Bad switch density assumptions: the optimizer treats the case range as jump-table friendly when real execution frequency says otherwise.
  • Hot-path disruption: common opcodes become more expensive to reach after lowering.
  • Backend mismatch: Cranelift IR transformation looks profitable, but the final Wasm or target code is not.
  • Code size inflation: more dispatch machinery harms instruction cache behavior.
  • Structured control-flow overhead in Wasm: the target form may not preserve the efficiency expected from native jump-table lowering.

Step-by-Step Solution

The safest fix is to verify that switch lowering is the source of the regression, then force a dispatch shape that avoids the harmful optimization. Depending on your toolchain control, you can solve it at the source level, IR level, or backend configuration level.

1. Reproduce the regression reliably

Build both a low-optimization and high-optimization version and compare them under the same runtime conditions.

emcc -O0 -s WASM=1 -s TOTAL_MEMORY=512MB evalloop.c -o evalloop_O0.html
emcc -O3 -s WASM=1 -s TOTAL_MEMORY=512MB evalloop.c -o evalloop_O3.html

Then benchmark both builds several times. If possible, use the same browser, machine state, and warm-up sequence.

2. Inspect generated control flow

Check whether the optimized build replaced the original branch structure with a jump-table-like dispatch or a more complex control-flow shape.

emcc -O3 -s WASM=1 -s TOTAL_MEMORY=512MB -S evalloop.c -o evalloop_O3.ll

If your pipeline exposes Cranelift IR or backend dumps, compare the switch lowering between debug and optimized paths. You are looking for:

  • Extra dispatch index calculations
  • Bounds checks before branch selection
  • Large block tables or reordered case blocks
  • Hot cases no longer placed on the fall-through path

3. Replace the monolithic switch with a table-driven dispatch

If source changes are allowed, replacing the giant switch with a function-pointer or handler table often avoids poor switch lowering heuristics. This gives the compiler a more explicit dispatch model.

typedef int (*op_handler)(int a, int b);

static int op_add(int a, int b) { return a + b; }
static int op_sub(int a, int b) { return a - b; }
static int op_mul(int a, int b) { return a * b; }

static op_handler handlers[] = {
    op_add,
    op_sub,
    op_mul
};

int eval_op(int opcode, int a, int b) {
    if (opcode < 0 || opcode >= (int)(sizeof(handlers) / sizeof(handlers[0]))) {
        return 0;
    }
    return handlers[opcode](a, b);
}

This approach is not always faster, but it prevents the backend from making the specific switch-to-jump-table choice that caused the regression.

4. Split hot and cold cases

If profiling shows a small subset of cases dominates runtime, isolate them before the main switch. This improves predictability and reduces the optimizer’s need to build a generic dispatch structure.

int eval_op(int opcode, int a, int b) {
    if (opcode == 0) return a + b;
    if (opcode == 1) return a - b;
    if (opcode == 2) return a * b;

    switch (opcode) {
        case 3: return a / (b ? b : 1);
        case 4: return a & b;
        case 5: return a | b;
        default: return 0;
    }
}

This is often the most practical fix for interpreter-style loops because it preserves readability while making the hot path explicit.

5. Reduce optimizer freedom around the dispatch loop

If the regression only appears in one function, compile that function or file with a different optimization level. This avoids sacrificing performance globally.

__attribute__((optimize("O2")))
int eval_loop(/* args */) {
    /* original dispatch logic */
}

If per-function attributes are not supported in your exact toolchain path, move the dispatch code into a separate translation unit and compile that file with a safer level such as -O2.

6. Use profile-guided decisions where available

If your environment supports profiling, feed real runtime data into the optimization pipeline. Generic switch heuristics are often wrong for synthetic dispatch code, but profile-guided optimization can preserve hot-case efficiency.

# Example workflow conceptually
# 1. build instrumented binary
# 2. run benchmark workload
# 3. rebuild with collected profile data

Even if Cranelift itself is not directly consuming the profile in your stack, profile results still help you decide whether to keep a switch, split cases, or use a handler table.

7. Validate the fix with assembly or Wasm inspection

After changing the source or backend settings, verify that the generated dispatch structure is actually different. A successful fix usually produces one of these outcomes:

  • Fewer instructions before reaching common cases
  • No oversized jump table
  • Better placement of hot blocks
  • Reduced code size in the dispatch loop
wasm-objdump -d evalloop_O3.wasm

If code generation still looks the same, the source rewrite may not be strong enough to defeat the problematic optimization.

Common Edge Cases

  • Sparse opcode ranges: if the switch spans a wide numeric range with only a few active cases, jump-table lowering is especially likely to be wasteful.
  • Benchmark noise: Wasm startup, JIT warm-up, and browser scheduling can hide or exaggerate the regression. Always benchmark after warm-up.
  • Inlining side effects: moving logic into helper functions may accidentally help or hurt due to inlining changes rather than the switch fix itself.
  • Different behavior across engines: one browser or runtime may execute the generated Wasm much faster than another, even with identical source.
  • Code size trade-offs: a fix that speeds up the hot path can still increase total binary size. Measure both runtime and artifact size.
  • Undefined or unusual benchmark assumptions: some historical benchmark code relies on patterns that modern backends optimize differently than expected.

FAQ

Why is -O3 slower than -O0 for this switch?

Because the optimized build may choose a lowering strategy such as a jump table or block reordering that is theoretically efficient but worse for the actual runtime profile of evalloop.c. This is a classic cost-model miss, not a correctness bug.

Is the real problem in C, Cranelift, or WebAssembly?

Usually it is the interaction of all three. The C source creates an interpreter-like dispatch pattern, Cranelift applies a generic optimization, and the final Wasm backend representation may amplify the downside.

What is the best practical fix if I need performance now?

Start by isolating hot cases outside the main switch or replacing the dispatch with a handler table. If you control compilation granularity, compile the problematic function at -O2 instead of -O3 and confirm the generated code with inspection tools.

The key takeaway is simple: this issue happens when a compiler’s generic switch optimization heuristic does not match the behavior of a tight interpreter loop. The durable solution is to make the dispatch structure more explicit, reduce harmful lowering choices, and validate the generated Wasm instead of assuming a higher optimization level will always be faster.

Leave a Reply

Your email address will not be published. Required fields are marked *