How to Fix: Inconsistent results for wasi-nn different backends

6 min read

wasi-nn image classification can return different labels or confidence ordering across backends even when the same model and input image are used. The issue is usually not a broken model. It is a mismatch in backend preprocessing, tensor layout assumptions, numeric precision, or output interpretation between implementations such as OpenVINO and ONNX-based paths.

Problem Overview

The affected test cases, including the linked nn_image_classification flow, expose a common portability problem: the WASI-NN API standardizes graph loading and inference execution, but it does not guarantee identical backend behavior unless every backend consumes the exact same model representation, input tensor format, normalization rule, and output decoding logic.

In practice, one backend may expect NCHW tensors while another was exported or wrapped as NHWC. One implementation may feed u8 pixel values directly, while another assumes normalized f32 tensors. Small numerical differences are expected, but large label disagreements usually mean the backend adapters are not applying the same inference contract.

Understanding the Root Cause

The root cause is backend inconsistency at the boundaries of the inference pipeline, not just inside model execution.

Here are the main technical reasons:

  • Different tensor layouts: A model exported for one runtime may expect channels-first data, while another runtime adapter may pass channels-last buffers. This does not always crash; it often produces valid but wrong predictions.

  • Preprocessing drift: Resizing, color channel order, mean subtraction, scaling, and normalization must be identical. If one backend uses RGB and another effectively feeds BGR, top-1 results can change dramatically.

  • Model artifact mismatch: Backends may load different serialized graph formats generated from the same source model, but with different conversion steps, operator fusions, or implicit metadata. That can alter output ordering or precision behavior.

  • Output tensor interpretation: The inference result may be numerically correct, but the host code may decode it differently across backends. For example, reading the wrong output shape, assuming an incorrect element type, or comparing raw logits from one backend against post-processed probabilities from another.

  • Floating-point and quantization differences: Minor variance is normal between CPU runtimes, optimized kernels, or quantized vs non-quantized execution. However, when variance is too large, it usually indicates one of the earlier issues rather than acceptable numerical drift.

For wasi-nn, this matters because the API abstracts execution, but the embedding application still controls model bytes, tensor descriptors, and post-processing. If those are not normalized, different backends will appear inconsistent.

Step-by-Step Solution

The fix is to make inference deterministic at the integration level. Standardize the entire input/output contract and verify each backend against it.

1. Use the same semantic model across backends

Start by ensuring that every backend loads graph artifacts converted from the same source checkpoint and that class label files are identical. If a backend requires a different file format, document the conversion path and confirm input/output tensor names and shapes after conversion.

# Example verification workflow conceptually
# 1. Start from one canonical model
# 2. Convert to backend-specific format
# 3. Record expected input shape, dtype, and output shape

canonical_model=mobile_net
backend_a_artifact=model.xml
backend_b_artifact=model.onnx
labels=imagenet_labels.txt

2. Normalize preprocessing in one shared path

Move image preprocessing into a single shared function so every backend receives the exact same tensor bytes. The most important checks are resize dimensions, channel order, data type, and normalization.

fn preprocess_image_to_nchw_f32(img: Image) -> Vec<f32> {
    // Pseudocode
    // 1. Resize to model input, e.g. 224x224
    // 2. Convert to RGB
    // 3. Normalize each channel consistently
    // 4. Reorder to NCHW
    // 5. Return f32 tensor buffer
}

let input_tensor: Vec<f32> = preprocess_image_to_nchw_f32(image);

If the current test program passes raw bytes directly, confirm whether each backend adapter expects that exact representation. A frequent bug is feeding HWC u8 to a backend that actually expects NCHW f32.

3. Explicitly define tensor metadata when setting input

Do not rely on assumptions in backend wrappers. Make the tensor dimensions and element type explicit and consistent with the model.

// Pseudocode reflecting the important idea
let dimensions = [1, 3, 224, 224];
let tensor_type = F32;
let tensor_data = preprocess_image_to_nchw_f32(image);

context.set_input(0, tensor_type, &dimensions, bytemuck::cast_slice(&tensor_data))?;

If a backend only accepts another layout, convert once in code and document it. Do not let each backend silently choose its own layout rule.

4. Standardize output decoding

Different backends may emit logits, probabilities, or tensors with slightly different shapes. Always inspect the output size and decode using one shared implementation.

fn top_k(output: &[f32], k: usize) -> Vec<(usize, f32)> {
    let mut indexed: Vec<(usize, f32)> = output.iter().copied().enumerate().collect();
    indexed.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
    indexed.into_iter().take(k).collect()
}

let results = top_k(&output_tensor, 5);

If one backend outputs logits and another outputs softmax probabilities, convert them into a common comparison format before asserting equality.

5. Compare with tolerance, not exact equality

For cross-backend tests, asserting identical floating-point arrays is too strict. Compare top-k class overlap, top-1 agreement, or confidence values within an acceptable tolerance.

fn approx_eq(a: f32, b: f32, eps: f32) -> bool {
    (a - b).abs() <= eps
}

assert_eq!(top1_backend_a.class_id, top1_backend_b.class_id);
assert!(approx_eq(top1_backend_a.score, top1_backend_b.score, 0.02));

This avoids false failures from expected numerical noise while still catching real integration bugs.

6. Add backend diagnostics around graph and tensor contracts

When debugging Wasmtime test programs, log the resolved input shape, tensor type, and output length for each backend. This quickly reveals silent mismatches.

println!("backend={backend}");
println!("input_dims={:?}", dimensions);
println!("input_type={:?}", tensor_type);
println!("input_len_bytes={}", input_bytes.len());
println!("output_len={}", output_tensor.len());

If available, also inspect backend-native model metadata before loading through wasi-nn. That confirms the exported artifacts truly describe the same network contract.

7. Update the test expectation

If the issue is in the repository test itself, do not assert one exact label score string for all backends unless preprocessing and artifact parity are proven. Instead, test for stable semantic behavior.

// Better cross-backend assertion idea
assert!(results.iter().any(|(class_id, _)| *class_id == expected_class));
assert!(results[0].1 > 0.5);

This is especially important for integration tests covering multiple execution providers.

Common Edge Cases

  • Channel order confusion: The source image library may decode into RGB, but a backend conversion path may assume BGR.

  • Shape includes batch dimension on one side only: A tensor of [1, 3, 224, 224] versus [3, 224, 224] can be accepted differently by adapters.

  • Quantized model on one backend, float model on another: This can produce materially different confidence scores even when the predicted class remains close.

  • Incorrect output buffer sizing: Reading too few or too many bytes from the output tensor can make predictions appear random.

  • Different resize algorithms: Bilinear versus nearest-neighbor preprocessing can change classification for borderline images.

  • Label file mismatch: The numeric output may match, but the displayed class names differ because labels are offset or sourced from different files.

FAQ

Why do two wasi-nn backends disagree on the top prediction for the same image?

Usually because the backends are not receiving the same effective tensor. The most common causes are layout mismatch, normalization differences, or loading different converted model artifacts.

Is small score variance between backends considered a bug?

No. Small floating-point differences are normal across runtimes and optimized kernels. It becomes a bug when the variance changes ranking significantly or when outputs differ because the integration contract is inconsistent.

What is the best way to test wasi-nn portability across backends?

Use one canonical model, one shared preprocessing function, explicit tensor metadata, and tolerant assertions based on top-k behavior rather than exact score equality. Also log shapes, dtypes, and output lengths during test runs.

The practical resolution for this GitHub issue is to treat wasi-nn backend consistency as a contract-enforcement problem. Once model conversion, preprocessing, tensor layout, and result decoding are unified, the inconsistent classification results disappear or shrink to acceptable numerical drift.

Leave a Reply

Your email address will not be published. Required fields are marked *