When wasm_f32x4_convert_i32x4 intrinsic gets its input from an instruction that clears top bits, the conversion gets compiled into i32x4_u instead of i32x4_s variant; for example:
v128_t plsno(v128_t x)
{
    // u32x4 here changes the convert instruction; it's a problem because u32->f32 is way slower on pre-AVX512 HW
    x = wasm_u32x4_shr(x, 1);
    return wasm_f32x4_convert_i32x4(x);
}With -msimd128 -O2 compiles into
        local.get       0
        i32.const       1
        i32x4.shr_u
        f32x4.convert_i32x4_u
        end_function
This is a problem because on x64 hardware, convert_i32x4_u gets lowered into a long multi instruction sequence unless the browser implements AVX512 code path and the hardware supports it. Thus this needlessly slows down efficient SIMD kernels.